Detection API
pretok.detection.LanguageDetector
Bases: Protocol
Protocol for language detection backends.
All language detectors must implement this protocol to be usable with pretok's detection system.
Example
class MyDetector: ... @property ... def name(self) -> str: ... return "my_detector" ... ... def detect(self, text: str) -> DetectionResult: ... # Detection logic here ... return DetectionResult( ... language="en", ... confidence=0.95, ... detector=self.name ... ) ... ... def detect_batch(self, texts: Sequence[str]) -> list[DetectionResult]: ... return [self.detect(t) for t in texts]
Source code in src/pretok/detection/__init__.py
name
property
Return the detector's unique name identifier.
detect(text)
Detect the language of a single text.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to detect language for |
required |
Returns:
| Type | Description |
|---|---|
DetectionResult
|
DetectionResult with language code and confidence |
detect_batch(texts)
Detect languages for multiple texts.
Default implementation calls detect() for each text, but backends may override for batch optimization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
Sequence[str]
|
Sequence of texts to detect |
required |
Returns:
| Type | Description |
|---|---|
list[DetectionResult]
|
List of DetectionResult for each input text |
Source code in src/pretok/detection/__init__.py
pretok.detection.DetectionResult
dataclass
Result from language detection.
Attributes:
| Name | Type | Description |
|---|---|---|
language |
str
|
ISO 639-1 language code (e.g., 'en', 'zh', 'ja') |
confidence |
float
|
Confidence score between 0.0 and 1.0 |
detector |
str
|
Name of the detector that produced this result |
raw_output |
dict[str, Any] | None
|
Optional raw output from the detector for debugging |
Source code in src/pretok/detection/__init__.py
__post_init__()
Validate confidence is in valid range.
pretok.detection.fasttext_backend.FastTextDetector
Bases: BaseDetector
Language detector using FastText library.
FastText provides fast and accurate language identification. Requires a pretrained language identification model.
Example
detector = FastTextDetector() result = detector.detect("Bonjour le monde!") result.language 'fr'
Source code in src/pretok/detection/fasttext_backend.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 | |
name
property
Return detector name.
__init__(config=None)
Initialize the FastText detector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
FastTextConfig | None
|
Optional configuration for FastText |
None
|
Raises:
| Type | Description |
|---|---|
ImportError
|
If fasttext is not installed |
Source code in src/pretok/detection/fasttext_backend.py
detect(text)
Detect language using FastText.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to detect language for |
required |
Returns:
| Type | Description |
|---|---|
DetectionResult
|
DetectionResult with detected language |
Raises:
| Type | Description |
|---|---|
DetectionError
|
If detection fails |
Source code in src/pretok/detection/fasttext_backend.py
detect_batch(texts)
Detect languages for multiple texts efficiently.
FastText can process multiple texts more efficiently in batch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
Sequence[str]
|
Sequence of texts to detect |
required |
Returns:
| Type | Description |
|---|---|
list[DetectionResult]
|
List of DetectionResult for each input text |
Source code in src/pretok/detection/fasttext_backend.py
pretok.detection.langdetect_backend.LangDetectDetector
Bases: BaseDetector
Language detector using langdetect library.
langdetect is a port of Google's language-detection library to Python. It's lightweight and doesn't require model files.
Example
detector = LangDetectDetector() result = detector.detect("Hello, world!") result.language 'en'
Source code in src/pretok/detection/langdetect_backend.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
name
property
Return detector name.
__init__(config=None)
Initialize the langdetect detector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
LangDetectConfig | None
|
Optional configuration for langdetect |
None
|
Raises:
| Type | Description |
|---|---|
ImportError
|
If langdetect is not installed |
Source code in src/pretok/detection/langdetect_backend.py
detect(text)
Detect language using langdetect.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to detect language for |
required |
Returns:
| Type | Description |
|---|---|
DetectionResult
|
DetectionResult with detected language |
Raises:
| Type | Description |
|---|---|
DetectionError
|
If detection fails |
Source code in src/pretok/detection/langdetect_backend.py
detect_with_alternatives(text, *, top_k=3)
Detect language with alternative possibilities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to detect language for |
required |
top_k
|
int
|
Number of top results to return |
3
|
Returns:
| Type | Description |
|---|---|
list[DetectionResult]
|
List of DetectionResult ordered by confidence |
Source code in src/pretok/detection/langdetect_backend.py
pretok.detection.composite.CompositeDetector
Bases: BaseDetector
Detector that combines multiple backends for improved accuracy.
Supports multiple aggregation strategies: - voting: Use majority vote among detectors - weighted_average: Use weighted average of confidences - fallback_chain: Use first successful detection
Example
from pretok.detection.langdetect_backend import LangDetectDetector detector = CompositeDetector([LangDetectDetector()]) result = detector.detect("Hello, world!") result.language 'en'
Source code in src/pretok/detection/composite.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | |
detectors
property
Return list of backend detectors.
name
property
Return detector name.
__init__(detectors, config=None)
Initialize composite detector.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
detectors
|
Sequence[BaseDetector]
|
List of detector backends to combine |
required |
config
|
CompositeDetectorConfig | None
|
Optional configuration |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If no detectors provided |
Source code in src/pretok/detection/composite.py
detect(text)
Detect language using combined backends.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Text to detect language for |
required |
Returns:
| Type | Description |
|---|---|
DetectionResult
|
DetectionResult with detected language |
Raises:
| Type | Description |
|---|---|
DetectionError
|
If all detectors fail |