Detection API

`pretok.detection.LanguageDetector`

Bases: Protocol

Protocol for language detection backends.

All language detectors must implement this protocol to be usable with pretok's detection system.

Example

class MyDetector: ... @property ... def name(self) -> str: ... return "my_detector" ... ... def detect(self, text: str) -> DetectionResult: ... # Detection logic here ... return DetectionResult( ... language="en", ... confidence=0.95, ... detector=self.name ... ) ... ... def detect_batch(self, texts: Sequence[str]) -> list[DetectionResult]: ... return [self.detect(t) for t in texts]

Source code in src/pretok/detection/__init__.py

@runtime_checkable
class LanguageDetector(Protocol):
    """Protocol for language detection backends.

    All language detectors must implement this protocol to be usable
    with pretok's detection system.

    Example:
        >>> class MyDetector:
        ...     @property
        ...     def name(self) -> str:
        ...         return "my_detector"
        ...
        ...     def detect(self, text: str) -> DetectionResult:
        ...         # Detection logic here
        ...         return DetectionResult(
        ...             language="en",
        ...             confidence=0.95,
        ...             detector=self.name
        ...         )
        ...
        ...     def detect_batch(self, texts: Sequence[str]) -> list[DetectionResult]:
        ...         return [self.detect(t) for t in texts]
    """

    @property
    def name(self) -> str:
        """Return the detector's unique name identifier."""
        ...

    def detect(self, text: str) -> DetectionResult:
        """Detect the language of a single text.

        Args:
            text: The text to detect language for

        Returns:
            DetectionResult with language code and confidence
        """
        ...

    def detect_batch(self, texts: Sequence[str]) -> list[DetectionResult]:
        """Detect languages for multiple texts.

        Default implementation calls detect() for each text,
        but backends may override for batch optimization.

        Args:
            texts: Sequence of texts to detect

        Returns:
            List of DetectionResult for each input text
        """
        ...

`name` `property`

Return the detector's unique name identifier.

`detect(text)`

Detect the language of a single text.

Parameters:

Name	Type	Description	Default
`text`	`str`	The text to detect language for	required

Returns:

Type	Description
`DetectionResult`	DetectionResult with language code and confidence

Source code in src/pretok/detection/__init__.py

def detect(self, text: str) -> DetectionResult:
    """Detect the language of a single text.

    Args:
        text: The text to detect language for

    Returns:
        DetectionResult with language code and confidence
    """
    ...

`detect_batch(texts)`

Detect languages for multiple texts.

Default implementation calls detect() for each text, but backends may override for batch optimization.

Parameters:

Name	Type	Description	Default
`texts`	`Sequence[str]`	Sequence of texts to detect	required

Returns:

Type	Description
`list[DetectionResult]`	List of DetectionResult for each input text

Source code in src/pretok/detection/__init__.py

def detect_batch(self, texts: Sequence[str]) -> list[DetectionResult]:
    """Detect languages for multiple texts.

    Default implementation calls detect() for each text,
    but backends may override for batch optimization.

    Args:
        texts: Sequence of texts to detect

    Returns:
        List of DetectionResult for each input text
    """
    ...

`pretok.detection.DetectionResult` `dataclass`

Result from language detection.

Attributes:

Name	Type	Description
`language`	`str`	ISO 639-1 language code (e.g., 'en', 'zh', 'ja')
`confidence`	`float`	Confidence score between 0.0 and 1.0
`detector`	`str`	Name of the detector that produced this result
`raw_output`	`dict[str, Any] \| None`	Optional raw output from the detector for debugging

Source code in src/pretok/detection/__init__.py

@dataclass(frozen=True, slots=True)
class DetectionResult:
    """Result from language detection.

    Attributes:
        language: ISO 639-1 language code (e.g., 'en', 'zh', 'ja')
        confidence: Confidence score between 0.0 and 1.0
        detector: Name of the detector that produced this result
        raw_output: Optional raw output from the detector for debugging
    """

    language: str
    confidence: float
    detector: str
    raw_output: dict[str, Any] | None = None

    def __post_init__(self) -> None:
        """Validate confidence is in valid range."""
        if not 0.0 <= self.confidence <= 1.0:
            msg = f"confidence must be between 0.0 and 1.0, got {self.confidence}"
            raise ValueError(msg)

`__post_init__()`

Validate confidence is in valid range.

Source code in src/pretok/detection/__init__.py

def __post_init__(self) -> None:
    """Validate confidence is in valid range."""
    if not 0.0 <= self.confidence <= 1.0:
        msg = f"confidence must be between 0.0 and 1.0, got {self.confidence}"
        raise ValueError(msg)

`pretok.detection.fasttext_backend.FastTextDetector`

Bases: BaseDetector

Language detector using FastText library.

FastText provides fast and accurate language identification. Requires a pretrained language identification model.

Example

detector = FastTextDetector() result = detector.detect("Bonjour le monde!") result.language 'fr'

Source code in src/pretok/detection/fasttext_backend.py

class FastTextDetector(BaseDetector):
    """Language detector using FastText library.

    FastText provides fast and accurate language identification.
    Requires a pretrained language identification model.

    Example:
        >>> detector = FastTextDetector()
        >>> result = detector.detect("Bonjour le monde!")
        >>> result.language
        'fr'
    """

    # Default model URL from Facebook/Meta
    DEFAULT_MODEL_URL = "https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin"

    def __init__(self, config: FastTextConfig | None = None) -> None:
        """Initialize the FastText detector.

        Args:
            config: Optional configuration for FastText

        Raises:
            ImportError: If fasttext is not installed
        """
        try:
            import fasttext
        except ImportError as e:
            msg = (
                "fasttext is required for FastTextDetector. "
                "Install it with: pip install fasttext-wheel"
            )
            raise ImportError(msg) from e

        self._fasttext = fasttext
        self._config = config
        self._model = None

        # Disable FastText's own warning messages
        fasttext.FastText.eprint = lambda *_args, **_kwargs: None

    @property
    def name(self) -> str:
        """Return detector name."""
        return "fasttext"

    def _ensure_model(self) -> None:
        """Ensure the model is loaded, downloading if necessary."""
        if self._model is not None:
            return

        model_path = self._get_model_path()

        if not model_path.exists():
            self._download_model(model_path)

        self._model = self._fasttext.load_model(str(model_path))
        logger.info("Loaded FastText model from %s", model_path)

    def _get_model_path(self) -> Path:
        """Get the path to the FastText model."""
        if self._config and self._config.model_path:
            return Path(self._config.model_path).expanduser()

        # Default path in user's cache directory
        cache_dir = Path.home() / ".cache" / "pretok" / "models"
        cache_dir.mkdir(parents=True, exist_ok=True)
        return cache_dir / "lid.176.bin"

    def _download_model(self, path: Path) -> None:
        """Download the FastText language identification model.

        Args:
            path: Path to save the model to
        """
        import urllib.request

        logger.info("Downloading FastText model to %s...", path)
        path.parent.mkdir(parents=True, exist_ok=True)

        try:
            urllib.request.urlretrieve(self.DEFAULT_MODEL_URL, path)
            logger.info("FastText model downloaded successfully")
        except Exception as e:
            msg = f"Failed to download FastText model: {e}"
            raise DetectionError(msg, detector=self.name) from e

    def detect(self, text: str) -> DetectionResult:
        """Detect language using FastText.

        Args:
            text: Text to detect language for

        Returns:
            DetectionResult with detected language

        Raises:
            DetectionError: If detection fails
        """
        if not text or not text.strip():
            raise DetectionError(
                "Cannot detect language of empty text",
                text=text,
                detector=self.name,
            )

        self._ensure_model()

        # FastText expects single-line text
        text = text.replace("\n", " ").strip()

        try:
            k = self._config.k if self._config else 1
            threshold = self._config.threshold if self._config else 0.0

            # Model is guaranteed to be loaded after _ensure_model()
            assert self._model is not None
            labels, probs = self._model.predict(text, k=k, threshold=threshold)

            if not labels:
                raise DetectionError(
                    "No language detected",
                    text=text,
                    detector=self.name,
                )

            # FastText returns labels like "__label__en"
            language = self._parse_fasttext_label(labels[0])
            confidence = float(probs[0])

            return DetectionResult(
                language=language,
                confidence=confidence,
                detector=self.name,
                raw_output={
                    "all_labels": list(labels),
                    "all_probs": [float(p) for p in probs],
                },
            )

        except Exception as e:
            if isinstance(e, DetectionError):
                raise
            raise DetectionError(
                f"Language detection failed: {e}",
                text=text,
                detector=self.name,
            ) from e

    def detect_batch(self, texts: Sequence[str]) -> list[DetectionResult]:
        """Detect languages for multiple texts efficiently.

        FastText can process multiple texts more efficiently in batch.

        Args:
            texts: Sequence of texts to detect

        Returns:
            List of DetectionResult for each input text
        """
        if not texts:
            return []

        self._ensure_model()

        results = []
        for text in texts:
            try:
                result = self.detect(text)
                results.append(result)
            except DetectionError:
                # Return unknown for failed detections
                results.append(
                    DetectionResult(
                        language="unknown",
                        confidence=0.0,
                        detector=self.name,
                    )
                )

        return results

    def _parse_fasttext_label(self, label: str) -> str:
        """Parse FastText label format to language code.

        Args:
            label: FastText label like "__label__en"

        Returns:
            Normalized language code
        """
        # Remove __label__ prefix
        code = label.replace("__label__", "")
        return self._normalize_language_code(code)

`name` `property`

Return detector name.

`init(config=None)`

Initialize the FastText detector.

Parameters:

Name	Type	Description	Default
`config`	`FastTextConfig \| None`	Optional configuration for FastText	`None`

Raises:

Type	Description
`ImportError`	If fasttext is not installed

Source code in src/pretok/detection/fasttext_backend.py

def __init__(self, config: FastTextConfig | None = None) -> None:
    """Initialize the FastText detector.

    Args:
        config: Optional configuration for FastText

    Raises:
        ImportError: If fasttext is not installed
    """
    try:
        import fasttext
    except ImportError as e:
        msg = (
            "fasttext is required for FastTextDetector. "
            "Install it with: pip install fasttext-wheel"
        )
        raise ImportError(msg) from e

    self._fasttext = fasttext
    self._config = config
    self._model = None

    # Disable FastText's own warning messages
    fasttext.FastText.eprint = lambda *_args, **_kwargs: None

`detect(text)`

Detect language using FastText.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to detect language for	required

Returns:

Type	Description
`DetectionResult`	DetectionResult with detected language

Raises:

Type	Description
`DetectionError`	If detection fails

Source code in src/pretok/detection/fasttext_backend.py

def detect(self, text: str) -> DetectionResult:
    """Detect language using FastText.

    Args:
        text: Text to detect language for

    Returns:
        DetectionResult with detected language

    Raises:
        DetectionError: If detection fails
    """
    if not text or not text.strip():
        raise DetectionError(
            "Cannot detect language of empty text",
            text=text,
            detector=self.name,
        )

    self._ensure_model()

    # FastText expects single-line text
    text = text.replace("\n", " ").strip()

    try:
        k = self._config.k if self._config else 1
        threshold = self._config.threshold if self._config else 0.0

        # Model is guaranteed to be loaded after _ensure_model()
        assert self._model is not None
        labels, probs = self._model.predict(text, k=k, threshold=threshold)

        if not labels:
            raise DetectionError(
                "No language detected",
                text=text,
                detector=self.name,
            )

        # FastText returns labels like "__label__en"
        language = self._parse_fasttext_label(labels[0])
        confidence = float(probs[0])

        return DetectionResult(
            language=language,
            confidence=confidence,
            detector=self.name,
            raw_output={
                "all_labels": list(labels),
                "all_probs": [float(p) for p in probs],
            },
        )

    except Exception as e:
        if isinstance(e, DetectionError):
            raise
        raise DetectionError(
            f"Language detection failed: {e}",
            text=text,
            detector=self.name,
        ) from e

`detect_batch(texts)`

Detect languages for multiple texts efficiently.

FastText can process multiple texts more efficiently in batch.

Parameters:

Name	Type	Description	Default
`texts`	`Sequence[str]`	Sequence of texts to detect	required

Returns:

Type	Description
`list[DetectionResult]`	List of DetectionResult for each input text

Source code in src/pretok/detection/fasttext_backend.py

def detect_batch(self, texts: Sequence[str]) -> list[DetectionResult]:
    """Detect languages for multiple texts efficiently.

    FastText can process multiple texts more efficiently in batch.

    Args:
        texts: Sequence of texts to detect

    Returns:
        List of DetectionResult for each input text
    """
    if not texts:
        return []

    self._ensure_model()

    results = []
    for text in texts:
        try:
            result = self.detect(text)
            results.append(result)
        except DetectionError:
            # Return unknown for failed detections
            results.append(
                DetectionResult(
                    language="unknown",
                    confidence=0.0,
                    detector=self.name,
                )
            )

    return results

`pretok.detection.langdetect_backend.LangDetectDetector`

Bases: BaseDetector

Language detector using langdetect library.

langdetect is a port of Google's language-detection library to Python. It's lightweight and doesn't require model files.

Example

detector = LangDetectDetector() result = detector.detect("Hello, world!") result.language 'en'

Source code in src/pretok/detection/langdetect_backend.py

class LangDetectDetector(BaseDetector):
    """Language detector using langdetect library.

    langdetect is a port of Google's language-detection library to Python.
    It's lightweight and doesn't require model files.

    Example:
        >>> detector = LangDetectDetector()
        >>> result = detector.detect("Hello, world!")
        >>> result.language
        'en'
    """

    def __init__(self, config: LangDetectConfig | None = None) -> None:
        """Initialize the langdetect detector.

        Args:
            config: Optional configuration for langdetect

        Raises:
            ImportError: If langdetect is not installed
        """
        try:
            import langdetect
            from langdetect import DetectorFactory
        except ImportError as e:
            msg = (
                "langdetect is required for LangDetectDetector. "
                "Install it with: pip install langdetect"
            )
            raise ImportError(msg) from e

        self._langdetect = langdetect
        self._config = config

        # Set seed for reproducibility if configured
        if config and config.seed is not None:
            DetectorFactory.seed = config.seed

    @property
    def name(self) -> str:
        """Return detector name."""
        return "langdetect"

    def detect(self, text: str) -> DetectionResult:
        """Detect language using langdetect.

        Args:
            text: Text to detect language for

        Returns:
            DetectionResult with detected language

        Raises:
            DetectionError: If detection fails
        """
        if not text or not text.strip():
            raise DetectionError(
                "Cannot detect language of empty text",
                text=text,
                detector=self.name,
            )

        try:
            # Get probabilities for all detected languages
            probs = self._langdetect.detect_langs(text)

            if not probs:
                raise DetectionError(
                    "No language detected",
                    text=text,
                    detector=self.name,
                )

            # Get the top result
            top = probs[0]
            language = self._normalize_language_code(top.lang)

            return DetectionResult(
                language=language,
                confidence=top.prob,
                detector=self.name,
                raw_output={"all_probs": [(p.lang, p.prob) for p in probs]},
            )

        except self._langdetect.LangDetectException as e:
            raise DetectionError(
                f"Language detection failed: {e}",
                text=text,
                detector=self.name,
            ) from e

    def detect_with_alternatives(self, text: str, *, top_k: int = 3) -> list[DetectionResult]:
        """Detect language with alternative possibilities.

        Args:
            text: Text to detect language for
            top_k: Number of top results to return

        Returns:
            List of DetectionResult ordered by confidence
        """
        if not text or not text.strip():
            return []

        try:
            probs = self._langdetect.detect_langs(text)
            results = []

            for prob in probs[:top_k]:
                language = self._normalize_language_code(prob.lang)
                results.append(
                    DetectionResult(
                        language=language,
                        confidence=prob.prob,
                        detector=self.name,
                    )
                )

            return results

        except self._langdetect.LangDetectException:
            return []

`name` `property`

Return detector name.

`init(config=None)`

Initialize the langdetect detector.

Parameters:

Name	Type	Description	Default
`config`	`LangDetectConfig \| None`	Optional configuration for langdetect	`None`

Raises:

Type	Description
`ImportError`	If langdetect is not installed

Source code in src/pretok/detection/langdetect_backend.py

def __init__(self, config: LangDetectConfig | None = None) -> None:
    """Initialize the langdetect detector.

    Args:
        config: Optional configuration for langdetect

    Raises:
        ImportError: If langdetect is not installed
    """
    try:
        import langdetect
        from langdetect import DetectorFactory
    except ImportError as e:
        msg = (
            "langdetect is required for LangDetectDetector. "
            "Install it with: pip install langdetect"
        )
        raise ImportError(msg) from e

    self._langdetect = langdetect
    self._config = config

    # Set seed for reproducibility if configured
    if config and config.seed is not None:
        DetectorFactory.seed = config.seed

`detect(text)`

Detect language using langdetect.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to detect language for	required

Returns:

Type	Description
`DetectionResult`	DetectionResult with detected language

Raises:

Type	Description
`DetectionError`	If detection fails

Source code in src/pretok/detection/langdetect_backend.py

def detect(self, text: str) -> DetectionResult:
    """Detect language using langdetect.

    Args:
        text: Text to detect language for

    Returns:
        DetectionResult with detected language

    Raises:
        DetectionError: If detection fails
    """
    if not text or not text.strip():
        raise DetectionError(
            "Cannot detect language of empty text",
            text=text,
            detector=self.name,
        )

    try:
        # Get probabilities for all detected languages
        probs = self._langdetect.detect_langs(text)

        if not probs:
            raise DetectionError(
                "No language detected",
                text=text,
                detector=self.name,
            )

        # Get the top result
        top = probs[0]
        language = self._normalize_language_code(top.lang)

        return DetectionResult(
            language=language,
            confidence=top.prob,
            detector=self.name,
            raw_output={"all_probs": [(p.lang, p.prob) for p in probs]},
        )

    except self._langdetect.LangDetectException as e:
        raise DetectionError(
            f"Language detection failed: {e}",
            text=text,
            detector=self.name,
        ) from e

`detect_with_alternatives(text, *, top_k=3)`

Detect language with alternative possibilities.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to detect language for	required
`top_k`	`int`	Number of top results to return	`3`

Returns:

Type	Description
`list[DetectionResult]`	List of DetectionResult ordered by confidence

Source code in src/pretok/detection/langdetect_backend.py

def detect_with_alternatives(self, text: str, *, top_k: int = 3) -> list[DetectionResult]:
    """Detect language with alternative possibilities.

    Args:
        text: Text to detect language for
        top_k: Number of top results to return

    Returns:
        List of DetectionResult ordered by confidence
    """
    if not text or not text.strip():
        return []

    try:
        probs = self._langdetect.detect_langs(text)
        results = []

        for prob in probs[:top_k]:
            language = self._normalize_language_code(prob.lang)
            results.append(
                DetectionResult(
                    language=language,
                    confidence=prob.prob,
                    detector=self.name,
                )
            )

        return results

    except self._langdetect.LangDetectException:
        return []

`pretok.detection.composite.CompositeDetector`

Bases: BaseDetector

Detector that combines multiple backends for improved accuracy.

Supports multiple aggregation strategies: - voting: Use majority vote among detectors - weighted_average: Use weighted average of confidences - fallback_chain: Use first successful detection

Example

from pretok.detection.langdetect_backend import LangDetectDetector detector = CompositeDetector([LangDetectDetector()]) result = detector.detect("Hello, world!") result.language 'en'

Source code in src/pretok/detection/composite.py

class CompositeDetector(BaseDetector):
    """Detector that combines multiple backends for improved accuracy.

    Supports multiple aggregation strategies:
    - voting: Use majority vote among detectors
    - weighted_average: Use weighted average of confidences
    - fallback_chain: Use first successful detection

    Example:
        >>> from pretok.detection.langdetect_backend import LangDetectDetector
        >>> detector = CompositeDetector([LangDetectDetector()])
        >>> result = detector.detect("Hello, world!")
        >>> result.language
        'en'
    """

    def __init__(
        self,
        detectors: Sequence[BaseDetector],
        config: CompositeDetectorConfig | None = None,
    ) -> None:
        """Initialize composite detector.

        Args:
            detectors: List of detector backends to combine
            config: Optional configuration

        Raises:
            ValueError: If no detectors provided
        """
        if not detectors:
            msg = "At least one detector must be provided"
            raise ValueError(msg)

        self._detectors = list(detectors)
        self._config = config
        self._strategy = config.strategy if config else "voting"
        self._weights = config.weights if config else {}

    @property
    def name(self) -> str:
        """Return detector name."""
        return "composite"

    @property
    def detectors(self) -> list[BaseDetector]:
        """Return list of backend detectors."""
        return self._detectors

    def detect(self, text: str) -> DetectionResult:
        """Detect language using combined backends.

        Args:
            text: Text to detect language for

        Returns:
            DetectionResult with detected language

        Raises:
            DetectionError: If all detectors fail
        """
        if self._strategy == "fallback_chain":
            return self._detect_fallback_chain(text)
        elif self._strategy == "weighted_average":
            return self._detect_weighted_average(text)
        else:  # voting (default)
            return self._detect_voting(text)

    def _detect_voting(self, text: str) -> DetectionResult:
        """Use majority voting strategy.

        Args:
            text: Text to detect

        Returns:
            DetectionResult based on majority vote
        """
        results: list[DetectionResult] = []
        errors: list[str] = []

        for detector in self._detectors:
            try:
                result = detector.detect(text)
                results.append(result)
            except DetectionError as e:
                errors.append(f"{detector.name}: {e}")

        if not results:
            raise DetectionError(
                f"All detectors failed: {'; '.join(errors)}",
                text=text,
                detector=self.name,
            )

        # Count votes for each language
        votes = Counter(r.language for r in results)
        winner, vote_count = votes.most_common(1)[0]

        # Calculate confidence as agreement ratio
        agreement = vote_count / len(results)

        # Get average confidence from detectors that voted for winner
        winning_confidences = [r.confidence for r in results if r.language == winner]
        avg_confidence = sum(winning_confidences) / len(winning_confidences)

        # Final confidence combines agreement and detector confidences
        final_confidence = agreement * avg_confidence

        return DetectionResult(
            language=winner,
            confidence=final_confidence,
            detector=self.name,
            raw_output={
                "votes": dict(votes),
                "agreement": agreement,
                "results": [
                    {"detector": r.detector, "language": r.language, "confidence": r.confidence}
                    for r in results
                ],
            },
        )

    def _detect_weighted_average(self, text: str) -> DetectionResult:
        """Use weighted average strategy.

        Args:
            text: Text to detect

        Returns:
            DetectionResult based on weighted confidence
        """
        results: list[DetectionResult] = []
        errors: list[str] = []

        for detector in self._detectors:
            try:
                result = detector.detect(text)
                results.append(result)
            except DetectionError as e:
                errors.append(f"{detector.name}: {e}")

        if not results:
            raise DetectionError(
                f"All detectors failed: {'; '.join(errors)}",
                text=text,
                detector=self.name,
            )

        # Calculate weighted scores for each language
        language_scores: dict[str, float] = {}
        total_weight = 0.0

        for result in results:
            weight = self._weights.get(result.detector, 1.0)
            score = weight * result.confidence
            total_weight += weight

            if result.language in language_scores:
                language_scores[result.language] += score
            else:
                language_scores[result.language] = score

        # Normalize scores
        if total_weight > 0:
            language_scores = {
                lang: score / total_weight for lang, score in language_scores.items()
            }

        # Get winner
        winner = max(language_scores, key=language_scores.get)  # type: ignore[arg-type]
        confidence = language_scores[winner]

        return DetectionResult(
            language=winner,
            confidence=min(confidence, 1.0),  # Cap at 1.0
            detector=self.name,
            raw_output={
                "scores": language_scores,
                "weights": self._weights,
                "results": [
                    {"detector": r.detector, "language": r.language, "confidence": r.confidence}
                    for r in results
                ],
            },
        )

    def _detect_fallback_chain(self, text: str) -> DetectionResult:
        """Use fallback chain strategy.

        Try each detector in order until one succeeds.

        Args:
            text: Text to detect

        Returns:
            DetectionResult from first successful detector
        """
        errors: list[str] = []

        for detector in self._detectors:
            try:
                return detector.detect(text)
            except DetectionError as e:
                errors.append(f"{detector.name}: {e}")

        raise DetectionError(
            f"All detectors in chain failed: {'; '.join(errors)}",
            text=text,
            detector=self.name,
        )

`detectors` `property`

Return list of backend detectors.

`name` `property`

Return detector name.

`init(detectors, config=None)`

Initialize composite detector.

Parameters:

Name	Type	Description	Default
`detectors`	`Sequence[BaseDetector]`	List of detector backends to combine	required
`config`	`CompositeDetectorConfig \| None`	Optional configuration	`None`

Raises:

Type	Description
`ValueError`	If no detectors provided

Source code in src/pretok/detection/composite.py

def __init__(
    self,
    detectors: Sequence[BaseDetector],
    config: CompositeDetectorConfig | None = None,
) -> None:
    """Initialize composite detector.

    Args:
        detectors: List of detector backends to combine
        config: Optional configuration

    Raises:
        ValueError: If no detectors provided
    """
    if not detectors:
        msg = "At least one detector must be provided"
        raise ValueError(msg)

    self._detectors = list(detectors)
    self._config = config
    self._strategy = config.strategy if config else "voting"
    self._weights = config.weights if config else {}

`detect(text)`

Detect language using combined backends.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to detect language for	required

Returns:

Type	Description
`DetectionResult`	DetectionResult with detected language

Raises:

Type	Description
`DetectionError`	If all detectors fail

Source code in src/pretok/detection/composite.py

def detect(self, text: str) -> DetectionResult:
    """Detect language using combined backends.

    Args:
        text: Text to detect language for

    Returns:
        DetectionResult with detected language

    Raises:
        DetectionError: If all detectors fail
    """
    if self._strategy == "fallback_chain":
        return self._detect_fallback_chain(text)
    elif self._strategy == "weighted_average":
        return self._detect_weighted_average(text)
    else:  # voting (default)
        return self._detect_voting(text)

Detection API

pretok.detection.LanguageDetector

name property

detect(text)

detect_batch(texts)

pretok.detection.DetectionResult dataclass

__post_init__()

pretok.detection.fasttext_backend.FastTextDetector

name property

__init__(config=None)

detect(text)

detect_batch(texts)

pretok.detection.langdetect_backend.LangDetectDetector

name property

__init__(config=None)

detect(text)

detect_with_alternatives(text, *, top_k=3)

pretok.detection.composite.CompositeDetector

detectors property

name property

__init__(detectors, config=None)

detect(text)

`pretok.detection.LanguageDetector`

`name` `property`

`detect(text)`

`detect_batch(texts)`

`pretok.detection.DetectionResult` `dataclass`

`__post_init__()`

`pretok.detection.fasttext_backend.FastTextDetector`

`name` `property`

`init(config=None)`

`detect(text)`

`detect_batch(texts)`

`pretok.detection.langdetect_backend.LangDetectDetector`

`name` `property`

`init(config=None)`

`detect(text)`

`detect_with_alternatives(text, *, top_k=3)`

`pretok.detection.composite.CompositeDetector`

`detectors` `property`

`name` `property`

`init(detectors, config=None)`

`detect(text)`