Translation API

`pretok.translation.Translator`

Bases: Protocol

Protocol for translation backends.

All translators must implement this protocol to be usable with pretok's translation system.

Source code in src/pretok/translation/base.py

@runtime_checkable
class Translator(Protocol):
    """Protocol for translation backends.

    All translators must implement this protocol to be usable
    with pretok's translation system.
    """

    @property
    def name(self) -> str:
        """Return the translator's unique name identifier."""
        ...

    @property
    def supported_languages(self) -> list[str]:
        """Return list of supported language codes."""
        ...

    def translate(
        self,
        text: str,
        target_language: str,
        source_language: str | None = None,
    ) -> TranslationResult:
        """Translate text to target language.

        Args:
            text: Text to translate
            target_language: Target language code (ISO 639-1)
            source_language: Source language code (optional, will auto-detect)

        Returns:
            TranslationResult with translated text
        """
        ...

    def translate_batch(
        self,
        texts: Sequence[str],
        target_language: str,
        source_language: str | None = None,
    ) -> list[TranslationResult]:
        """Translate multiple texts to target language.

        Args:
            texts: List of texts to translate
            target_language: Target language code
            source_language: Source language code (optional)

        Returns:
            List of TranslationResult for each input
        """
        ...

`name` `property`

Return the translator's unique name identifier.

`supported_languages` `property`

Return list of supported language codes.

`translate(text, target_language, source_language=None)`

Translate text to target language.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to translate	required
`target_language`	`str`	Target language code (ISO 639-1)	required
`source_language`	`str \| None`	Source language code (optional, will auto-detect)	`None`

Returns:

Type	Description
`TranslationResult`	TranslationResult with translated text

Source code in src/pretok/translation/base.py

def translate(
    self,
    text: str,
    target_language: str,
    source_language: str | None = None,
) -> TranslationResult:
    """Translate text to target language.

    Args:
        text: Text to translate
        target_language: Target language code (ISO 639-1)
        source_language: Source language code (optional, will auto-detect)

    Returns:
        TranslationResult with translated text
    """
    ...

`translate_batch(texts, target_language, source_language=None)`

Translate multiple texts to target language.

Parameters:

Name	Type	Description	Default
`texts`	`Sequence[str]`	List of texts to translate	required
`target_language`	`str`	Target language code	required
`source_language`	`str \| None`	Source language code (optional)	`None`

Returns:

Type	Description
`list[TranslationResult]`	List of TranslationResult for each input

Source code in src/pretok/translation/base.py

def translate_batch(
    self,
    texts: Sequence[str],
    target_language: str,
    source_language: str | None = None,
) -> list[TranslationResult]:
    """Translate multiple texts to target language.

    Args:
        texts: List of texts to translate
        target_language: Target language code
        source_language: Source language code (optional)

    Returns:
        List of TranslationResult for each input
    """
    ...

`pretok.translation.TranslationResult` `dataclass`

Result from translation operation.

Attributes:

Name	Type	Description
`source_text`	`str`	Original text
`translated_text`	`str`	Translated text
`source_language`	`str`	Detected or specified source language
`target_language`	`str`	Target language
`translator`	`str`	Name of the translator used
`confidence`	`float \| None`	Translation confidence (if available)
`metadata`	`dict[str, Any]`	Additional translation metadata

Source code in src/pretok/translation/base.py

@dataclass(frozen=True, slots=True)
class TranslationResult:
    """Result from translation operation.

    Attributes:
        source_text: Original text
        translated_text: Translated text
        source_language: Detected or specified source language
        target_language: Target language
        translator: Name of the translator used
        confidence: Translation confidence (if available)
        metadata: Additional translation metadata
    """

    source_text: str
    translated_text: str
    source_language: str
    target_language: str
    translator: str
    confidence: float | None = None
    metadata: dict[str, Any] = field(default_factory=dict)

    @property
    def was_translated(self) -> bool:
        """Check if text was actually translated."""
        return self.source_text != self.translated_text

`was_translated` `property`

Check if text was actually translated.

`pretok.translation.TranslationError`

Bases: Exception

Raised when translation fails.

Source code in src/pretok/translation/base.py

class TranslationError(Exception):
    """Raised when translation fails."""

    def __init__(
        self,
        message: str,
        *,
        source_text: str | None = None,
        source_language: str | None = None,
        target_language: str | None = None,
        translator: str | None = None,
        cause: Exception | None = None,
    ) -> None:
        super().__init__(message)
        self.source_text = source_text
        self.source_language = source_language
        self.target_language = target_language
        self.translator = translator
        self.cause = cause

`pretok.translation.llm.LLMTranslator`

Bases: BaseTranslator

Translator using OpenAI-compatible APIs.

This translator works with any API that implements the OpenAI chat completions interface, including: - OpenAI API - OpenRouter - Ollama - vLLM - LM Studio - Azure OpenAI - Together AI - Groq - etc.

Example

from pretok.config import LLMTranslatorConfig config = LLMTranslatorConfig( ... base_url="https://api.openai.com/v1", ... model="gpt-4o-mini", ... api_key="sk-xxx" ... ) translator = LLMTranslator(config) result = translator.translate("Hello", "zh") result.translated_text '你好'

With Ollama

config = LLMTranslatorConfig( ... base_url="http://localhost:11434/v1", ... model="llama3.2", ... ) translator = LLMTranslator(config)

Source code in src/pretok/translation/llm.py

class LLMTranslator(BaseTranslator):
    """Translator using OpenAI-compatible APIs.

    This translator works with any API that implements the OpenAI
    chat completions interface, including:
    - OpenAI API
    - OpenRouter
    - Ollama
    - vLLM
    - LM Studio
    - Azure OpenAI
    - Together AI
    - Groq
    - etc.

    Example:
        >>> from pretok.config import LLMTranslatorConfig
        >>> config = LLMTranslatorConfig(
        ...     base_url="https://api.openai.com/v1",
        ...     model="gpt-4o-mini",
        ...     api_key="sk-xxx"
        ... )
        >>> translator = LLMTranslator(config)
        >>> result = translator.translate("Hello", "zh")
        >>> result.translated_text
        '你好'

    With Ollama:
        >>> config = LLMTranslatorConfig(
        ...     base_url="http://localhost:11434/v1",
        ...     model="llama3.2",
        ... )
        >>> translator = LLMTranslator(config)
    """

    def __init__(self, config: LLMTranslatorConfig) -> None:
        """Initialize LLM translator.

        Args:
            config: LLM translator configuration

        Raises:
            ImportError: If openai package is not installed
            TranslationError: If configuration is invalid
        """
        try:
            from openai import OpenAI
        except ImportError as e:
            msg = (
                "openai package is required for LLMTranslator. Install it with: pip install openai"
            )
            raise ImportError(msg) from e

        self._config = config
        self._api_key = config.get_api_key()

        # Initialize OpenAI client
        client_kwargs: dict[str, Any] = {}

        if config.base_url:
            client_kwargs["base_url"] = config.base_url

        if self._api_key:
            client_kwargs["api_key"] = self._api_key
        else:
            # Some local APIs don't require an API key
            client_kwargs["api_key"] = "not-required"

        self._client = OpenAI(**client_kwargs)

        # Prompts
        self._system_prompt = config.system_prompt or DEFAULT_SYSTEM_PROMPT
        self._user_prompt_template = config.user_prompt_template

        logger.info(
            "Initialized LLMTranslator with model=%s, base_url=%s",
            config.model,
            config.base_url or "default",
        )

    @property
    def name(self) -> str:
        """Return translator name."""
        return f"llm:{self._config.model}"

    @property
    def supported_languages(self) -> list[str]:
        """Return supported languages.

        LLM translators generally support many languages,
        so we return an empty list to indicate "all supported".
        """
        return []

    def translate(
        self,
        text: str,
        target_language: str,
        source_language: str | None = None,
    ) -> TranslationResult:
        """Translate text using LLM.

        Args:
            text: Text to translate
            target_language: Target language code
            source_language: Source language code (optional)

        Returns:
            TranslationResult with translated text
        """
        if not text or not text.strip():
            return TranslationResult(
                source_text=text,
                translated_text=text,
                source_language=source_language or "unknown",
                target_language=target_language,
                translator=self.name,
            )

        # Build user prompt
        if self._user_prompt_template:
            user_prompt = self._user_prompt_template.format(
                text=text,
                source_language=source_language or "auto-detect",
                target_language=target_language,
            )
        elif source_language:
            user_prompt = DEFAULT_USER_PROMPT.format(
                text=text,
                source_language=self._get_language_name(source_language),
                target_language=self._get_language_name(target_language),
            )
        else:
            user_prompt = DEFAULT_USER_PROMPT_AUTO.format(
                text=text,
                target_language=self._get_language_name(target_language),
            )

        # Calculate max_tokens: use explicit value if set, otherwise use multiplier
        if self._config.max_tokens is not None:
            max_tokens = self._config.max_tokens
        else:
            max_tokens = len(text) * self._config.max_tokens_multiplier

        logger.debug(
            "Translation request: text_len=%d, max_tokens=%d",
            len(text),
            max_tokens,
        )

        max_retries = self._config.max_retries
        retry_delay = getattr(self._config, "retry_delay", 1.0)
        last_error: Exception | None = None

        for attempt in range(max_retries + 1):
            try:
                response = self._client.chat.completions.create(
                    model=self._config.model,
                    messages=[
                        {"role": "system", "content": self._system_prompt},
                        {"role": "user", "content": user_prompt},
                    ],
                    temperature=self._config.temperature,
                    max_tokens=max_tokens,
                )

                translated_text = response.choices[0].message.content or ""

                # Clean up the response - remove common LLM artifacts
                translated_text = self._clean_translation(translated_text, text)

                return TranslationResult(
                    source_text=text,
                    translated_text=translated_text,
                    source_language=source_language or "auto",
                    target_language=target_language,
                    translator=self.name,
                    metadata={
                        "model": response.model,
                        "usage": {
                            "prompt_tokens": response.usage.prompt_tokens if response.usage else 0,
                            "completion_tokens": response.usage.completion_tokens
                            if response.usage
                            else 0,
                        },
                        "attempts": attempt + 1,
                    },
                )

            except Exception as e:
                last_error = e
                if attempt < max_retries:
                    logger.warning(
                        "Translation attempt %d/%d failed: %s. Retrying in %.1fs...",
                        attempt + 1,
                        max_retries + 1,
                        e,
                        retry_delay,
                    )
                    time.sleep(retry_delay)
                    continue

        raise TranslationError(
            f"LLM translation failed after {max_retries + 1} attempts: {last_error}",
            source_text=text,
            source_language=source_language,
            target_language=target_language,
            translator=self.name,
            cause=last_error,
        )

    def translate_batch(
        self,
        texts: Sequence[str],
        target_language: str,
        source_language: str | None = None,
    ) -> list[TranslationResult]:
        """Translate multiple texts.

        For LLM translators, we translate texts one by one
        to maintain quality and handle errors individually.
        """
        results = []
        for text in texts:
            try:
                result = self.translate(text, target_language, source_language)
                results.append(result)
            except TranslationError:
                # Return original on failure
                results.append(
                    TranslationResult(
                        source_text=text,
                        translated_text=text,
                        source_language=source_language or "unknown",
                        target_language=target_language,
                        translator=self.name,
                    )
                )
        return results

    def _get_language_name(self, code: str) -> str:
        """Convert language code to full name for prompts.

        Args:
            code: ISO 639-1 language code

        Returns:
            Full language name
        """
        language_names = {
            "en": "English",
            "zh": "Chinese",
            "ja": "Japanese",
            "ko": "Korean",
            "es": "Spanish",
            "fr": "French",
            "de": "German",
            "it": "Italian",
            "pt": "Portuguese",
            "ru": "Russian",
            "ar": "Arabic",
            "hi": "Hindi",
            "th": "Thai",
            "vi": "Vietnamese",
            "nl": "Dutch",
            "pl": "Polish",
            "tr": "Turkish",
            "he": "Hebrew",
            "id": "Indonesian",
            "cs": "Czech",
            "sv": "Swedish",
            "da": "Danish",
            "fi": "Finnish",
            "no": "Norwegian",
            "el": "Greek",
            "hu": "Hungarian",
            "ro": "Romanian",
            "uk": "Ukrainian",
            "bg": "Bulgarian",
            "hr": "Croatian",
            "sk": "Slovak",
            "sl": "Slovenian",
            "lt": "Lithuanian",
            "lv": "Latvian",
            "et": "Estonian",
            "ms": "Malay",
            "bn": "Bengali",
            "ta": "Tamil",
            "te": "Telugu",
            "mr": "Marathi",
            "gu": "Gujarati",
            "kn": "Kannada",
            "ml": "Malayalam",
            "pa": "Punjabi",
            "fa": "Persian",
            "ur": "Urdu",
            "sw": "Swahili",
            "af": "Afrikaans",
            "ca": "Catalan",
            "eu": "Basque",
            "gl": "Galician",
        }
        return language_names.get(code.lower(), code)

    def _clean_translation(self, translated: str, original: str) -> str:
        """Clean up LLM translation output.

        Removes common artifacts like notes, explanations, and preambles
        that LLMs sometimes add despite instructions.

        Args:
            translated: Raw translation from LLM
            original: Original text (to preserve whitespace patterns)

        Returns:
            Cleaned translation text
        """
        import re

        result = translated

        # Remove common preamble patterns
        preamble_patterns = [
            r"^(?:Here(?:'s| is) (?:the )?translation:?\s*)",
            r"^(?:Translation:?\s*)",
            r"^(?:Translated(?: text)?:?\s*)",
            r"^(?:In English:?\s*)",
        ]
        for pattern in preamble_patterns:
            result = re.sub(pattern, "", result, flags=re.IGNORECASE)

        # Remove trailing notes/explanations in parentheses
        # Match patterns like "(Note: ...)" or "(I translated ...)" at the end
        result = re.sub(
            r"\s*\((?:Note|I translated|Translation note|Translator'?s? note)[^)]*\)\s*$",
            "",
            result,
            flags=re.IGNORECASE,
        )

        # Remove trailing notes after newlines
        # Handles cases like "\n\nNote: I translated..."
        result = re.sub(
            r"\n+(?:Note|N\.B\.|PS|P\.S\.):?\s+.*$",
            "",
            result,
            flags=re.IGNORECASE | re.DOTALL,
        )

        # Preserve original whitespace structure
        # If original started with newline, ensure result does too
        original_starts_with_newline = original.startswith("\n")
        original_ends_with_newline = original.endswith("\n")

        result = result.strip()

        if original_starts_with_newline and not result.startswith("\n"):
            result = "\n" + result
        if original_ends_with_newline and not result.endswith("\n"):
            result = result + "\n"

        return result

`name` `property`

Return translator name.

`supported_languages` `property`

Return supported languages.

LLM translators generally support many languages, so we return an empty list to indicate "all supported".

`init(config)`

Initialize LLM translator.

Parameters:

Name	Type	Description	Default
`config`	`LLMTranslatorConfig`	LLM translator configuration	required

Raises:

Type	Description
`ImportError`	If openai package is not installed
`TranslationError`	If configuration is invalid

Source code in src/pretok/translation/llm.py

def __init__(self, config: LLMTranslatorConfig) -> None:
    """Initialize LLM translator.

    Args:
        config: LLM translator configuration

    Raises:
        ImportError: If openai package is not installed
        TranslationError: If configuration is invalid
    """
    try:
        from openai import OpenAI
    except ImportError as e:
        msg = (
            "openai package is required for LLMTranslator. Install it with: pip install openai"
        )
        raise ImportError(msg) from e

    self._config = config
    self._api_key = config.get_api_key()

    # Initialize OpenAI client
    client_kwargs: dict[str, Any] = {}

    if config.base_url:
        client_kwargs["base_url"] = config.base_url

    if self._api_key:
        client_kwargs["api_key"] = self._api_key
    else:
        # Some local APIs don't require an API key
        client_kwargs["api_key"] = "not-required"

    self._client = OpenAI(**client_kwargs)

    # Prompts
    self._system_prompt = config.system_prompt or DEFAULT_SYSTEM_PROMPT
    self._user_prompt_template = config.user_prompt_template

    logger.info(
        "Initialized LLMTranslator with model=%s, base_url=%s",
        config.model,
        config.base_url or "default",
    )

`translate(text, target_language, source_language=None)`

Translate text using LLM.

Parameters:

Name	Type	Description	Default
`text`	`str`	Text to translate	required
`target_language`	`str`	Target language code	required
`source_language`	`str \| None`	Source language code (optional)	`None`

Returns:

Type	Description
`TranslationResult`	TranslationResult with translated text

Source code in src/pretok/translation/llm.py

def translate(
    self,
    text: str,
    target_language: str,
    source_language: str | None = None,
) -> TranslationResult:
    """Translate text using LLM.

    Args:
        text: Text to translate
        target_language: Target language code
        source_language: Source language code (optional)

    Returns:
        TranslationResult with translated text
    """
    if not text or not text.strip():
        return TranslationResult(
            source_text=text,
            translated_text=text,
            source_language=source_language or "unknown",
            target_language=target_language,
            translator=self.name,
        )

    # Build user prompt
    if self._user_prompt_template:
        user_prompt = self._user_prompt_template.format(
            text=text,
            source_language=source_language or "auto-detect",
            target_language=target_language,
        )
    elif source_language:
        user_prompt = DEFAULT_USER_PROMPT.format(
            text=text,
            source_language=self._get_language_name(source_language),
            target_language=self._get_language_name(target_language),
        )
    else:
        user_prompt = DEFAULT_USER_PROMPT_AUTO.format(
            text=text,
            target_language=self._get_language_name(target_language),
        )

    # Calculate max_tokens: use explicit value if set, otherwise use multiplier
    if self._config.max_tokens is not None:
        max_tokens = self._config.max_tokens
    else:
        max_tokens = len(text) * self._config.max_tokens_multiplier

    logger.debug(
        "Translation request: text_len=%d, max_tokens=%d",
        len(text),
        max_tokens,
    )

    max_retries = self._config.max_retries
    retry_delay = getattr(self._config, "retry_delay", 1.0)
    last_error: Exception | None = None

    for attempt in range(max_retries + 1):
        try:
            response = self._client.chat.completions.create(
                model=self._config.model,
                messages=[
                    {"role": "system", "content": self._system_prompt},
                    {"role": "user", "content": user_prompt},
                ],
                temperature=self._config.temperature,
                max_tokens=max_tokens,
            )

            translated_text = response.choices[0].message.content or ""

            # Clean up the response - remove common LLM artifacts
            translated_text = self._clean_translation(translated_text, text)

            return TranslationResult(
                source_text=text,
                translated_text=translated_text,
                source_language=source_language or "auto",
                target_language=target_language,
                translator=self.name,
                metadata={
                    "model": response.model,
                    "usage": {
                        "prompt_tokens": response.usage.prompt_tokens if response.usage else 0,
                        "completion_tokens": response.usage.completion_tokens
                        if response.usage
                        else 0,
                    },
                    "attempts": attempt + 1,
                },
            )

        except Exception as e:
            last_error = e
            if attempt < max_retries:
                logger.warning(
                    "Translation attempt %d/%d failed: %s. Retrying in %.1fs...",
                    attempt + 1,
                    max_retries + 1,
                    e,
                    retry_delay,
                )
                time.sleep(retry_delay)
                continue

    raise TranslationError(
        f"LLM translation failed after {max_retries + 1} attempts: {last_error}",
        source_text=text,
        source_language=source_language,
        target_language=target_language,
        translator=self.name,
        cause=last_error,
    )

`translate_batch(texts, target_language, source_language=None)`

Translate multiple texts.

For LLM translators, we translate texts one by one to maintain quality and handle errors individually.

Source code in src/pretok/translation/llm.py

def translate_batch(
    self,
    texts: Sequence[str],
    target_language: str,
    source_language: str | None = None,
) -> list[TranslationResult]:
    """Translate multiple texts.

    For LLM translators, we translate texts one by one
    to maintain quality and handle errors individually.
    """
    results = []
    for text in texts:
        try:
            result = self.translate(text, target_language, source_language)
            results.append(result)
        except TranslationError:
            # Return original on failure
            results.append(
                TranslationResult(
                    source_text=text,
                    translated_text=text,
                    source_language=source_language or "unknown",
                    target_language=target_language,
                    translator=self.name,
                )
            )
    return results

`pretok.translation.factory.create_translator(backend, config=None)`

Create a translator instance by backend name.

Parameters:

Name	Type	Description	Default
`backend`	`str`	Backend name ('llm', 'huggingface', 'google', 'deepl')	required
`config`	`TranslationConfig \| None`	Translation configuration	`None`

Returns:

Type	Description
`BaseTranslator`	Configured translator instance

Raises:

Type	Description
`ValueError`	If unknown backend specified
`ImportError`	If required dependencies not installed

Source code in src/pretok/translation/factory.py

def create_translator(
    backend: str,
    config: TranslationConfig | None = None,
) -> BaseTranslator:
    """Create a translator instance by backend name.

    Args:
        backend: Backend name ('llm', 'huggingface', 'google', 'deepl')
        config: Translation configuration

    Returns:
        Configured translator instance

    Raises:
        ValueError: If unknown backend specified
        ImportError: If required dependencies not installed
    """
    if backend == "llm":
        from pretok.translation.llm import LLMTranslator

        if config is None or config.llm is None:
            raise ValueError("LLM translator requires configuration")

        return LLMTranslator(config.llm)

    elif backend == "huggingface":
        from pretok.translation.huggingface import HuggingFaceTranslator

        if config is None:
            from pretok.config import HuggingFaceTranslatorConfig

            hf_config = HuggingFaceTranslatorConfig()
        else:
            hf_config = config.huggingface

        return HuggingFaceTranslator(hf_config)

    elif backend == "google":
        from pretok.translation.google import GoogleTranslator

        if config is None:
            from pretok.config import GoogleTranslatorConfig

            google_config = GoogleTranslatorConfig()
        else:
            google_config = config.google

        return GoogleTranslator(google_config)

    elif backend == "deepl":
        from pretok.translation.deepl import DeepLTranslator

        if config is None:
            from pretok.config import DeepLTranslatorConfig

            deepl_config = DeepLTranslatorConfig()
        else:
            deepl_config = config.deepl

        return DeepLTranslator(deepl_config)

    else:
        raise ValueError(f"Unknown translator backend: {backend}")

Translation API

pretok.translation.Translator

name property

supported_languages property

translate(text, target_language, source_language=None)

translate_batch(texts, target_language, source_language=None)

pretok.translation.TranslationResult dataclass

was_translated property

pretok.translation.TranslationError

pretok.translation.llm.LLMTranslator

name property

supported_languages property

__init__(config)

translate(text, target_language, source_language=None)

translate_batch(texts, target_language, source_language=None)

pretok.translation.factory.create_translator(backend, config=None)

`pretok.translation.Translator`

`name` `property`

`supported_languages` `property`

`translate(text, target_language, source_language=None)`

`translate_batch(texts, target_language, source_language=None)`

`pretok.translation.TranslationResult` `dataclass`

`was_translated` `property`

`pretok.translation.TranslationError`

`pretok.translation.llm.LLMTranslator`

`name` `property`

`supported_languages` `property`

`init(config)`

`translate(text, target_language, source_language=None)`

`translate_batch(texts, target_language, source_language=None)`

`pretok.translation.factory.create_translator(backend, config=None)`