pretok

Universal pre-token language adaptation layer for text-based LLMs.

pretok enables any Large Language Model to receive input in any human language by automatically translating input text into a language the model supports—all before tokenization, without modifying the model or tokenizer.

Features

Model-Agnostic: Works with any text-based LLM (local, remote, open-source, proprietary)
Pre-Token Boundary: All transformations occur on raw text before tokenization
Prompt Structure Preservation: Role markers, delimiters, code blocks, and control tokens are preserved
Pluggable Backends: Support for multiple detection and translation engines
Flexible Translation: Use any LLM via OpenAI-compatible APIs (OpenRouter, Ollama, vLLM, etc.)
Explicit Capability Contracts: Models declare their supported languages

Installation

pip install pretok

Quick Start

from pretok import Pretok, create_pretok

# Create with default settings (targets English)
pretok = Pretok(target_language="en")

# Process text
result = pretok.process("Bonjour, comment ça va?")

print(result.processed_text)  # "Hello, how are you?"
print(result.was_modified)    # True

With Model-Specific Optimization

# Auto-detect optimal language from model capabilities
pretok = create_pretok(model_id="gpt-4")

result = pretok.process("Hello World")
# Uses GPT-4's primary language (English)

With Custom Translation Backend

from pretok import Pretok
from pretok.config import LLMTranslatorConfig
from pretok.translation.llm import LLMTranslator

# Use any OpenAI-compatible API
config = LLMTranslatorConfig(
    base_url="https://api.openai.com/v1",  # Or OpenRouter, Ollama, vLLM, etc.
    model="gpt-4o-mini",
)
translator = LLMTranslator(config)
pretok = Pretok(target_language="en", translator=translator)