Architecture
Overview
pretok is designed as a modular, pluggable system for pre-token language adaptation.
┌─────────────────────────────────────────────────────────────────────────┐
│ PreTok Pipeline │
├─────────────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Language │ │ Model │ │ Segment │ │ Translation │ │
│ │ Detection │──│ Capability │──│ Processing │──│ Engine │ │
│ │ Module │ │ Registry │ │ Module │ │ Module │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Configuration System │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Processing Flow
- Input - Raw text in any language
- Segment Parsing - Split into typed segments
- Language Detection - Identify source language
- Capability Check - Determine if translation needed
- Translation - Translate content segments
- Reconstruction - Reassemble the prompt
- Output - Text ready for tokenization
Module Design
Plugin Architecture
Detection and translation backends implement protocols:
class LanguageDetector(Protocol):
def detect(self, text: str) -> DetectionResult: ...
class Translator(Protocol):
async def translate(
self, text: str, source: str, target: str
) -> TranslationResult: ...
Configuration System
Three-tier configuration hierarchy: 1. Built-in defaults 2. Configuration file 3. Runtime overrides
Caching Strategy
Translation caching with pluggable backends: - In-memory LRU (default) - Redis (distributed) - SQLite (persistent)
Design Principles
- Model-Agnostic - No dependency on specific LLMs
- Pre-Token Boundary - Raw text transformations only
- Structure Preservation - Maintain prompt integrity
- Pluggable Backends - Easy to extend
- Explicit Contracts - Declared capabilities