Architecture

Overview

pretok is designed as a modular, pluggable system for pre-token language adaptation.

┌─────────────────────────────────────────────────────────────────────────┐
│                              PreTok Pipeline                             │
├─────────────────────────────────────────────────────────────────────────┤
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐ │
│  │   Language   │  │    Model     │  │   Segment    │  │ Translation  │ │
│  │  Detection   │──│  Capability  │──│  Processing  │──│   Engine     │ │
│  │   Module     │  │   Registry   │  │    Module    │  │    Module    │ │
│  └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘ │
│         │                 │                 │                 │         │
│         ▼                 ▼                 ▼                 ▼         │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │                      Configuration System                         │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

Processing Flow

Input - Raw text in any language
Segment Parsing - Split into typed segments
Language Detection - Identify source language
Capability Check - Determine if translation needed
Translation - Translate content segments
Reconstruction - Reassemble the prompt
Output - Text ready for tokenization

Module Design

Plugin Architecture

Detection and translation backends implement protocols:

class LanguageDetector(Protocol):
    def detect(self, text: str) -> DetectionResult: ...

class Translator(Protocol):
    async def translate(
        self, text: str, source: str, target: str
    ) -> TranslationResult: ...

Configuration System

Three-tier configuration hierarchy: 1. Built-in defaults 2. Configuration file 3. Runtime overrides

Caching Strategy

Translation caching with pluggable backends: - In-memory LRU (default) - Redis (distributed) - SQLite (persistent)

Design Principles

Model-Agnostic - No dependency on specific LLMs
Pre-Token Boundary - Raw text transformations only
Structure Preservation - Maintain prompt integrity
Pluggable Backends - Easy to extend
Explicit Contracts - Declared capabilities