Configuration
pretok uses a hierarchical configuration system with sensible defaults.
Configuration Hierarchy
- Built-in defaults - Sensible out-of-box behavior
- Configuration file -
pretok.yamlorpretok.toml - Runtime overrides - Programmatic configuration
Configuration File
pretok automatically searches for configuration files in this order:
pretok.yamlpretok.ymlpretok.tomlpretok.json.pretok.yaml
Full Configuration Example
version: "1.0"
# Pipeline settings
pipeline:
default_detector: langdetect
cache_enabled: true
# Language detection configuration
detection:
fasttext:
model_path: /path/to/lid.176.bin
langdetect:
seed: 42
composite:
detectors: [fasttext, langdetect]
strategy: voting
# Translation configuration
translation:
# LLM-based translation (recommended)
llm:
base_url: "https://api.openai.com/v1"
model: "gpt-4o-mini"
api_key_env: OPENAI_API_KEY
temperature: 0.3
max_tokens: 1024
system_prompt: "You are a professional translator."
user_prompt_template: "Translate from {source_lang} to {target_lang}:\n{text}"
# HuggingFace models for local translation
huggingface:
model: facebook/nllb-200-distilled-600M
device: cuda
max_length: 512
num_beams: 4
# Cache configuration
cache:
memory:
max_size: 1000
ttl: 3600
# Model capabilities
models:
default:
supported_languages: [en]
primary_language: en
gpt-4:
supported_languages: [en, zh, ja, ko, fr, de, es, it, pt, ru]
primary_language: en
qwen-7b:
supported_languages: [zh, en]
primary_language: zh
llama-2-7b:
supported_languages: [en]
primary_language: en
# Segment processing
segment:
preserve_code_blocks: true
translate_code_comments: false
translate_json_strings: false
Environment Variables
Sensitive values can reference environment variables:
Or inline with ${VAR} syntax:
Programmatic Configuration
from pretok import Pipeline
from pretok.config import PretokConfig, PipelineConfig, TranslationConfig
config = PretokConfig(
pipeline=PipelineConfig(
default_detector="langdetect",
default_translator="openai",
confidence_threshold=0.8,
),
translation=TranslationConfig(
openai={"model": "gpt-4"},
),
)
pipeline = Pipeline(config=config)
Configuration Validation
Configuration is validated on load: