Segment Processing
pretok intelligently parses prompts to preserve structure during translation.
Segment Types
| Type | Description | Translatable |
|---|---|---|
CONTENT |
Regular text content | Yes |
ROLE_MARKER |
Role indicators (system, user, assistant) | No |
CONTROL_TOKEN |
Special tokens like <\|endoftext\|> |
No |
DELIMITER |
Structural separators | No |
CODE_BLOCK |
Fenced code blocks | No |
JSON_SCHEMA |
JSON/structured data | No |
WHITESPACE |
Whitespace-only segments | No |
Supported Formats
ChatML
Llama
Alpaca
Custom Markers
Register custom markers for non-standard formats:
from pretok.segment import SegmentProcessor
processor = SegmentProcessor()
processor.register_role_marker(r"\[ROLE:(\w+)\]")
processor.register_control_token(r"<\|custom\|>")
Code Block Handling
Code blocks are preserved by default:
'''result = pipeline.adapt(prompt, model_id="llama-2-7b")