Skip to content

Segment Processing

pretok intelligently parses prompts to preserve structure during translation.

Segment Types

Type Description Translatable
CONTENT Regular text content Yes
ROLE_MARKER Role indicators (system, user, assistant) No
CONTROL_TOKEN Special tokens like <\|endoftext\|> No
DELIMITER Structural separators No
CODE_BLOCK Fenced code blocks No
JSON_SCHEMA JSON/structured data No
WHITESPACE Whitespace-only segments No

Supported Formats

ChatML

<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
Hello!
<|im_end|>

Llama

[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

Hello! [/INST]

Alpaca

### Instruction:
Translate this text.

### Input:
Bonjour

### Response:

Custom Markers

Register custom markers for non-standard formats:

from pretok.segment import SegmentProcessor

processor = SegmentProcessor()
processor.register_role_marker(r"\[ROLE:(\w+)\]")
processor.register_control_token(r"<\|custom\|>")

Code Block Handling

Code blocks are preserved by default:

prompt = '''Explain this code:

```python
def hello():
    print("Hello")
'''

result = pipeline.adapt(prompt, model_id="llama-2-7b")

Code block is preserved, only surrounding text is translated

## JSON Handling

JSON structures are preserved:

```python
prompt = '''Parse this JSON:

{"name": "John", "city": "Paris"}
'''

result = pipeline.adapt(prompt, model_id="llama-2-7b")
# JSON is preserved exactly