Quality > Quantity: Building DarkForest
How I designed a custom 128K Byte-BPE tokenizer for DarkForest, and the data curation strategy that turned a noisy corpus into something a model can learn from.
The model is DarkForest by BlackwoodAI.
Two decisions you make before training starts will quietly shape the whole run. The first is the tokenizer, which decides how the model sees text. The second is the data, which decides what it sees and how often. Neither is glamorous. Nobody writes blog posts about tokenizer audits. But those are the two pieces I've been heads-down on for the last month, and most of the value gets made there.
This post is about both. Less of a tutorial, more of a record of what I had to figure out the hard way.
The tokenizer is the eye
Every language model has the same problem on day one: the network does math on integers, and text isn't integers. Something has to do the translation.
That something is the tokenizer. It turns "hello world" into a list of numbers and reads them back into characters. Most people treat it like infrastructure you grab off a shelf and forget about. I think that's a mistake when you're training from scratch.
A tokenizer is, in practical terms, a compression algorithm with opinions. It looks at your data, decides which substrings are common enough to earn a single token, and spells out everything else across multiple tokens. The model never gets to revisit those decisions. Whatever the tokenizer commits to on day one, the model is stuck with forever.
If you train on Python and your tokenizer encodes "def " as four tokens, you're paying a 4× tax on every function definition in the training corpus. If you have a lot of LaTeX and the tokenizer doesn't know that \sqrt{ is a unit, the model has to learn it the long way. If you're planning on tool calling later and the tokenizer fragments <|tool_call|> into eight pieces, the model has eight chances to mess it up instead of one.
So I rolled my own. About three weeks of engineering. Not fun while doing it. Glad afterwards.
Byte-level BPE, 128K vocab
The algorithm is Byte-Pair Encoding. The "byte-level" qualifier matters more than people give it credit for. Plain BPE works on characters. Byte-level BPE works on raw bytes. That's a small change in description and a big change in behavior.
Unicode is a minefield. CJK characters span two to four bytes. Emoji span one to four bytes, plus optional zero-width joiners. Code in mixed encodings, log files with stray binary garbage, mathematical symbols, all of these break naive tokenizers in ways that show up as silently dropped characters. Byte-level BPE sidesteps all of it by treating the 256-byte alphabet as a primitive. There is no UNK token, ever. Whatever you put in, you get back.
Then there's the vocab size question, which is a genuine trade-off and scales with model size. The embedding table is vocab_size × d_model parameters. That's the tied-embeddings number, where the input embedding and the output (unembedding) projection share weights. If you untie them, double the count. At 7B-class dimensions (d_model = 4096), the tied math is:
- 32K vocab → ~131M embedding params (about 1.9% of a 7B model, tied)
- 64K vocab → ~262M (about 3.7%, tied)
- 128K vocab → ~524M (about 7.5%, tied)
If you untie, 128K at d_model=4096 lands at roughly 1.05B params just for the embedding stack, which puts that single decision at ~15% of a 7B's total budget. Worth knowing before you commit.
At 7B scale with tied embeddings, 128K costs you somewhere around 7-8% of the parameter budget. Llama 3 chose the same regime and the public report supports better compression overall, especially on non-English text. Qwen 2.5 pushed to ~152K. Why the smaller 32K vocabs (Phi-3-mini, older Llama 2): at smaller scales a 128K embedding table can eat 25%+ of the whole model, so a smaller vocab keeps room for attention and MLP capacity. At 7B that pressure largely lifts.
So 128K it is. Two things that buys: better compression on the corpus (on the held-out text and math I've measured, 128K cuts inputs into roughly 10-15% fewer tokens than 32K would; my code measurement is still re-running, so I'm not committing to a code number yet), and 226 unused special-token slots reserved for chat structure, tools, FIM, plus any modality I might want to add later. Reserved IDs aren't a free pass to multimodality. You still have to train embeddings, adapters, the rest of it. They just save you from having to regrow and re-initialize the embedding table when you do.
The 128K total breaks down into three layers. At the bottom, a 256-slot byte alphabet so anything is encodable. Above that, 256 reserved special-token slots (I'm using 30 of them, for chat structure, tools, FIM, language tags). The remaining ~127K are merges learned during tokenizer training on the data mix itself.
Where the merges came from
The BPE merges weren't trained on a generic web crawl. They were trained on a representative subset of the same mixture the model will eventually see, sampled in roughly the proportions the corpus uses. The merges reflect what the model is going to encounter, not whatever happens to be in some generic corpus.
Source classes that went into the tokenizer-training subset are below. (I'm holding the model-pretraining manifest private until a license/attribution review wraps. Tokenizer training has a lower memorization surface than full model pretraining, though it's not zero. BPE can and does learn real substrings: names, file paths, URLs, snippets. So I treat it as a different bar, but not a free pass.)
Two things worth flagging. First, the proportions in the tokenizer training subset don't match the proportions in the full pretraining mix. Sub-sampling is deliberately biased toward making sure rare-but-important categories (math symbols, CJK, code idioms) get enough representation for BPE to learn them as single tokens. Use raw mix proportions and the long tail of math symbols would be underrepresented and rarely merge.
Second, "openly licensed" isn't a hall pass. Public-domain books and permissive-license code still come with author attribution and downstream-use practices I'm tracking. Tokenizer training is lower-risk than model pretraining because the memorization surface is smaller, but it isn't zero. BPE will happily learn frequent literal substrings if they appear often enough in the corpus. I treat the two bars as different, not as one being absent.
The pre-tokenizer almost broke me
This part isn't obvious until you've built one. BPE on its own doesn't know about word boundaries. Let it merge whatever it wants and you end up with junk merges like "theand" and " end." showing up as single tokens. They look fine in isolation. They waste vocab slots and the model has to learn to ignore them.
The pre-tokenizer fixes this by splitting raw text into chunks first. BPE then only gets to merge inside each chunk. Cross-chunk merges are forbidden.
I spent a week on the regex.
It has to handle words. It has to handle numbers, but you want to cap digit runs at 1-3 so the model learns place value instead of memorizing every 4-digit string ("2026" becomes two tokens, not one). Punctuation runs need to go as their own units. Whitespace runs stay separate. All of that, while still letting BPE find the common patterns inside each chunk, because that's where the compression comes from.
The pattern I ended up with is the Llama 3 / tiktoken-style one, which is more elaborate than the toy regex you see in BPE tutorials:
(?i:'s|'t|'re|'ve|'m|'ll|'d)
| [^\r\n\p{L}\p{N}]?+\p{L}+
| \p{N}{1,3}
| ?[^\s\p{L}\p{N}]++[\r\n]*
| \s*[\r\n]
| \s+(?!\S)
| \s+
A few things worth pointing out. The leading-apostrophe alternation ('s, 't, 're, etc.) splits the apostrophe-suffix off as its own pre-token chunk, so "don't" ends up as ["don", "'t"] for BPE to consider, not one undifferentiated blob. It doesn't keep the full contraction together. The suffix is just stable enough across the corpus that BPE will reliably learn "'t" as a single token. The [^\r\n\p{L}\p{N}]?+\p{L}+ part attaches one optional leading non-alphanumeric (usually a space) to a word run, which is what gives you single tokens like " hello" instead of two tokens for the space and the word. Digit runs cap at 3 so "2026" becomes two tokens, not one. That generalizes to any 4-digit number instead of memorizing specific ones. Trailing whitespace splits into separate cases for "whitespace ending in newline," "whitespace not followed by content," and "general whitespace," because indentation in code matters.
Separate gotcha. You have to set use_regex=False on the byte-level encoder layer underneath the pre-tokenizer. Leave it at the default and the encoder applies its own built-in regex on top of yours, double-splitting everything and producing nonsense merges. I lost a day figuring out why my merges looked off. Read the tokenizer library's source if it's at all opinionated.
Special tokens are not decoration
The 128K vocab includes 256 reserved special-token slots. I used 30 of them.
These are the structural handles the model learns to emit during fine-tuning. Chat structure (<|user|>, <|assistant|>, <|system|>). Tool calling (<|tool_call|>, <|tool_result|>). Reasoning markers (<|think|>, <|end_think|>). Code language hints. Fill-in-the-middle markers for code completion.
Each of these is exactly one token in the vocabulary. That gives you a clean, unambiguous boundary at decode time. No string-matching fragility, no protocol hacks layered on top of generic text. The model itself doesn't know what a <|tool_call|> token means at the start of pretraining. It just sees an ID. The protocol meaning ("this token signals a structured tool invocation, here's how to format the JSON inside") gets taught later, in SFT and reinforced in any RL stage. Pretraining just makes sure the boundary token is atomic, so the post-training signal isn't fighting the tokenizer when the time comes.
The remaining 226 reserved slots are blank space. If next year I want to add image or audio modalities, the IDs are pre-allocated, which means I don't have to grow and re-initialize the embedding table to make room. Reserved IDs don't make multimodal training easy on their own. You still have to train embeddings, projection layers, and any modality adapters end-to-end. They just remove one ugly piece of plumbing from a future migration.
Validating that it works
Before trusting a tokenizer with months of training, you have to run every check you can think of. Quick summary of what I ran:
Compression on held-out data. The most important number. You measure bytes per token on text the tokenizer never saw during training. Higher is better, because for every byte of source you produce fewer tokens, and that means fewer compute steps for the same content.
| Domain | Bytes per token | Notes |
|---|---|---|
| English prose | 4.49 | Held-out books, articles, encyclopedic content |
| Math | 4.04 | LaTeX-heavy, expected to be a touch lower |
| Code | behavior-passed; comparative bpt re-running | Round-trip, indentation, FIM markers, single-token specials all validated. The cross-domain bpt comparison hit a schema mismatch on the held-out reader and is being regenerated. |
To be specific about what's known and what isn't: the tokenizer passed every behavior check on code that I cared enough to test. Round-trip on real source files (Python with 4-space indent, TypeScript, Rust, SQL with quoted strings, makefiles with tabs) was 10/10 exact. FIM and chat special tokens encode/decode atomically. The pre-tokenizer regex splits indentation and digit runs the way the real Llama 3 / tiktoken pattern does. What's pending is the comparative number, the held-out code bytes-per-token I want to put on a table next to the prose and math figures. The first-pass validation reader had a schema mismatch and the run is going again. So the model went into pretraining on a tokenizer whose code behavior is validated; I just don't have the clean comparative number yet to publish next to 4.49 and 4.04.
A swing of 0.5 bpt on the domains I've measured is roughly 10% fewer tokens per training step on those domains. Over a multi-hundred-billion-token run, that compounds to real money. Only if the compression number holds across the whole mix, including code.
Special token integrity. Every active special token must encode as exactly one token, and decode back to the original string when you ask the decoder to preserve specials. Sounds trivial. It's the bug that kills people. Mine: 30/30 specials passed.
Byte alphabet coverage. Every one of the 256 possible bytes must be addressable as a single token. This is what makes round-trip lossless. 256/256 covered.
Round-trip on diverse samples. Code with 4-space indentation. LaTeX with fractions and integrals. CJK paragraphs. Emoji. URLs with query strings. HTML. SQL with quoted strings. Tab characters. Anything weird I could think of. 10/10 round-trip exactly.
Vocab inspection. Look at the lowest token IDs (the earliest learned merges). They should be common subword fragments. Look at the highest IDs (the last merges to be learned). They should be rare-but-real words or technical terms. If your highest IDs are gibberish, something went wrong upstream.
The first audit I wrote was broken. It tested round-trips with the default decode settings, which strip special tokens. So every special token "failed" the round-trip even though they were perfectly atomic. An hour of confusion later I realized the audit was lying, not the tokenizer. If your audit doesn't pass skip_special_tokens=False to the decoder, your audit is wrong. Mine was, briefly.
Now the data
The tokenizer is the eye. The corpus is what you put in front of it.
A note before I get into specifics. I'm holding the exact pretraining manifest private until a license/attribution review wraps up. The landscape around training data has gotten genuinely thorny over the last year, and I'd rather under-share than misrepresent a source's terms. What I can talk about is the shape of the curation strategy: how I tier sources, how I weight them, how I decontaminate. That part is universal, and it's where the lesson is.
The conventional wisdom on data is "more is better." For small models that's incomplete at best. Every parameter the model has is a small piece of representational space that you have to teach something useful. A 1B model has roughly a billion of those slots. You don't have so many tokens to spare that you can afford to fill them with noise.
Two framings are worth knowing. They optimize for different things and people confuse them.
The first is Chinchilla (Hoffmann et al., DeepMind, 2022). For compute-optimal training, given a fixed FLOP budget, the optimal split between model size and token count works out to roughly 20 tokens per parameter:
So a 1B model wants ~20B tokens, a 7B wants ~140B. Train past that and you're "wasting compute" in the sense that the same FLOP budget could have built a bigger model with less data and lower final loss.
That law is real and at the time DeepMind published it, it was the right answer. The catch is that it optimizes for training cost, not inference cost. If you're going to serve the model billions of times, inference dwarfs training and the calculus flips.
The second framing is the one that flipped the calculus. People call it inference-optimal scaling or, less formally, over-training. Llama 3 is the public example most people point to: it shows the modern preference for heavily over-trained smaller models when inference cost matters. Phi-3 pushed in the same direction, combining over-training with aggressive curation: smaller model, more tokens, only the highest-quality ones, with the best slices repeated. Quality matters more, and it can be repeated.
The combined rule of thumb that's emerged for from-scratch models in 2026 looks something like this:
- Train as small a model as you can get away with for your downstream eval target.
- Train it on 50 to 100× the Chinchilla-optimal token count. For a 7B that's 7 to 14T tokens (Llama 3 8B did 15T). For a 1B, 1 to 2T tokens. Don't confuse this with 5 to 10× Chinchilla, which is a much weaker over-training regime.
- Make the tokens count: aggressive quality weighting, decontamination, stage-2 cooldown.
- Replay the best slices a few times. Repeating high-quality data helps. Repeating raw web hurts.
To anchor those numbers in real models, here's where a few public training runs sit on the over-training spectrum:
That's the frame everything below sits inside.
Bigger models can absorb noise. They have enough capacity that a few hundred billion noisy tokens average out into something useful. Small models can't. Every token is precious. The question stops being "do I have enough tokens" and starts being "how educational is each token."
That changes the strategy from filtering to weighting.
From filter to weight
The old way of curating data:
- Crawl a lot.
- Run a quality classifier.
- Keep the top X%.
- Train uniformly on the survivors.
The new way:
- Identify your sources by category and quality tier.
- Assign each tier a multiplier. How many times more often the model should see this slice during training.
- Build a training manifest where each shard has a sampling weight that combines the domain ratio with the quality multiplier.
- The trainer samples shards by weight, so the model effectively sees high-quality content multiple times per epoch, while still being exposed to the broader distribution.
The math:
What this lever controls is the per-token sampling rate inside a domain. In code, for example, a token from a ×4 shard gets four times the sampling probability of a token from a ×1 shard. That's not the same as saying the shard itself gets sampled four times as often. Bigger shards still contribute more total probability mass because they have more tokens. The multiplier reweights each of those tokens.
So in practice, a small high-quality shard at ×4 ends up over-sampled (the model sees its tokens roughly four times per Chinchilla-equivalent pass through the domain), while a noisy shard at ×0.5 stays under-saturated even if it's enormous.
The framework matters more than the specific multipliers. The exact number I assign to each slice depends on my corpus. Yours will be different. The pattern itself, "tag each shard, weight by tier, sample by weight," is universal.
Stage-2 cooldown
You don't train on the same mix the entire run. Nobody told me this. I had to find it in published recipes.
For the first 80 to 90 percent of training tokens, you train on the full quality-weighted distribution above. The model needs breadth early on. It's still figuring out what English is, what code is, what numbers mean. Some of the noisier tiers do real work here. They teach the model the wider distribution of human writing, the ugly parts included. You want that. A model that's only seen polished text panics on real-world inputs.
For the final 10 to 20 percent, you flip a switch. Drop the lower-quality tiers. Concentrate sampling on the top two tiers. Decay the learning rate. The terminology across labs is a mess. OLMo 2 makes this explicit and calls it "mid-training" with a documented mid-training mix; the AI2 paper is the cleanest public report. Qwen 2.5 discusses staged shifts in data mixture and quality emphasis toward the end of pretraining without using the same name. People informally call the whole family "annealing" because the learning rate decays at the same time, but that's borrowing language from the LR schedule. The data-mix shift is what's doing the work.
The published gains are real but I'd be careful giving you tight numbers. OLMo 2's mid-training intervention adds several points across knowledge and reasoning benchmarks; the magnitude depends heavily on the base mix you started from and how aggressive your stage-2 multipliers are. "A few points on MMLU-style knowledge tasks, a meaningfully larger jump on coding tasks where the high-quality slice is denser" is closer to the truth than any precise table I could write before running it myself.
The intuition I've come to: late in training the model has stable representations of the world. What it needs at that point is to overfit lightly to the highest-quality patterns, the ones it'll be evaluated on and used for. The early raw exposure gave it the margin to handle weird inputs. The late cooldown sharpens it on the patterns that matter.
The mistake people make is doing only the cooldown mix the entire run. You end up with a model that's eloquent inside the textbook distribution and brittle outside it. The early raw exposure is what gives you that brittleness margin. Skip it and your model is fragile in deployment.
Decontamination, briefly
There's one more thing that has to happen before any of this matters. If your training data contains the exact text of the benchmark you're evaluating on, your benchmark numbers are a lie. The model isn't reasoning. It's remembering.
This isn't theoretical. I scanned the entire training pool against the standard small-model benchmarks. Found documents where the literal benchmark question appeared verbatim, every n-gram in place, in some random repository someone had pushed to the public internet. If I'd let those into training, the resulting benchmark scores would have been inflated by pure recall. They wouldn't have meant anything.
The decontamination step itself is conceptually simple, but it has one subtlety that's easy to get wrong. Take the eval items, compute their word 13-grams, hash each with blake2b truncated to 64 bits. The naive approach is to dump those hashes into one big global set, then for every training document compute its own 13-grams, hash, and check membership. That tells you whether the doc shares some n-grams with some eval item. It can't tell you whether 80% of any one specific eval item showed up in a single doc, which is the contamination criterion that matters.
The fix is to keep an inverted index from hash to the list of eval items containing that 13-gram, also known as a postings list. For each training document, walk its 13-grams, look each one up in the index, and increment a per-eval-item hit counter. If any single eval item's counter exceeds 80% of its own 13-gram count, flag the doc.
There's one more thing. Hash collisions exist; a 64-bit hash on 13 words has very low but nonzero false-positive rate. After a doc trips the 80% threshold against an eval item, do an exact-text verification pass: pull the actual 13-gram strings, confirm they really do appear verbatim in the document. This rules out the case where you flag a doc for a hash collision instead of real contamination. In practice this affects a tiny fraction of flagged docs but it's free at this point in the pipeline and worth doing.
The naive cross-product is roughly O(N_documents × N_eval_items). For a corpus of 250 million documents and 30,000 eval items, that's projected weeks. The inverted-index approach makes each doc's check time proportional to its own length, regardless of eval-set size. The vast majority of documents have zero eval n-grams and exit on the fast path; only the rare hits go through the per-item counter and the exact-text confirmation step.
Took my scan from "weeks, infeasible" to "33 hours, done." I almost shipped without doing this step at the right level of rigor. I'm glad I didn't.
What it looks like at the end
You end up with a pipeline that has these pieces:
- A tokenizer designed for your data, audited end-to-end
- A decontamination pass against your eval set
- Tagged shards, each with a domain and quality tier
- A weighted manifest that the trainer reads to sample shards
- A second cooldown manifest for the final phase
It's a lot of plumbing, and plumbing isn't glamorous. Skip any of these pieces and your final model degrades silently. You won't know until you're three weeks into training and your benchmarks look mediocre and you can't tell why.
As of this writing, DarkForest is mid-training. The pretraining run is about halfway through its planned token budget. The loss curve is behaving, throughput is stable, and the stage-2 cooldown manifest is staged for the final phase. If the multipliers turn out to be wrong, or the cooldown doesn't transfer the way the published numbers suggest, I'll see it in the next stretch of the run, not in some hypothetical future one. None of it is guaranteed yet. But the data pipeline is no longer the unknown it was a month ago.
The engineering pieces are reusable, even if your conclusions about your own data end up different from mine. A custom 128K Byte-BPE tokenizer designed for your specific mix is a one-time cost that pays off for years. A quality-weighted manifest is a 200-line Python script. A cooldown manifest is the same script with different multipliers. Decontamination is the difference between a model and a model that's lying about its scores.
The thing nobody else can do for you is the judgment about which of your sources belong in which tier. That's the real work of training from-scratch in 2026. Everything else is engineering, and I've shown you most of it.
Next post: the architecture, the optimizer, the proxy sweep numbers. Including which of these multipliers survive contact with reality and which don't.
DarkForest by BlackwoodAI. Mid-training, May 2026.