Training a 7B LLM From Scratch

Training a 7B parameter LLM from scratch. Not fine-tuning someone else's model. From random weights, raw data, rented GPUs.

Not a code model. Not a chat model. General-purpose base model first, trained on code, text, and math. Code specialization comes later through SFT and reinforcement learning on real coding trajectories I've been collecting for over a year. Build the foundation right, then sharpen it.

First post in a series. Architecture, data, training, failures, what actually works.

Why from scratch

You don't understand a thing until you build it. I read the transformer paper multiple times. Didn't really get attention until I wrote the RoPE rotation matrices myself and spent two hours debugging why GQA produced NaN after 200 steps.

Fine-tuning teaches you the API. Training from scratch teaches you the machine.

Every open model carries someone else's decisions baked in. Llama's tokenizer was optimized for Meta's crawl data. Qwen's architecture reflects Alibaba's priorities. When you fine-tune, you inherit all of that. Don't even know what decisions were made, let alone why.

I wanted to own the whole stack. Tokenizer, architecture, data mix, hyperparameters, training recipe. Every choice mine. When something breaks I know exactly where to look.

Architecture

Decoder-only transformer. Same family as Llama, Mistral, most modern open LLMs. Here's what I picked and why.

The core attention mechanism:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

RoPE for position encoding. Original transformer used sinusoidal positional encoding. RoPE rotates query and key vectors in attention instead. Each dimension pair gets rotated by an angle proportional to position. Attention scores between two tokens depend on their relative distance, not absolute positions. Helps the model generalize to unseen sequence lengths. Nearly every recent open LLM uses it.

SwiGLU in the feed-forward blocks. Standard FFN: project up, ReLU/GELU, project down. SwiGLU adds a gate:

$\text{SwiGLU}(x, W_1, V, W_2) = (\text{Swish}(xW_1) \odot xV) W_2$

Three weight matrices instead of two. FFN multiplier drops to ~2.67x to keep param count the same. Small consistent improvements across benchmarks. Llama, Mistral, Gemma, PaLM, DeepSeek all use it. Free gains.

RMSNorm instead of LayerNorm. Drops the mean computation. Just divides by root mean square. ~5-10% faster, works comparably. One less thing to go wrong.

GQA (Grouped Query Attention) is the interesting one. Standard multi-head attention has N query heads and N KV heads. GQA shares KV heads across multiple query heads. 10 query heads, 5 KV heads means each KV head serves 2 query heads. KV cache halved. Llama 3 70B uses 8:1. Mistral 7B uses 4:1.

I started at 5:1. At 100M scale the model doesn't have enough capacity for that. Going to 2:1 (10 query, 5 KV) gave a massive improvement. At 7B I'll move to 4:1 where the memory savings actually matter.

Tied embeddings. Input embedding matrix and output projection share the same weights. Saves 32M params at 100M scale. 200M+ at 7B. Model uses the same representation space for reading and generating. Not free though. The input embedding and output projection want different things. One maps tokens into continuous space, the other maps hidden states back to probabilities. Llama and Mistral untie them at 7B for this reason. I'm starting tied because at 100M the param savings matter. At 7B the 262M saved is only ~3.7% of total params so I'll untie if validation shows it helps.

Full architecture:

The scaling ladder

Not jumping straight to 7B. One bad run costs thousands. A 100M run costs $2.

The path: 100M, 350M, 1B, 7B.

All runs on rented GPUs across Verda (formerly DataCrunch) and Vast.ai. H100 SXM5 spot instances run $1.49-2.25/hr depending on availability. H200 clusters for the larger runs. No reserved clusters, no long-term contracts. Scale up when needed, shut down when done.

100M is where you explore. 16 layers, d_model=640, 10 heads. ~101M params. 35 minutes on a single H100, roughly $2. I test 20 ideas per day at this scale. Learning rate schedules, data mix ratios, architectural tweaks. Iterate fast, fail cheap.

Nobody cares about a 100M model. The goal is finding a recipe that transfers when you scale up.

How do I know if something transfers? Hard part. Some things that improve 100M don't improve 350M. The signal: if a change improves validation AND tightens the train-val gap AND doesn't rely on memorization, it's probably real.

350M is the transfer test. 24 layers, d_model=1024, 16 heads. Take the best 100M recipe and see if the ordering holds. If change A beat change B at 100M, does it still win at 350M? If yes, the recipe works.

1B is where the recipe gets locked. No experimentation. Confirmation, not discovery.

Skipping 3B. I've spent years building distributed systems. Comfortable jumping from 1B straight to 7B. A 3B run would catch distributed training bugs that never show at 1B. Bad throughput from parallelism and sharding choices. Checkpoint, restart, fault tolerance issues. Activation memory surprises at scale. Whether the 1B recipe still transfers before you spend real money. I know those failure modes. Built systems that hit them. If something unexpected breaks at 7B I'll slot a 3B run in the middle. But the plan is to skip it.

7B is the real thing. Multiple days, multiple GPUs, 2-3T tokens. Every mistake at 100M that you didn't catch costs 70x more here.

What I've found so far

GQA ratio matters more than expected. 5:1 to 2:1 gave a 0.13 improvement on validation. Enormous at this scale. At small model sizes every parameter matters. Aggressively compressing KV representations throws away capacity the model needs.

LR schedule bugs don't crash. They just make everything subtly worse. I found the LR was applied after optimizer.step() instead of before. Model still trained. Loss still went down. Fixing the order gave a noticeable improvement. Worst kind of bug. Invisible unless you're comparing against a correct baseline.

Not every trick from every paper works. Gemma uses $\sqrt{d_{\text{model}}}$ embedding scaling, multiplying the embedding output by ~25x. Their paper reports improvements. In my setup with tied embeddings and pre-norm RMSNorm, it made things significantly worse. Pre-norm already handles the scale. Adding another scaling factor blows up activation magnitudes.

Read papers. Test everything yourself. Experimental conditions are never identical.

Train-val gap tells you more than raw loss. A model with low training loss but much higher validation loss is memorizing. At 7B with trillions of tokens the gap naturally shrinks. At 100M a tight gap means the recipe is sound.

Automated experiment loops. Karpathy popularized this with his autoresearch repo. Give an AI agent a training script, let it propose changes, run experiments, keep what works, revert what doesn't. I do something similar but with multiple agents. Claude and GPT propose changes independently, review each other's work, then submit to H100s on Modal in parallel. Runs 24/7 across scales.

Code trajectories: the data nobody talks about

This is where it gets interesting.

I've been collecting real code trajectories since Cursor started getting good in early 2025. Then Cline came along. Claude Code. Windsurf, Roo Code, Codex, Devin. The whole explosion of AI coding agents.

Every one of these tools generates something valuable: complete trajectories. Human describes a task. Agent thinks about it, reads files, writes code, runs tests, hits errors, debugs, retries, produces working code. That whole sequence is a trajectory.

Most people throw away the process. I saved everything.

Source	Collecting Since	What I Get
Cursor	Early 2025	IDE-integrated trajectories, tab completions, inline edits
Cline (Claude Dev)	Early 2025	First-gen autonomous agent trajectories, basic edit sequences
Windsurf	Mid 2025	Cascade agent flows, deep codebase reasoning
Claude Code	Mid 2025	Extended thinking chains, multi-file reasoning
Roo Code	Mid 2025	Alternative agent strategies, different tool usage
Codex (autonomous)	2026	Clean autonomous edit sequences with explanations

Good runs where the agent nails it first try. Messy runs where it takes four attempts and a stack trace. Failures. All of it.

Why trajectories matter more than static code. Code in a repo teaches a model what correct code looks like. Trajectories teach it how to get there. The thinking. The debugging loop. When to run tests, how to recover from a wrong approach. Difference between knowing the answer and knowing how to find it.

These tools generate data at a quality level that didn't exist a year ago. Claude Code with extended thinking produces reasoning chains that are genuinely instructive. Codex autonomous mode gives clean edit sequences. Real multi-step problem solving with real codebases. Not synthetic "write a function that..." prompts.

The RL opportunity. Each trajectory is a gym episode. Start state: codebase before changes. Action space: read file, edit file, run command. Reward: tests pass or fail.

You can replay trajectories for reward models. Successful ones for rejection sampling. Set up execution-based RL where the agent runs code and gets feedback. The tools did the hard part. Thousands of expert demonstrations in real codebases. That data trains the next generation.

This is the real edge. Everyone has GitHub repos. Everyone can download Stack v2. But curated, real-world coding trajectories with thinking traces, debugging loops, verified outcomes? Barely exists in the open. I've been collecting it for over a year.

General first, code second

This is not a code model. General-purpose base model first, specialized for code through post-training.

Base model sees code, text, math during pretraining. Broad language understanding, reasoning, world knowledge, coding patterns all at once. Much stronger foundation than training exclusively on code.

Specialization happens after:

Phase	What	Data	Key Detail
SFT Phase 1	General instruction following	Instruction-response pairs across all domains	Turns base model into instruction follower
SFT Phase 2	Code-specific fine-tuning	Generation, debugging, refactoring, test writing	20%+ general/math interleaved to prevent forgetting
RL	Execution-based reinforcement learning	Code trajectories with test outcomes	Reward from actual execution, not learned reward model

Phase 2 keeps 20%+ general and math data alongside code. Interleaved replay. Without it the model catastrophically forgets general capabilities. I've seen this happen in test runs. Drop the replay and within 500 steps the model can't answer basic questions anymore.

RL: Code has verifiable outcomes. Tests pass or they don't. Use execution-based rewards:

$r(s, a) = \begin{cases} +1 & \text{tests pass} \\ -0.5 & \text{tests fail} \\ -1 & \text{compilation error} \end{cases}$

Cleaner than RLHF for code because the reward is grounded in execution. Harder for the model to game, though reward hacking still happens with weak test suites. KL penalty against the SFT model prevents policy drift. Rejection sampling: generate multiple solutions, execute all, train on the ones that pass.

General model with broad capabilities. SFT adjusts the distribution toward code. RL refines for correctness. Interleaved replay maintains the foundation. Model writes correct code because it understands language and logic, not despite lacking those skills.

Targets (aspirational). Honestly hard to set. Dense 7B models score low single digits on SWE-bench verified today. SWE-Llama 7B got 3-4%. The good scores from small models come from MoE architectures with 80B+ total params and 3B active. A dense 7B trained from scratch with execution-based RL and real trajectories is uncharted. I'm hoping for 10-15% standalone, 20-30% with scaffolding. Would be a real result for a from-scratch 7B. Anything higher and I'll be surprised.

Data

Petabytes of raw data collected across three domains. Most of it is noise. After dedup and quality filtering, training uses 2-3T tokens of the best stuff. Massive funnel, tiny output.

Domain	Mix %	Sources	Purpose
Code	55%	Quality GitHub repos, Stack v2, private sources	Core competency
Text	25%	Web text, books, technical writing, private collections	Language understanding, world knowledge
Math	20%	Formal proofs, competitive programming, arXiv, private sources	Reasoning backbone

Filtered aggressively. No minified bundles, no generated code, no vendored deps, no copy-pasted boilerplate. File size, line length, byte entropy, directory pattern checks on everything.

Over-Chinchilla. Chinchilla says optimal is ~20 tokens per parameter:

$D_{\text{opt}} \approx 20N$

For 7B that's ~140B tokens. Petabytes collected, but after dedup and filtering, training on 2-3T tokens. ~300-400 tokens per parameter.

Chinchilla optimizes for training compute. I care about inference. Smaller models that see more data outperform larger models that see less at inference time. Phi-3 showed this at 3.8B with 3.3T tokens. LLaMA 3 at 8B with 15T.

Quality is the word. Overtraining on garbage memorizes garbage. Every token goes through exact hash + MinHash dedup, quality filtering, cross-source dedup.

What this isn't

Learning project. Not competing with Llama or DeepSeek. Meta has hundreds of thousands of GPUs and billion-dollar clusters. I'm renting H100s and H200s by the hour on Verda and Vast.ai.

Training from scratch teaches things fine-tuning never will. Why certain learning rates work. Why some data mixes produce better models. Why architectural choices that look good on paper fall apart in practice. Papers tell you what works. Building it tells you why.

100M recipe converged. 350M scaling test done, architecture validated. Moved to 1B. Custom BPE tokenizer (64K vocab) trained on the actual data mix.

Next articles: data pipeline, deduplication at scale, building a custom tokenizer, scaling training across multiple servers.