How Matrix Multiplication Learned to Refactor Code

How Matrix Multiplication Learned to Refactor Code

JavaScript is off (or blocked) in your browser, so the interactive demos on this page will not work.

A Brief History of Language Models

If you have used the internet in the last couple of years, you have probably used a large language model. Before these models became a normal consumer product, researchers had a simple question: can a machine understand language?

The answer, it turns out, depends entirely on what you mean by "understand."

The Age of Rules

In 1966, Joseph Weizenbaum at MIT created ELIZA,Weizenbaum, J. "ELIZA — A Computer Program for the Study of Natural Language Communication Between Man and Machine." Communications of the ACM, 9(1), 1966. a program that acted like a Rogerian psychotherapist. ELIZA did pattern matching. It scanned your input for keywords and applied rewrite rules. If you typed "I am sad," it might respond "How long have you been sad?" It did not know what sadness is. It did not know what therapy is. It did not even "know" English.

People still got attached to it. Weizenbaum wrote that his own secretary asked him to leave the room so she could talk to the program in private. The lesson is simple: humans project meaning onto text very easily.

ELIZA

A tiny ELIZA clone (1966). It does pattern matching and rewrites. Type a few lines and see what your brain does with it.

A few years later, Terry Winograd built SHRDLU (1971), which could follow commands inside a simulated world of colored blocks. "Pick up the big red block" worked. SHRDLU was brittle. It only worked inside its block world. Extending it to open-ended language did not scale. The rule set ballooned, edge cases piled up, and by the early 1980s this approach hit a wall.

SHRDLU

A browser port of SHRDLU (1971). Click to run it.

Source: santiontanon/SHRDLU. A text adventure based on SHRDLU (1971).

The problem was scale. Language is messy and context-heavy. You cannot write rules for all of it.

The Statistical Revolution

In the 1990s, a lot of NLP shifted to counting. Instead of writing grammar rules, people measured co-occurrence. "Bank" near "river" looks different from "bank" near "money." "New York" is often followed by "City."

N-gram models were common. A bigram model estimates $P (w_{n} ∣ w_{n - 1})$ . A trigram model estimates $P (w_{n} ∣ w_{n - 1}, w_{n - 2})$ . These models showed up in early speech recognition, machine translation, and spell-checking. They were rough, but they got better as data grew.

The bag-of-words model took an even simpler approach: represent a document as nothing more than the count of each word it contains, ignoring order entirely. "The dog bit the man" and "The man bit the dog" would have identical representations. This is obviously wrong in important ways, and yet bag-of-words powered surprisingly effective spam filters and search engines throughout the 2000s.

These models treated words as IDs. "Cat" was no closer to "kitten" than it was to "democracy." The model had no learned notion of meaning.

Intents, Entities, and Slots

Before LLM chat, a lot of production "chatbots" were task systems. The job was to get a small set of things done: book a flight, reset a password, check an order, schedule an appointment.

The core NLU loop in these systems is intent classification and entity extraction. An intent is the coarse action ("book_flight"). An entity is a specific value in the message ("Paris"). Rasa uses the same primitives: intents and entities are the basic structured output of its NLU pipeline.Rasa docs: "Intents and Entities".

Entities usually end up in slots. A slot is a key-value store that tracks information across the conversation, like destination=Paris or date=tomorrow. Rasa calls this out directly in its glossary.Rasa docs: glossary entry "Slot".

In older NLU literature, "slot filling" is usually token tagging. The model labels each token with a tag like B-destination, I-destination, or O. Named entity recognition is similar. It labels spans with types like PERSON or ORG. In a chatbot, those spans become entities, and entities become slot values.

Once you have intents and slots, you need dialogue management. Some systems do this with hand-written state machines. Some learn a policy from example conversations. Rasa supports both styles with rules and stories, and it trains a dialogue policy over that data.Rasa docs: "Stories" and "Rules".

If you want a research name for the same idea, you will often see "intent detection" and "slot filling" as paired tasks, usually benchmarked on datasets like ATIS.Liu, B. and Lane, I. "Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling." arXiv:1609.01454, 2016.

I even wrote a post about making your own crude CoreML powered chatbot in Swift..

Words as Vectors

In 2013, Tomas Mikolov and colleagues at Google published Word2Vec.Mikolov, T., Chen, K., Corrado, G., & Dean, J. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781, 2013. The idea is straightforward. Train a neural network to predict a word from its context (or predict context from a word). The model learns internal representations that cluster related words.

Each word becomes a vector (often a few hundred numbers). Similar words land near each other in that space. "Cat" ends up close to "kitten" and "dog," and far from "democracy."

Some relationships show up as directions. The vector from "man" to "woman" is close to the vector from "king" to "queen." That makes the following arithmetic work in practice:

{\vec{v}}_{king} - {\vec{v}}_{man} + {\vec{v}}_{woman} \approx {\vec{v}}_{queen}

You can find the same pattern in other pairs. A common example is capitals and countries (paris - france + germany ≈ berlin). Another is verb forms (walking - walk + swim ≈ swimming). This is a learned geometry over words.

The words are plotted in two dimensions (reduced from their original high-dimensional space).

Word Embedding Explorer

Words plotted in 2D. Hover to highlight a cluster. Use the dropdowns for `A - B + C`.

− + =

This kicked off a lot of what came next.

Learning to Sequence

Word2Vec gives vectors for words, but word order matters. "The dog chased the cat" is different from "The cat chased the dog." To handle order, researchers used Recurrent Neural Networks (RNNs). If you read Karpathy's blog on RNNs, this section will feel familiar.Karpathy, A. "The Unreasonable Effectiveness of Recurrent Neural Networks." karpathy.github.io/2015/05/21/rnn-effectiveness/

An RNN reads text one word at a time and updates a hidden state. The hidden state is its memory. In theory the state can carry information forward across a whole sentence. In practice, basic RNNs forget earlier tokens as sequences get longer. Gradients shrink during training and earlier information fades out.

h_{t} = \tanh (W_{h} h_{t - 1} + W_{x} x_{t} + b)

Long Short-Term Memory networks (LSTMs), introduced by Hochreiter and Schmidhuber in 1997Hochreiter, S. & Schmidhuber, J. "Long Short-Term Memory." Neural Computation, 9(8), 1997. use gates that decide what to keep and what to drop. LSTMs were widely used in NLP for years, including early neural machine translation systems.

The sequence-to-sequence (seq2seq) architecture, introduced by Sutskever, Vinyals, and Le in 2014,Sutskever, I., Vinyals, O., & Le, Q. V. "Sequence to Sequence Learning with Neural Networks." NeurIPS, 2014. used one LSTM to encode an input sequence and another to decode it into an output sequence. This was the first general architecture for tasks like translation, summarisation, and question answering. But even LSTMs struggled with very long sequences. A 500-word paragraph was pushing the limits. A full document was out of reach.

Attention Please

In 2017, folks at Google published "Attention Is All You Need,"Vaswani, A., Shazeer, N., Parmar, N., et al. "Attention Is All You Need." NeurIPS, 2017.. The transformer architecture they introduced abandoned recurrence entirely. Instead of processing words one at a time, the transformer processed all words simultaneously, using a mechanism called self-attention to let each word attend to every other word in the input.

Self-attention computes, for each word, how relevant every other word is to it. When processing the word "it" in the sentence "The animal didn't cross the street because it was too tired," self-attention learns to strongly connect "it" with "animal." This happens in parallel across all positions, making transformers dramatically faster to train than RNNs and far better at capturing long-range dependencies.

Transformers became the default architecture for language models. BERT and GPT are transformer models.

BERTDevlin, J., Chang, M.-W., Lee, K., & Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL, 2019. (2018) was trained to predict masked words, using both left and right context. GPT-1 and GPT-2 (2018–2019) trained left-to-right, predicting the next token. OpenAI delayed releasing the full GPT-2 weights due to misuse concerns which in hindsight seems like such a marketing stunt.

The Scaling Era

Then the scale changed. GPT-3,Brown, T., Mann, B., Ryder, N., et al. "Language Models are Few-Shot Learners." NeurIPS, 2020. released in 2020, had 175 billion parameters, over 100 times more than GPT-2. Bigger models started picking up behaviours that were weak or missing in smaller ones. GPT-3 could write code, translate between languages, do basic arithmetic, and answer questions about a wide range of topics. A lot of this came from scale and data, not a new training objective for each skill.

The years that followed brought GPT-4, Claude, Gemini, and a proliferation of models both open and closed. Scaling laws, formalized by Kaplan et al. at OpenAI,Kaplan, J., McCandlish, S., Henighan, T., et al. "Scaling Laws for Neural Language Models." arXiv:2001.08361, 2020. showed that model performance improved predictably with more parameters, more data, and more compute. The recipe, it seemed, was simply: make it bigger.

Size was not the whole story. Reinforcement learning from human feedback (RLHF) and similar methods trained models to follow instructions and avoid unsafe or unwanted outputs. This is a big reason chat-style assistants feel usable.

The Agent Era

By 2024, the top models could write and analyse text well. Most of the common interfaces were still conversational. You type a prompt and you get text back.

Agents add tools. A coding agent can edit files, run commands, read the output, and try again. That changes the workflow. I stopped copy-pasting snippets and started caring more about whether the agent can run tests and recover from errors.

The model still generates text, but now the text can trigger actions. The core behaviour is a loop: decide, act, read results, and update the next step. That loop is what the rest of this post covers.

But first, we need to understand the machinery that makes all of this possible: how a transformer actually processes text.

Inside the Machine: How Text Becomes Tokens

Models operate on numbers. The first step is turning text into numbers.

This is the problem of tokenisation, and it is less trivial than it sounds.

Splitting on spaces is not enough. "don't" has punctuation. "New York" has a space but acts like one unit in some contexts. "unhappiness" has parts a model can reuse. You also want a fallback when the model sees a new term or a misspelling.

Modern language models use a technique called Byte Pair Encoding (BPE), originally developed as a data compression algorithm and adapted for NLP by Sennrich et al. in 2016.Sennrich, R., Haddow, B., & Birch, A. "Neural Machine Translation of Rare Words with Subword Units." ACL, 2016. BPE starts with individual characters and iteratively merges the most frequent pairs into new tokens. After thousands of merge operations, the vocabulary converges on a set of subword units that balance two goals: common words like "the" or "and" get their own tokens (efficient), while rare words get broken into recognizable pieces like "un" + "happi" + "ness" (flexible).

The result is a vocabulary of typically 30,000 to 100,000 tokens. Each token gets mapped to an integer ID, and the sequence of IDs is what the model actually processes. The text you type is not what the model sees. The model sees a sequence of numbers. It's just Math!

This is where text becomes token IDs.

Tokeniser Playground

Type text and see how this toy BPE tokeniser splits it into subword pieces.

Show token IDs

If you play with the tokeniser above, you will see a few patterns. Very common words like "the" tend to be single tokens. Rarer words split into pieces. "unhappiness" might become something like "un" + "happi" + "ness." That gives the model reusable chunks, so it can share information across related words.

Token count matters because it sets cost. A rough rule is one token is about three-quarters of a word in English. When someone says a model has a 100,000 token context window, that is around 75,000 words. More tokens means more memory, more compute, and more money.

After tokenisation, each token ID maps to an embedding vector. These embeddings are the input to the transformer layers that follow.

The Transformer

The previous section ended with token embeddings: one vector per token. By themselves, those vectors do not include much context. "bank" in "river bank" and "bank" in "bank account" need different context. "it" in "The cat sat on the mat because it was tired" refers to the cat.

Transformers add context with self-attention.

How Self-Attention Works

Self-attention computes a score between each token and every other token. Those scores become weights. The model uses them to combine information from other tokens into an updated vector for the current token.

Example: "The cat sat on the mat because it was tired." When the model updates the token "it," it needs a guess for what "it" refers to. Self-attention computes scores between "it" and every other token. The score for "cat" should be high, since "it" refers to the cat in this sentence. Scores for unrelated tokens should be low. The updated vector for "it" is a weighted mix of other token vectors, based on those scores.

The scores come from three learned projections of each embedding: a query, a key, and a value. The score between two tokens uses the dot product of one token's query and the other token's key. The model normalises scores with softmax, then uses the result as weights to average the value vectors. That weighted average is the attention output.

The mathematical formula is compact:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

In plain terms, queries match keys, and the resulting weights mix the values.

Self-Attention Visualiser

Pick a sentence. Click a word to see its attention weights. Darker arcs mean higher weight.

Try clicking "it" in the first sentence. You should see arcs connecting it to "cat". In that sentence, "it" refers to the cat. Now click "bank" in the second sentence. It attends to words like "river" and "eroded," because "bank" here means a riverbank. The same token can map to different context depending on nearby words.

Multi-Head Attention

Self-attention also has to track different relationships at the same time. In "She gave her dog a bone", the token "gave" connects to who did the action and what happened to what. A single attention pattern can miss some of this.

The solution is multi-head attention. Instead of computing one attention pattern, the model computes multiple patterns in parallel. Each head has its own learned query, key, and value projections. In practice, some heads focus on syntax and some focus on reference links like pronouns. After heads compute their outputs, the model concatenates them and projects back to the hidden dimension.

Heads learn patterns through training. People who have inspected heads have found some that track syntax, some that track coreference, and some that focus on local neighborhoods. Some heads are hard to interpret.

The demo below shows three attention heads over the same sentence. Each head focuses on different connections.

Multi-Head Attention Visualiser

Three attention heads on the sentence "She gave her dog a bone". Click a word to compare the heads.

The Full Transformer Block

Self-attention is the star of the show, but it does not work alone. A transformer block is a carefully designed sequence of operations, each serving a specific purpose.

Input embeddings map each token ID to a dense vector. If the hidden size is 4,096, each token becomes 4,096 numbers. These embeddings are learned during training.
Positional encoding adds location in the sequence. Self-attention does not carry position by default, so the model needs a signal for order. Without position information, "the cat sat on the mat" and "mat the on sat cat the" can look too similar.
Multi-head attention runs self-attention in parallel heads and combines the results.
Add & normalise usually means a residual connection plus layer normalisation. The residual path helps training stay stable and helps gradients flow.
The feed-forward network is an MLP applied per token position. It transforms each token vector after attention.
Then, we repeat the residual + normalisation step and pass the output to the next block.

Modern models stack many of these blocks. GPT-3 had 96 transformer blocks. With depth, representations change layer by layer. You can think of it as repeated mixing (attention) and per-token transformation (MLP).

Why Transformers Won

Transformers trained well on modern hardware and handled long contexts better than older sequence models. Self-attention is parallel across tokens, so training uses GPUs efficiently. Attention also gives a direct path between far-apart tokens, which helps with long-range references.

Transformers also scaled. Performance improved in a fairly predictable way as people increased parameters, data, and compute. Scaling laws captured some of this behaviour.The key finding: model loss $L$ scales as a power law with compute $C$ , dataset size $D$ , and parameters $N$ (for parameters alone, one reported fit is roughly $L \propto N^{- 0.076}$ ).

Now connect this back to a chat or coding agent. Your text is tokenized, embedded, and processed through many transformer blocks. The model outputs a probability distribution for the next token, samples a token, appends it, and repeats. That is text generation. An agent adds a loop around generation.

The playground below lets you watch this process unfold. Pick a prompt, and step through generation one token at a time. At each step, the model produces a probability distribution over candidate next tokens — the bar chart shows the top contenders. Adjust the temperature to see how it reshapes the distribution: at zero, the model always picks the most likely token; turn it up, and less probable tokens get a fighting chance.

Next-Token Prediction

Pick a prompt. Click Next token to sample. Temperature changes randomness.

Temperature: 0.7 Auto-play

From Model to Agent — The Loop

A coding agent is mostly boring plumbing. The core is a model call inside a loop, plus some tool execution. As the people at Amp put it: "an LLM, a loop, and sufficient tokens."See "How to Build an Agent".

The pseudocode for a coding agent fits on a napkin:

messages = [system_prompt, user_request]

while True:
    response = call_llm(messages)
    messages.append(response)

    if response.has_tool_calls:
        for tool_call in response.tool_calls:
            result = execute_tool(tool_call)
            messages.append(result)
    else:
        break  # model is done

That is the whole agent. A loop calls the model, executes any requested tool calls, appends tool results, and repeats. When the model stops requesting tools, the loop ends.

The key data structure here is the messages array. This is the conversation history, and it is everything. Every user message, every assistant response, every tool call, every tool result — they all get appended to this array. The array grows with every iteration of the loop.

One detail matters a lot: the model is stateless. It has no memory between API calls. Every API call includes the full messages array, including tool results. The model reads it and decides what to do next. Even though OpenAI's Responses API and Google/Gemini's Interactions API can be stateful, the model itself is stateless.

This is why the loop works. If a tool call fails, the error output is in the conversation history. The model sees it on the next turn and can try a different approach. The conversation history is the state.

The loop gives the model room to do multi-step work. It can read, edit, run, and repeat instead of trying to dump a single answer. It can also adjust when it learns something new from a search result or an error message.

The loop itself is simple. The hard parts are token limits, tool design and descriptions, prompting, and safety.

Watch both the conversation history on the left and the effects on your machine on the right.

Conversation - messages[]
•
Machine
Waiting...

    Step 0 of 12
    
       Auto-play

Notice what happened there. The model made four round trips through the loop. It read a file, made an edit, ran tests, and then summarised the result. Each tool result went into the conversation history, which the model saw on the next turn. The messages array grew from 2 entries to 12.

The loop works because the model can take small steps. It can read, decide, act, and verify. The harness runs the loop. The model decides what to do at each step.

Tool Calling - How Models Reach the Real World

A language model by itself generates text. Tools let it interact with a real environment: read files, run commands, and edit code.

A tool in an agent system has a name, a description, and an input schema. The name is something like read_file, edit_file, or bash. The description explains what the tool does. The schema defines the parameters it accepts.

When the model calls a tool, it outputs a structured JSON request. The harness receives that request and runs the tool. The tool result goes back to the model on the next iteration.

This is also the security boundary. The model can suggest actions, but the harness decides what actually runs and when to ask for approval.

Here are some exampe tools:

Tools
Tool Definition
Pick a tool to view its schema.
In Action
Pick a tool to see an example.

Tool descriptions matter. The model reads them as instructions. The wording affects when the model picks a tool and how it formats inputs. If the description is vague, the model will call the tool at the wrong time or with the wrong shape.

Compare a tool described as "run a command" with one described as "execute a shell command in the project's working directory and return stdout/stderr; use it to run tests or check git status." The second description gives the model both the API shape and the intended use.

Tools have a cost. Tool definitions sit in the system prompt and consume context window tokens. A small set is cheap. A large set adds overhead. This is why many agents ship with a small toolkit that composes. For example, you can avoid a special create_react_component tool if you already have edit_file. You can avoid a run_tests tool if you already have a generic bash tool.

The best tool sets follow the Unix philosophy: small, sharp tools that compose. Read a file. Edit a file. Search for text. Run a command. List files. That is enough to build almost anything. This is why adding too many MCPs to your setup would wreak havoc to your context window!

RAG is DEAD

Retrieval-Augmented Generation (RAG) is a common pattern. You split documents into chunks, embed each chunk into a vector, store vectors in a database, and at query time retrieve the closest chunks to the user's question. You add those chunks to the model's context.

RAG works well for things like searching a knowledge base and answering questions over large document collections.

For code, RAG has limitations.

Code has dependencies across files. A function in auth.py can call helpers in utils.py, which imports constants from config.py, which reads an environment variable. If you retrieve a single chunk, you can miss the call path.

Embedding similarity does not guarantee logical relevance. You might search for "login returns 403" and retrieve text that mentions those words, while the bug sits in a permission check that does not mention either string. Chunking can also split a function across boundaries and separate a signature from the logic that matters.

Another approach is to give the model tools to search and read the repository. The model can form a hypothesis, search for symbols, open files, follow references, and update the hypothesis as it reads more context.

Below we have a toy debugging scenario with both approaches.

Scenario: "Why does the login function return 403 for admin users?"

        RAG
        WAITING
      
Click "Run" to start.

        Repo Search
        WAITING
      
Click "Run" to start.

In the example above, the RAG flow retrieved chunks that mentioned the login function and authentication in general. It missed the middleware bug because that code never mentions "login" or "403." The retrieval step did not surface check_permissions.

The agent flow found the bug by following the call chain. It searched for login, read the file, saw check_permissions, searched for that function, and read middleware.py.

When people say "the filesystem is all you need," they mean a model can do real work with basic repo tools like search and file reads. The model forms a hypothesis, checks code, and updates the next step based on what it finds. All the *nix neckbeards have already given enough training data up on the internet on "oh, you can just run this command."

This is similar to how a new developer works. They search, open files, follow symbols, and build context over time. The agent loop enables this because each tool result goes back into the same messages history.

RAG still helps in some scenarios.

RAG can help when scale or latency dominates. If you have millions of documents or a repository with tens of thousands of files, grepping everything can cost time and tokens. A vector index can narrow the search quickly.

Tool calling has round trips. The model generates a call, the harness executes it, and the result returns to the model. Multiple calls add latency.

If you already have structured mappings (for example, a dependency graph), you can inject that directly instead of having the model rediscover it.

In practice, an agent can have both styles available. Vector search can be one tool alongside grep and file reads, and the model can pick based on the problem.

The Context Window - Memory and Its Limits

Before talking about capability, it helps to talk about capacity. Every request runs inside a context window. The context window includes the system prompt, tool definitions, conversation history, files you paste or read, and the model's own output.

The unit of measurement here is the token. A token is roughly four characters of English text, or about three-quarters of a word. The word "function" is two tokens. A typical line of code is 10-20 tokens. A 500-line source file might be 3,000-5,000 tokens. These numbers matter because the context window is measured in tokens, and it fills up faster than you might expect.

Context windows grew quickly over the last few years. GPT-3 (2020) shipped with 4,096 tokens, roughly 3,000 words. GPT-4 started at 8,192 tokens and later expanded. Claude pushed higher, and some current models support up to 1 million tokens. At that size, you can fit a lot of code, but it is still a budget.

A large context window still has overhead. The system prompt for an agent can take a few thousand tokens. Tool definitions can take thousands more. Conversation history grows every turn. Reading a 500-line file adds a few thousand tokens. A big search result adds more. If you enable extended thinking, that also consumes tokens.

Long sessions fill the window. After enough turns, earlier context gets crowded out, even if it was important.

When the window fills up, systems either drop older content, summarise it, or keep a sliding window of recent turns. All of these lose information.

This is one reason long coding sessions degrade. The model can look worse later because key details from earlier turns got dropped or compressed. If you see this happening, you can restart with a clean prompt and re-provide the key context.

Context Window Sandbox

Add blocks to the window. When it overflows, older content gets summarised or dropped.

0 / 32,768 tokens 0%

Context Size

Add Blocks

Try adding a system prompt, tool definitions, and then a series of turns and file reads. Switch to 4K and watch it fill up. Switch to 1M and the same blocks take a smaller fraction of the window. A larger window makes tool overhead and long histories easier to tolerate.

Caching - Why Your Second Message Is Faster

If you have used a coding agent, you have probably noticed the first message is slower than later ones. The main reason is caching at multiple layers.

KV Cache: The Engine-Level Optimization

KV caching comes from how attention works. For each token at each layer, the model computes a query (Q), a key (K), and a value (V). Attention uses dot products between queries and keys, then uses the result as weights over values.

Without caching, each new token would require recomputing K and V for all previous tokens. For a sequence of length $N$ , total work for K/V recomputation grows like $1 + 2 + \dots + N = \frac{N (N + 1)}{2}$ .

The KV cache stores K and V for previous tokens. When generating the next token, the model computes K and V for the new token and reuses cached K/V for earlier tokens.

This is why you see two speeds. The time to first token is slow because the model processes the full prompt in one pass. After that, subsequent tokens can reuse the KV cache and stream faster.

Prompt Caching: The Provider-Level Optimization

Coding agents have large shared prefixes between requests. The system prompt and tool definitions stay the same. Conversation history up to the new user message stays the same.

Prompt caching (prefix caching) takes advantage of this. Some providers cache computed results for prompt prefixes. If two requests share the same prefix, the provider can reuse work from the previous request. OpenAI's Responses API provides this for free, whereas you have to pay Anthropic for this privilege.

This can reduce latency and cost for repeated prefixes.

Application-Level Caching

The third layer of caching happens in the harness. If the agent reads a file and later needs it again, the harness can reuse the previous content if the file did not change. This avoids repeated tool calls and repeated context window usage.

Harnesses can also cache directory listings, search results, and build outputs. Some keep a summarised project context across sessions.

KV Cache: Generating Token by Token

Text generation happens one token at a time. Attention needs to look back over earlier tokens. The grid below is a toy attention matrix (row = current token, column = earlier token). Click Next token and compare the work.

Without KV Cache

Recomputes K/V for the full sequence. Highlighted cells show work for this step.

With KV Cache

Stores old K/V. Only the new row is computed.

Token 0 / 6

Prompt Caching: Reusing the Prefix

Agent requests share a big prefix (system prompt, tool definitions, and history). Some providers cache work for that prefix, so later requests can skip it.

Request 1 — cold start

System Prompt (3K)

Tool Definitions (5K)

Request 2 — cached prefix

System (cached)

Tools (cached)

fix bug

Caching makes coding agents feel fast even when the underlying model is expensive to run. The KV cache speeds up token generation after the initial prefill. Prompt caching speeds up repeated prefixes. Application-level caching reduces how much content the app sends.

Reasoning Tokens — Thinking Before Answering

When a problem is hard, people usually pause and think before acting. Models can do a similar thing when you give them space to generate intermediate tokens before writing the final answer. This affects hard tasks more than easy ones.

Chain-of-thought prompting was an early version of this idea.Wei, J., Wang, X., Schuurmans, D., et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS, 2022. If you ask a model to "think step by step," accuracy often improves on tasks like math and multi-step code analysis. The model uses the intermediate text as scratch space, and later steps can attend to earlier steps.

Extended thinking or reasoning tokens take this further. The model gets a dedicated thinking phase where it generates tokens that may be hidden from the user. These tokens are internal working memory. The model can explore approaches, compare tradeoffs, catch mistakes, and form a plan before writing the visible response.

This matters for coding tasks. "Rename the variable x to userName" is straightforward. "Refactor this auth system to support OAuth2 alongside the existing JWT flow" requires reading code, mapping dependencies, planning changes, and coordinating edits across files.

There is a tradeoff. Reasoning tokens use context window space. They also cost money and add latency, since the model generates them before the visible output. For small edits, a long reasoning trace is overhead. For a large refactor, it can help.

This ties back to the agent loop. At each step, the model chooses actions like searching, reading files, editing, running tests, or asking a question. Better decisions early reduce wasted work later.

If you go back to the context window demo above and click "Extended Thinking (+8K)" a few times, you can see the tradeoff. On a 32K context window, two rounds of extended thinking use about half the window. On a 128K window, the same thinking uses about 12%.

Harnesses, Tooling, and the Ecosystem

Many coding agents use the same underlying model. The experience changes because the harness changes.

The harness is the system prompt, the tool set, the agent loop, permissions, UI, and safety checks.

You can see this in current tools. Claude Code is terminal-first and can run commands and read/write files. Cursor is an IDE and use editor context like the current file and selection. GitHub Copilot used o focus on inline completions as you typed.

These differences come from different choices about how humans and tools work together.

The system prompt sets how an agent behaves. It can include constraints, tool usage rules, and output format. Different products ship different prompts. That changes how the same base model behaves.

Tools are what let an agent read files, run commands, search code, and edit code. Tool quality matters. Error handling, permissions, and how results are returned all affect outcomes.

Permissions and sandboxing are harness responsibilities. A coding agent with filesystem access can delete files or run destructive commands. Most harnesses limit access, require approval for risky actions, and log what the agent does. The harness is the boundary between model output and your machine. I personally live in the --yolo world, so I don't really care about this.

Stayin' Grounded

Coding agents can look like they understand a codebase. They can read files, make changes, run tests, fix failures, and produce working code. Under the hood, the model is still doing next-token prediction. Being clear about that helps you decide when to trust outputs and when to verify.See Mihail Eric's "The Emperor Has No Clothes".

Models Predict, They Do Not Understand

Large language models predict the next token. With enough data and scale, the output often looks like expert writing. The internal process is pattern prediction, not a human-style mental model. Although, this then becomes a philosophical question of what the human-style mental model itself is.

This matters when you hit something new or under-documented. A model can struggle when the right solution is not common in its training data, even if the fix is simple once you understand the system. You can also prevent some of this by giving it a way to research and test different hypotheses and collaborate with other agents.

Context Window Degradation

Even with large context windows, attention is not uniform. "Lost in the middle" results show that models often use the start and end of the context more than the middle.Liu, N.F., Lin, K., Hewitt, J., et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL, 2024. If an important detail sits in the middle of a long history, the model can miss it.

This can happen even when nothing was truncated. The text is still present, but it is harder for the model to use. If a detail matters, repeat it near the current request.

The practical skill is knowing when to give more context, when to let the agent run, and when to step in.

Understanding the loop, tools, caching, context limits, and reasoning helps you use agents better. You can keep prompts short, re-state key details when needed, and stop the agent when it starts thrashing.

Epilogue: It Is All Matrix Multiplication

Every breakthrough, every billion-dollar model, every coding agent that can navigate a codebase and fix bugs while you get coffee — reduces to the same operation your graphics card has been doing since the 1990s: matrix multiplication.

Attention uses matrix multiplications. The feed-forward network uses matrix multiplications. The final projection that produces logits over the vocabulary is another matrix multiplication. You can write the basic shape as:

Y = f (X W + b)

Multiply inputs by weights, add a bias, apply a non-linearity, and repeat many times. The behaviour comes from the learned weights. Training adjusts those weights with gradient descent.

We did not discover the secret of intelligence. We discovered that if you multiply enough matrices together, with enough data, on enough hardware, the result behaves as if it understands. The distinction between "understands" and "behaves as if it understands" is one of the great unresolved questions of our time, and we are not going to settle it here.

What we can say is this: fifty years from now, when whatever comes after transformers has long since made our current models look like pocket calculators, someone will look back at this era and marvel. They will marvel that we took the entirety of human written knowledge, compressed it into a few terabytes of matrix weights through an optimization procedure we do not fully understand, wrapped it in a while loop with some JSON parsing, and called it an "agent." They will marvel that it worked at all. And they will marvel, most of all, that the thing powering the most sophisticated code generation systems ever built was, at the end of the day, the same operation a nineteen-year-old learns in week three of linear algebra.

We curve-fit the world. And somehow, it worked.

Navan