Transformer Basics

This page explains the transformer architecture at a conceptual level. You don't need to understand the math to build GenAI systems, but understanding why transformers handle sequences differently helps with design decisions.

Prerequisites for:

The Evolution from RNNs to Transformers

Before transformers, sequence modeling relied on Recurrent Neural Networks (RNNs) like LSTMs and GRUs.

RNN Processing: Sequential and Slow

The problem: To understand Token 100, the model had to process Tokens 1-99 first. Information from early tokens had to "travel" through every intermediate step, getting diluted or lost along the way.

Transformer Processing: Parallel and Direct

The breakthrough: Transformers process all tokens simultaneously. Any token can directly "attend to" any other token in a single step—no sequential passing required.

Self-Attention: The Key Innovation

Self-attention is the mechanism that allows each token to "look at" all other tokens simultaneously.

How Attention Works: Query, Key, Value

Each token is represented as three vectors during attention:

Why This Matters for Language Understanding

In the sentence "The cat sat on the mat", attention captures relationships:

Without sequential processing, the model can directly learn that "sat" relates to "cat" (subject-verb agreement), even though "cat" comes before "sat".

Tokens and Positional Information

What Are Tokens?

LLMs don't process words directly—they process tokens. A token is typically:

Text	Approximate Tokens	Why
"The"	1	Common word
"cat"	1	Common word
"AgentFlow"	2-3	Uncommon compound
"transformer"	2-3	Longer word
"🚀"	2-5	Emoji (varies by model)

Rule of thumb: 1 token ≈ 4 characters in English ≈ 0.75 words

Why Tokenization Matters

Different tokenizers behave differently:

Implication: Tokenization affects:

Cost (more tokens = more money)
Latency (more tokens = slower)
Chunk boundaries (where you split documents)

Positional Encoding: Adding Order Awareness

Self-attention has no inherent sense of order—it treats tokens as a "bag of words". Positional encoding fixes this by adding position information:

Important: Without positional encoding, "Hello World" and "World Hello" would be treated identically. With it, the model knows the difference.

Why Transformers Handle Long Sequences Better

The Path Length Problem

Model	Path Length	Consequence
RNN	O(N)	Long-range dependencies dilute
Transformer	O(1)	Direct connections regardless of distance

This is why modern LLMs can work with context windows of 128K+ tokens.

Why Long Context Still Has Tradeoffs

Despite transformers handling longer sequences better, long context has real quality and cost tradeoffs.

Quality Tradeoffs

Issue	Description	Mitigation
Lost in the middle	Models often pay less attention to middle content	Put important info at beginning or end
Attention dilution	More tokens = less attention per token	Use retrieval instead of stuffing context
Stale representations	Models trained on shorter context may not generalize	Test with actual long context

Cost Tradeoffs

Context Size	Approximate Cost	Latency Impact
4K tokens	1x baseline	Baseline
32K tokens	~3-4x	+50-100%
128K tokens	~10-15x	+200-400%

Design implication: Don't use 128K when 4K suffices. More tokens cost more and often produce worse results for focused tasks.

Architecture Variants: Encoder, Decoder, and Combinations

There are three main transformer variants:

Variant	Processing	Best For	Examples
Encoder-only	Bidirectional (sees all tokens)	Classification, extraction, embeddings	BERT, RoBERTa
Decoder-only	Causal (sees only past tokens)	Text generation, chat	GPT-4, Claude, Llama
Encoder-decoder	Encoder for input, decoder for output	Translation, summarization	T5, BART

Most modern LLMs are decoder-only because they're optimized for generative tasks like chat and text completion.

Practical Implications for GenAI Systems

What This Means for Your Designs

Design Guidelines

Guideline	Reason
Minimize context when possible	Every token costs money and latency
Put important info at start or end	"Lost in the middle" phenomenon
Use retrieval for specific information	Targeted retrieval > long context
Test with actual context lengths	Quality degrades non-linearly
Cache stable context	System prompts can often be cached

Key Takeaways

Transformers use self-attention — Each token can attend to all other tokens in parallel, enabling direct long-range dependencies.
Positional encoding adds order awareness — Without it, "Hello World" and "World Hello" are identical.
Long context has real tradeoffs — More tokens mean higher cost, more latency, and often lower per-token quality.
Retrieval beats stuffing for specific queries — Targeted retrieval is usually cheaper and more accurate than long context.
Architecture affects capability — Most modern LLMs are decoder-only transformers optimized for generation.

What You Learned

Self-attention allows parallel processing and direct long-range dependencies
Positional encoding gives transformers order awareness
Long context has quality (lost in middle) and cost tradeoffs
Most modern LLMs are decoder-only transformers
Context design decisions directly impact cost and quality

Prerequisites Map

This page supports these lessons:

Course	Lesson	Dependency
Beginner	Lesson 1: Use cases, models, and the LLM app lifecycle	Self-attention, context windows
Beginner	Lesson 2: Prompting and structured outputs	Cost and latency implications
Advanced	Lesson 3: Context engineering, long context, and caching	Full page
Advanced	Lesson 4: Advanced RAG	Why retrieval often beats long context

Next Step

Continue to Tokenization and context windows to understand how text is split into tokens and how this affects your GenAI applications.

Or jump directly to a course:

The Evolution from RNNs to Transformers​

RNN Processing: Sequential and Slow​

Transformer Processing: Parallel and Direct​

Self-Attention: The Key Innovation​

How Attention Works: Query, Key, Value​

Why This Matters for Language Understanding​

Tokens and Positional Information​

What Are Tokens?​

Why Tokenization Matters​

Positional Encoding: Adding Order Awareness​

Why Transformers Handle Long Sequences Better​

The Path Length Problem​

Why Long Context Still Has Tradeoffs​

Quality Tradeoffs​

Cost Tradeoffs​

Architecture Variants: Encoder, Decoder, and Combinations​

Practical Implications for GenAI Systems​

What This Means for Your Designs​

Design Guidelines​

Key Takeaways​

What You Learned​

Prerequisites Map​

Next Step​