Skip to main content

Transformer Basics

This page explains the transformer architecture at a conceptual level. You don't need to understand the math to build GenAI systems, but understanding why transformers handle sequences differently helps with design decisions.

Prerequisites for:


The Evolution from RNNs to Transformers

Before transformers, sequence modeling relied on Recurrent Neural Networks (RNNs) like LSTMs and GRUs.

RNN Processing: Sequential and Slow

The problem: To understand Token 100, the model had to process Tokens 1-99 first. Information from early tokens had to "travel" through every intermediate step, getting diluted or lost along the way.

Transformer Processing: Parallel and Direct

The breakthrough: Transformers process all tokens simultaneously. Any token can directly "attend to" any other token in a single step—no sequential passing required.


Self-Attention: The Key Innovation

Self-attention is the mechanism that allows each token to "look at" all other tokens simultaneously.

How Attention Works: Query, Key, Value

Each token is represented as three vectors during attention:

Why This Matters for Language Understanding

In the sentence "The cat sat on the mat", attention captures relationships:

Without sequential processing, the model can directly learn that "sat" relates to "cat" (subject-verb agreement), even though "cat" comes before "sat".


Tokens and Positional Information

What Are Tokens?

LLMs don't process words directly—they process tokens. A token is typically:

TextApproximate TokensWhy
"The"1Common word
"cat"1Common word
"AgentFlow"2-3Uncommon compound
"transformer"2-3Longer word
"🚀"2-5Emoji (varies by model)

Rule of thumb: 1 token ≈ 4 characters in English ≈ 0.75 words

Why Tokenization Matters

Different tokenizers behave differently:

Implication: Tokenization affects:

  • Cost (more tokens = more money)
  • Latency (more tokens = slower)
  • Chunk boundaries (where you split documents)

Positional Encoding: Adding Order Awareness

Self-attention has no inherent sense of order—it treats tokens as a "bag of words". Positional encoding fixes this by adding position information:

Important: Without positional encoding, "Hello World" and "World Hello" would be treated identically. With it, the model knows the difference.


Why Transformers Handle Long Sequences Better

The Path Length Problem

ModelPath LengthConsequence
RNNO(N)Long-range dependencies dilute
TransformerO(1)Direct connections regardless of distance

This is why modern LLMs can work with context windows of 128K+ tokens.


Why Long Context Still Has Tradeoffs

Despite transformers handling longer sequences better, long context has real quality and cost tradeoffs.

Quality Tradeoffs

IssueDescriptionMitigation
Lost in the middleModels often pay less attention to middle contentPut important info at beginning or end
Attention dilutionMore tokens = less attention per tokenUse retrieval instead of stuffing context
Stale representationsModels trained on shorter context may not generalizeTest with actual long context

Cost Tradeoffs

Context SizeApproximate CostLatency Impact
4K tokens1x baselineBaseline
32K tokens~3-4x+50-100%
128K tokens~10-15x+200-400%

Design implication: Don't use 128K when 4K suffices. More tokens cost more and often produce worse results for focused tasks.


Architecture Variants: Encoder, Decoder, and Combinations

There are three main transformer variants:

VariantProcessingBest ForExamples
Encoder-onlyBidirectional (sees all tokens)Classification, extraction, embeddingsBERT, RoBERTa
Decoder-onlyCausal (sees only past tokens)Text generation, chatGPT-4, Claude, Llama
Encoder-decoderEncoder for input, decoder for outputTranslation, summarizationT5, BART

Most modern LLMs are decoder-only because they're optimized for generative tasks like chat and text completion.


Practical Implications for GenAI Systems

What This Means for Your Designs

Design Guidelines

GuidelineReason
Minimize context when possibleEvery token costs money and latency
Put important info at start or end"Lost in the middle" phenomenon
Use retrieval for specific informationTargeted retrieval > long context
Test with actual context lengthsQuality degrades non-linearly
Cache stable contextSystem prompts can often be cached

Key Takeaways

  1. Transformers use self-attention — Each token can attend to all other tokens in parallel, enabling direct long-range dependencies.

  2. Positional encoding adds order awareness — Without it, "Hello World" and "World Hello" are identical.

  3. Long context has real tradeoffs — More tokens mean higher cost, more latency, and often lower per-token quality.

  4. Retrieval beats stuffing for specific queries — Targeted retrieval is usually cheaper and more accurate than long context.

  5. Architecture affects capability — Most modern LLMs are decoder-only transformers optimized for generation.


What You Learned

  • Self-attention allows parallel processing and direct long-range dependencies
  • Positional encoding gives transformers order awareness
  • Long context has quality (lost in middle) and cost tradeoffs
  • Most modern LLMs are decoder-only transformers
  • Context design decisions directly impact cost and quality

Prerequisites Map

This page supports these lessons:

CourseLessonDependency
BeginnerLesson 1: Use cases, models, and the LLM app lifecycleSelf-attention, context windows
BeginnerLesson 2: Prompting and structured outputsCost and latency implications
AdvancedLesson 3: Context engineering, long context, and cachingFull page
AdvancedLesson 4: Advanced RAGWhy retrieval often beats long context

Next Step

Continue to Tokenization and context windows to understand how text is split into tokens and how this affects your GenAI applications.

Or jump directly to a course: