11  LLM Building Blocks: Tokenization and Markov Models

A First Generative Model From Scratch

Open the live notebook in Google Colab or download the live notebook.

In this set of notes, we’ll take our first peak “under the hood” into a model for text generation which is simple enough for us to code from scratch and train almost instantly. This model is called a Markov model.

We’ll also introduce tokenization, which prepares text data for use in machine learning models by breaking it down into smaller units.

Here’s a simple piece of text for us to consider:

import pandas as pd
from collections import defaultdict
import random
import urllib.request 

url = "https://raw.githubusercontent.com/middcs/data-science-notes/refs/heads/main/data/seuss/places.txt"
text = "\n".join([line.decode('utf-8').strip() for line in urllib.request.urlopen(url)])

This is the complete text of Dr. Seuss’s Oh, the Places You’ll Go!. Here’s the first tidbit.

print(text[0:193])
Congratulations!
Today is your day.
You're off to Great Places!
You're off and away!

You have brains in your head.
You have feet in your shoes.
You can steer yourself
any direction you choose.

We’d like to build a model that can generate text which is “similar to this.” The idea of “similarity” here requires specification, and the task of operationalizing the idea of “similarity” is one of the core parts of language modeling.

Byte-Pair Tokenization

Before we get into our first concept of “similarity,” we need to grapple with our units. One of the fundamental aspects of human language is that it is composed by arranging many small units into larger structures. But what are the units? One reasonable answer to this question, at least in the context of written language, is that the units are characters. Python makes it very direct to split a given piece of text into its constituent characters:


char_tokens = list(text)

print(char_tokens[0:20])
['C', 'o', 'n', 'g', 'r', 'a', 't', 'u', 'l', 'a', 't', 'i', 'o', 'n', 's', '!', '\n', 'T', 'o', 'd']

Another reasonable choice is to split the text into words:


word_tokens = text.split() # split on the space \s character

print(word_tokens[0:20])
['Congratulations!', 'Today', 'is', 'your', 'day.', "You're", 'off', 'to', 'Great', 'Places!', "You're", 'off', 'and', 'away!', 'You', 'have', 'brains', 'in', 'your', 'head.']

Note that this approach has some limitations:

  • Punctuation is attached to words (e.g., “Congratulations!” is a single token)
  • Capitalization is preserved (e.g., “You” and “you” are different tokens)

Modern tokenizers implement complex pipelines for breaking text into tokens. A typical API for a tokenizer includes a vocabulary (the set of all tokens the tokenizer can produce), a mapping of the vocabulary to a set of integer ids, and methods for encoding text into tokens/ids and decoding tokens/ids back into text. Here is a simple example of a character-level tokenizer that implements this API:

class CharTokenizer: 
    
    def __init__(self):
        self.token_to_id = {}
        self.id_to_token = {}

    def train(self, text):
        unique_tokens = sorted(set(text))
        self.token_to_id = {char: idx for idx, char in enumerate(unique_tokens)}
        self.id_to_token = {idx: char for char, idx in self.token_to_id.items()}
        
    def encode(self, text):
        return [self.token_to_id[char] for char in text]
    
    def decode(self, ids):
        return ''.join([self.id_to_token[idx] for idx in ids])

Here’s this tokenizer in action:

small_text = "Character tokens!"

tokenizer = CharTokenizer()
tokenizer.train(small_text)

The association between integer ids and characters is stored in the tokenizer’s vocabulary:

vocab = tokenizer.token_to_id
print(vocab)
{' ': 0, '!': 1, 'C': 2, 'a': 3, 'c': 4, 'e': 5, 'h': 6, 'k': 7, 'n': 8, 'o': 9, 'r': 10, 's': 11, 't': 12}

We can encode small_text into integer ids:

encoded = tokenizer.encode(small_text)
print(encoded)
[2, 6, 3, 10, 3, 4, 12, 5, 10, 0, 12, 9, 7, 5, 8, 11, 1]

And decode the text back from the ids:

decoded = tokenizer.decode(encoded)
print(decoded)
Character tokens!

It’s also possible to instead use complete words as tokens, rather than characters. Modern tokenizers used in practice often use subword units, which are somewhere between characters and words in size. As one important example, we’ll consider Byte-Pair Encoding (BPE) tokenization, which is widely used in practice. The BPE training algorithm begins with each character in the training text as its own token. It then repeatedly finds the most common pair of adjacent tokens in the training text and merges them, creating a new token. This process repeats until some specified criterion is reached. In the example above, we specified that only pairs which appear at least 10 times in the training text should be merged; once no more pairs meet this criterion, the algorithm stops.

Here’s a complete example of BPE in action on the following piece of text.

do dogs like dogs?  

The most common pair of tokens in this text is do, which appears 3 times. To start BPE, we replace do with a new token, which we’ll call Z. The new string now reads

Z Zgs like Zgs?

The next most common pair of tokens is Zg, which appears twice. We replace Zg with a new token, which we’ll call Y. The new string now reads

Z Ys like Ys?

The next most common pair of tokens is Ys, which also appears twice. We replace Ys with a new token, which we’ll call X. The new string now reads

Z X like X?

There are no more pairs of tokens which appear multiple times in the text, so we stop here. The final token vocabulary can be represented in the following lookup table:

{
    "dogs" : "X",
    "do"   : "Z",
    "l" : "l", 
    "i" : "i",
    "k" : "k",
    "e" : "e",
    " " : " ",
    "?" : "?"
}

In practice, we would assign an integer id to each of these tokens; the integer ids would be the actual objects manipulated by a language model.

HuggingFace implements BPE and many other tokenization algorithms in their tokenizers library, which we’ll now demonstrate:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer


tokenizer = Tokenizer(BPE())
trainer   = BpeTrainer(min_frequency=10)
tokenizer.train_from_iterator([text], trainer, )


Let’s take a look at the vocabulary learned by this tokenizer:


vocab = tokenizer.get_vocab()

# visualize the first 20 tokens in the vocabulary
pd.DataFrame(vocab.items(), columns=['token', 'id']).head(20)
token id
0 o 53
1 n 52
2 S 33
3 R 32
4 no 136
5 Y 38
6 4 12
7 j 48
8 pe 137
9 ch 117
10 . 9
11 9 14
12 wh 114
13 u 59
14 at 116
15 la 113
16 P 31
17 And 115
18 - 8
19 are 128

Now that we’ve trained the tokenizer, we can encode our text into integer ids:


encoded = tokenizer.encode(text).ids

print(encoded[0:20])
len(encoded) # number of distinct tokens in the text
[18, 78, 45, 138, 58, 59, 113, 58, 47, 78, 57, 99, 34, 53, 42, 39, 81, 47, 74, 101]
2764

A First Language Model: Markov Models

We’ll now introduce Markov models, one of the simplest type of generative language models. Markov models are relatively easy to describe mathematically, and can be implemented from scratch with a manageable amount of code.

In Markov models, we generate new text one token at a time. The probability of generating token \(t\) at time \(i\) is determined by the previous \(n\) tokens generated, where \(k\) is a fixed integer often called the context length. Suppose that we have generated a sequence of \(k-1\) tokens \(T_1, T_2, \ldots, T_{k-1}\). The probability at token position \(k\) that we generate token \(t\) can be described by the complete conditional probability distribution over the entire history:

Here, the notation \(\mathbb{P}(T_k = t)\) can be interpreted as “the probability that the undetermined token \(T_k\) in position \(k\) turns out to have value \(t\).” The notation in the equation below can be read: “Assuming it were the case that \(T_{k-1} = t_{k-1},\ldots,T_{1} = t_1\), then the probability that \(T_k = t_k\) would be…”

\[ \begin{aligned} \mathbb{P}(T_k = t\mid T_{k-1} = t_{k-1}, T_{k-2} = t_{k-2}, \ldots , T_1 = t_1) \end{aligned} \]

As written, the \(k\)th token depends on all previous tokens. This is a very general formulation, but it’s also very difficult to work with in practice: for example, storing a table containing all possible combinations of previous tokens of arbitrary length is infeasible. Fortunately, it’s also not really necessary: we might reasonably assume that the next token doesn’t depend, or depends only very weakly, on a token thousands of positions back.

To create a Markov model, we make the simplifying assumption that the next token depends only on the previous \(n\) tokens, where \(n\) is a fixed integer called the context length. This gives us the following simplified conditional probability distribution:

\[ \begin{aligned} &\mathbb{P}(T_k = t\mid T_{k-1} = t_{k-1}, T_{k-2} = t_{k-2}, \ldots , T_1 = t_1)= \\ &\mathbb{P}(T_k = t\mid T_{k-1} = t_{k-1}, T_{k-2} = t_{k-2}, \ldots , T_{k-n} = t_{k-n})\;. \end{aligned} \]

The idea here is that we can now store a table of probabilities for each possible sequence of \(n\) tokens, rather than for every possible sequence of arbitrary length.

Let’s go ahead and begin considering our implementation. Suppose that, in our text, we have the sequence of tokens

ids = tokenizer.encode("you will ").ids
ids
[80, 129]

Suppose further that our context length is \(n = 2\). We now need to figure out a table of probabilities of the form

\[ \begin{aligned} \mathbb{P}(T_k = t_k \mid T_{k-1} = \text{will}, T_{k-2} = \text{you}) \text{ for all possible tokens } t_k \text{ in the vocabulary.} \end{aligned} \]

The simplest way to construct this table of probabilities is to look through the training text and inspect what happens after the sequence “you will” in the text. Here’s a simple manual loop which does exactly this:

context_length = len(ids)

following_ids = [encoded[ix + context_length] 
                    for ix in range(len(encoded) - context_length) 
                    if encoded[ix:ix+context_length] == ids]

following_tokens = tokenizer.decode(following_ids).split()
following_tokens
['s', 't', 'f', 'go', 'go', 'go', 'h']

We can then sample from these possible following tokens to generate a new one. To sample, we just select a given following token with probability equal to the fraction of times it appears in the list of following tokens. In our working example, we can compute the counts and probabilities of each possible following token like so. To match our upcoming implementation, we’ll collect each of the following tokens in a dictionary that maps from token to count:


following_counts = defaultdict(int)
for token in following_tokens:
    following_counts[token] += 1

print(following_counts)
defaultdict(<class 'int'>, {'s': 1, 't': 1, 'f': 1, 'go': 3, 'h': 1})

This dictionary collects facts like, for example, the token “go” followed “you will” 3 times in the training text, while the token “h” followed once.

The following function samples from such a dictionary proportional to the number of counts for each token:

def sample_from_counter(counter):
    total = sum(counter.values())
    probabilities = [count / total for count in counter.values()]
    return random.choices(list(counter.keys()), probabilities)[0]

Rerunning this function will give us different samples of tokens which could follow our input sequence:

for _ in range(5):
    print("you will " +  sample_from_counter(following_counts))
you will go
you will t
you will go
you will t
you will h

From here, we could imagine simple enclosing this entire process in a loop which repeatedly samples new tokens based on the previous n tokens. This is exactly what a Markov model does.

Formal Specification

The distinguishing feature of a Markov model is that the next generated token depends only on the previous \(n\) tokens, where \(n\) is a fixed integer called the context length. So, to run a Markov model, we check the previous \(n\) tokens, look up all the possible next tokens which follow that sequence in the training text, and sample one of those tokens to produce the next token.

Like every language model, a Markov model needs three core ingredients:

  1. Some internal states which describe what it has “learned” in training.
  2. A train method which updates the internal states based on supplied training text.
  3. A generate method which produces new text based on the internal states.

Let’s take these three pieces one by one.

Internal States

Our implementation of a Markov model will keep track of three internal states: the context length \(n\), a mapping from contexts (tuples of \(n\) tokens) to possible next tokens and their counts, and a count of how many times each context appears in the training text.


class MarkovModel:
    def __init__(self, n):
        self.n = n
        self.transitions = defaultdict(lambda: defaultdict(int))
        self.context_counts = defaultdict(int)

The internal state transitions is a nested dictionary which maps from context tuples to dictionaries of next-token counts. To query it we’ll do something like this: python self.transitions[context][next_token], which will return the count of how many times next_token follows context in the training text. The internal state context_counts maps from context tuples to the total number of times that context appears in the training text; we can use this for sampling a new context in case the user doesn’t provide one.

Training

The train method of the Markov model simply takes a list of tokens, loops through them, and updates the internal states. Each time we see a sequence of n tokens, we look at the next token and increment its count in the transitions dictionary. We also increment the count of how many times we’ve seen that context in the context_counts dictionary. Here’s the code:

def train(self, tokens):

1    for i in range(len(tokens) - self.n):
2        context = tuple() if self.n == 0 else tuple(tokens[i:i + self.n])
        next_token = tokens[i + self.n] 
        
3        self.transitions[context][next_token] += 1
        self.context_counts[context] += 1
        
1
Loop through the entire list of tokens, stopping n tokens before the end to avoid indexing errors.
2
Set the context equal to the current sequence of n tokens. If n is zero, the context is the empty tuple.
3
Look at the next token after the context. Increment its count in the transitions dictionary and increment the count of how many times we’ve seen this context.

Generation

The implementation of the generate method requires the most complicated code, but it is just an automated loop through the manual sampling approach that we followed above. Our implementation below allows the user to specify a starting context; if none is provided, we sample one from the context_counts dictionary. We then repeatedly sample new tokens based on the previous \(n\) tokens until we’ve generated the desired length of text. Here’s the code:

def generate(self, start_context = None, n_to_generate = 100):

    # randomly sample a starting context if none is provided
    if start_context is None: 
        start_context = sample_from_counter(self.context_counts)
    
    # otherwise, use the provided starting context
    context = tuple(start_context)

    # build up a sequence of generated tokens, starting with the current context
    result = list(context)

    # until we've generated enough tokens
    for _ in range(n_to_generate):
        
        # generate a new token based on the context and add to result
        next_token = sample_from_counter(self.transitions[context])
        result.append(next_token)

        # update the context: most recent n tokens
        context = tuple(result[-self.n:]) if self.n > 0 else tuple()
    return result

Putting It All Together

Here’s the complete implementation of the Markov model. The full implementation of generate includes error handling to deal with the case where the current context has never been seen in training; in that case, we simply resample a new context from the training data.

class MarkovModel:
    def __init__(self, n):
        self.n = n
        self.transitions = defaultdict(lambda: defaultdict(int))
        self.context_counts = defaultdict(int)

    def train(self, tokens):

        for i in range(len(tokens) - self.n):
            context = tuple() if self.n == 0 else tuple(tokens[i:i + self.n])
            next_token = tokens[i + self.n]
            self.transitions[context][next_token] += 1
            self.context_counts[context] += 1
                
    def generate(self, start_context = None, n_to_generate = 100):

        if start_context is None:
            start_context = sample_from_counter(self.context_counts)
        
        context = tuple(start_context)
        result = list(context)
        for _ in range(n_to_generate):
            
            try: 
                next_token = sample_from_counter(self.transitions[context])
                result.append(next_token)
                context = tuple(result[-self.n:]) if self.n > 0 else tuple()
            except IndexError:
                context = sample_from_counter(self.context_counts)

        return result

Training and Generating Text

We can build, train, and generate text with a Markov model like this. We’ve implemented a format_seuss function to clean up the generated text a bit for readability. We’ve also implemented a generation_experiment function which trains a Markov model with a given context length n and generates some text for convenient experimentation with many context lengths.

import re 

def format_seuss(text):
    output = re.sub(r'(\s)(?=[A-z])', r'', text)
    output = re.sub(r'(\s)(?=\')', r'', output)
    output = re.sub(r'(\s)(?=[!.?,)()])', r'', output)
    output = output.replace('  ', ' ')
    return output

def generation_experiment(n, n_to_generate=200):
    model = MarkovModel(n)
    model.train(encoded)

    generated_ids = model.generate(n_to_generate=n_to_generate)
    generated_tokens = tokenizer.decode(generated_ids)
    print("-"*40)
    print(format_seuss(generated_tokens))
    print("-"*40)

Let’s go ahead and try generating some text with different context lengths. We can first try a context length of \(n = 0\), which corresponds to simply sampling tokens according to their overall frequency in the training text:

generation_experiment(n=0, n_to_generate=200)
----------------------------------------
P, d oad oowo er 're   .
r cOyfe.
're ing a gelashnthuthe t gfSslstcsoingKt feg or Youycht to g doA t d dUgoq e yfor hrwhpeth y m, ariothat t for cest rt  cppe you soto in to s ghKecl.
ing lthe d And ay ingleroao Ys .

gowhouto y 4, cooyouereyyou owin gesO iaman inaner glo'eeuceout 
io 
k are 
cYaeisoSsyour ayour ereouonxply Yt ry n
----------------------------------------

We can see some common sequences of English characters in here, but overall this is not very readable.

Let’s try increasing the context length to 1:

generation_experiment(n=1, n_to_generate=200)
----------------------------------------
ound for you.

Bick you will you lscout whay staran'tay ded.

And It,
And thrairightecos 
that some! and 3 / 4 to rainntaust in to rshuneas or waitings you're not.

You'll place,
you're nott.

NSis ppening for thould you cak and onfen staroun 
dow.
AChieve are and stakeh you'll hos 
the, not bacces that you'll be ause to your plirs 
bituxbuner of your disangulaces besnay go com
----------------------------------------

Still pretty difficult to read, although we can see the “words” becoming slightly more reminiscent of English. Let’s try increasing the context length to 2:

generation_experiment(n=2, n_to_generate=200)
----------------------------------------
ay long.
You'll pase,
as fun to you.

You happen 
and foost worry to sads at a break.

OH!
THE PLACES YOUNTAINS!
 
So..for people just waiting, percent go around for Fride high!
Read.
A place where the street,
you're in a Slumplane to go.

You'll get on you will fly get all the requarants.
You're off and awatching you won't 
and foget alancing Ac
----------------------------------------

At context length \(n = 2\), observe that the generated text begins to consist largely of real English words, although none of the grammar works even for Dr. Seuss.

When we dial up to \(n = 4\), suddenly the text looks correctly structured, with approximately the right line lengths, and many comprehensible phrases. Most of the sentences are still nonsense, but some of them are quite close to real English.

generation_experiment(n=4, n_to_generate=200)
----------------------------------------
mes too.
Games you can't win 
'cause you're that kind of a guy!
 
Oh, the places you'll go! There are games too.
Games you can't win 
'cause you're that kind of a guy!
 
Oh, the places you'll go! There are points to be scored. There is fun to be done!
There are some, down the road between hither and yon,
that can scare you right out of your pants.
There are points to be scored
----------------------------------------

When we increase the context length sufficiently, something interesting happens:

generation_experiment(n=15, n_to_generate=200)
----------------------------------------
ay night or waiting, perhaps, for their Uncle Jake or a pot to boil, or a Better Break or a string of pearls, or a pair of pants or a wig with curls, or Another Chance.
Everyone is just waiting.

Waiting for the fish to bite or waiting for the wind to fly a kite or waiting around for Friday night or waiting, perhaps, for their Uncle Jake or a pot to boil, or a Better Break or a string of pearls
----------------------------------------

This text is quite quite coherent and could perhaps have been written by Dr. Seuss himself.

The Problem of Memorization

In fact, this text was written by Dr. Seuss himself – it’s an exact copy of a portion of the training text. When the context length is very large, the Markov model simply memorizes long sequences of tokens from the training text and reproduces them verbatim. This is another example of overfitting – the model has fit to the fine details in the data, not the overall picture.

An important legal and social question is whether contemporary language models memorize their training data. If models do memorize their data, and if this data is under copywright (as much training data used by LLMs are), then this may constitute copyright infringement.

Developers of LLMs of course have a vested interest in claiming that their models do not memorize their training data, or that models learn from their training data in ways similar to humans. However, recent work suggests that even modern production LLMs do in fact memorize significant portions of their training data. The scientific consensus, as well as the legal landscape, is rapidly evolving.