10 Prediction to Generation

Transitioning from Predictive Models to Generative Models

Open the live notebook in Google Colab or download the live notebook.

Prediction to generation

Let’s start today by loading a “small” pre-trained language model. We will use the HuggingFace transformers library which provides model definitions and associated tools for a wide variety of transformer models. The library makes it (relatively) easy to load and work with pre-trained models. Here will we use the GPT2 model, the smallest variant of the GPT2 family of models released by OpenAI (Radford et al. 2019). The original model card is available here. We chose this model because because it just trained for next-token prediction without the chat interface (and because it is small and fast to work with).

Note that small still means 124 million parameters in the core model!

Note that the initial model loading may take a few minutes as the model weights are downloaded from HuggingFace’s model repository.

import torch # Common imports for Pytorch
import torch.nn.functional as F

from transformers import AutoTokenizer, AutoModelForCausalLM
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
model.eval();  # Put model in evaluation mode

We downloaded two components: the model itself and a tokenizer. The tokenizer is responsible for converting text into the integer token IDs that form the model’s vocabulary. This tokenizer has a vocabulary size of 50257 tokens, a sample of which is shown below.

The ‘Ġ’ character is how a space is encoded with this tokenizer, i.e., many of the tokens include a preceding space.

import itertools

# Extract first 10 tokens in vocabulary (without creating intermediate data structures)
dict(itertools.islice(tokenizer.vocab.items(), 10))

{'1988': 26709,
 'bnb': 31971,
 'ĠRandy': 21248,
 '*****': 35625,
 'ĠEdit': 5312,
 'Ġforgiveness': 26027,
 'each': 27379,
 'ttp': 29281,
 'ĠBur': 5481,
 'Ġoriginally': 6198}

Tokens and tokenization?

We will learn more about tokenization as shortly, but for now, let’s just think of the tokens as the model’s vocabulary. Tokens are not always whole words, and purposely so. The GPT2 tokenizer uses a byte-based encoding scheme that includes all possible bytes, e.g., individual letters, and common “merges” of those bytes into larger tokens, e.g., common word fragments or whole words. For example, the word “alligator” is broken into the following tokens. The ability to break words into sub-word tokens, including individual letters, allows the model to represent (both as input and generated output) any possible text, even if it includes words that were not observed during training.


example_tokens = tokenizer("alligator")
tokenizer.convert_ids_to_tokens(example_tokens["input_ids"])

['all', 'igator']

The GPT2 model was trained to predict the next token in a sequence given the previous tokens. The version we are using here includes the base GPT2 model, and a causal language modeling “head” that maps the base model’s outputs to a probability distribution over the vocabulary, i.e., \(p(t_i|t_1, \ldots, t_{i-1})\), the conditional probability of each token in the vocabulary being the next token given the previous tokens.

A conditional probability distribution, \(p(X|Y)\), is the probability of one event given another event(s), shown on the right side of the \(|\), has occurred.

Let’s consider the input text “See you”. What words (tokens) would you expect to come next? We can ask the model the same question. We tokenize the input (as PyTorch tensors) and pass those token IDs as the inputs to the model. Among the outputs are the logits, the unnormalized log-probabilities for each token in the vocabulary being the next token.

In statistics, logits are formally defined as the log-odds. In this context and with neural networks generally, the term is used more loosely to refer to the unnormalized outputs of a layer prior to applying a normalization function like softmax to convert them to a probability distribution.

inputs = tokenizer("See you", return_tensors="pt")
logits = model(inputs["input_ids"]).logits  # Run model's forward pass with input IDs

The shape of the logits tensor is (batch_size, sequence_length, vocab_size), that is we get a prediction for each “next token” in our sequence, i.e., after “See” and after “See you”, for each input in the batch (here just one), over the entire vocabulary.

logits.shape

torch.Size([1, 2, 50257])

The last index in the sequence dimension corresponds to the prediction after the full input sequence, i.e., after “See you”. We can extract those logits and convert them to a probability distribution using the softmax function.

softmax, \(\text{Softmax}(x_i) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}\), converts unnormalized log-probabilities (logits) to a probability distribution by computing the exponential of each value and normalizing by the sum of all exponentials.


# Extract last logits in the 'sequence' dimension, recall -1 is the last index in Python
next_token_logits = logits[:, -1, :] 
probs = F.softmax(next_token_logits, dim=-1)  # Convert logits to (batch_size, vocab_size) probabilities

The largest of these probabilities corresponds to the most likely next token according to the model. We can find that token and convert it back to text using the tokenizer. Seems like a pretty reasonable prediction!


next_token_id = torch.argmax(probs, dim=-1)  # argmax returns the index, i.e., token ID, of the largest value
tokenizer.decode(next_token_id)

' soon'

The next-token prediction is the “heart” of the generative process. We can use this predictor to generate new text auto-regressively, i.e., we append each predicted token to the input sequence to be used to predict the next token, repeating this process for the desired number of tokens. By incorporating the generated tokens into the input sequence, each next-token prediction incorporates both the original input, e.g., “See you”, and the subsequently generated text.

input_ids = tokenizer("See you", return_tensors="pt")["input_ids"]
for _ in range(20):  # Predict 20 tokens
    logits = model(input_ids).logits
    
    next_token_logits = logits[:, -1, :]
    next_token_id = torch.argmax(next_token_logits, dim=-1)
    # Append predicted token to input_ids to auto-regressively generate the next token.
    # next_token_id[:, None] makes 1-D tensor of shape (batch_size,) (batch_size, 1) so
    # we can concatenate it to the input_ids
    input_ids = torch.cat((input_ids, next_token_id[:, None]), dim=-1)
    
input_ids

tensor([[ 6214,   345,  2582,    11,   198,   198,   464,  4453,   286,   262,
         26028,   198,   198,   464,  4453,   286,   262, 26028,    25,   383,
          5172,  3776]])

# Decode entire batch (i.e., don't need to strip the outer batch dimension)
tokenizer.batch_decode(input_ids)

['See you soon,\n\nThe Lord of the Rings\n\nThe Lord of the Rings: The Card Game']

I suspect you find the results a little underwhelming. The algorithm we just implemented is named “greedy decoding” because at each step it deterministically picks the most likely next token. This approach often leads to repetitive and not-very-interesting text. Recall that the output of the model is not a single word, but actually a probability distribution over all possible next tokens. For example:

import pandas as pd

(values, indices) = torch.topk(probs, k=10)  # Find top 10 most probable next tokens
pd.DataFrame({
    "token": [tokenizer.decode(t) for t in indices[0]],
    "probability": values[0].detach().numpy(), # Detach from gradient tracking graph to convert to numpy
})

	token	probability
0	soon	0.084681
1	there	0.073554
2	all	0.072775
3	next	0.072505
4	guys	0.061731
5	in	0.055900
6	out	0.053875
7	on	0.040310
8	later	0.028322
9	,	0.027074

How might be we use this distribution to generate more interesting text? One approach is to sample from the distribution rather than deterministically taking the most likely token. This approach is called “sampling decoding”. We can think of sampling from the multinomial token distribution as a “raffle” in which each token gets a number of tickets proportional to its probability. We then randomly draw one ticket from the raffle to select the next token. Thus while the mostly likely token is still the most likely sample, all other tokens also have a chance of being selected, leading to more diverse and interesting text. Notice that highly probably next token like “soon”, “later”, “guys”, etc. are still likely to be selected, but we observe some other interesting options as well.

Alternately you could think of multinomial sampling as rolling a biased \(n\)-sided die, where the probability of each side is determined by the distribution.


next_token_ids = torch.multinomial(probs, num_samples=20, replacement=True)  # Sample from the distribution
for token_id in next_token_ids[0]:
    print(tokenizer.decode(token_id))

 at
 guys
 out
?!
 within
 Vector
 live
 soon
 next
 in
 next
 here
 at
 here
 to
 guys
 there
 soon
 around
 all

We can now update our generation loop to use sampling decoding rather than greedy decoding.

input_ids = tokenizer("See you", return_tensors="pt")["input_ids"]
for _ in range(20):  # Predict 20 tokens
    logits = model(input_ids).logits
    next_token_logits = logits[:, -1, :]
    
    next_token_ids = torch.multinomial(F.softmax(next_token_logits, dim=-1), num_samples=1)  # Sample one token ID
    input_ids = torch.cat((input_ids, next_token_ids), dim=-1) # next_token_ids is already (batch_size, 1)
    
input_ids

tensor([[ 6214,   345,   319,   262,  3470,   400,  7427,  1711,   286,  3909,
          5265,  7547,  1584,    25,     8, 50256,  9980,  1909,   546,   734,
         17366,   547]])

tokenizer.batch_decode(input_ids)

['See you on the hundredth awesome hour of Saturday Night Live 2016:)<|endoftext|>News today about two teenagers were']

As a practical matter, the transformers library provides higher-level APIs for generating text, including additional decoding procedures, options to generate multiple sequences and more. As an example:

inputs = tokenizer("See you", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=20, do_sample=True, num_return_sequences=5, pad_token_id=tokenizer.eos_token_id)
tokenizer.batch_decode(output, skip_special_tokens=True)

Results still underwhelming?

I suspect that even with sampling decoding, the results are still a bit underwhelming. Keep in mind that by modern standards, this model, released in 2019, is very small (124 million parameters) and trained on a relatively small dataset (40 GB of internet text). State-of-the-art models today have billions of parameters and are trained on datasets that are orders of magnitude larger. We chose it here for size and speed. We will explore larger and more capable models as we proceed.

Interlude: Generative vs. discriminative models

A statistical generative model, like we used above, is a probability distribution \(p(x)\) that incorporates data and some form prior knowledge about that data. We want this distribution to have the following properties:

If we sample from \(p(x)\), i.e., \(x_{new} \sim p(x)\), then \(x_{new}\) should “look like” the data we used to train the model, e.g., be valid english text.
\(p(x)\) should assign high probability to data that “looks like” the training data, i.e., to real text, and low probability to data that does not, e.g., gibberish.

The first property underlies the generative process, i.e., generating new data, the second enables anomaly detection and other applications.

How is this different than the discriminative models we learned about before? In a discriminative setting, we have both data \(x\) and associated labels \(y\), e.g., text and sentiment labels (“positive” or “negative”), or images and the digit represented in the image. The goal is to predict the label \(y\) given some new data point \(x\), i.e., a discriminative model is a conditional probability distribution \(p(Y|X)\). A discriminative model doesn’t need or attempt to model the distribution of the data itself, \(p(X)\), only the features needed to discriminate (hence the name) between the labels. Thus the discriminative model is simpler, but also limited. It can only be used for tasks involving that conditional probability distribution (i.e., for which we have an input \(x\) and want to predict a label \(y\)).

In contrast, a generative model for the same context learns the joint distribution \(p(X, Y)\) over both the data and labels. This joint distribution can be used to compute the distribution of the data \(p(X)\) (by “summing” or marginalizing out the labels) and the conditional distribution \(p(Y|X)\) (using Bayes’ rule, below). Thus a generative model is more flexible than its discriminative counterpart as it can be used for both generation of new data points and for prediction of labels.

\[ p(Y|X) = \frac{p(X,Y)}{p(X)} \]

We can, then, solve our prediction problems two different ways: using a discriminative model that directly models \(p(Y|X)\), or using a generative model that models \(p(X,Y)\) and then uses Bayes’ rule to compute \(p(Y|X)\). Theoretical and empirical results show that generative models can outperform discriminative models when the amount of training data is limited (i.e., “converge” faster), but that discriminative models often outperform generative models when large amounts of training data are available (Ng and Jordan 2001). The latter are often simpler to build and more computationally efficient, making them a popular choice in practice for solving purely discriminative tasks. Alternately, if you need to generate new data points, or perform unsupervised learning (where you don’t have \(y\)) or estimate the probability of some data, \(p(x)\) (e.g., for anomaly detection), then a generative model is required.

Models from the same family trained to optimize the joint distribution or to optimize the conditional probability \(p(Y|X)\) (or its equivalent), are termed Generative-Discriminative pairs. Interested readers are pointed to follow-on courses to learn about about this duality.

Applying a generative model: Comparing sequences

We noted that a generative model should assign high probability to data that “looks like” the training data, e.g., valid text, and low probability to data that does not. That is \(p(x)\) should be larger for sequences that “look like” the training data. We can use this property to compare different sequences. For example, which of the following two sequences would you expect to be more probable?

“See you later, alligator.”
“See you later, crocodile.”

Both are valid English sentences, but the first is a common slang phrase (with “in a while, crocodile” is a common response), while the second is not. Let’s see what the GPT2 model thinks. We can estimate the probability of each sequence by computing the product of the probability of each token given the previous tokens, i.e., apply the chain rule \(P(t_1, t_2, \ldots) = \prod_i P(t_i | t_1, \ldots, t_{i-1})\). This is also described as the product of the transition scores.

Because these probabilities are often very small, we will compute the log-probability of each sequence (so values close to 0 become large negative numbers) by computing the log-probability of each token given the previous tokens and computing the sum of those log probabilities over all tokens.

Recall that \(log(a\cdot b) = log(a) + log(b)\)

# Insert the "beginning-of-sequence" and "end-of-sequence" tokens to create a "complete" input
alligator_ids = tokenizer(f"{tokenizer.bos_token}See you later, alligator.{tokenizer.eos_token}", return_tensors="pt")["input_ids"]
crocodile_ids = tokenizer(f"{tokenizer.bos_token}See you later, crocodile.{tokenizer.eos_token}", return_tensors="pt")["input_ids"]

def sequence_log_prob(token_ids):
    """Compute the log-probability of a sequence given the model logits."""
    logits = model(token_ids).logits
    
    # Log-encoded probability distribution over vocabulary with shape (batch_size, seq_length, vocab_size)
    probs = F.log_softmax(logits, dim=-1)  
    # prob[:, i, :] is the log-probabilities of transitioning to token[i+1] given token[:i]. Use gather
    # to get the log-probabilities of the actual tokens in the sequence. We skip the first token since it
    # does not have preceding tokens. 
    # We need to add an extra dimension to token_ids for gather, i.e., (batch_size, seq_length) -> (batch_size, seq_length, 1)
    # to gather the values at the token indices in dim=2, the vocab_size dimension.
    token_log_probs = torch.gather(probs, 2, token_ids[:, 1:, None]).squeeze(-1)
    return token_log_probs.sum(dim=-1)  # Sum log-probabilities over sequence length
    

print("'See you later, alligator.' log-probability:", sequence_log_prob(alligator_ids).item())
print("'See you later, crocodile.' log-probability:", sequence_log_prob(crocodile_ids).item())

'See you later, alligator.' log-probability: -31.038818359375
'See you later, crocodile.' log-probability: -35.54851531982422

As we expected, the model assigns a higher probability to the more common saying (it has a less negative log-probability). And compared to gibberish, both of the actual sentences are more likely!

gibberish_ids = tokenizer(f"{tokenizer.bos_token}See you later, asdf.{tokenizer.eos_token}", return_tensors="pt")["input_ids"]
print("'See you later, asdf.' log-probability:", sequence_log_prob(gibberish_ids).item())

'See you later, asdf.' log-probability: -42.26846694946289

The ability to compare sequences this way is a key feature of generative models. It is also key part of some more sophisticated decoding strategies such as beam search. Instead of “committing” to the next token at each step (whether via greedy or sample decoding), beam search keeps track of the \(k\) most likely sequences at each step, expanding each of those sequences with all possible next tokens, and then selecting the \(k\) most likely resulting sequences. Beam search can improve generation quality by not requiring the decoding to “commit” to a token choice until it has seen more (or all) of the generated sequence.

Ng, Andrew, and Michael Jordan. 2001. “On Discriminative Vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes.” In Advances in Neural Information Processing Systems, edited by T. Dietterich, S. Becker, and Z. Ghahramani. Vol. 14. MIT Press. https://proceedings.neurips.cc/paper_files/paper/2001/file/7b7a53e239400a13bd6be6c91c4f6c4e-Paper.pdf.

Radford, Alec, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.” https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.