Building a GPT-3.5/GPT-4 compatible BPE tokenizer in .NET 4.8 framework

,
Photo by Suzy Hazelwood

In this post I will present a GPT-3.5/GPT-4 compatible BPE tokenizer I have written for the .NET 4.8 framework. I am going to explain how the GPT models perform tokenization and how we can mimic the tokenization process locally. My tokenizer is available from my GitHub.

The basic building blocks when the GPT models process texts are small chucks of letters called tokens. So before processing a text, the GPT models (including ChatGPT) tokenizes the text using what is called BPE (Byte pair encoding) tokenization, creating a list of tokens.

The GPT 3.5 and 4 models recognizes about 100.000 tokens (chucks of letters including spaces), each identified with a unique ID. The complete list of recognized tokens is called the vocabulary of the GPT model. So, while we are inputting a text into the GPT model, what the GPT model processes a list of ID from its vocabulary. When we ask ChatGPT “How are you?”, what the ChatGPT model “sees” the following list of numbers: [15546, 527, 499, 30] (the tokens |How| are| you|?|).

When generating a response, the GPT model again uses the vocabulary of tokens, stringing together the response one token at a time based on probabilities of the likely next token. It might answer “I am fine”, but in reality, it is returning this list of token IDs: [40, 1097, 7060] (the tokens |I| am| fine|).

On the surface one could say that the tokenization turns a “language problems” into a “mathematical problem” that – for a computer – is more manageable: Given a list of numbers (a question), produce another list of numbers (an answer). Of cause this is a simplification because producing this list of numbers requires knowledge about language, but viewed in isolation the tokenization does turn texts into numbers.

So, to understand how the GTP model works it is interesting to also understand the process of tokenization.

How BPE tokenization works

As an example, I am going to use the sentence “The lasagne hatter”. This sentence does not make much sense, but every sentence can potentially be tokenized, even sentences including non-existing words. In this case the sentence is tokenized like this: |The| las|agne| h|atter|. It is worth noticing that a token can both be a character (single letter), a chunk of characters or even a complete word.

The chunks of characters do not need to be grammatically significant, as the tokenization is not a grammatical analysis or a syllabification (the process of dividing a word into syllables). When the result of a tokenization sometimes looks very grammatical “correct” in English, it is because the preference of one grouping of letters over another is based one the prevalence of the group of letters in the training data used to build the vocabulary. This tend to mean that the tokenizer e.g., correctly split out common English suffixes like -tion, -ness, -er, but this is not because they are grammatical meaningfull, but simply because they are common.

This can be illustrated with the word hatter where the tokenization | h|atter| takes precedence over a grammatical meaningful tokenization like | hat|ter|, simply because the token |atter| is more common collection of letters in the training data than the token | hat|.

Technical, the way this works is like this: First the text is split into 1-character initial tokens. The tokenizer will then run through a list of merge rules (which is simply just a text file where each line contains two tokens that we are allowed to merge). The merge rules will merge 1-character tokens into groups of 2-character tokens, then merge these 2-character tokens into large chunks – slowly building words if possible. The list of merge rules is prioritized so they will build common chucks of characters (if possible) before building uncommon chunks of characters.

Notice that the list of merge rules is about as long as the vocabulary because the vocabulary contains all single characters as well as the result of any merge rule. This means that a merge rule can never create a chunk of characters not in the vocabulary – and on the other hand having a token in the vocabulary that cannot be created by a merge rule does not make sense.

So, lets take this reduced list of merge rules (with the Ġ representing a space which is a common convention in the world of Open AI):

#1: e r
#2: a t
#3: a s
#4: Ġ h
#5: Ġ l
#6: a g
#7: h e
#8: t er
#9: T he
#10: n e
#11: at ter
#12: Ġh at
#13: Ġl as
#14: ag ne

The first rule says that if we see a e followed by a r, we are allowed to merge the two, creating the token er. This makes sense as the GPT models have been trained a lot on English text, where er is a common building block of words. So, we will simply take the rules one a time: If we find the two tokens next to each other matching the rule, we will merge them, otherwise the merge rule will be skipped. This will result in the following merge operations:

|T|h|e|Ġ|l|a|s|a|g|n|e|Ġ|h|a|t|t|e|r|
|T|h|e|Ġ|l|a|s|a|g|n|e|Ġ|h|a|t|t|er| ← Applied merge rule #1 (e r)
|T|h|e|Ġ|l|a|s|a|g|n|e|Ġ|h|at|t|er| ← Applied merge rule #2 (a t)
|T|h|e|Ġ|l|as|a|g|n|e|Ġ|h|at|t|er| ← Applied merge rule #3 (a s)
|T|h|e|Ġ|l|as|a|g|n|e|Ġh|at|t|er| ← Applied merge rule #4 (Ġ h)
|T|h|e|Ġl|as|a|g|n|e|Ġh|at|t|er| ← Applied merge rule #5 (Ġ l)
|T|h|e|Ġl|as|ag|n|e|Ġh|at|t|er| ← Applied merge rule #6 (a g)
|T|he|Ġl|as|ag|n|e|Ġh|at|t|er| ← Applied merge rule #7 (h e)
|T|he|Ġl|as|ag|n|e|Ġh|at|ter| ← Applied merge rule #8 (t er)
|The|Ġl|as|ag|n|e|Ġh|at|ter| ← Applied merge rule #9 (T he)
|The|Ġl|as|ag|ne|Ġh|at|ter| ← Applied merge rule #10 (n e)
|The|Ġl|as|ag|ne|Ġh|atter| ← Applied merge rule #11 (at ter)
|The|Ġlas|ag|ne|Ġh|atter| ← Applied merge rule #13 (Ġl as)
|The|Ġlas|agne|Ġh|atter| ← Applied merge rule #14 (ag ne)

Notice that the rules are applied in order. We might merge multiple sets of e r in the list of tokens, but once we are done with the rule, we will not return. Also notice that rule #12 has no effect. It would be able to create the token Ġhat, but because if comes after rule #11, the token at have already been merge the right-side token ter.

In my C# code this loop will look like this:

public IEnumerable<string> Tokenize(string input, MergeLog mergeLog = null)
{
    var tokens = input.ToStringBuilder();
    mergeLog?.Add(tokens.FromStringBuilder());
    foreach (var mergeRule in MergeRules)
    {
        if (mergeRule.Apply(tokens, mergeLog != null))
            mergeLog?.Add(tokens.FromStringBuilder(), mergeRule);
    }
    return tokens.FromStringBuilder();
}

As you can see, I keep an optional merge log of applied merge rules. This is simply used for debugging and to understand what is going one behind the scenes.

Identifying the tokens

Now that we have tokenized the text, we can identify the tokens using the vocabulary. The vocabulary is a JSON file, and would include these entries:

{
  "Ġh": 305,
  "The": 791,
  "atter": 1683,
  "Ġlas": 5252,
  "agne": 24812
}

Hence the IDs of the tokens would be [791, 5252, 24812, 305, 1683].

Motivation, file formats and credits

The process described above is called BPE (Byte pair encoding) tokenization. The vocabulary and merge rules for the GPT 3.5 and 4.0 models are available as part of the tiktoken tokenizer written in Python and released by Open AI.

The file formats I am using are the same as the Microsoft.ML.Tokenizers (a NuGet package from Microsoft) uses. The conversion to this file format has been made by Joshua Lochner (Xenova).

Having these files means that we can use the Microsoft.ML.Tokenizers package and the following code:

var tokenizer = new Microsoft.ML.Tokenizers.Tokenizer(
    new Bpe("Data/vocab_GPT3_5_turbo.txt", "Data/merges_GPT3_5_turbo.txt")
);
var tokens = mlTokenizer.Encode("This is a text".Replace(" ", " Ġ"));

Notice that to get the Ġ convention working with the Microsoft tokenizer we need to replace spaces ourselves.

However, my motivation for creating my own tokenizer instead of using Microsoft’s was mostly to get a better understanding of the process via the merge log – but also to be able to add the merge rules and vocabulary as resources into my DLL instead of having to load files as needed by the Microsoft tokenizer.

This means that my tokenizer can be used without loading external files:

// Supported models are GPTModel.GPT4 and GPTModel.GPT3_5_turbo
var tokenizer = new Tokenizer(GPTModel.GPT3_5_turbo);
var tokens = tokenizer.Tokenize("This is a text");

The tokenizer will return the tokens as a collection of strings. If you need the IDs of the tokens use:

var ids = tokenizer.GetIDs(tokens);

Final thoughts

Understanding the tokenizing process is a first step to understand how the GPT models processes our prompts and how they generate responses by piecing together texts one token at a time based on the probabilities produced by the model. Hence, I hope you found this post interesting ☺