Humans and machines process language in fundamentally different ways. A common misconception, particularly among those outside the technological domain, is that machines interpret language in the same manner as humans. In reality, machines do not possess linguistic understanding; instead, they operate on mathematical and symbolic representations of text.
To enable effective interaction between humans and machines, natural language must be transformed into a form suitable for computation. Modern large language models (LLMs) achieve this through a sequence of pre-processing steps. A foundational step in this process is tokenization, which determines how text is decomposed into basic units and turned into tokens.
A token is an element of a finite set called a vocabulary:
$$ \mathcal{V} = \{ t_1, t_2, \dots, t_{|\mathcal{V}|} \} $$
Each token \(t_i\) can be a word, sub-word, or character, depending on the tokenization process. At a high level, tokenization transforms raw text into a sequence of discrete symbols, indivisible, countable units drawn from a finite set, where each unit has a unique identity that can be processed by a machine learning model. There are several tokenization techniques; current LLMs use a modified version of a process known as Byte-Pair Encoding (BPE).
What is BPE and how it works
The original.
One way of visualizing it is if we take a particular string of text
$$\text{aabacaacbacb}$$
We define our string of text as a sequence where each \(s_i\) represents an individual character.
\(\mathcal{S}=s_1,s_2,\ldots, s_n\)
In the case of the example, we would break down the word we used:
$$\mathcal{S}=a,a,b,a,c,a,a,c,b,a,c,b$$
We also should define the next variable \(\Sigma\) which represents the set of all characters in our string of text
$$\Sigma={s_1,s_2,\ldots, s_n}$$
In our example:
$$\Sigma={\text{unique}\; s_i ;\text{in} ;\mathcal{S} }$$
The next step is to count the frequency of adjacent pairs in the sequence \(\mathcal{S}\), which we represent by a function \(f(x,y;\mathcal{S})\,) and select the pair with maximum frequency, which can be written mathematically as:
$$(x_i, y_i) = \arg\max_{x,y \in \text{adjacent pairs of current sequence}} f(x,y; S)$$
to define \(f(x,y;\mathcal{S})\) we index the sequence \(\mathcal{S}\) by a position \(k\), where \(s_k\) denotes the character at position \(k\)
In our example, we can write
| \(k\) | \(s\) |
| 1 | a |
| 2 | a |
| 3 | b |
| 4 | a |
| 5 | c |
| 6 | a |
| 7 | a |
| 8 | c |
| 9 | b |
| 10 | a |
| 11 | c |
| 12 | b |
now we want to count the frequent adjacent pairs. We can write our pairs as \((s_k,s_{k+1})\) since we want the character right next to the other one, and we also want to make sure to not count if there’s no pair. We can do that with an indicator function like this:
$$1_{(s_k,s_{k+1})=(x,y)}$$
This function counts 1 when the specified pair is found and 0 otherwise.
$$1_{(s_k,s_{k+1})=(x,y)} = \begin{cases} 1 & \text{if } (s_k, s_{k+1}) = (x,y) \\[2mm] 0 & \text{otherwise} \end{cases} $$
Now we can define our function \(f(x,y; S)\)
$$f(x,y; S) = \sum_{k=1}^{n-1} \mathbf{1}_{(s_k, s_{k+1}) = (x,y)}$$
In our example, we can do a part of it to show how it would look in practice for \(\mathcal{S}=aabacaa\) and count \(f(a,a;S)\)
| \(k\) | \((s_k, s_{k+1})\) | \((s_k,s_{k+1}) = (a,a)\) |
| 1 | (a,a) | 1 |
| 2 | (a,b) | 0 |
| 3 | (b,a) | 0 |
| 4 | (a,c) | 0 |
| 5 | (c,a) | 0 |
| 6 | (a,a) | 1 |
We repeat the same process across all combinations, and we can get our frequency table. We only include here the pairs in the string:
| Pair | Count |
|---|---|
| (a,a) | 2 |
| (a,b) | 1 |
| (b,a) | 2 |
| (a,c) | 3 |
| (c,a) | 1 |
| (c,b) | 2 |
We want to merge the most common strings and create a new token (z), such as
$$z = x\oplus y$$
Here \(\oplus\) represents concatenation or merge of characters, and \(x,y\) represent each character in the pair, for example, on our previous table, we can see that the most frequent combination is \((a,c)\), so we can concatenate it:
$$x\oplus y = a \oplus c=ac$$
At this point, we can begin constructing the vocabulary. Initially, the vocabulary consists of the set of all characters present in the original string.
$$ \mathcal{V}_0 = \Sigma $$
As we concatenate our characters, we keep the \(\mathcal{V}_0\) intact. At each step, the updated vocabulary is the union of the previous vocabulary and the newly merged token:
$$ \mathcal{V}_i= \mathcal{V}_{i-1} ; \cup ; \{x_i\oplus y_i\} $$
So initially, with our first concatenation, it would look like this:
$$ \mathcal{V}_1=\{a,b,c,ac\} $$
And it means now we update our string:
$$ S = a, a, b, ac, a, a, c, b, ac, b $$
We apply the same process, looking for the most common adjacent pair for \(\mathcal{V}_2\) which is \((a,a)\,) and we get:
$$ \mathcal{V}_2=\{a,b,c,ac,aa\} $$
And now we have a new string of text:
$$S = aa, b,ac, aa, c, b, ac, b$$
And for \(\mathcal{V}_3\) we notice the most common pair is \((b,ac)\) which lead us to the final vocabulary
$$\mathcal{V}_3={a,b,c,ac,aa,bac}$$
After performing this merge, we continue the process iteratively, the union of \(\mathcal{V}_0\) and all merges performed during training, such as:
$$ \mathcal{V} = \mathcal{V}_0 \cup \bigcup_{i=1}^{M} \{ x_i \oplus y_i \} $$
In our final example, we can write it as
$$ \mathcal{V} = \{ a, b, c, ac, aa, bac \} $$
Here we introduce another factor that goes along with the algorithm,m and its \(T\), which is our translation table:
$$ T = [(x_1,y_1),(x_2,y_2),\ldots,(x_m,y_m) ] $$
Each entry \((x_i, y_i)\) corresponds to the pair merged at step \(i\). In this case, order matters; merges are applied in this exact order during the tokenization of a new string.
At the end of training, \(T\) contains the ordered list of all merges. In our example
$$ T = [(a,c),(a,a),(b,ac) ] $$
In other words, T keeps track of the exact merge order needed to tokenize new text consistently.
What I’ve shown so far is a simplification on how it works since it is easier to visualize with text however there’s a step after we separate our string of text we turn those string of text into bytes hence the name byte-pair encoding, the reason why we do this is that computers don’t read characters like we do, we need to transform them into a system of information that can be interpreted by machines.
How BPE is treated in LLMs
Modern BPE (as used in LLMs) does not operate on characters, as we saw in the example It operates on a finite alphabet, and that alphabet is:
$$ \Sigma = \{0,1,\dots,255\} $$
The reason why it’s 256 lies in the definition of a bit (binary digit) which is the most basic unit of information in computing and digital communication, it represents one binary choice \({0,1}\) yes/no, on/off, true/false, 0/1, Integers, characters, images, audio, video all of these are just collections of bits, and we can group it to create even more complex communication in our case a group of 8 bits is defined as a byte so we have \(2^8\) possibilities
$$byte∈{0,1}^8$$
And we represent it as an integer \( \{0,1,…,255\}\) where each number is a different sequence of 8 bits. To interpret that information as data, we need a mapping. In the current LLM, we use
UTF-8 stands for Unicode Transformation Format, where the ‘8’ refers to the use of 8-bit bytes
Unicode is the encoding standard designed to support the use of text in all of the world’s writing systems that can be digitized, This allows LLMs to process text from different writing systems under a unified encoding, for further reading in how Unicode works at a deeper level I recommend “Unicode Demystified by Richard Gillam” It will allow you to understand how Unicode and UTF play a role into computer language at a deeper level on here we will continue focusing on which role does this play in Tokenization.
The way we use Unicode is that after we separate the string of text, we translate it into UTF-8 code, which will be represented in hexadecimal format type “0x00” to not confuse with the unexpected formatting for clarit,y it´s another way to write the group of 8 bytes, to understand how it takes this shape go to further reading recommended previously. Each byte already has a token ID equal to its numeric value.
| Byte | Token ID |
|---|---|
| 0x00 | 0 |
| 0x01 | 1 |
| … | … |
| 0x41 | 65 |
| … | … |
| 0xFF | 255 |
So the first 256 Token IDs are reserved for single bytes such as
$$ \mathcal{V}_0 = \{ b_i \mid b_i \in \{0,\dots,255\} \} $$
And the Token ID is defined as a mapping function
$$ \text{ID}_0: V_0 \to \mathbb{Z} $$
And as we saw in the tables, we can write it like:
$$ \text{ID}_0(b_i)=i $$
so
$$\text{ID}_0(0x00) = 0,\text{ID}_0(0x01) = 1, …, \text{ID}_0(0xFF) = 255$$
During the merging process, when creating new tokens, the token ID will continue to grow accordingly. Therefore, if a merge of 2 bytes occurs, the next assigned token ID will be 256, and so on.
To get all the tokens needed for an LLM, we use a finite sequence of texts known as a corpus, in the case of modern LLM this is all the data being fed to train the model, like internet data, books, articles etc…, In models such as GPT-3, it is estimated that training involved approximately 500–570 billion tokens.
For modern LLMs like GPT-4, estimates range from hundreds of billions to multiple trillions of tokens, though exact figures are not publicly disclosed. This corresponds to petabytes of raw text once encoded in UTF-8 and pre-processed.
We can denote a corpus like:
$$\mathcal{C}={T_1,T_2,…,T_N}$$
where each \(T_i\) is a document or string of text and \(L_i\) is the length of the document in \(T_i\)
$$T_i = c_1, c_2, \dots c_{L_i} \quad c_j \in \Sigma$$
So BPE is applied over the corpus and assigned a token ID to get these massive lengths of billions and trillions of tokens, each one with its respective token ID, so at the end, we will end up with the set we defined at the start
$$ \mathcal{V} = \{ t_1, t_2, \dots, t_{|\mathcal{V}|} \} $$
With the information we have seen, we can now understand how tokenization works

Leave a Reply