Author: Velasquez

  • Chapter 1: Converting Human language into Computer language

    Chapter 1: Converting Human language into Computer language

    Humans and machines process language in fundamentally different ways. A common misconception, particularly among those outside the technological domain, is that machines interpret language in the same manner as humans. In reality, machines do not possess linguistic understanding; instead, they operate on mathematical and symbolic representations of text.

    To enable effective interaction between humans and machines, natural language must be transformed into a form suitable for computation. Modern large language models (LLMs) achieve this through a sequence of pre-processing steps. A foundational step in this process is tokenization, which determines how text is decomposed into basic units and turned into tokens.

    A token is an element of a finite set called a vocabulary:

    $$ \mathcal{V} = \{ t_1, t_2, \dots, t_{|\mathcal{V}|} \} $$

    Each token \(t_i\) can be a word, sub-word, or character, depending on the tokenization process. At a high level, tokenization transforms raw text into a sequence of discrete symbols, indivisible, countable units drawn from a finite set, where each unit has a unique identity that can be processed by a machine learning model. There are several tokenization techniques; current LLMs use a modified version of a process known as Byte-Pair Encoding (BPE).

    What is BPE and how it works

    The original.

    One way of visualizing it is if we take a particular string of text

    $$\text{aabacaacbacb}$$

    We define our string of text as a sequence where each \(s_i\) represents an individual character.

    \(\mathcal{S}=s_1,s_2,\ldots, s_n\)


    In the case of the example, we would break down the word we used:

    $$\mathcal{S}=a,a,b,a,c,a,a,c,b,a,c,b$$


    We also should define the next variable \(\Sigma\) which represents the set of all characters in our string of text

    $$\Sigma={s_1,s_2,\ldots, s_n}$$


    In our example:

    $$\Sigma={\text{unique}\; s_i​ ;\text{in} ;\mathcal{S} }$$

    The next step is to count the frequency of adjacent pairs in the sequence \(\mathcal{S}\), which we represent by a function \(f(x,y;\mathcal{S})\,) and select the pair with maximum frequency, which can be written mathematically as:


    $$(x_i, y_i) = \arg\max_{x,y \in \text{adjacent pairs of current sequence}} f(x,y; S)$$

    to define \(f(x,y;\mathcal{S})\) we index the sequence \(\mathcal{S}\) by a position \(k\), where \(s_k\) denotes the character at position \(k\)

    In our example, we can write

    \(k\)\(s\)
    1a
    2a
    3b
    4a
    5c
    6a
    7a
    8c
    9b
    10a
    11c
    12b

    now we want to count the frequent adjacent pairs. We can write our pairs as \((s_k,s_{k+1})\) since we want the character right next to the other one, and we also want to make sure to not count if there’s no pair. We can do that with an indicator function like this:

    $$1_{(s_k,s_{k+1})=(x,y)}$$

    This function counts 1 when the specified pair is found and 0 otherwise.

    $$1_{(s_k,s_{k+1})=(x,y)} = \begin{cases} 1 & \text{if } (s_k, s_{k+1}) = (x,y) \\[2mm] 0 & \text{otherwise} \end{cases} $$

    Now we can define our function \(f(x,y; S)\)

    $$f(x,y; S) = \sum_{k=1}^{n-1} \mathbf{1}_{(s_k, s_{k+1}) = (x,y)}$$

    In our example, we can do a part of it to show how it would look in practice for \(\mathcal{S}=aabacaa\) and count \(f(a,a;S)\)

    \(k\)\((s_k, s_{k+1})\)\((s_k,s_{k+1}) = (a,a)\)
    1(a,a)1
    2(a,b)0
    3(b,a)0
    4(a,c)0
    5(c,a)0
    6(a,a)1

    We repeat the same process across all combinations, and we can get our frequency table. We only include here the pairs in the string:

    PairCount
    (a,a)2
    (a,b)1
    (b,a)2
    (a,c)3
    (c,a)1
    (c,b)2

    We want to merge the most common strings and create a new token (z), such as

    $$z = x\oplus y$$

    Here \(\oplus\) represents concatenation or merge of characters, and \(x,y\) represent each character in the pair, for example, on our previous table, we can see that the most frequent combination is \((a,c)\), so we can concatenate it:

    $$x\oplus y = a \oplus c=ac$$

    At this point, we can begin constructing the vocabulary. Initially, the vocabulary consists of the set of all characters present in the original string.

    $$ \mathcal{V}_0 = \Sigma $$

    As we concatenate our characters, we keep the \(\mathcal{V}_0\) intact. At each step, the updated vocabulary is the union of the previous vocabulary and the newly merged token:

    $$ \mathcal{V}_i= \mathcal{V}_{i-1} ; \cup ; \{x_i\oplus y_i\} $$

    So initially, with our first concatenation, it would look like this:

    $$ \mathcal{V}_1=\{a,b,c,ac\} $$

    And it means now we update our string:

    $$ S = a, a, b, ac, a, a, c, b, ac, b $$

    We apply the same process, looking for the most common adjacent pair for \(\mathcal{V}_2\) which is \((a,a)\,) and we get:

    $$ \mathcal{V}_2=\{a,b,c,ac,aa\} $$

    And now we have a new string of text:

    $$S = aa, b,ac, aa, c, b, ac, b$$


    And for \(\mathcal{V}_3\) we notice the most common pair is \((b,ac)\) which lead us to the final vocabulary

    $$\mathcal{V}_3={a,b,c,ac,aa,bac}$$

    After performing this merge, we continue the process iteratively, the union of \(\mathcal{V}_0\) and all merges performed during training, such as:

    $$ \mathcal{V} = \mathcal{V}_0 \cup \bigcup_{i=1}^{M} \{ x_i \oplus y_i \} $$

    In our final example, we can write it as

    $$ \mathcal{V} = \{ a, b, c, ac, aa, bac \} $$

    Here we introduce another factor that goes along with the algorithm,m and its \(T\), which is our translation table:

    $$ T = [(x_1,y_1),(x_2,y_2),\ldots,(x_m,y_m) ] $$

    Each entry \((x_i, y_i)\) corresponds to the pair merged at step \(i\). In this case, order matters; merges are applied in this exact order during the tokenization of a new string.

    At the end of training, \(T\) contains the ordered list of all merges. In our example


    $$ T = [(a,c),(a,a),(b,ac) ] $$


    In other words, T keeps track of the exact merge order needed to tokenize new text consistently.

    What I’ve shown so far is a simplification on how it works since it is easier to visualize with text however there’s a step after we separate our string of text we turn those string of text into bytes hence the name byte-pair encoding, the reason why we do this is that computers don’t read characters like we do, we need to transform them into a system of information that can be interpreted by machines.

    How BPE is treated in LLMs

    Modern BPE (as used in LLMs) does not operate on characters, as we saw in the example It operates on a finite alphabet, and that alphabet is:

    $$ \Sigma = \{0,1,\dots,255\} $$

    The reason why it’s 256 lies in the definition of a bit (binary digit) which is the most basic unit of information in computing and digital communication, it represents one binary choice \({0,1}\) yes/no, on/off, true/false, 0/1, Integers, characters, images, audio, video all of these are just collections of bits, and we can group it to create even more complex communication in our case a group of 8 bits is defined as a byte so we have \(2^8\) possibilities

    $$byte∈{0,1}^8$$


    And we represent it as an integer \( \{0,1,…,255\}\) where each number is a different sequence of 8 bits. To interpret that information as data, we need a mapping. In the current LLM, we use
    UTF-8 stands for Unicode Transformation Format, where the ‘8’ refers to the use of 8-bit bytes

    Unicode is the encoding standard designed to support the use of text in all of the world’s writing systems that can be digitized, This allows LLMs to process text from different writing systems under a unified encoding, for further reading in how Unicode works at a deeper level I recommend “Unicode Demystified by Richard Gillam” It will allow you to understand how Unicode and UTF play a role into computer language at a deeper level on here we will continue focusing on which role does this play in Tokenization.

    The way we use Unicode is that after we separate the string of text, we translate it into UTF-8 code, which will be represented in hexadecimal format type “0x00” to not confuse with the unexpected formatting for clarit,y it´s another way to write the group of 8 bytes, to understand how it takes this shape go to further reading recommended previously. Each byte already has a token ID equal to its numeric value.

    ByteToken ID
    0x000
    0x011
    0x4165
    0xFF255

    So the first 256 Token IDs are reserved for single bytes such as

    $$ \mathcal{V}_0 = \{ b_i \mid b_i \in \{0,\dots,255\} \} $$

    And the Token ID is defined as a mapping function

    $$ \text{ID}_0: V_0 \to \mathbb{Z} $$

    And as we saw in the tables, we can write it like:

    $$ \text{ID}_0(b_i)=i $$

    so

    $$\text{ID}_0(0x00) = 0,\text{ID}_0(0x01) = 1, …, \text{ID}_0(0xFF) = 255$$

    During the merging process, when creating new tokens, the token ID will continue to grow accordingly. Therefore, if a merge of 2 bytes occurs, the next assigned token ID will be 256, and so on.

    To get all the tokens needed for an LLM, we use a finite sequence of texts known as a corpus, in the case of modern LLM this is all the data being fed to train the model, like internet data, books, articles etc…, In models such as GPT-3, it is estimated that training involved approximately 500–570 billion tokens.

    For modern LLMs like GPT-4, estimates range from hundreds of billions to multiple trillions of tokens, though exact figures are not publicly disclosed. This corresponds to petabytes of raw text once encoded in UTF-8 and pre-processed.

    We can denote a corpus like:

    $$\mathcal{C}={T_1​,T_2​,…,T_N​}$$

    where each \(T_i\) is a document or string of text and \(L_i\) is the length of the document in \(T_i\)

    $$T_i = c_1, c_2, \dots c_{L_i} \quad c_j \in \Sigma$$


    So BPE is applied over the corpus and assigned a token ID to get these massive lengths of billions and trillions of tokens, each one with its respective token ID, so at the end, we will end up with the set we defined at the start

    $$ \mathcal{V} = \{ t_1, t_2, \dots, t_{|\mathcal{V}|} \} $$

    With the information we have seen, we can now understand how tokenization works

  • Chapter 0: Why We Need to Talk About LLMs

    Chapter 0: Why We Need to Talk About LLMs

    Every day, we see more news about AI and how it is set to change our lives. Traditional information platforms, like Google, are slowly being replaced by AI agents, and AI-generated content is appearing everywhere. It is undeniable that large language models (LLMs) are challenging our status quo and reshaping the way we live and work.

    This text is divided into two parts. First, we will demystify how large language models operate at a fundamental level, breaking down the math and algorithms in an accessible way. Second, we will examine the state of AI today who is building it, how much they are investing, its environmental costs, and its broader societal impact. The goal is that by combining technical understanding with critical context, readers will gain the tools to engage with AI knowledgeably

    As a business student, large language models can seem far from my field of study, bordering more on computer science than business. I have been highly critical of generative AI for reasons beyond the underlying technology, but I also believe it is important to examine this technology even if we are not data scientists. Through this text, I will try to explain the technical aspects of AI in a way that is accessible to most people outside the field and connect them to a holistic view of AI across political, economic, and social domains.

    We cannot discuss our future with LLMs or their role in society if we do not understand how they work. Understanding how large language models operate at a fundamental level acts as a demystifying tool, allowing us to be more critical of the technology we are using.

    Having a critical stance against AI does not mean that we are rejecting progress or new technologies. Only by examining the current state of things through a critical lens can we make better-informed decisions.

    Whenever a new technology emerges, sensationalism spreads across most of the media we consume. This pattern repeats itself with every new innovation. Sensationalism can be harmful because it encourages blind adoption of technology that we do not fully understand. More importantly, the hype and promises shift the conversation away from the real, uncomfortable questions that we should be asking.

    We should be asking ourselves these uncomfortable questions because billions of dollars are being spent on this technology. This investment will influence areas of our lives even if we are not active participants in its development. For example, will society benefit in the long term after the burst of the AI bubble, as it did with the dotcom bubble? During that period, although there was considerable economic damage, society still benefited from the technology that emerged.

    Another important concern is sustainability. The rising global spending on data centres and the increasing energy demands to train larger models raise questions about whether this technology will genuinely improve our lives or if it is primarily a struggle for power. While there have been breakthroughs that allow current LLMs to consume less energy, current incentives are not focused on sustainability. We are building infrastructure that harms the environment for a technology that the general public does not yet understand.

    I want this text to serve as a starting point where we can learn about LLMs at a fundamental level and also explore how AI and governments are using LLMs for power, as well as the challenges and benefits of these models. My goal is to provide readers with an arsenal of information that allows them to address AI more critically.

    This text assumes a basic level of linear algebra and calculus, also it’s recommended to study some set builder notation.

    However, I will break down the concepts as much as possible to make them accessible. I highly encourage readers to engage with this journey, even if their field is not technological. Current LLMs are becoming a part of many areas of our lives, regardless of the field, and understanding it at a deeper level will allow us all to become more aware of our usage of it in the future.