By: Amir Tadrisi
Published on: 5/26/2025
Last updated on: 5/28/2025
I am the recipe of wisdom. turning words, images, and numbers into signs and secrets made plain. Feed me your histories, and I shall whisper what cometh next.
Models are engines that turn data (images, text, documents, etc) into numbers. Numbers are the way that lets computers recognize patterns and learn from data. How ? by doing arithmetic operations on these numbers to learn the data.
Language models receive Text as input, chunk it into smaller pieces (token
) and convert them to a number matrix (embedding vectors
) to do math and statistical operations on them to learn about words. When the model is trained and ready, it can do language completion tasks (using prediction) like Summarization, Translation, and Coding.
Tokens are atoms of models, they are the basic unit of a model. Each model tokenizes words differently, for example, in GPT-4 each token is 3/4 the length of a word
You can try tokenizing your words at the OpenAI Tokenizer
As you see, some tokens are like ,
't
which helps the model understand the structure of words. Also, tokenizing helps the model to be more efficient since the number of unique tokens is fewer than the number of unique words.
All of the tokens that a model is trained on are called the model's vocabulary. Vocabulary for a model is like the alphabet for us. We can generate many words using the alphabet, and the same for the model using the vocabulary
We can't consider Large Language Model as a scientific term since there is no measure to define Large. For example, what was considered a large model last year is not considered large anymore since new models have come out that are trained on more data.
One important aspect is supervision. To train models, we used to provide a huge dataset labeled manually by humans to train a model to learn our dataset. For example, to train a brain tumor model, we had to provide hundreds of thousands of MRI images and label them as healthy or tumor, and feed this data to the model. This process is time-consuming and costly.
On the other hand, Language Models can learn from the raw data without us labeling the data for them (self-supervision). For example, we give them text and they make a puzzle from the text, hiding some tokens to see if they can predict them or not. In this case, we don't need to manually label data for the model, which reduces training time and cost.
These are the self-supervised models that we can build AI applications on top of them like GPT, BERT, PaLM, etc. These models were trained on a vast amount of data; for example, 60% of GPT training data was from Common Crawl Data. These models make AI applications development accessible to common software developers who don't have the infrastructure and resources to train models, and companies like OpenAI provide their models as a service, so we can use their APIs to interact with different models.