By: Amir Tadrisi

Published on: 5/26/2025

Last updated on: 5/28/2025

Language Models and Large Language Models

I am the recipe of wisdom. turning words, images, and numbers into signs and secrets made plain. Feed me your histories, and I shall whisper what cometh next.

What are Models

Models are engines that turn data (images, text, documents, etc) into numbers. Numbers are the way that lets computers recognize patterns and learn from data. How ? by doing arithmetic operations on these numbers to learn the data.

Language Models

Language models receive Text as input, chunk it into smaller pieces (token) and convert them to a number matrix (embedding vectors) to do math and statistical operations on them to learn about words. When the model is trained and ready, it can do language completion tasks (using prediction) like Summarization, Translation, and Coding.

Tokens

Tokens are atoms of models, they are the basic unit of a model. Each model tokenizes words differently, for example, in GPT-4 each token is 3/4 the length of a word

GPT-4 Tokenization
GPT-4 Tokenization

You can try tokenizing your words at the OpenAI Tokenizer

As you see, some tokens are like , 't which helps the model understand the structure of words. Also, tokenizing helps the model to be more efficient since the number of unique tokens is fewer than the number of unique words.

Vocabulary

All of the tokens that a model is trained on are called the model's vocabulary. Vocabulary for a model is like the alphabet for us. We can generate many words using the alphabet, and the same for the model using the vocabulary

Large Language Models

We can't consider Large Language Model as a scientific term since there is no measure to define Large. For example, what was considered a large model last year is not considered large anymore since new models have come out that are trained on more data.

Why are language models the center of attention?

One important aspect is supervision. To train models, we used to provide a huge dataset labeled manually by humans to train a model to learn our dataset. For example, to train a brain tumor model, we had to provide hundreds of thousands of MRI images and label them as healthy or tumor, and feed this data to the model. This process is time-consuming and costly.

On the other hand, Language Models can learn from the raw data without us labeling the data for them (self-supervision). For example, we give them text and they make a puzzle from the text, hiding some tokens to see if they can predict them or not. In this case, we don't need to manually label data for the model, which reduces training time and cost. 

Foundation models

These are the self-supervised models that we can build AI applications on top of them like GPT, BERT, PaLM, etc. These models were trained on a vast amount of data; for example, 60% of GPT training data was from Common Crawl Data. These models make AI applications development accessible to common software developers who don't have the infrastructure and resources to train models, and companies like OpenAI provide their models as a service, so we can use their APIs to interact with different models.