By: Amir Tadrisi
Published on: 5/28/2025
Last updated on: 5/29/2025
Language models are open-ended: given a text prompt, they understand its meaning and can generate virtually any kind of response—poems, explanations, code snippets, answers to questions, or even structured data. There’s no single “correct” output.
By contrast, closed-ended models have a fixed set of possible answers. For example, an image classifier trained to distinguish dogs from cats receives a picture and must choose exactly one label—dog or cat. There’s one ground-truth answer, only one true answer, and anything else is considered wrong.
Open-ended models' capability comes from their power in learning text meaning, not just surface strings. In the following, we see some methods and how good they are in providing the text's meaning.
Let's say we have three sentences:
Break each into tokens (words) and count overlaps:
So this method has some limitations:
Another way to find similarity between 2 texts is to see how many edits (insertions, deletions, substitutions) we need to turn one word into another—often via Levenshtein distance.
Example:
In this case, the similarity distance score for boy and buy is close, which recommends they are similar, but not really. On the other hand, the distance between kitten and cat is far, which suggests they are not similar at all, but as we know, their meaning are similar
Word Overlap and Approximate String Matching both use Lexical similarity. As we see, lexical similarity is not ideal for intelligent models like language models, because it deals with word forms instead of their meaning. We need a way to let the model learn the meaning of the data.
To do this, we need to find a way to represent data in a form that computers can perform statistical and arithmical operations on to deeply learn their meaning, This process is called Embedding
We are going to train our model on one paragraph to see if it can find semantic similarity between cat and kitten. The paragraph is:
My cat loves to play with yarn.
The kitten chased a ball of yarn.
The cat naps in the sun.
The kitten sleeps on a soft blanket.
Both a cat and a kitten have soft fur.
For models to be more efficient, we need to break our input data into tokens. In this example, each token is a word, but normally there is a rule for that, for example, in GPT-4 each token is 3/4 of the length of a word. The reason it helps the model to be more efficient is that there are fewer unique tokens than unique words.
So we split our sentence into [“My”, “cat”, “loves”, …, “fur”].
Next, we choose a context window, in our case, a window of 2 words on each side (before and after). The window defines what type of relationship we find and has a direct impact on our training quality and final score. In our embedding process, let's say we are at the "loves" token. In this case, we check 2 tokens before "my" "cat" and 2 tokens after "to" "play"
Start with 2–5. That’s a sensible middle ground for many NLP tasks. If you need fine-grained, syntactic, or relational embeddings (parsing, part-of-speech tasks), try smaller windows (1–2). If you care about broad topic modeling or document similarity, experiment with larger windows (8–15).
Every time “cat” appears, look at the 2 tokens before and after and note them as its context.
Do the same for “kitten”:
Both “cat” and “kitten” share contexts like “yarn” and “soft” (and later “fur”). These overlaps mean they occur in similar “company.”
Because “cat” and “kitten” repeatedly share contexts, their vectors get nudged toward the same region in the high-dimensional space.
Embedding similarity—most often computed via cosine similarity—yields a score between -1 and 1.
1.0
means the two vectors point in exactly the same direction (perfect semantic match). 0.0
means they don't have any meaningful relation.–1.0
means they point in exactly opposite directions (maximal dissimilarity).In practice, embedding scores for related words usually fall between 0.3 and 0.9, with higher values indicating stronger semantic similarity
In our case, 0.34 is a clear positive signal (well above 0) that these two terms share many contexts (“play,” “yarn,” “soft,” “fur”), so they’re semantically related. On our tiny five-sentence corpus, a value in the low-0.3s is actually quite good
Embeddings turn raw data into numerical vectors that capture deep relationships and meaning. We’ve seen how semantic embeddings overcome the limits of word-overlap and approximate string matching (fuzzy matching).
Embeddings are the key to smarter, more flexible AI across language, vision, and beyond.