Q1. What is a Large Language Model (LLM) and what distinguishes it from traditional NLP models like Word2Vec or
LSTMs?
Ans: A Large Language Model (LLM) is a deep learning model characterized by its massive size (billions of parameters) and
its training on vast quantities of text data. The core innovation that distinguishes LLMs is the Transformer architecture,
which uses a mechanism called self-attention. Here’s a breakdown of the key differences:
i) Architecture and Context:
- Traditional Models (LSTMs, RNNs): Process text sequentially (word by word). This creates a bottleneck making it difficult
to capture long-range dependencies and relationships between distant words in a text. Their understanding of context is often
limited to a relatively small window.
- LLMs (Transformers): Process all text tokens simultaneously. The self-attention mechanism allows the model to weigh the
importance of every other word in the input when processing a specific word. This provides a deep, holistic understanding of
context, grammar, and nuance across the entire document.
ii) Scale and Emergent Abilities:
- Traditional models: are much smaller and trained on specific, ,smaller datasets for narrow tasks (e.g: sentiment analysis,
named entity recognition).
- LLMs: are trained on internet scale text. This massive scale leads to emergent abilities – complex capabilities like zero-shot
learning in-context learning, and chain-of-thought reasoning that were not explicitly programmed but arise from the model’s
deep understanding of patterns in the data.
iii) Task Generalization:
- Traditional Models: Are typically task-specific. A model trained for translation cannot perform summarization without
significant retraining.
- LLMs: are general purpose. A single, pre-trained foundation model can be adapted to a wide variety of tasks (summarization,
translation, question-answering, code generation) through simple prompting or minimal fine-tuning.
In essence, while LSTMs learn to predict the next word based on recent sequence, LLMs learn a rich, internal representation of
language itself, enabling them to reason about the text.
Q2. What is Q, K, V in Attention?
Answer:
“In attention, we take input embeddings and multiply them by three learned weight matrices to get Query, Key, and
Value. Queries ask ‘what am I looking for,’ Keys say ‘what I offer,’ and Values hold the information. Attention
scores are computed as QKTQK^TQKT, softmaxed, and used to weight the Values.”
Actually, it’s Q, K, V (Query, Key, Value). I’ll explain what they mean:
Query (Q): What we’re looking for.
Key (K): What each word/embedding offers.
Value (V): The actual information we’ll use if the key matches the query.
👉 Analogy:
Think of Google Search:
Your search text = Query (Q)
The keywords in all websites = Keys (K)
The website content = Values (V)
The attention mechanism checks how much each Key matches the Query, then uses that weight to combine the
Values.
,Q3. what is the role of softmax in transformer?
Answer:
“In a Transformer, Softmax turns raw attention scores into a probability distribution, so each token decides how
much to ‘attend’ to others. It normalizes and highlights the most relevant tokens while keeping weights stable.”
What is the role of Softmax in Transformers (Attention)?
When we compute attention, we first get similarity scores between queries (Q) and keys (K):
These scores can be any range: negative, positive, large, small.
What Softmax Does
1. Normalizes scores into probabilities
o Softmax converts raw scores into values between 0 and 1.
o Sum of each row = 1.
o This makes them interpretable as “how much attention to pay.”
2. Highlights the most relevant tokens
o Higher scores → higher probability.
o Softmax amplifies differences (the highest score becomes dominant).
3. Stabilizes training
o Without Softmax, weights could explode or vanish.
, o Softmax ensures a smooth distribution.
Q. Why do we divide by sqrt(dk) before Softmax in Attention?
Answer:
“We divide by sqrt(dk) to prevent large dot products when the embedding dimension is high. Without scaling,
Softmax would saturate, making attention focus too narrowly and hurting training stability.”
Q4. Which embeddings do we use in LLM Transformers?
Answer:
“In LLM Transformers, we start with token embeddings from the vocabulary, add positional embeddings to give
word order, and then project these into Q, K, V embeddings for the attention mechanism.”
, So in LLM Transformers we use:
1. Token embeddings (semantic meaning of tokens)
2. Positional embeddings (word order)
3. Q/K/V embeddings (projected versions for attention calculation)
Example-