1. What is the difference between supervised and unsupervised learning?
• Answer: Supervised learning uses labeled data where the algorithm
learns to map inputs to known outputs, while unsupervised learning
works with unlabeled data to identify patterns or structures without
predefined outputs.
2. What is the bias-variance tradeoff?
• Answer: The bias-variance tradeoff is the balance between a model's
ability to fit the training data (low bias) and its ability to generalize to
new data (low variance). High-complexity models tend to have low bias
but high variance, while simpler models have higher bias but lower
variance.
3. What is overfitting in machine learning?
• Answer: Overfitting occurs when a model learns the training data too
well, including its noise and outliers, resulting in poor performance on
unseen data. The model essentially memorizes the training examples
rather than learning generalizable patterns.
4. How does regularization help prevent overfitting?
• Answer: Regularization adds a penalty term to the loss function that
discourages complex models by constraining parameter values. This
reduces model variance and improves generalization to new data by
preventing the model from fitting noise in the training data.
5. What is cross-validation and why is it important?
• Answer: Cross-validation is a technique where the dataset is split into
multiple subsets, with different parts used for training and validation in
iterations. It's important because it provides a more reliable estimate of
, model performance on unseen data compared to a single train-test split,
helping detect overfitting.
6. Explain the difference between bagging and boosting.
• Answer: Bagging (Bootstrap Aggregating) trains multiple models in
parallel on random subsets of data and averages their predictions to
reduce variance. Boosting trains models sequentially, with each model
focusing on examples previous models performed poorly on, combining
them with weighted voting to reduce bias.
7. What is the curse of dimensionality?
• Answer: The curse of dimensionality refers to various challenges that
arise when analyzing data in high-dimensional spaces. As dimensions
increase, data becomes sparse, distances between points become less
meaningful, and models require exponentially more data to generalize
effectively.
8. What is the ROC curve and what does AUC represent?
• Answer: The Receiver Operating Characteristic (ROC) curve plots the
true positive rate against the false positive rate at various classification
thresholds. The Area Under the Curve (AUC) represents the probability
that the classifier will rank a randomly chosen positive instance higher
than a randomly chosen negative one, with 1.0 being perfect
classification.
9. Explain the difference between L1 and L2 regularization.
• Answer: L1 regularization (Lasso) adds the sum of the absolute values of
the coefficients to the loss function, which can drive some coefficients to
exactly zero, performing feature selection. L2 regularization (Ridge) adds
the sum of squared coefficients, which shrinks all coefficients
proportionally but rarely to exactly zero.
10. What is the cold start problem in recommendation systems?
• Answer: The cold start problem occurs when a recommendation system
cannot make reliable recommendations due to insufficient data about new
users or items. Without historical interaction data, the system struggles to
identify preferences or similarities needed for accurate recommendations.
11. What are principal components in PCA?
, • Answer: Principal components are orthogonal vectors that represent
directions of maximum variance in the data. They are eigenvectors of the
covariance matrix, ranked by their corresponding eigenvalues, and form a
new coordinate system where data dimensions are uncorrelated.
12. Explain the difference between a generative and discriminative model.
• Answer: Generative models learn the joint probability distribution
P(X,Y) to understand how data is generated, allowing them to create new
samples. Discriminative models learn the conditional probability P(Y|X)
to focus on decision boundaries between classes for classification tasks.
13. What is transfer learning and when is it useful?
• Answer: Transfer learning is a technique where a model developed for
one task is reused as the starting point for a model on a second task. It's
useful when the target task has limited training data, when the source and
target tasks share similarities, or when pre-trained models capture
relevant features that transfer well.
14. What is the difference between batch, mini-batch, and stochastic
gradient descent?
• Answer: Batch gradient descent computes gradients using the entire
dataset in each iteration. Mini-batch uses random subsets of data for each
update. Stochastic gradient descent uses just one example per update.
Mini-batch balances computational efficiency with update stability, while
stochastic provides the noisiest but most frequent updates.
15. What is the vanishing gradient problem?
• Answer: The vanishing gradient problem occurs when gradients become
extremely small as they propagate backward through many layers of a
deep neural network. This makes it difficult to update weights in earlier
layers, slowing or preventing learning in those parts of the network.
16. How does batch normalization help in training deep networks?
• Answer: Batch normalization normalizes the inputs to each layer by
subtracting the batch mean and dividing by batch standard deviation. This
stabilizes the learning process, allows higher learning rates, reduces the
dependency on careful initialization, acts as a regularizer, and can
accelerate training.
, 17. What is the purpose of activation functions in neural networks?
• Answer: Activation functions introduce non-linearity into neural
networks, allowing them to learn complex patterns. Without non-linear
activation functions, a neural network would behave like a single linear
model regardless of its depth, limiting its ability to represent complex
relationships in data.
18. Explain the concept of feature importance in decision trees.
• Answer: Feature importance in decision trees measures how much each
feature contributes to decreasing impurity (like Gini or entropy) across all
splits where it's used. Features that lead to larger reductions in impurity or
are used closer to the root are considered more important for prediction.
19. What is a confusion matrix and what metrics can be derived from it?
• Answer: A confusion matrix is a table showing the counts of true
positives, false positives, true negatives, and false negatives for a
classifier. Metrics derived include accuracy, precision, recall, F1 score,
specificity, and the false positive/negative rates.
20. What is the difference between instance-based and model-based
learning?
• Answer: Instance-based learning (like k-NN) stores training examples
and makes predictions based on similarity to these instances without
building an explicit model. Model-based learning (like linear regression)
creates a parametric model from the training data that can make
predictions without referring back to the original examples.
21. What is the kernel trick in SVMs?
• Answer: The kernel trick transforms data into a higher-dimensional
space without explicitly computing the coordinates in that space, by using
kernel functions that compute inner products in the higher-dimensional
space. This allows SVMs to find non-linear decision boundaries while
maintaining computational efficiency.
22. Explain the concept of information gain in decision trees.
• Answer: Information gain measures the reduction in entropy (or
uncertainty) achieved by splitting the data on a particular feature. It
quantifies how much information about the target variable is gained by