Data Science Interview Questions
Statistics:
1. What is the Central Limit Theorem and why is it important?
“Suppose that we are interested in estimating the average height among all people. Collecting data for
every person in the world is impossible. While we can’t obtain a height measurement from everyone in the
population, we can still sample some people. The question now becomes, what can we say about the
average height of the entire population given a single sample. The Central Limit Theorem addresses this
question exactly.” Read more here.
2. What is sampling? How many sampling methods do you know?
“Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative
subset of data points to identify patterns and trends in the larger data set being examined.” Read the full
answer here.
3. What is the difference between type I vs type II error?
“A type I error occurs when the null hypothesis is true, but is rejected. A type II error occurs when the null
hypothesis is false, but erroneously fails to be rejected.” Read the full answer here.
4. What is linear regression? What do the terms p-value, coefficient, and r-squared
value mean? What is the significance of each of these components?
A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends
on a myriad of factors, such as its size or its location. In order to see the relationship between these
variables, we need to build a linear regression, which predicts the line of best fit between them and can
help conclude whether or not these two factors have a positive or negative relationship. Read
more here and here.
5. What are the assumptions required for linear regression?
There are four major assumptions: 1. There is a linear relationship between the dependent variables and
the regressors, meaning the model you are creating actually fits the data, 2. The errors or residuals of the
data are normally distributed and independent from each other, 3. There is minimal multicollinearity
between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression
line is the same for all values of the predictor variable.
6. What is a statistical interaction?
”Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output
variable) differs among levels of another factor.” Read more here.
7. What is selection bias?
“Selection (or ‘sampling’) bias occurs in an ‘active,’ sense when the sample data that is gathered and
prepared for modeling has characteristics that are not representative of the true, future population of cases
Follow Steve Nouri for more AI and Data science posts: https://lnkd.in/gZu463X
,the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e.,
non-randomly) excluded from analysis.” Read more here.
8. What is an example of a data set with a non-Gaussian distribution?
“The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of
them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has
a solid grounding in statistics, they can be utilized where appropriate.” Read more here.
9. What is the Binomial Probability Formula?
“The binomial distribution consists of the probabilities of each of the possible numbers of successes on N
trials for independent events that each have a probability of π (the Greek letter pi) of occurring.” Read more
Data Science :
Q1. What is Data Science? List the differences between supervised and unsupervised
learning.
Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover
hidden patterns from the raw data. How is this different from what statisticians have been doing for years?
The answer lies in the difference between explaining and predicting.
The differences between supervised and unsupervised learning are as follows;
Supervised Learning Unsupervised Learning
Input data is labelled. Input data is unlabelled.
Uses a training data set. Uses the input data set.
Used for prediction. Used for analysis.
Enables classification and regression. Enables Classification, Density Estimation, & Dimension Reduction
Q2. What is Selection Bias?
Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. It is
usually associated with research where the selection of participants isn’t random. It is sometimes referred to
Follow Steve Nouri for more AI and Data science posts: https://lnkd.in/gZu463X
,as the selection effect. It is the distortion of statistical analysis, resulting from the method of collecting
samples. If the selection bias is not taken into account, then some conclusions of the study may not be
accurate.
The types of selection bias include:
1. Sampling bias: It is a systematic error due to a non-random sample of a population causing some
members of the population to be less likely to be included than others resulting in a biased sample.
2. Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the
extreme value is likely to be reached by the variable with the largest variance, even if all variables
have a similar mean.
3. Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on
arbitrary grounds, instead of according to previously stated or generally agreed criteria.
4. Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting
trial subjects/tests that did not run to completion.
Q3. What is bias-variance trade-off?
Bias: Bias is an error introduced in your model due to oversimplification of the machine learning algorithm.
It can lead to underfitting. When you train your model at that time model makes simplified assumptions to
make the target function easier to understand.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning
algorithms — Linear Regression, Logistic Regression
Variance: Variance is error introduced in your model due to complex machine learning algorithm, your model
learns noise also from the training data set and performs badly on test data set. It can lead to high sensitivity
and overfitting.
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias
in the model. However, this only happens until a particular point. As you continue to make your model more
complex, you end up over-fitting your model and hence your model will start suffering from high variance.
Bias-Variance trade-off: The goal of any supervised machine learning algorithm is to have low bias and
low variance to achieve good prediction performance.
Follow Steve Nouri for more AI and Data science posts: https://lnkd.in/gZu463X
, 1. The k-nearest neighbour algorithm has low bias and high variance, but the trade-off can be changed
by increasing the value of k which increases the number of neighbours that contribute to the prediction
and in turn increases the bias of the model.
2. The support vector machine algorithm has low bias and high variance, but the trade-off can be
changed by increasing the C parameter that influences the number of violations of the margin allowed
in the training data which increases the bias but decreases the variance.
There is no escaping the relationship between bias and variance in machine learning. Increasing the bias
will decrease the variance. Increasing the variance will decrease bias.
Q4. What is a confusion matrix?
The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various
measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from
it. Confusion Matrix
A data set used for performance evaluation is called a test data set. It should contain the correct labels and
predicted labels.
The predicted labels will exactly the same if the performance of a binary classifier is perfect.
Follow Steve Nouri for more AI and Data science posts: https://lnkd.in/gZu463X
Statistics:
1. What is the Central Limit Theorem and why is it important?
“Suppose that we are interested in estimating the average height among all people. Collecting data for
every person in the world is impossible. While we can’t obtain a height measurement from everyone in the
population, we can still sample some people. The question now becomes, what can we say about the
average height of the entire population given a single sample. The Central Limit Theorem addresses this
question exactly.” Read more here.
2. What is sampling? How many sampling methods do you know?
“Data sampling is a statistical analysis technique used to select, manipulate and analyze a representative
subset of data points to identify patterns and trends in the larger data set being examined.” Read the full
answer here.
3. What is the difference between type I vs type II error?
“A type I error occurs when the null hypothesis is true, but is rejected. A type II error occurs when the null
hypothesis is false, but erroneously fails to be rejected.” Read the full answer here.
4. What is linear regression? What do the terms p-value, coefficient, and r-squared
value mean? What is the significance of each of these components?
A linear regression is a good tool for quick predictive analysis: for example, the price of a house depends
on a myriad of factors, such as its size or its location. In order to see the relationship between these
variables, we need to build a linear regression, which predicts the line of best fit between them and can
help conclude whether or not these two factors have a positive or negative relationship. Read
more here and here.
5. What are the assumptions required for linear regression?
There are four major assumptions: 1. There is a linear relationship between the dependent variables and
the regressors, meaning the model you are creating actually fits the data, 2. The errors or residuals of the
data are normally distributed and independent from each other, 3. There is minimal multicollinearity
between explanatory variables, and 4. Homoscedasticity. This means the variance around the regression
line is the same for all values of the predictor variable.
6. What is a statistical interaction?
”Basically, an interaction is when the effect of one factor (input variable) on the dependent variable (output
variable) differs among levels of another factor.” Read more here.
7. What is selection bias?
“Selection (or ‘sampling’) bias occurs in an ‘active,’ sense when the sample data that is gathered and
prepared for modeling has characteristics that are not representative of the true, future population of cases
Follow Steve Nouri for more AI and Data science posts: https://lnkd.in/gZu463X
,the model will see. That is, active selection bias occurs when a subset of the data are systematically (i.e.,
non-randomly) excluded from analysis.” Read more here.
8. What is an example of a data set with a non-Gaussian distribution?
“The Gaussian distribution is part of the Exponential family of distributions, but there are a lot more of
them, with the same sort of ease of use, in many cases, and if the person doing the machine learning has
a solid grounding in statistics, they can be utilized where appropriate.” Read more here.
9. What is the Binomial Probability Formula?
“The binomial distribution consists of the probabilities of each of the possible numbers of successes on N
trials for independent events that each have a probability of π (the Greek letter pi) of occurring.” Read more
Data Science :
Q1. What is Data Science? List the differences between supervised and unsupervised
learning.
Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover
hidden patterns from the raw data. How is this different from what statisticians have been doing for years?
The answer lies in the difference between explaining and predicting.
The differences between supervised and unsupervised learning are as follows;
Supervised Learning Unsupervised Learning
Input data is labelled. Input data is unlabelled.
Uses a training data set. Uses the input data set.
Used for prediction. Used for analysis.
Enables classification and regression. Enables Classification, Density Estimation, & Dimension Reduction
Q2. What is Selection Bias?
Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. It is
usually associated with research where the selection of participants isn’t random. It is sometimes referred to
Follow Steve Nouri for more AI and Data science posts: https://lnkd.in/gZu463X
,as the selection effect. It is the distortion of statistical analysis, resulting from the method of collecting
samples. If the selection bias is not taken into account, then some conclusions of the study may not be
accurate.
The types of selection bias include:
1. Sampling bias: It is a systematic error due to a non-random sample of a population causing some
members of the population to be less likely to be included than others resulting in a biased sample.
2. Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the
extreme value is likely to be reached by the variable with the largest variance, even if all variables
have a similar mean.
3. Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on
arbitrary grounds, instead of according to previously stated or generally agreed criteria.
4. Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting
trial subjects/tests that did not run to completion.
Q3. What is bias-variance trade-off?
Bias: Bias is an error introduced in your model due to oversimplification of the machine learning algorithm.
It can lead to underfitting. When you train your model at that time model makes simplified assumptions to
make the target function easier to understand.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning
algorithms — Linear Regression, Logistic Regression
Variance: Variance is error introduced in your model due to complex machine learning algorithm, your model
learns noise also from the training data set and performs badly on test data set. It can lead to high sensitivity
and overfitting.
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias
in the model. However, this only happens until a particular point. As you continue to make your model more
complex, you end up over-fitting your model and hence your model will start suffering from high variance.
Bias-Variance trade-off: The goal of any supervised machine learning algorithm is to have low bias and
low variance to achieve good prediction performance.
Follow Steve Nouri for more AI and Data science posts: https://lnkd.in/gZu463X
, 1. The k-nearest neighbour algorithm has low bias and high variance, but the trade-off can be changed
by increasing the value of k which increases the number of neighbours that contribute to the prediction
and in turn increases the bias of the model.
2. The support vector machine algorithm has low bias and high variance, but the trade-off can be
changed by increasing the C parameter that influences the number of violations of the margin allowed
in the training data which increases the bias but decreases the variance.
There is no escaping the relationship between bias and variance in machine learning. Increasing the bias
will decrease the variance. Increasing the variance will decrease bias.
Q4. What is a confusion matrix?
The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various
measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from
it. Confusion Matrix
A data set used for performance evaluation is called a test data set. It should contain the correct labels and
predicted labels.
The predicted labels will exactly the same if the performance of a binary classifier is perfect.
Follow Steve Nouri for more AI and Data science posts: https://lnkd.in/gZu463X