AND SOLUTIONS GRADED A+
✔✔Pipelines are useful (in the analytics with Python sense) for what reasons? -
✔✔Make it easy to repeat/replicate steps and run multiple models, help organize the
code you used to clean and treat data, and make it eassy to change small things in
model like which variables to include.
✔✔Y and y-hat are a little different. Y is our target vector, and y-hat is an output in our
model that is a..... - ✔✔Estimate or prediction of y
✔✔The basic idea of a regression is very simple. We have some X values (we called
these ___________ and some Y value (this is the variable we are trying to _________ .
We could have multiple Y values, but that is not something we have covered. -
✔✔Features; Predict
✔✔When looking at the code in the videos, we sometimes used a variable to hold our
model.
What is the significance of the word "model" in the below code?
model = LinearRegression(fit_intercept=True) - ✔✔model' is a named variable and is
just holding our linear regression model. It could be renamed anything. The word itself is
not important. It is just a container.
✔✔Which of the below were discussed as being problems with the hold out method for
validation? - ✔✔Outliers can skew the results and the model is not trained on all of the
data
✔✔Which of the following is a common use case for the random forest algorithm in
machine learning? - ✔✔Classifying data into categories based on input features
✔✔Which of the following is a potential benefit of using decision trees in machine
learning? - ✔✔Can handle both numerical and categorical data
✔✔Which of the following statements best describes an ensemble method in machine
learning? - ✔✔A technique that combines the results of multiple models to improve
overall predictive accuracy
✔✔Which of the following best describes supervised learning? - ✔✔A machine learning
approach where the algorithm receives labeled data and learns to map inputs to outputs
based on those labels
, ✔✔Which of the following statements best describes classification in machine learning?
- ✔✔A type of supervised learning where the goal is to assign input data points to
predefined categories or classes
✔✔We want the R-squared value for our regression model to be 100% (true or false) -
✔✔False
✔✔One weakness of cross-validation discussed is that information can sometimes ____
across different periods. A common situation in which this happens is when we are
looking at stock data. - ✔✔Leak
✔✔In which of these situations would you want to use a clustering algorithm? - ✔✔You
have a dataset containing customer data for Cheesecake Factory and you want to look
at customer spending at the restaurant in order to find patterns among customers who
share similar characteristics
✔✔What is a potential downside of using linear regression models in machine learning?
- ✔✔They are prone to over fitting the data
✔✔What type of algorithm would you use to segment customers into groups?
Assume the groups are already labeled. - ✔✔Decision trees, regression, random forest,
cluster regression
✔✔Which of the following is true about data validation and cross-validation in machine
learning? - ✔✔Data validation and cross-validation are used to evaluate a model's
performance and prevent overfitting
✔✔What is the role of cluster centers clustering, and how are they determined during
the algorithm? - ✔✔Cluster centers are the initial data points chosen randomly to begin
clustering, and they are updated iteratively to minimize the within-cluster sum of
squares
✔✔Which of the following machine learning models utilizes supervised learning? -
✔✔Regression
✔✔What is scikit-learn? - ✔✔A machine learning package in Python that has built in
machine learning algorithms we can use on our dataset
✔✔Which of the following best describes the difference between a supervised and an
unsupervised learning task in machine learning? - ✔✔A supervised learning task
requires labeled data, while an unsupervised learning task does not