LATEST ISYE 6501 Spring '23 Exam 1 QUESTIONS WITH 100% SOLUTIONS 2024
What is modeling? - ANSWER Describing a real-life situation mathematically, analyzing the math and then turning the math back into a real-life situation. What is a data point? - ANSWER A row of data. All of the information about one observation. What are some (4) names for columns of data? - ANSWER Attributes, features, covariates, predictors Name some common types of structured data (5) - ANSWER Quantitative, categorical, binary, unrelated, time series What is binary data? - ANSWER A subset of categorical data (although can be treated as numerical) that can only take on one value (Y/N, M/F, etc.) What is unstructured data? Example? - ANSWER Data that is not easily described or stored. Ex: Text. When do you use a soft classifier? - ANSWER When you can't draw a line to divide all data points. What are support vectors? - ANSWER Points supporting a shape on parallel lines. SVM - what does it mean if the coefficients are near zero? - ANSWER Those coefficients are probably not relevant for classification. SVM - Does a classifier need to be a straight line? - ANSWER No. Kernel methods allow for nonlinear classifiers. What is the most common scaling used? - ANSWER Scaling data between 0 and 1. What is standardization? - ANSWER Scaling data to a normal distribution (typically mean = 0, sd = 1) When might you use scaling over standardization? - ANSWER When your data needs to be in a bounded range. Ex: neural networks, optimization models, etc. What are two models that work better with standardization over scaling? - ANSWER Principle component analysis and clustering. How does KNN determine what a new point's class will be? - ANSWER The new data point's class is the most common class among the k neighbors What is the most common method to measure distance between k nearest neighbors? - ANSWER A straight line, although other methods can be used How can you adjust KNN for attributes that are more important for the classification? - ANSWER Weight the distances and give those attributes more weight What type of effects explain why we can't measure the model's effectiveness on training data? - ANSWER Random effects Why is the observed performance on high performing models probably too optimistic? - ANSWER High performing models are most likely to have benefited from random effects. Which set of data is used to choose the best model - ANSWER Validation set Which set of data is used for building models? - ANSWER Training set Which set of data is used for measuring a model's performance? - ANSWER Test set Name two ways to split the data into training and test sets - ANSWER Random and rotation Given a set of time series data, which might be a better way to split the data into training and test sets and why? Random or rotation? - ANSWER Rotation because rotation will spread the data equally and random could select more values from one year Which should we use the majority of our data for: training, validation or test? - ANSWER Training - most experts recommend using 50-70% for training and splitting the rest equally between validation and test. What is a solution to the problem of more important data points appearing in only validation or test sets? - ANSWER Cross validation How do you measure the model's quality using k-fold cross validation? - ANSWER Average the k evaluations to estimate the quality (k=10 is common) What are three advantages of k-fold cross validation? - ANSWER Better use of data, better estimate of model quality, choose your model more effectively In k-fold cross validation, how many times is each part of the data used for training and validation? - ANSWER k-1 times for training and once for validation. What type of model might you build to figure out where to build police stations? - ANSWER Clustering What are the three steps k-means clustering uses? - ANSWER 1. Pick k cluster centers 2. Assign each data point to nearest cluster center 3. recalculate cluster centers (centroids) - Repeat steps 2 and 3 until there are no more changes What is a heuristic? - ANSWER An algorithm that isn't guaranteed to find the best solution but usually gets close and does so quickly What is the best way to deal with an outlier? - ANSWER Find out more about it and what it means in the situation you're working on before deciding to remove it or keep it. What is a method to pick the number of clusters to use for k-means clustering? - ANSWER An elbow diagram. Find the point where the benefit of adding another cluster gets really small (the curve flattens) How do you use k-means for predictive analytics? - ANSWER Find the distance to the nearest cluster center and assign that point to that cluster.
Written for
- Institution
- ISYE 6501
- Course
- ISYE 6501
Document information
- Uploaded on
- February 21, 2024
- Number of pages
- 6
- Written in
- 2023/2024
- Type
- Exam (elaborations)
- Contains
- Questions & answers
Subjects
-
questions
-
answers
-
2024
-
isye 6501 spring 23 exam 1 questions with 100 so