QUESTIONS & SOLUTIONS(RATED
A+)
What is KDD, and what are the basics of KDD? - ANSWERKDD is Knowledge
Discovery in Databases. The basics are extracting valuable insights, patterns, and
knowledge from large datasets.
What is the KDD pipeline? - ANSWERdata selection, pre-processing, transformation,
data mining, interpretation/evaluation
Dimensionalities of Data Mining - ANSWER-data to be mined
- knowledge to be mined (data mining functions)
- techniques utilized
- applications adapted
What is a data sample? - ANSWERA subset of data taken from a larger dataset
What is a dataset? - ANSWERA collection of related data points or instances
representing all available data.
Different categories of attributes (Categorical) - ANSWERNominal- names of things,
categories, states
Binary- nominal attribute with only 2 states (0,1)
Ordinal- values have meaningful order (ranking)
Different categories of attributes (Numeric) - ANSWERInterval- measured on a scale of
equal sized units
Ratio- values are in order of magnitude (
Statistical description of data - ANSWERMotivation: tendencies, variation, spread
Data dispersion: medium, max, min, quantile, outliers, variances
Data Transformation Methods - ANSWER- scaling
- logarithmic transformation
- aggregation
- encoding
- binning
- dimensionality reduction
What is EDA? (Exploratory Data Analysis) - ANSWERAn approach in data analysis to
gain insights in understanding of the data, before formal modeling or hypothesis testing
, Motivation of EDA - ANSWERTo explore and summarize the main characteristics,
patterns, and relationships within the data
EDA Methods - ANSWER- Descriptive
- Data Visualization
- Correlation Analysis
- Outlier detection
- Missing Data Analysis
- Data Transformation
- Dimensionality Reduction
What is confidence interval estimation? - ANSWERA statistical technique used to
estimate a range within which a population parameter is likely to lie with a specified
level of confidence.
What is cross-validation? - ANSWEREvaluates model performance by splitting data into
k mutually exclusive subsets for training and testing
What is overfitting? - ANSWEROccurs when the model tries to fit every possible
trend/structure into the training set
Bias-Variance Trade-Off - ANSWERBalance between two model qualities, bias and
variance, to minimize overall error for unobserved data
What is KNN (K Nearest Neighbor) - ANSWERInstance-based learning where training
set records are stored first (???)
What is the main procedure of KNN? - ANSWER1. Determine parameter where k=# of
nearest neighbors
2. Calculate distance between new instance and all the training examples
3. Sort the examples by distance and determine nearest neighbors based on the k^th
minimum distance
4. Gather the category Y of the nearest neighbor
5. Use simply majority of the category of the nearest neighbors as the prediction value
of query instance
What are decision trees? Execution? - ANSWER- Uses a flow-chart like tree structure to
make predictions
Execution:
1. preprocess data
2. split data intro training/testing
3. train decision tree model on training data
4. evaluate performance on testing data