Written by students who passed Immediately available after payment Read online or as PDF Wrong document? Swap it for free 4.6 TrustPilot
logo-home
Exam (elaborations)

Data Mining - Fall II | More than 100 Quizzes

Rating
-
Sold
-
Pages
14
Grade
A+
Uploaded on
10-12-2024
Written in
2024/2025

Data Mining - Fall II 118. What are the two kinds of MDS?: 1. Classical (metric) - preserves original distance between points, very similar to PCA 2. Non-metric (ordinal MDS) - constructs fitted distances that are in the same rank order as the original distance (can be used with quantitative and qualitative data) 119. What is the main difference between MDS and PCA?: PCA - more focused on the dimensions themselves (maximizing explained variance) MDS - more focused on relations among the scaled objects 120. What are the measures used to assess the success of metric MDS? Non-metric MDS?: Metric - GOF - between 0 and 1 - want a high # Non-metric - stress - percent between 0 and 100 - want a low # 121. What happens when you have a large # of predictor variables?: Finding true signal becomes difficult... it can be hidden in all dimensions of the data.

Show more Read less
Institution
Course

Content preview

Data Mining - Fall II


1. Why do we split data into training/validation/test?: To make sure the results are generalizable. We don't want to
overfit to the training sample.
2. How should you split into training/validation/test?: You can really split it however you want, but some
suggested methods are:
50-40-10 split for lots of data
70-20-10 split on smaller data sets
3. Do you have to split the data for unsupervised techniques?: Nope! Use whole data set for unsupervised learning
(like clustering)
4. How do you know when you don't have enough data to split it? What approach should you then take?: It is
good to have at least 10 observations per variable
Use cross-validation (with test set) if you don't feel that you have enough data to do a normal split
5. Should you ever report accuracy stats on the training data set to a client?-
: NO - report on test ideally, validation if you must
6. Steps for model creation: 1. Use training data to build model
2. Evaluate/tune models on validation data (but don't train model on validation, only use it to fine tune)
3. Once a final model is chosen, re-run this model on the training and validation TOGETHER to finalize
parameters.
4. Use this model on the test data set. Report this accuracy statistic!
5. If model is going to be deployed, use ALL data to update to final parameters.
7. What is k-fold cross-validation?: Divide data into k equally sized samples. Then, for each fold, train the model on
all data except one-fold that is left out as the validation data. Record accuracy measures for each individual fold left out,
and at the end take an average/stdev of all accuracy stats. Use the summary to choose a model.
8. Can you use cross-validation as a splitting technique in any situation?: Yes - can always be used. However, it is
most commonly used when there are not sufficient observations to break into training/validation/test
9. What is Jackknifing?: Leave one out cross-validation n-fold cross validation
where n = sample size
use only one observation as the validation set, and repeat for each observation in the data set
**can be really time consuming, use for only really small data sets
10.What is bootstrapping? Why is it used?: A non-parametric procedure that can estimate the standard error,
compute confidence intervals, or perform hypothesis tests on a statistic.
The data is assumed to be the population, and sampling WITH replacement is used






, Data Mining - Fall II


to create samples of the same size. A distribution can be made with many of these samples.
It is used when the data set does not meet assumptions (like a normal distribution) and you cannot use other techniques.
11.What are the assumptions of bootstrapping?: Independent observations and the samples are representative of the
population.
12. What are two applications of bootstrapping that were examples in class?-
: 1. Finding variability in the median (or any statistic)
2. Test if two medians (or other stats) are significantly different
13.If you do a lot of hypothesis testing, what happens to your Type I error rates?: They become inflated, so
you need to adjust the p-values
14.If you do a lot of hypothesis testing, what happens? What is one technique you can use to control the family-
wise error rate? How does it work?: lots of hypothesis testing -> inflated Type I error (rejecting the null when it is
actually true)

You can use the Bonferroni technique - it controls family wise error rate. You can multiply the p-values by the number of
tests you ran (inflates p-vals so they are less likely to be rejected)
15.What is transactional data?: Each row represents a transaction, so the data is very long (one person can have a lot
of rows of different purchases/visits/etc.) Want to roll up the data so it has one row per transaction modeled.
Transform long -> wide
16.What is feature creation in the context of transactional data?: Create fea- tures (columns) for the data that you
can pull from the transactions that matters. You need to THINK ABOUT THIS, it will be different in each context.
EX of things that might be important:
- date of first/last transaction
- total amount of transactions
- max/min/avg cost of purchases
17.What are some different approaches to handling missing values?: - Create a flag variable that indicates whether
a value is missing
- For continuous variables, if you want to keep the variable (over 50% is there) you can impute values and add a flag
variable or bin the variable and add a missing bin.
- for categorical variables, create a missing bin
18.What are 3 ways to bin numeric variables?: Equal width - each bin has same range of the variable value but
different numbers of observations within each bin (EX: 0-10, 11-20, 21-30...)
Equal depth - each bin has the same number of observations (take percentiles of the population and bin them)

Written for

Course

Document information

Uploaded on
December 10, 2024
Number of pages
14
Written in
2024/2025
Type
Exam (elaborations)
Contains
Questions & answers

Subjects

$9.49
Get access to the full document:

Wrong document? Swap it for free Within 14 days of purchase and before downloading, you can choose a different document. You can simply spend the amount again.
Written by students who passed
Immediately available after payment
Read online or as PDF

Get to know the seller

Seller avatar
Reputation scores are based on the amount of documents a seller has sold for a fee and the reviews they have received for those documents. There are three levels: Bronze, Silver and Gold. The better the reputation, the more your can rely on the quality of the sellers work.
smartchoices Chamberlain College Of Nursing
Follow You need to be logged in order to follow users or courses
Sold
36
Member since
5 year
Number of followers
5
Documents
4499
Last sold
2 weeks ago

4.8

9 reviews

5
7
4
2
3
0
2
0
1
0

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

Student with book image

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Working on your references?

Create accurate citations in APA, MLA and Harvard with our free citation generator.

Working on your references?

Frequently asked questions