Exam (elaborations)

Data Mining - Fall II | More than 100 Quizzes

Rating

Sold

Pages

Grade

A+

Uploaded on

10-12-2024

Written in

2024/2025

Data Mining - Fall II 118. What are the two kinds of MDS?: 1. Classical (metric) - preserves original distance between points, very similar to PCA 2. Non-metric (ordinal MDS) - constructs fitted distances that are in the same rank order as the original distance (can be used with quantitative and qualitative data) 119. What is the main difference between MDS and PCA?: PCA - more focused on the dimensions themselves (maximizing explained variance) MDS - more focused on relations among the scaled objects 120. What are the measures used to assess the success of metric MDS? Non-metric MDS?: Metric - GOF - between 0 and 1 - want a high # Non-metric - stress - percent between 0 and 100 - want a low # 121. What happens when you have a large # of predictor variables?: Finding true signal becomes difficult... it can be hidden in all dimensions of the data.

Show more Read less

Institution

Course

Content preview

Data Mining - Fall II

1. Why do we split data into training/validation/test?: To make sure the results are generalizable. We don't want to
overfit to the training sample.
2. How should you split into training/validation/test?: You can really split it however you want, but some
suggested methods are:
50-40-10 split for lots of data
70-20-10 split on smaller data sets
3. Do you have to split the data for unsupervised techniques?: Nope! Use whole data set for unsupervised learning
(like clustering)
4. How do you know when you don't have enough data to split it? What approach should you then take?: It is
good to have at least 10 observations per variable
Use cross-validation (with test set) if you don't feel that you have enough data to do a normal split
5. Should you ever report accuracy stats on the training data set to a client?-
: NO - report on test ideally, validation if you must
6. Steps for model creation: 1. Use training data to build model
2. Evaluate/tune models on validation data (but don't train model on validation, only use it to fine tune)
3. Once a final model is chosen, re-run this model on the training and validation TOGETHER to finalize
parameters.
4. Use this model on the test data set. Report this accuracy statistic!
5. If model is going to be deployed, use ALL data to update to final parameters.
7. What is k-fold cross-validation?: Divide data into k equally sized samples. Then, for each fold, train the model on
all data except one-fold that is left out as the validation data. Record accuracy measures for each individual fold left out,
and at the end take an average/stdev of all accuracy stats. Use the summary to choose a model.
8. Can you use cross-validation as a splitting technique in any situation?: Yes - can always be used. However, it is
most commonly used when there are not sufficient observations to break into training/validation/test
9. What is Jackknifing?: Leave one out cross-validation n-fold cross validation
where n = sample size
use only one observation as the validation set, and repeat for each observation in the data set
**can be really time consuming, use for only really small data sets
10.What is bootstrapping? Why is it used?: A non-parametric procedure that can estimate the standard error,
compute confidence intervals, or perform hypothesis tests on a statistic.
The data is assumed to be the population, and sampling WITH replacement is used

, Data Mining - Fall II

to create samples of the same size. A distribution can be made with many of these samples.
It is used when the data set does not meet assumptions (like a normal distribution) and you cannot use other techniques.
11.What are the assumptions of bootstrapping?: Independent observations and the samples are representative of the
population.
12. What are two applications of bootstrapping that were examples in class?-
: 1. Finding variability in the median (or any statistic)
2. Test if two medians (or other stats) are significantly different
13.If you do a lot of hypothesis testing, what happens to your Type I error rates?: They become inflated, so
you need to adjust the p-values
14.If you do a lot of hypothesis testing, what happens? What is one technique you can use to control the family-
wise error rate? How does it work?: lots of hypothesis testing -> inflated Type I error (rejecting the null when it is
actually true)

You can use the Bonferroni technique - it controls family wise error rate. You can multiply the p-values by the number of
tests you ran (inflates p-vals so they are less likely to be rejected)
15.What is transactional data?: Each row represents a transaction, so the data is very long (one person can have a lot
of rows of different purchases/visits/etc.) Want to roll up the data so it has one row per transaction modeled.
Transform long -> wide
16.What is feature creation in the context of transactional data?: Create fea- tures (columns) for the data that you
can pull from the transactions that matters. You need to THINK ABOUT THIS, it will be different in each context.
EX of things that might be important:
- date of first/last transaction
- total amount of transactions
- max/min/avg cost of purchases
17.What are some different approaches to handling missing values?: - Create a flag variable that indicates whether
a value is missing
- For continuous variables, if you want to keep the variable (over 50% is there) you can impute values and add a flag
variable or bin the variable and add a missing bin.
- for categorical variables, create a missing bin
18.What are 3 ways to bin numeric variables?: Equal width - each bin has same range of the variable value but
different numbers of observations within each bin (EX: 0-10, 11-20, 21-30...)
Equal depth - each bin has the same number of observations (take percentiles of the population and bin them)

Report Copyright Violation

Written for

Course: Data Mining

All documents for this subject (371)

Document information

Uploaded on: December 10, 2024
Number of pages: 14
Written in: 2024/2025
Type: Exam (elaborations)
Contains: Questions & answers

Subjects

data mining fall ii

$9.49

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

smartchoices

4.8

(9)

Get to know the seller

smartchoices Chamberlain College Of Nursing

View profile

Sold

Member since

5 year

Number of followers

Documents

4499

Last sold

2 weeks ago

4.8

9 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller smartchoices. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $9.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 47251 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

Data Mining - Fall II | More than 100 Quizzes

Content preview

Written for

Document information

Subjects

Get to know the seller

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?