Exam (elaborations)

DSCI 4520 Exam 2023 with verified questions and answers

Rating

Sold

Pages

Grade

A+

Uploaded on

21-03-2023

Written in

2022/2023

Data Mining (knowledge discovery in databases) -A process of identifying hidden patterns and relationships within data, -Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) info or patterns from data in large databases. Data explosion problem The tremendous amounts of data stored in databases, data warehouses and other info repositories that arises from Automated data collection tools and mature database technology. Required expertise for Data mining -Domain -Data -Analytical Methods Data mining: Info technology (IT) Complicated database queries Data mining: Machine Learning (ML) Inductive learning from examples Data mining: Statistics (Stats) What we were taught not to do Market analysis and management Risk analysis and management Fraud detection and management Data mining Potential Applications in Database analysis and decision support include ______, _____, & _______. Market Analysis and Management: Data sources for analysis Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies. Market Analysis and Management: Target Marketing Finds clusters of "model" customers who share the same characteristics: interest, income level, spending habits, etc. Market Analysis and Management: Cross-Market Analysis -Seeks associations/co-relationships between product sales -Predictions based on association information Market Analysis and Management: Customer Profiling Using data mining to reveal what types of customers buy what products (clustering classification) Corporate Analysis and Risk Management Finance planning and asset evaluation, Resource planning, & Competition are all included in ______ ____ and ____ _______. An sector where data mining is a very valuable resource. Knowledge Discovery in Databases The acronym "KDD" stands for what? Sample, Explore, Modify, Model, Assess The acronym "SEMMA" stands for what? Sample Input Data Source, Sampling, & Data Partition are what type of nodes in SEMMA? Explore Distribution Explorer, Association, Multiplot, Variable Selection, Insight, & Link Analysis are what type of nodes in SEMMA? Modify Data Set Attributes, Clustering, Transform Variables, Self-Organized Maps Kohonen Networks, Filter Outliers, Time Series, & Replacement are what type of nodes in SEMMA? Model Regression, User Defined Model, Tree, Ensemble, Neural Network, Memory Based Reasoning, Princomp/Dmneural, & Two-Stage Model are what type of nodes in SEMMA? Assess Assesment, & Reporter are what type of nodes in SEMMA? Scoring Nodes Score, & C*Score are what type of nodes? Utility Nodes Group Processing, Data Mining Database, SAS Code, Control Point, & Subdiagram what type of nodes? Categorical variable ______ (variable) is one for which the measurement scale consists of a set of categories Nominal Categorical variables for which levels (categories) do not have a natural ordering are called _____? Ordinal Categorical variables that do have a natural ordering of their levels are called ______? Interval variable ______ (variable) is one that has numerical distances between any two levels of the scale. Define and Measure the target variable The FIRST step in any data mining is to __________________ to be predicted by the model that emerges from your analysis of the data. By comparing the distributions of some key variables in the sample and the target universe. How do you verify that a sample is a good representation of a target universe? Observation Weights When distributions of key characteristics in a sample and a target population are different, sometimes __________ are used to correct for any bias. *In order to detect the difference between a target population and a sample it is necessary to have some prior knowledge of the target population.* Pre-Processing the Data The following steps are all included in what? - Eliminating obviously Irrelevant Data elements (name, SS#, etc.) that clearly have no effect on the target var. -Converting Data to appropriate measurement scale, esp converting categorical (nominal-scaled) data to interval-selected when appropriate - Eliminating Variables which highly skewed distributions - Eliminating Inputs that are really target variables disguised as inputs - Impute missing values The modeling tool and the number of inputs under consideration for modeling What does the choice of modeling strategy depend on? Input Data Node -1st node in any diagram (unless you start with the SAS Code node). -Node in which you specify the data set that you want to use in the diagram MultiPlot & StatExplore Nodes used in Initial Data Exploration of a diagram. Nodes which enable you to examine the distributions of the input variables and their relationships with the target variable. **To use these nodes you must create a process flow diagram StatExplore Node Node that can be used to find out which input variables are closely related to the target **Results of this node produce: 1) A graph window showing a graph called the Chi-Square Plot, which shows the statistic known as Cramer's V on the and the inputs on the horizontal axis 2) A output window that includes a sub-window called "Results" here you can see the modal value of each input for each target level. MultiPlot Node Node that can be used for visualizing the data of a diagram. You are able to examine the distributions of the variables and relationships among variables. Impute Node What node is used for imputing missing values of inputs? Data Partition Node What is the node used to partition a sample into Training, Validation, and Test sub-samples? Training Data Data used for developing a model using tools such as Regression, Decision Tree, & Neural Network. These tools generate a number of models Validation Data set Data set used to evaluate models generated from the tools used in the training process, and then select the best one. "Fine Tuning" Common name for the process of selecting best model (occurs within Validation). Test Data Data set used for an independent assessment of a final model. (Occurs after Validation) Default, Simple random, Cluster, & Stratified What are the four Partitioning methods that can be specified for use in the Data Partition node. Filter Node Node which can be used for eliminating observations with extreme values (outliers) in the variables. **You should not use this node routinely to eliminate outliers. While it may be reasonable to eliminate some outliers for very large data sets for predictive models, the outliers often have interesting info that leads to insights about the data and customer behavior. **Before using you should first find out the source of any extreme value. If error is source, error should be corrected. If no error, you can truncate the value so that the extreme value doesn't have an undue influence on the model. Variable Selection Node -Node which can be used to make a preliminary selection of the variables to be included in predictive modeling. -Performs initial selection as well as a final variable selection *there are a number of alt methods with various options for selecting variables. The methods of variable selection depend on the measurement scales of the inputs and the targets R-Square & Chi-Square selection methods Two basic techniques used by the Variable Selection node. **Both techniques select variables based on the strength of their relationship with the target variable Interval Targets Type of targets where ONLY R-Square selection method is available. Binary Targets Type of target where BOTH the R-Square and Chi-Square selection methods are available. R-Square selection method Variable Selection method which uses these two steps: 1) Variable selection node computes R-Square value based on the relationship between the target and each input variable, and then assigns the Rejected role to those variables that have a value less than the minimum R-Square. 2) Variable selection node performs a forward stepwise regression to evaluate the variables chosen in the first step. Those variables that have a stepwise R-Square improvement less than the cut-off criterion have the role of "rejected." R-Square Value (squared correlation coefficient) The proportion of variation in the target variable explained by a single input variable, ignoring the effect of other input variables. Combing categories with similar distribution of target levels For variable selection, the number of categories of a nominal categorical variable can be reduced by _____? Chi-Square selection method The method where the Variable Selection node creates a tree based on Chi-Square maximization is the ________ method. *With this method, the Variable selection node first bins interval variables and then uses those rather than the original inputs in building the tree. Default # of bins is 50 (can be changed) *Any split with a ________ below the specified threshold is rejected. Default value for this threshold is 3.84 (can be changed by setting Minimum of the "methods" value to the desired level) *Inputs that give the best splits are included in the final tree, and passed to the next node with the role of "Input." Transformations for INTERVAL Inputs The following transformations apply to what type of inputs? i)Simple Transformations -Log, Square Root, Inverse, Square, Exponential, and Standardize. ii)Binning Transformations -Setting the value of _____ inputs property to Bucket, Quantile, or Optimal. iii) Best Power Transformations -Maximum Normal -Maximum Correlation -Equalize spread with Target levels -Optimal Maximum Spread with Target Level Transformation of CLASS Inputs The following available transformations apply to which type of inputs? i)Group Rare Levels transformation ii)Dummy Indicators Transformation SAS Code Node Node used to incorporate SAS procedures and external SAS code into the process flow of a project. DATA step programming can be performed in this node. Class variables Name given when; a target is continuous and the inputs are categorical and nominal-scaled? Continuous Target with Numeric Interval-Scaled Inputs The Variable Selection node uses R-Square as the default criterion of selection when a target is? AOV16 *(CHECK TO MAKE SURE THIS IS CORRECT)* Binned variables in Enterprise Miner are labeled _____? Continuous Target with Nominal Categorical Inputs When is R-Squared calculated using one-way ANOVA. Where the option of using either the original or the grouped variables is available. Binary Target with Numeric Interval-scaled Inputs What case allows for the following? Either the R-Square or Chi-Square criterion can be used i) R-Square: Target variable treated like continuous variable R-Square with the target is computed for each original & binned input. (2 step proc followed) 1) an initial variable set is selected based on the Min R-Square criterion; 2) A sequential forward selection procedure is used to select variables from those selected in Step 1 on the basis of a Stop R-Square criterion. ** Inputs which meet both these criterion receive the Role of Input in the subsequent modeling tool, such as the Regression node. ii) Chi-Square: selection process doesn't have 2 distinct steps - Instead a tree is constructed. - The inputs selected in the construction of the the tree are passed to the next node with the assigned Role of Input. Chi-Squared Automatic Interaction Detection The acronym (CHAID) that is sometimes used as a label for a tree constructed using the Chi-Squared criterion stands for ________? Chi-Squared Tree Development stages What do the following stages portray? 1) The records of the data set are divided into two groups. 2) Each group is further divided into 2 more groups, and so on Recursive partitioning Name of the process of dividing groups during tree development. Node The name of each group created during the recursive partitioning is called a _____? Terminal Nodes, Leaf Nodes, or Leaves of the Tree Nodes that are not divided down further are called _____? Intermediate Nodes Nodes that are divided further are called _____? Root Node The node with all the records of the data set is called the _______? Tree Terminal, Leaf, or Leaves of the Tree; Intermediate Nodes, Root Node all together make what? Group Rare Levels & Dummy Indicators Transformation for the categories What are the two types of transformations for the categorical (class) variables? Cross-Industry Standard Process for Data Mining The Acronym CRISP-DM stands for what? The Six Phases of CRISP-DM Business understanding Data understanding Data preparation Modeling Evaluation Deployment Business Understanding The following are all components of _____? -Determining business objectives -Assessing the current situation -Establishing Data Mining goals -Developing a project plan

Show more Read less

Institution

DSCI 4520

Course

DSCI 4520

Whoops! We can’t load your doc right now. Try again or contact support.

Report Copyright Violation

Written for

Institution: DSCI 4520
Course: DSCI 4520

Document information

Uploaded on: March 21, 2023
Number of pages: 8
Written in: 2022/2023
Type: Exam (elaborations)
Contains: Questions & answers

Subjects

extraction of interest
dsci 4520 exam 2023 with verified questions and answers
data mining knowledge discovery in databases a process of identifying hidden patterns and relationships within data

$9.49

Get access to the full document:

Written by students who passed

Immediately available after payment

Read online or as PDF

Get to know the seller

Arthurmark

3.7

(9)

Get to know the seller

Arthurmark Chamberlain College Of Nursing

View profile

Sold

Member since

4 year

Number of followers

Documents

1422

Last sold

7 months ago

3.7

9 reviews

Why students choose Stuvia

Created by fellow students, verified by reviews

Quality you can trust: written by students who passed their tests and reviewed by others who've used these notes.

Didn't get what you expected? Choose another document

No worries! You can instantly pick a different document that better fits what you're looking for.

Pay as you like, start learning right away

No subscription, no commitments. Pay the way you're used to via credit card and download your PDF document instantly.

“Bought, downloaded, and aced it. It really can be that simple.”

Alisha Student

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller Arthurmark. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $9.49. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews) 50851 documents were sold in the last 30 days Founded in 2010, the go-to place to buy study notes for 16 years now

DSCI 4520 Exam 2023 with verified questions and answers

Written for

Document information

Subjects

Get to know the seller

Recently viewed by you

Why students choose Stuvia

Created by fellow students, verified by reviews

Didn't get what you expected? Choose another document

Pay as you like, start learning right away

Working on your references?

Frequently asked questions

What do I get when I buy this document?

Satisfaction guarantee: how does it work?

Who am I buying these notes from?

Will I be stuck with a subscription?

Can Stuvia be trusted?