Tentamen (uitwerkingen)

Complete testbank for _discovering_knowledge_in_data by Daniel TLarose

Beoordeling

Verkocht

Pagina's

237

Cijfer

A+

Geüpload op

05-12-2024

Geschreven in

2024/2025

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, , fax , or on the web at . Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008. Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in preparing this book, they make no representations or warranties with respect to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives or written sales materials. The advice and strategies contained herein may not be suitable for your situation. You should consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages. For general information on our other products and services please contact our Customer Care Department within the U.S. at , outside the U.S. at or fax . Wiley also publishes its books in a variety of electronic formats. Some content that appears in print, however, may not be available in electronic format. Library of Congress Cataloging-in-Publication Data: Larose, Daniel T. Discovering knowledge in data : an introduction to data mining / Daniel T. Larose p. cm. Includes bibliographical references and index. ISBN 0-471-66657-2 (cloth) 1. Data mining. I. Title. QA76.9.D343L38 2005 006.3 12—dc Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 Dedication To my parents, And their parents, And so on... For my children, And their children, And so on... 2004 Chantal Larose CONTENTS PREFACE xi 1 INTRODUCTION TO DATA MINING 1 What Is Data Mining? 2 Why Data Mining? 4 Need for Human Direction of Data Mining 4 Cross-Industry Standard Process: CRISP–DM 5 Case Study 1: Analyzing Automobile Warranty Claims: Example of the CRISP–DM Industry Standard Process in Action 8 Fallacies of Data Mining 10 What Tasks Can Data Mining Accomplish? 11 Description 11 Estimation 12 Prediction 13 Classification 14 Clustering 16 Association 17 Case Study 2: Predicting Abnormal Stock Market Returns Using Neural Networks 18 Case Study 3: Mining Association Rules from Legal Databases 19 Case Study 4: Predicting Corporate Bankruptcies Using Decision Trees 21 Case Study 5: Profiling the Tourism Market Using k-Means Clustering Analysis 23 References 24 Exercises 25 2 DATA PREPROCESSING 27 Why Do We Need to Preprocess the Data? 27 Data Cleaning 28 Handling Missing Data 30 Identifying Misclassifications 33 Graphical Methods for Identifying Outliers 34 Data Transformation 35 Min–Max Normalization 36 Z-Score Standardization 37 Numerical Methods for Identifying Outliers 38 References 39 Exercises 39 vii viii CONTENTS 3 EXPLORATORY DATA ANALYSIS 41 Hypothesis Testing versus Exploratory Data Analysis 41 Getting to Know the Data Set 42 Dealing with Correlated Variables 44 Exploring Categorical Variables 45 Using EDA to Uncover Anomalous Fields 50 Exploring Numerical Variables 52 Exploring Multivariate Relationships 59 Selecting Interesting Subsets of the Data for Further Investigation 61 Binning 62 Summary 63 References 64 Exercises 64 4 STATISTICAL APPROACHES TO ESTIMATION AND PREDICTION 67 Data Mining Tasks in Discovering Knowledge in Data 67 Statistical Approaches to Estimation and Prediction 68 Univariate Methods: Measures of Center and Spread 69 Statistical Inference 71 How Confident Are We in Our Estimates? 73 Confidence Interval Estimation 73 Bivariate Methods: Simple Linear Regression 75 Dangers of Extrapolation 79 Confidence Intervals for the Mean Value of y Given x 80 Prediction Intervals for a Randomly Chosen Value of y Given x 80 Multiple Regression 83 Verifying Model Assumptions 85 References 88 Exercises 88 5 k-NEAREST NEIGHBOR ALGORITHM 90 Supervised versus Unsupervised Methods 90 Methodology for Supervised Modeling 91 Bias–Variance Trade-Off 93 Classification Task 95 k-Nearest Neighbor Algorithm 96 Distance Function 99 Combination Function 101 Simple Unweighted Voting 101 Weighted Voting 102 Quantifying Attribute Relevance: Stretching the Axes 103 Database Considerations 104 k-Nearest Neighbor Algorithm for Estimation and Prediction 104 Choosing k 105 Reference 106 Exercises 106 CONTENTS ix 6 DECISION TREES 107 Classification and Regression Trees 109 C4.5 Algorithm 116 Decision Rules 121 Comparison of the C5.0 and CART Algorithms Applied to Real Data 122 References 126 Exercises 126 7 NEURAL NETWORKS 128 Input and Output Encoding 129 Neural Networks for Estimation and Prediction 131 Simple Example of a Neural Network 131 Sigmoid Activation Function 134 Back-Propagation 135 Gradient Descent Method 135 Back-Propagation Rules 136 Example of Back-Propagation 137 Termination Criteria 139 Learning Rate 139 Momentum Term 140 Sensitivity Analysis 142 Application of Neural Network Modeling 143 References 145 Exercises 145 8 HIERARCHICAL AND k-MEANS CLUSTERING 147 Clustering Task 147 Hierarchical Clustering Methods 149 Single-Linkage Clustering 150 Complete-Linkage Clustering 151 k-Means Clustering 153 Example of k-Means Clustering at Work 153 Application of k-Means Clustering Using SAS Enterprise Miner 158 Using Cluster Membership to Predict Churn 161 References 161 Exercises 162 9 KOHONEN NETWORKS 163 Self-Organizing Maps 163 Kohonen Networks 165 Example of a Kohonen Network Study 166 Cluster Validity 170 Application of Clustering Using Kohonen Networks 170 Interpreting the Clusters 171 Cluster Profiles 175 x CONTENTS Using Cluster Membership as Input to Downstream Data Mining Models 177 References 178 Exercises 178 10 ASSOCIATION RULES 180 Affinity Analysis and Market Basket Analysis 180 Data Representation for Market Basket Analysis 182 Support, Confidence, Frequent Itemsets, and the A Priori Property 183 How Does the A Priori Algorithm Work (Part 1)? Generating Frequent Itemsets 185 How Does the A Priori Algorithm Work (Part 2)? Generating Association Rules 186 Extension from Flag Data to General Categorical Data 189 Information-Theoretic Approach: Generalized Rule Induction Method 190 J-Measure 190 Application of Generalized Rule Induction 191 When Not to Use Association Rules 193 Do Association Rules Represent Supervised or Unsupervised Learning? 196 Local Patterns versus Global Models 197 References 198 Exercises 198 11 MODEL EVALUATION TECHNIQUES 200 Model Evaluation Techniques for the Description Task 201 Model Evaluation Techniques for the Estimation and Prediction Tasks 201 Model Evaluation Techniques for the Classification Task 203 Error Rate, False Positives, and False Negatives 203 Misclassification Cost Adjustment to Reflect Real-World Concerns 205 Decision Cost/Benefit Analysis 207 Lift Charts and Gains Charts 208 Interweaving Model Evaluation with Model Building 211 Confluence of Results: Applying a Suite of Models 212 Reference 213 Exercises 213 EPILOGUE: “WE’VE ONLY JUST BEGUN” 215 INDEX 217 PREFACE WHAT IS DATA MINING? Data mining is predicted to be “one of the most revolutionary developments of the next decade,” according to the online technology magazine ZDNET News(February 8, 2001). In fact, the MIT Technology Review chose data mining as one of ten emerging technologies that will change the world. According to the Gartner Group, “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.” Because data mining represents such an important field, Wiley-Interscience and Dr. Daniel T. Larose have teamed up to publish a series of volumes on data mining, consisting initially of three volumes. The first volume in the series, Discovering Knowledge in Data: An Introduction to Data Mining, introduces the reader to this rapidly growing field of data mining. WHY IS THIS BOOK NEEDED? Human beings are inundated with data in most fields. Unfortunately, these valuable data, which cost firms millions to collect and collate, are languishing in warehouses and repositories. The problem is that not enough trained human analysts are available who are skilled at translating all of the data into knowledge, and thence up the taxonomy tree into wisdom. This is why this book is needed; it provides readers with: Models and techniques to uncover hidden nuggets of information Insight into how data mining algorithms work The experience of actually performing data mining on large data sets Data mining is becoming more widespread every day, because it empowers companies to uncover profitable patterns and trends from their existing databases. Companies and institutions have spent millions of dollars to collect megabytes and terabytes of data but are not taking advantage of the valuable and actionable information hidden deep within their data repositories. However, as the practice of data mining becomes more widespread, companies that do not apply these techniques are in danger of falling behind and losing market share, because their competitors are using data mining and are thereby gaining the competitive edge. In Discovering Knowledge in Data, the step-by-step hands-on solutions of real-world business problems using widely available data mining techniques applied to real-world data sets xi xii PREFACE will appeal to managers, CIOs, CEOs, CFOs, and others who need to keep abreast of the latest methods for enhancing return on investment. DANGER! DATA MINING IS EASY TO DO BADLY The plethora of new off-the-shelf software platforms for performing data mining has kindled a new kind of danger. The ease with which these GUI-based applications can manipulate data, combined with the power of the formidable data mining algorithms embedded in the black-box software currently available, make their misuse proportionally more hazardous. Just as with any new information technology, data mining is easy to do badly. A little knowledge is especially dangerous when it comes to applying powerful models based on large data sets. For example, analyses carried out on unpreprocessed data can lead to erroneous conclusions, or inappropriate analysis may be applied to data sets that call for a completely different approach, or models may be derived that are built upon wholly specious assumptions. If deployed, these errors in analysis can lead to very expensive failures. ‘‘WHITE BOX’’ APPROACH: UNDERSTANDING THE UNDERLYING ALGORITHMIC AND MODEL STRUCTURES The best way to avoid these costly errors, which stem from a blind black-box approach to data mining, is to apply instead a “white-box” methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software. Discovering Knowledge in Data applies this white-box approach by: Walking the reader through the various algorithms Providing examples of the operation of the algorithm on actual large data sets Testing the reader’s level of understanding of the concepts and algorithms Providing an opportunity for the reader to do some real data mining on large data sets Algorithm Walk-Throughs Discovering Knowledge in Data walks the reader through the operations and nuances of the various algorithms, using small-sample data sets, so that the reader gets a true appreciation of what is really going on inside the algorithm. For example, in Chapter 8, we see the updated cluster centers being updated, moving toward the center of their respective clusters. Also, in Chapter 9 we see just which type of network weights will result in a particular network node “winning” a particular record. Applications of the Algorithms to Large Data Sets Discovering Knowledge in Data provides examples of the application of various algorithms on actual large data sets. For example, in Chapter 7 a classification problem DATA MINING AS A PROCESS xiii is attacked using a neural network model on a real-world data set. The resulting neural network topology is examined along with the network connection weights, as reported by the software. These data sets are included at the book series Web site, so that readers may follow the analytical steps on their own, using data mining software of their choice. Chapter Exercises: Checking to Make Sure That You Understand It Discovering Knowledge in Data includes over 90 chapter exercises, which allow readers to assess their depth of understanding of the material, as well as to have a little fun playing with numbers and data. These include conceptual exercises, which help to clarify some of the more challenging concepts in data mining, and “tiny data set” exercises, which challenge the reader to apply the particular data mining algorithm to a small data set and, step by step, to arrive at a computationally sound solution. For example, in Chapter 6 readers are provided with a small data set and asked to construct by hand, using the methods shown in the chapter, a C4.5 decision tree model, as well as a classification and regression tree model, and to compare the

Meer zien Lees minder

Instelling

Discovering Knowledge In Data

Vak

Discovering knowledge in data

Voorbeeld van de inhoud

,DISCOVERING
KNOWLEDGE IN DATA
An Introduction to Data Mining

DANIEL T. LAROSE
Director of Data Mining
Central Connecticut State University

A JOHN WILEY & SONS, INC., PUBLICATION

iii

,DISCOVERING
KNOWLEDGE IN DATA

i

, DISCOVERING
KNOWLEDGE IN DATA
An Introduction to Data Mining

DANIEL T. LAROSE
Director of Data Mining
Central Connecticut State University

A JOHN WILEY & SONS, INC., PUBLICATION

iii

Meld schending auteursrecht

Geschreven voor

Instelling: Discovering knowledge in data
Vak: Discovering knowledge in data

Documentinformatie

Geüpload op: 5 december 2024
Aantal pagina's: 237
Geschreven in: 2024/2025
Type: Tentamen (uitwerkingen)
Bevat: Vragen en antwoorden

Onderwerpen

discovering knowledge in data

$18.49

Krijg toegang tot het volledige document:

Geschreven door studenten die geslaagd zijn

Direct beschikbaar na je betaling

Online lezen of als PDF

Maak kennis met de verkoper

TESTBANKSPOOL001

Maak kennis met de verkoper

TESTBANKSPOOL001 Princeton University

Bekijk profiel

Volgen

Verkocht

Lid sinds

1 jaar

Aantal volgers

Documenten

Laatst verkocht

4 maanden geleden

0.0

0 beoordelingen

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Kwaliteit die je kunt vertrouwen: geschreven door studenten die slaagden en beoordeeld door anderen die dit document gebruikten.

Niet tevreden? Kies een ander document

Geen zorgen! Je kunt voor hetzelfde geld direct een ander document kiezen dat beter past bij wat je zoekt.

Betaal zoals je wilt, start meteen met leren

Geen abonnement, geen verplichtingen. Betaal zoals je gewend bent via iDeal of creditcard en download je PDF-document meteen.

“Gekocht, gedownload en geslaagd. Zo makkelijk kan het dus zijn.”

Alisha Student

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper TESTBANKSPOOL001. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor $18.49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews) Afgelopen 30 dagen zijn er 49593 samenvattingen verkocht Opgericht in 2010, al 16 jaar dé plek om samenvattingen te kopen

Complete testbank for _discovering_knowledge_in_data by Daniel TLarose

Voorbeeld van de inhoud

Geschreven voor

Documentinformatie

Onderwerpen

Maak kennis met de verkoper

Recent door jou bekeken

Waarom studenten kiezen voor Stuvia

Gemaakt door medestudenten, geverifieerd door reviews

Niet tevreden? Kies een ander document

Betaal zoals je wilt, start meteen met leren

Bezig met je bronvermelding?

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?