Discovering Knowledge in Data

Name: Discovering Knowledge in Data | An Introduction to Data Mining
Brand: Wiley
Price: 86.99 EUR
Availability: OnlineOnly

An Introduction to Data Mining

Daniel T. Larose(Author)

Wiley (Publisher)

2nd Edition

Published on 2. June 2014

336 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-87357-1 (ISBN)

€86.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Person

Content

Preface xi

Chapter 1 An Introduction to Data Mining 1

1.1 What is Data Mining? 1

1.2 Wanted: Data Miners 2

1.3 The Need for Human Direction of Data Mining 3

1.4 The Cross-Industry Standard Practice for Data Mining 4

1.4.1 Crisp-DM: The Six Phases 5

1.5 Fallacies of Data Mining 6

1.6 What Tasks Can Data Mining Accomplish? 8

1.6.1 Description 8

1.6.2 Estimation 8

1.6.3 Prediction 10

1.6.4 Classification 10

1.6.5 Clustering 12

1.6.6 Association 14

References 14

Exercises 15

Chapter 2 Data Preprocessing 16

2.1 Why do We Need to Preprocess the Data? 17

2.2 Data Cleaning 17

2.3 Handling Missing Data 19

2.4 Identifying Misclassifications 22

2.5 Graphical Methods for Identifying Outliers 22

2.6 Measures of Center and Spread 23

2.7 Data Transformation 26

2.8 Min-Max Normalization 26

2.9 Z-Score Standardization 27

2.10 Decimal Scaling 28

2.11 Transformations to Achieve Normality 28

2.12 Numerical Methods for Identifying Outliers 35

2.13 Flag Variables 36

2.14 Transforming Categorical Variables into Numerical Variables 37

2.15 Binning Numerical Variables 38

2.16 Reclassifying Categorical Variables 39

2.17 Adding an Index Field 39

2.18 Removing Variables that are Not Useful 39

2.19 Variables that Should Probably Not Be Removed 40

2.20 Removal of Duplicate Records 41

2.21 A Word About ID Fields 41

The R Zone 42

References 48

Exercises 48

Hands-On Analysis 50

Chapter 3 Exploratory Data Analysis 51

3.1 Hypothesis Testing Versus Exploratory Data Analysis 51

3.2 Getting to Know the Data Set 52

3.3 Exploring Categorical Variables 55

3.4 Exploring Numeric Variables 62

3.5 Exploring Multivariate Relationships 69

3.6 Selecting Interesting Subsets of the Data for Further Investigation 71

3.7 Using EDA to Uncover Anomalous Fields 71

3.8 Binning Based on Predictive Value 72

3.9 Deriving New Variables: Flag Variables 74

3.10 Deriving New Variables: Numerical Variables 77

3.11 Using EDA to Investigate Correlated Predictor Variables 77

3.12 Summary 80

The R Zone 82

Reference 88

Exercises 88

Hands-On Analysis 89

Chapter 4 Univariate Statistical Analysis 91

4.1 Data Mining Tasks in Discovering Knowledge in Data 91

4.2 Statistical Approaches to Estimation and Prediction 92

4.3 Statistical Inference 93

4.4 How Confident are We in Our Estimates? 94

4.5 Confidence Interval Estimation of the Mean 95

4.6 How to Reduce the Margin of Error 97

4.7 Confidence Interval Estimation of the Proportion 98

4.8 Hypothesis Testing for the Mean 99

4.9 Assessing the Strength of Evidence Against the Null Hypothesis 101

4.10 Using Confidence Intervals to Perform Hypothesis Tests 102

4.11 Hypothesis Testing for the Proportion 104

The R Zone 105

Reference 106

Exercises 106

Chapter 5 Multivariate Statistics 109

5.1 Two-Sample t-Test for Difference in Means 110

5.2 Two-Sample Z-Test for Difference in Proportions 111

5.3 Test for Homogeneity of Proportions 112

5.4 Chi-Square Test for Goodness of Fit of Multinomial Data 114

5.5 Analysis of Variance 115

5.6 Regression Analysis 118

5.7 Hypothesis Testing in Regression 122

5.8 Measuring the Quality of a Regression Model 123

5.9 Dangers of Extrapolation 123

5.10 Confidence Intervals for the Mean Value of y Given x 125

5.11 Prediction Intervals for a Randomly Chosen Value of y Given x 125

5.12 Multiple Regression 126

5.13 Verifying Model Assumptions 127

The R Zone 131

Reference 135

Exercises 135

Hands-On Analysis 136

Chapter 6 Preparing to Model the Data 138

6.1 Supervised Versus Unsupervised Methods 138

6.2 Statistical Methodology and Data Mining Methodology 139

6.3 Cross-Validation 139

6.4 Overfitting 141

6.5 BIAS-Variance Trade-Off 142

6.6 Balancing the Training Data Set 144

6.7 Establishing Baseline Performance 145

The R Zone 146

Reference 147

Exercises 147

Chapter 7 K-Nearest Neighbor Algorithm 149

7.1 Classification Task 149

7.2 k-Nearest Neighbor Algorithm 150

7.3 Distance Function 153

7.4 Combination Function 156

7.4.1 Simple Unweighted Voting 156

7.4.2 Weighted Voting 156

7.5 Quantifying Attribute Relevance: Stretching the Axes 158

7.6 Database Considerations 158

7.7 k-Nearest Neighbor Algorithm for Estimation and Prediction 159

7.8 Choosing k 160

7.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler 160

The R Zone 162

Exercises 163

Hands-On Analysis 164

Chapter 8 Decision Trees 165

8.1 What is a Decision Tree? 165

8.2 Requirements for Using Decision Trees 167

8.3 Classification and Regression Trees 168

8.4 C4.5 Algorithm 174

8.5 Decision Rules 179

8.6 Comparison of the C5.0 and Cart Algorithms Applied to Real Data 180

The R Zone 183

References 184

Exercises 185

Hands-On Analysis 185

Chapter 9 Neural Networks 187

9.1 Input and Output Encoding 188

9.2 Neural Networks for Estimation and Prediction 190

9.3 Simple Example of a Neural Network 191

9.4 Sigmoid Activation Function 193

9.5 Back-Propagation 194

9.5.1 Gradient Descent Method 194

9.5.2 Back-Propagation Rules 195

9.5.3 Example of Back-Propagation 196

9.6 Termination Criteria 198

9.7 Learning Rate 198

9.8 Momentum Term 199

9.9 Sensitivity Analysis 201

9.10 Application of Neural Network Modeling 202

The R Zone 204

References 207

Exercises 207

Hands-On Analysis 207

Chapter 10 Hierarchical and K-Means Clustering 209

10.1 The Clustering Task 209

10.2 Hierarchical Clustering Methods 212

10.3 Single-Linkage Clustering 213

10.4 Complete-Linkage Clustering 214

10.5 k-Means Clustering 215

10.6 Example of k-Means Clustering at Work 216

10.7 Behavior of MSB, MSE, and PSEUDO-F as the k-Means Algorithm Proceeds 219

10.8 Application of k-Means Clustering Using SAS Enterprise Miner 220

10.9 Using Cluster Membership to Predict Churn 223

The R Zone 224

References 226

Exercises 226

Hands-On Analysis 226

Chapter 11 Kohonen Networks 228

11.1 Self-Organizing Maps 228

11.2 Kohonen Networks 230

11.2.1 Kohonen Networks Algorithm 231

11.3 Example of a Kohonen Network Study 231

11.4 Cluster Validity 235

11.5 Application of Clustering Using Kohonen Networks 235

11.6 Interpreting the Clusters 237

11.6.1 Cluster Profiles 240

11.7 Using Cluster Membership as Input to Downstream Data Mining Models 242

The R Zone 243

References 245

Exercises 245

Hands-On Analysis 245

Chapter 12 Association Rules 247

12.1 Affinity Analysis and Market Basket Analysis 247

12.1.1 Data Representation for Market Basket Analysis 248

12.2 Support, Confidence, Frequent Itemsets, and the a Priori Property 249

12.3 How Does the a Priori Algorithm Work? 251

12.3.1 Generating Frequent Itemsets 251

12.3.2 Generating Association Rules 253

12.4 Extension from Flag Data to General Categorical Data 255

12.5 Information-Theoretic Approach: Generalized Rule Induction Method 256

12.5.1 J-Measure 257

12.6 Association Rules are Easy to do Badly 258

12.7 How Can We Measure the Usefulness of Association Rules? 259

12.8 Do Association Rules Represent Supervised or Unsupervised Learning? 260

12.9 Local Patterns Versus Global Models 261

The R Zone 262

References 263

Exercises 263

Hands-On Analysis 264

Chapter 13 Imputation of Missing Data 266

13.1 Need for Imputation of Missing Data 266

13.2 Imputation of Missing Data: Continuous Variables 267

13.3 Standard Error of the Imputation 270

13.4 Imputation of Missing Data: Categorical Variables 271

13.5 Handling Patterns in Missingness 272

The R Zone 273

Reference 276

Exercises 276

Hands-On Analysis 276

Chapter 14 Model Evaluation Techniques 277

14.1 Model Evaluation Techniques for the Description Task 278

14.2 Model Evaluation Techniques for the Estimation and Prediction Tasks 278

14.3 Model Evaluation Techniques for the Classification Task 280

14.4 Error Rate, False Positives, and False Negatives 280

14.5 Sensitivity and Specificity 283

14.6 Misclassification Cost Adjustment to Reflect Real-World Concerns 284

14.7 Decision Cost/Benefit Analysis 285

14.8 Lift Charts and Gains Charts 286

14.9 Interweaving Model Evaluation with Model Building 289

14.10 Confluence of Results: Applying a Suite of Models 290

The R Zone 291

Reference 291

Exercises 291

Hands-On Analysis 291

Appendix: Data Summarization and Visualization 294

Index 309

Preface

What is Data Mining?

According to the Gartner Group,

Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.

Today, there are a variety of terms used to describe this process, including analytics, predictive analytics, big data, machine learning, and knowledge discovery in databases. But these terms all share in common the objective of mining actionable nuggets of knowledge from large data sets. We shall therefore use the term data mining to represent this process throughout this text.

Why is This Book Needed?

Humans are inundated with data in most fields. Unfortunately, these valuable data, which cost firms millions to collect and collate, are languishing in warehouses and repositories. The problem is that there are not enough trained human analysts available who are skilled at translating all of these data into knowledge, and thence up the taxonomy tree into wisdom. This is why this book is needed.

The McKinsey Global Institute reports:1

There will be a shortage of talent necessary for organizations to take advantage of big data. A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data…. We project that demand for deep analytical positions in a big data world could exceed the supply being produced on current trends by 140,000 to 190,000 positions. … In addition, we project a need for 1.5 million additional managers and analysts in the United States who can ask the right questions and consume the results of the analysis of big data effectively.

This book is an attempt to help alleviate this critical shortage of data analysts. Discovering Knowledge in Data: An Introduction to Data Mining provides readers with:

The models and techniques to uncover hidden nuggets of information,
The insight into how the data mining algorithms really work, and
The experience of actually performing data mining on large data sets.

Data mining is becoming more widespread everyday, because it empowers companies to uncover profitable patterns and trends from their existing databases. Companies and institutions have spent millions of dollars to collect megabytes and terabytes of data, but are not taking advantage of the valuable and actionable information hidden deep within their data repositories. However, as the practice of data mining becomes more widespread, companies which do not apply these techniques are in danger of falling behind, and losing market share, because their competitors are applying data mining, and thereby gaining the competitive edge.

In Discovering Knowledge in Data, the step-by-step, hands-on solutions of real-world business problems, using widely available data mining techniques applied to real-world data sets, will appeal to managers, CIOs, CEOs, CFOs, and others who need to keep abreast of the latest methods for enhancing return-on-investment.

What's New for the Second Edition?

The second edition of Discovery Knowledge in Data is enhanced with an abundance of new material and useful features, including:

Nearly 100 pages of new material.
Three new chapters:
- Chapter 5: Multivariate Statistical Analysis covers the hypothesis tests used for verifying whether data partitions are valid, along with analysis of variance, multiple regression, and other topics.
- Chapter 6: Preparing to Model the Data introduces a new formula for balancing the training data set, and examines the importance of establishing baseline performance, among other topics.
- Chapter 13: Imputation of Missing Data addresses one of the most overlooked issues in data analysis, and shows how to impute missing values for continuous variables and for categorical variables, as well as how to handle patterns in missingness.
The R Zone. In most chapters of this book, the reader will find The R Zone, which provides the actual R code needed to obtain the results shown in the chapter, along with screen shots of some of the output, using R Studio.
A host of new topics not covered in the first edition. Here is a sample of these new topics, chapter by chapter:
- Chapter 2: Data Preprocessing. Decimal scaling; Transformations to achieve normality; Flag variables; Transforming categorical variables into numerical variables; Binning numerical variables; Reclassifying categorical variables; Adding an index field; Removal of duplicate records.
- Chapter 3: Exploratory Data Analysis. Binning based on predictive value; Deriving new variables: Flag variables; Deriving new variables: Numerical variables; Using EDA to investigate correlated predictor variables.
- Chapter 4: Univariate Statistical Analysis. How to reduce the margin of error; Confidence interval estimation of the proportion; Hypothesis testing for the mean; Assessing the strength of evidence against the null hypothesis; Using confidence intervals to perform hypothesis tests; Hypothesis testing for the proportion.
- Chapter 5: Multivariate Statistics. Two-sample test for difference in means; Two-sample test for difference in proportions; Test for homogeneity of proportions; Chi-square test for goodness of fit of multinomial data; Analysis of variance; Hypothesis testing in regression; Measuring the quality of a regression model.
- Chapter 6: Preparing to Model the Data. Balancing the training data set; Establishing baseline performance.
- Chapter 7: k-Nearest Neighbor Algorithm. Application of k-nearest neighbor algorithm using IBM/SPSS Modeler.
- Chapter 10: Hierarchical and k-Means Clustering. Behavior of MSB, MSE, and pseudo-F as the k-means algorithm proceeds.
- Chapter 12: Association Rules. How can we measure the usefulness of association rules?
- Chapter 13: Imputation of Missing Data. Need for imputation of missing data; Imputation of missing data for continuous variables; Imputation of missing data for categorical variables; Handling patterns in missingness.
- Chapter 14: Model Evaluation Techniques. Sensitivity and Specificity.
An Appendix on Data Summarization and Visualization. Readers who may be a bit rusty on introductory statistics may find this new feature helpful. Definitions and illustrative examples of introductory statistical concepts are provided here, along with many graphs and tables, as follows:
- Part 1: Summarization 1: Building Blocks of Data Analysis
- Part 2: Visualization: Graphs and Tables for Summarizing and Organizing Data
- Part 3: Summarization 2: Measures of Center, Variability, and Position
- Part 4: Summarization and Visualization of Bivariate Relationships
New Exercises. There are over 100 new chapter exercises in the second edition.

Danger! Data Mining is Easy to Do Badly

The plethora of new off-the-shelf software platforms for performing data mining has kindled a new kind of danger. The ease with which these graphical user interface (GUI)-based applications can manipulate data, combined with the power of the formidable data mining algorithms embedded in the black box software currently available, makes their misuse proportionally more hazardous.

Just as with any new information technology, data mining is easy to do badly. A little knowledge is especially dangerous when it comes to applying powerful models based on large data sets. For example, analyses carried out on unpreprocessed data can lead to erroneous conclusions, or inappropriate analysis may be applied to data sets that call for a completely different approach, or models may be derived that are built upon wholly specious assumptions. These errors in analysis can lead to very expensive failures, if deployed.

“White Box” Approach: Understanding the Underlying Algorithmic and Model Structures

The best way to avoid these costly errors, which stem from a blind black-box approach to data mining, is to instead apply a “white-box” methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software.

Discovering Knowledge in Data applies this white-box approach by:

Walking the reader through the various algorithms;
Providing examples of the operation of the algorithm on actual large data sets;
Testing the reader's level of understanding of the concepts and algorithms;
Providing an opportunity for the reader to do some real data mining on large data sets; and
Supplying the reader with the actual R code used to achieve these data mining results, in The R Zone.

Algorithm Walk-Throughs

Discovering Knowledge in Data walks the reader through the operations and nuances of the various algorithms, using small sample data sets, so that the reader gets a true appreciation of what is really...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Discovering Knowledge in Data

Description

More details

Other editions

Additional editions

Person

Content

Preface

What is Data Mining?

Why is This Book Needed?

What's New for the Second Edition?

Danger! Data Mining is Easy to Do Badly

“White Box” Approach: Understanding the Underlying Algorithmic and Model Structures

Algorithm Walk-Throughs

System requirements