
Discovering Knowledge in Data
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions


Person
Content
Preface xi
Chapter 1 An Introduction to Data Mining 1
1.1 What is Data Mining? 1
1.2 Wanted: Data Miners 2
1.3 The Need for Human Direction of Data Mining 3
1.4 The Cross-Industry Standard Practice for Data Mining 4
1.4.1 Crisp-DM: The Six Phases 5
1.5 Fallacies of Data Mining 6
1.6 What Tasks Can Data Mining Accomplish? 8
1.6.1 Description 8
1.6.2 Estimation 8
1.6.3 Prediction 10
1.6.4 Classification 10
1.6.5 Clustering 12
1.6.6 Association 14
References 14
Exercises 15
Chapter 2 Data Preprocessing 16
2.1 Why do We Need to Preprocess the Data? 17
2.2 Data Cleaning 17
2.3 Handling Missing Data 19
2.4 Identifying Misclassifications 22
2.5 Graphical Methods for Identifying Outliers 22
2.6 Measures of Center and Spread 23
2.7 Data Transformation 26
2.8 Min-Max Normalization 26
2.9 Z-Score Standardization 27
2.10 Decimal Scaling 28
2.11 Transformations to Achieve Normality 28
2.12 Numerical Methods for Identifying Outliers 35
2.13 Flag Variables 36
2.14 Transforming Categorical Variables into Numerical Variables 37
2.15 Binning Numerical Variables 38
2.16 Reclassifying Categorical Variables 39
2.17 Adding an Index Field 39
2.18 Removing Variables that are Not Useful 39
2.19 Variables that Should Probably Not Be Removed 40
2.20 Removal of Duplicate Records 41
2.21 A Word About ID Fields 41
The R Zone 42
References 48
Exercises 48
Hands-On Analysis 50
Chapter 3 Exploratory Data Analysis 51
3.1 Hypothesis Testing Versus Exploratory Data Analysis 51
3.2 Getting to Know the Data Set 52
3.3 Exploring Categorical Variables 55
3.4 Exploring Numeric Variables 62
3.5 Exploring Multivariate Relationships 69
3.6 Selecting Interesting Subsets of the Data for Further Investigation 71
3.7 Using EDA to Uncover Anomalous Fields 71
3.8 Binning Based on Predictive Value 72
3.9 Deriving New Variables: Flag Variables 74
3.10 Deriving New Variables: Numerical Variables 77
3.11 Using EDA to Investigate Correlated Predictor Variables 77
3.12 Summary 80
The R Zone 82
Reference 88
Exercises 88
Hands-On Analysis 89
Chapter 4 Univariate Statistical Analysis 91
4.1 Data Mining Tasks in Discovering Knowledge in Data 91
4.2 Statistical Approaches to Estimation and Prediction 92
4.3 Statistical Inference 93
4.4 How Confident are We in Our Estimates? 94
4.5 Confidence Interval Estimation of the Mean 95
4.6 How to Reduce the Margin of Error 97
4.7 Confidence Interval Estimation of the Proportion 98
4.8 Hypothesis Testing for the Mean 99
4.9 Assessing the Strength of Evidence Against the Null Hypothesis 101
4.10 Using Confidence Intervals to Perform Hypothesis Tests 102
4.11 Hypothesis Testing for the Proportion 104
The R Zone 105
Reference 106
Exercises 106
Chapter 5 Multivariate Statistics 109
5.1 Two-Sample t-Test for Difference in Means 110
5.2 Two-Sample Z-Test for Difference in Proportions 111
5.3 Test for Homogeneity of Proportions 112
5.4 Chi-Square Test for Goodness of Fit of Multinomial Data 114
5.5 Analysis of Variance 115
5.6 Regression Analysis 118
5.7 Hypothesis Testing in Regression 122
5.8 Measuring the Quality of a Regression Model 123
5.9 Dangers of Extrapolation 123
5.10 Confidence Intervals for the Mean Value of y Given x 125
5.11 Prediction Intervals for a Randomly Chosen Value of y Given x 125
5.12 Multiple Regression 126
5.13 Verifying Model Assumptions 127
The R Zone 131
Reference 135
Exercises 135
Hands-On Analysis 136
Chapter 6 Preparing to Model the Data 138
6.1 Supervised Versus Unsupervised Methods 138
6.2 Statistical Methodology and Data Mining Methodology 139
6.3 Cross-Validation 139
6.4 Overfitting 141
6.5 BIAS-Variance Trade-Off 142
6.6 Balancing the Training Data Set 144
6.7 Establishing Baseline Performance 145
The R Zone 146
Reference 147
Exercises 147
Chapter 7 K-Nearest Neighbor Algorithm 149
7.1 Classification Task 149
7.2 k-Nearest Neighbor Algorithm 150
7.3 Distance Function 153
7.4 Combination Function 156
7.4.1 Simple Unweighted Voting 156
7.4.2 Weighted Voting 156
7.5 Quantifying Attribute Relevance: Stretching the Axes 158
7.6 Database Considerations 158
7.7 k-Nearest Neighbor Algorithm for Estimation and Prediction 159
7.8 Choosing k 160
7.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler 160
The R Zone 162
Exercises 163
Hands-On Analysis 164
Chapter 8 Decision Trees 165
8.1 What is a Decision Tree? 165
8.2 Requirements for Using Decision Trees 167
8.3 Classification and Regression Trees 168
8.4 C4.5 Algorithm 174
8.5 Decision Rules 179
8.6 Comparison of the C5.0 and Cart Algorithms Applied to Real Data 180
The R Zone 183
References 184
Exercises 185
Hands-On Analysis 185
Chapter 9 Neural Networks 187
9.1 Input and Output Encoding 188
9.2 Neural Networks for Estimation and Prediction 190
9.3 Simple Example of a Neural Network 191
9.4 Sigmoid Activation Function 193
9.5 Back-Propagation 194
9.5.1 Gradient Descent Method 194
9.5.2 Back-Propagation Rules 195
9.5.3 Example of Back-Propagation 196
9.6 Termination Criteria 198
9.7 Learning Rate 198
9.8 Momentum Term 199
9.9 Sensitivity Analysis 201
9.10 Application of Neural Network Modeling 202
The R Zone 204
References 207
Exercises 207
Hands-On Analysis 207
Chapter 10 Hierarchical and K-Means Clustering 209
10.1 The Clustering Task 209
10.2 Hierarchical Clustering Methods 212
10.3 Single-Linkage Clustering 213
10.4 Complete-Linkage Clustering 214
10.5 k-Means Clustering 215
10.6 Example of k-Means Clustering at Work 216
10.7 Behavior of MSB, MSE, and PSEUDO-F as the k-Means Algorithm Proceeds 219
10.8 Application of k-Means Clustering Using SAS Enterprise Miner 220
10.9 Using Cluster Membership to Predict Churn 223
The R Zone 224
References 226
Exercises 226
Hands-On Analysis 226
Chapter 11 Kohonen Networks 228
11.1 Self-Organizing Maps 228
11.2 Kohonen Networks 230
11.2.1 Kohonen Networks Algorithm 231
11.3 Example of a Kohonen Network Study 231
11.4 Cluster Validity 235
11.5 Application of Clustering Using Kohonen Networks 235
11.6 Interpreting the Clusters 237
11.6.1 Cluster Profiles 240
11.7 Using Cluster Membership as Input to Downstream Data Mining Models 242
The R Zone 243
References 245
Exercises 245
Hands-On Analysis 245
Chapter 12 Association Rules 247
12.1 Affinity Analysis and Market Basket Analysis 247
12.1.1 Data Representation for Market Basket Analysis 248
12.2 Support, Confidence, Frequent Itemsets, and the a Priori Property 249
12.3 How Does the a Priori Algorithm Work? 251
12.3.1 Generating Frequent Itemsets 251
12.3.2 Generating Association Rules 253
12.4 Extension from Flag Data to General Categorical Data 255
12.5 Information-Theoretic Approach: Generalized Rule Induction Method 256
12.5.1 J-Measure 257
12.6 Association Rules are Easy to do Badly 258
12.7 How Can We Measure the Usefulness of Association Rules? 259
12.8 Do Association Rules Represent Supervised or Unsupervised Learning? 260
12.9 Local Patterns Versus Global Models 261
The R Zone 262
References 263
Exercises 263
Hands-On Analysis 264
Chapter 13 Imputation of Missing Data 266
13.1 Need for Imputation of Missing Data 266
13.2 Imputation of Missing Data: Continuous Variables 267
13.3 Standard Error of the Imputation 270
13.4 Imputation of Missing Data: Categorical Variables 271
13.5 Handling Patterns in Missingness 272
The R Zone 273
Reference 276
Exercises 276
Hands-On Analysis 276
Chapter 14 Model Evaluation Techniques 277
14.1 Model Evaluation Techniques for the Description Task 278
14.2 Model Evaluation Techniques for the Estimation and Prediction Tasks 278
14.3 Model Evaluation Techniques for the Classification Task 280
14.4 Error Rate, False Positives, and False Negatives 280
14.5 Sensitivity and Specificity 283
14.6 Misclassification Cost Adjustment to Reflect Real-World Concerns 284
14.7 Decision Cost/Benefit Analysis 285
14.8 Lift Charts and Gains Charts 286
14.9 Interweaving Model Evaluation with Model Building 289
14.10 Confluence of Results: Applying a Suite of Models 290
The R Zone 291
Reference 291
Exercises 291
Hands-On Analysis 291
Appendix: Data Summarization and Visualization 294
Index 309
Preface
What is Data Mining?
According to the Gartner Group,
Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.
Today, there are a variety of terms used to describe this process, including analytics, predictive analytics, big data, machine learning, and knowledge discovery in databases. But these terms all share in common the objective of mining actionable nuggets of knowledge from large data sets. We shall therefore use the term data mining to represent this process throughout this text.
Why is This Book Needed?
Humans are inundated with data in most fields. Unfortunately, these valuable data, which cost firms millions to collect and collate, are languishing in warehouses and repositories. The problem is that there are not enough trained human analysts available who are skilled at translating all of these data into knowledge, and thence up the taxonomy tree into wisdom. This is why this book is needed.
The McKinsey Global Institute reports:1
There will be a shortage of talent necessary for organizations to take advantage of big data. A significant constraint on realizing value from big data will be a shortage of talent, particularly of people with deep expertise in statistics and machine learning, and the managers and analysts who know how to operate companies by using insights from big data…. We project that demand for deep analytical positions in a big data world could exceed the supply being produced on current trends by 140,000 to 190,000 positions. … In addition, we project a need for 1.5 million additional managers and analysts in the United States who can ask the right questions and consume the results of the analysis of big data effectively.
This book is an attempt to help alleviate this critical shortage of data analysts. Discovering Knowledge in Data: An Introduction to Data Mining provides readers with:
- The models and techniques to uncover hidden nuggets of information,
- The insight into how the data mining algorithms really work, and
- The experience of actually performing data mining on large data sets.
Data mining is becoming more widespread everyday, because it empowers companies to uncover profitable patterns and trends from their existing databases. Companies and institutions have spent millions of dollars to collect megabytes and terabytes of data, but are not taking advantage of the valuable and actionable information hidden deep within their data repositories. However, as the practice of data mining becomes more widespread, companies which do not apply these techniques are in danger of falling behind, and losing market share, because their competitors are applying data mining, and thereby gaining the competitive edge.
In Discovering Knowledge in Data, the step-by-step, hands-on solutions of real-world business problems, using widely available data mining techniques applied to real-world data sets, will appeal to managers, CIOs, CEOs, CFOs, and others who need to keep abreast of the latest methods for enhancing return-on-investment.
What's New for the Second Edition?
The second edition of Discovery Knowledge in Data is enhanced with an abundance of new material and useful features, including:
- Nearly 100 pages of new material.
- Three new chapters:
- Chapter 5: Multivariate Statistical Analysis covers the hypothesis tests used for verifying whether data partitions are valid, along with analysis of variance, multiple regression, and other topics.
- Chapter 6: Preparing to Model the Data introduces a new formula for balancing the training data set, and examines the importance of establishing baseline performance, among other topics.
- Chapter 13: Imputation of Missing Data addresses one of the most overlooked issues in data analysis, and shows how to impute missing values for continuous variables and for categorical variables, as well as how to handle patterns in missingness.
- The R Zone. In most chapters of this book, the reader will find The R Zone, which provides the actual R code needed to obtain the results shown in the chapter, along with screen shots of some of the output, using R Studio.
- A host of new topics not covered in the first edition. Here is a sample of these new topics, chapter by chapter:
- Chapter 2: Data Preprocessing. Decimal scaling; Transformations to achieve normality; Flag variables; Transforming categorical variables into numerical variables; Binning numerical variables; Reclassifying categorical variables; Adding an index field; Removal of duplicate records.
- Chapter 3: Exploratory Data Analysis. Binning based on predictive value; Deriving new variables: Flag variables; Deriving new variables: Numerical variables; Using EDA to investigate correlated predictor variables.
- Chapter 4: Univariate Statistical Analysis. How to reduce the margin of error; Confidence interval estimation of the proportion; Hypothesis testing for the mean; Assessing the strength of evidence against the null hypothesis; Using confidence intervals to perform hypothesis tests; Hypothesis testing for the proportion.
- Chapter 5: Multivariate Statistics. Two-sample test for difference in means; Two-sample test for difference in proportions; Test for homogeneity of proportions; Chi-square test for goodness of fit of multinomial data; Analysis of variance; Hypothesis testing in regression; Measuring the quality of a regression model.
- Chapter 6: Preparing to Model the Data. Balancing the training data set; Establishing baseline performance.
- Chapter 7: k-Nearest Neighbor Algorithm. Application of k-nearest neighbor algorithm using IBM/SPSS Modeler.
- Chapter 10: Hierarchical and k-Means Clustering. Behavior of MSB, MSE, and pseudo-F as the k-means algorithm proceeds.
- Chapter 12: Association Rules. How can we measure the usefulness of association rules?
- Chapter 13: Imputation of Missing Data. Need for imputation of missing data; Imputation of missing data for continuous variables; Imputation of missing data for categorical variables; Handling patterns in missingness.
- Chapter 14: Model Evaluation Techniques. Sensitivity and Specificity.
- An Appendix on Data Summarization and Visualization. Readers who may be a bit rusty on introductory statistics may find this new feature helpful. Definitions and illustrative examples of introductory statistical concepts are provided here, along with many graphs and tables, as follows:
- Part 1: Summarization 1: Building Blocks of Data Analysis
- Part 2: Visualization: Graphs and Tables for Summarizing and Organizing Data
- Part 3: Summarization 2: Measures of Center, Variability, and Position
- Part 4: Summarization and Visualization of Bivariate Relationships
- New Exercises. There are over 100 new chapter exercises in the second edition.
Danger! Data Mining is Easy to Do Badly
The plethora of new off-the-shelf software platforms for performing data mining has kindled a new kind of danger. The ease with which these graphical user interface (GUI)-based applications can manipulate data, combined with the power of the formidable data mining algorithms embedded in the black box software currently available, makes their misuse proportionally more hazardous.
Just as with any new information technology, data mining is easy to do badly. A little knowledge is especially dangerous when it comes to applying powerful models based on large data sets. For example, analyses carried out on unpreprocessed data can lead to erroneous conclusions, or inappropriate analysis may be applied to data sets that call for a completely different approach, or models may be derived that are built upon wholly specious assumptions. These errors in analysis can lead to very expensive failures, if deployed.
“White Box” Approach: Understanding the Underlying Algorithmic and Model Structures
The best way to avoid these costly errors, which stem from a blind black-box approach to data mining, is to instead apply a “white-box” methodology, which emphasizes an understanding of the algorithmic and statistical model structures underlying the software.
Discovering Knowledge in Data applies this white-box approach by:
- Walking the reader through the various algorithms;
- Providing examples of the operation of the algorithm on actual large data sets;
- Testing the reader's level of understanding of the concepts and algorithms;
- Providing an opportunity for the reader to do some real data mining on large data sets; and
- Supplying the reader with the actual R code used to achieve these data mining results, in The R Zone.
Algorithm Walk-Throughs
Discovering Knowledge in Data walks the reader through the operations and nuances of the various algorithms, using small sample data sets, so that the reader gets a true appreciation of what is really...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.