
Applied Predictive Analytics
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
More details
Other editions
Additional editions

Person
Content
Introduction xxi
Chapter 1 Overview of Predictive Analytics 1
What Is Analytics? 3
What Is Predictive Analytics? 3
Supervised vs. Unsupervised Learning 5
Parametric vs. Non-Parametric Models 6
Business Intelligence 6
Predictive Analytics vs. Business Intelligence 8
Do Predictive Models Just State the Obvious? 9
Similarities between Business Intelligence and Predictive Analytics 9
Predictive Analytics vs. Statistics 10
Statistics and Analytics 11
Predictive Analytics and Statistics Contrasted 12
Predictive Analytics vs. Data Mining 13
Who Uses Predictive Analytics? 13
Challenges in Using Predictive Analytics 14
Obstacles in Management 14
Obstacles with Data 14
Obstacles with Modeling 15
Obstacles in Deployment 16
What Educational Background Is Needed to Become a Predictive Modeler? 16
Chapter 2 Setting Up the Problem 19
Predictive Analytics Processing Steps: CRISP-DM 19
Business Understanding 21
The Three-Legged Stool 22
Business Objectives 23
Defining Data for Predictive Modeling 25
Defining the Columns as Measures 26
Defining the Unit of Analysis 27
Which Unit of Analysis? 28
Defining the Target Variable 29
Temporal Considerations for Target Variable 31
Defining Measures of Success for Predictive Models 32
Success Criteria for Classification 32
Success Criteria for Estimation 33
Other Customized Success Criteria 33
Doing Predictive Modeling Out of Order 34
Building Models First 34
Early Model Deployment 35
Case Study: Recovering Lapsed Donors 35
Overview 36
Business Objectives 36
Data for the Competition 36
The Target Variables 36
Modeling Objectives 37
Model Selection and Evaluation Criteria 38
Model Deployment 39
Case Study: Fraud Detection 39
Overview 39
Business Objectives 39
Data for the Project 40
The Target Variables 40
Modeling Objectives 41
Model Selection and Evaluation Criteria 41
Model Deployment 41
Summary 42
Chapter 3 Data Understanding 43
What the Data Looks Like 44
Single Variable Summaries 44
Mean 45
Standard Deviation 45
The Normal Distribution 45
Uniform Distribution 46
Applying Simple Statistics in Data Understanding 47
Skewness 49
Kurtosis 51
Rank-Ordered Statistics 52
Categorical Variable Assessment 55
Data Visualization in One Dimension 58
Histograms 59
Multiple Variable Summaries 64
Hidden Value in Variable Interactions: Simpson's Paradox 64
The Combinatorial Explosion of Interactions 65
Correlations 66
Spurious Correlations 66
Back to Correlations 67
Crosstabs 68
Data Visualization, Two or Higher Dimensions 69
Scatterplots 69
Anscombe's Quartet 71
Scatterplot Matrices 75
Overlaying the Target Variable in Summary 76
Scatterplots in More Than Two Dimensions 78
The Value of Statistical Significance 80
Pulling It All Together into a Data Audit 81
Summary 82
Chapter 4 Data Preparation 83
Variable Cleaning 84
Incorrect Values 84
Consistency in Data Formats 85
Outliers 85
Multidimensional Outliers 89
Missing Values 90
Fixing Missing Data 91
Feature Creation 98
Simple Variable Transformations 98
Fixing Skew 99
Binning Continuous Variables 103
Numeric Variable Scaling 104
Nominal Variable Transformation 107
Ordinal Variable Transformations 108
Date and Time Variable Features 109
ZIP Code Features 110
Which Version of a Variable Is Best? 110
Multidimensional Features 112
Variable Selection Prior to Modeling 117
Sampling 123
Example: Why Normalization Matters for K-Means Clustering 139
Summary 143
Chapter 5 Itemsets and Association Rules 145
Terminology 146
Condition 147
Left-Hand-Side, Antecedent(s) 148
Right-Hand-Side, Consequent, Output, Conclusion 148
Rule (Item Set) 148
Support 149
Antecedent Support 149
Confidence, Accuracy 150
Lift 150
Parameter Settings 151
How the Data Is Organized 151
Standard Predictive Modeling Data Format 151
Transactional Format 152
Measures of Interesting Rules 154
Deploying Association Rules 156
Variable Selection 157
Interaction Variable Creation 157
Problems with Association Rules 158
Redundant Rules 158
Too Many Rules 158
Too Few Rules 159
Building Classification Rules from Association Rules 159
Summary 161
Chapter 6 Descriptive Modeling 163
Data Preparation Issues with Descriptive Modeling 164
Principal Component Analysis 165
The PCA Algorithm 165
Applying PCA to New Data 169
PCA for Data Interpretation 171
Additional Considerations before Using PCA 172
The Effect of Variable Magnitude on PCA Models 174
Clustering Algorithms 177
The K-Means Algorithm 178
Data Preparation for K-Means 183
Selecting the Number of Clusters 185
The Kohonen SOM Algorithm 192
Visualizing Kohonen Maps 194
Similarities with K-Means 196
Summary 197
Chapter 7 Interpreting Descriptive Models 199
Standard Cluster Model Interpretation 199
Problems with Interpretation Methods 202
Identifying Key Variables in Forming Cluster Models 203
Cluster Prototypes 209
Cluster Outliers 210
Summary 212
Chapter 8 Predictive Modeling 213
Decision Trees 214
The Decision Tree Landscape 215
Building Decision Trees 218
Decision Tree Splitting Metrics 221
Decision Tree Knobs and Options 222
Reweighting Records: Priors 224
Reweighting Records: Misclassification Costs 224
Other Practical Considerations for Decision Trees 229
Logistic Regression 230
Interpreting Logistic Regression Models 233
Other Practical Considerations for Logistic Regression 235
Neural Networks 240
Building Blocks: The Neuron 242
Neural Network Training 244
The Flexibility of Neural Networks 247
Neural Network Settings 249
Neural Network Pruning 251
Interpreting Neural Networks 252
Neural Network Decision Boundaries 253
Other Practical Considerations for Neural Networks 253
K-Nearest Neighbor 254
The k-NN Learning Algorithm 254
Distance Metrics for k-NN 258
Other Practical Considerations for k-NN 259
Naïve Bayes 264
Bayes' Theorem 264
The Naïve Bayes Classifier 268
Interpreting Naïve Bayes Classifiers 268
Other Practical Considerations for Naïve Bayes 269
Regression Models 270
Linear Regression 271
Linear Regression Assumptions 274
Variable Selection in Linear Regression 276
Interpreting Linear Regression Models 278
Using Linear Regression for Classification 279
Other Regression Algorithms 280
Summary 281
Chapter 9 Assessing Predictive Models 283
Batch Approach to Model Assessment 284
Percent Correct Classification 284
Rank-Ordered Approach to Model Assessment 293
Assessing Regression Models 301
Summary 304
Chapter 10 Model Ensembles 307
Motivation for Ensembles 307
The Wisdom of Crowds 308
Bias Variance Tradeoff 309
Bagging 311
Boosting 316
Improvements to Bagging and Boosting 320
Random Forests 320
Stochastic Gradient Boosting 321
Heterogeneous Ensembles 321
Model Ensembles and Occam's Razor 323
Interpreting Model Ensembles 323
Summary 326
Chapter 11 Text Mining 327
Motivation for Text Mining 328
A Predictive Modeling Approach to Text Mining 329
Structured vs. Unstructured Data 329
Why Text Mining Is Hard 330
Text Mining Applications 332
Data Sources for Text Mining 333
Data Preparation Steps 333
POS Tagging 333
Tokens 336
Stop Word and Punctuation Filters 336
Character Length and Number Filters 337
Stemming 337
Dictionaries 338
The Sentiment Polarity Movie Data Set 339
Text Mining Features 340
Term Frequency 341
Inverse Document Frequency 344
Tf-idf 344
Cosine Similarity 346
Multi-Word Features: N-Grams 346
Reducing Keyword Features 347
Grouping Terms 347
Modeling with Text Mining Features 347
Regular Expressions 349
Uses of Regular Expressions in Text Mining 351
Summary 352
Chapter 12 Model Deployment 353
General Deployment Considerations 354
Deployment Steps 355
Summary 375
Chapter 13 Case Studies 377
Survey Analysis Case Study: Overview 377
Business Understanding: Defining the Problem 378
Data Understanding 380
Data Preparation 381
Modeling 385
Deployment: "What-If" Analysis 391
Revisit Models 392
Deployment 401
Summary and Conclusions 401
Help Desk Case Study 402
Data Understanding: Defining the Data 403
Data Preparation 403
Modeling 405
Revisit Business Understanding 407
Deployment 409
Summary and Conclusions 411
Index 413
Chapter 1
Overview of Predictive Analytics
A small direct response company had developed dozens of programs in cooperation with major brands to sell books and DVDs. These affinity programs were very successful, but required considerable up-front work to develop the creative content and determine which customers, already engaged with the brand, were worth the significant marketing spend to purchase the books or DVDs on subscription. Typically, they first developed test mailings on a moderately sized sample to determine if the expected response rates were high enough to justify a larger program.
One analyst with the company identified a way to help the company become more profitable. What if one could identify the key characteristics of those who responded to the test mailing? Furthermore, what if one could generate a score for these customers and determine what minimum score would result in a high enough response rate to make the campaign profitable? The analyst discovered predictive analytics techniques that could be used for both purposes, finding key customer characteristics and using those characteristics to generate a score that could be used to determine which customers to mail.
Two decades before, the owner of a small company in Virginia had a compelling idea: Improve the accuracy and flexibility of guided munitions using optimal control. The owner and president, Roger Barron, began the process of deriving the complex mathematics behind optimal control using a technique known as variational calculus and hired a graduate student to assist him in the task. Programmers then implemented the mathematics in computer code so they could simulate thousands of scenarios. For each trajectory, the variational calculus minimized the miss distance while maximizing speed at impact as well as the angle of impact.
The variational calculus algorithm succeeded in identifying the optimal sequence of commands: how much the fins (control surfaces) needed to change the path of the munition to follow the optimal path to the target. The concept worked in simulation in the thousands of optimal trajectories that were run. Moreover, the mathematics worked on several munitions, one of which was the MK82 glide bomb, fitted (in simulation) with an inertial guidance unit to control the fins: an early smart-bomb.
There was a problem, however. The variational calculus was so computationally complex that the small computers on-board could not solve the problem in real time. But what if one could estimate the optimal guidance commands at any time during the flight from observable characteristics of the flight? After all, the guidance unit can compute where the bomb is in space, how fast it is going, and the distance of the target that was programmed into the unit when it was launched. If the estimates of the optimum guidance commands were close enough to the actual optimal path, it would be near optimal and still succeed. Predictive models were built to do exactly this. The system was called Optimal Path-to-Go guidance.
These two programs designed by two different companies seemingly could not be more different. One program knows characteristics of people, such as demographics and their level of engagement with a brand, and tries to predict a human decision. The second program knows locations of a bomb in space and tries to predict the best physical action for it to hit a target.
But they share something in common: They both need to estimate values that are unknown but tremendously useful. For the affinity programs, the models estimate whether or not an individual will respond to a campaign, and for the guidance program, the models estimate the best guidance command. In this sense, these two programs are very similar because they both involve predicting a value or values that are known historically, but are unknown at the time a decision is needed. Not only are these programs related in this sense, but they are far from unique; there are countless decisions businesses and government agencies make every day that can be improved by using historic data as an aid to making decisions or even to automate the decisions themselves.
This book describes the back-story behind how analysts build the predictive models like the ones described in these two programs. There is science behind much of what predictive modelers do, yet there is also plenty of art, where no theory can inform us as to the best action, but experience provides principles by which tradeoffs can be made as solutions are found. Without the art, the science would only be able to solve a small subset of problems we face. Without the science, we would be like a plane without a rudder or a kite without a tail, moving at a rapid pace without any control, unable to achieve our objectives.
What Is Analytics?
Analytics is the process of using computational methods to discover and report influential patterns in data. The goal of analytics is to gain insight and often to affect decisions. Data is necessarily a measure of historic information so, by definition, analytics examines historic data. The term itself rose to prominence in 2005, in large part due to the introduction of Google Analytics. Nevertheless, the ideas behind analytics are not new at all but have been represented by different terms throughout the decades, including cybernetics, data analysis, neural networks, pattern recognition, statistics, knowledge discovery, data mining, and now even data science.
The rise of analytics in recent years is pragmatic: As organizations collect more data and begin to summarize it, there is a natural progression toward using the data to improve estimates, forecasts, decisions, and ultimately, efficiency.
What Is Predictive Analytics?
Predictive analytics is the process of discovering interesting and meaningful patterns in data. It draws from several related disciplines, some of which have been used to discover patterns in data for more than 100 years, including pattern recognition, statistics, machine learning, artificial intelligence, and data mining. What differentiates predictive analytics from other types of analytics?
First, predictive analytics is data-driven, meaning that algorithms derive key characteristic of the models from the data itself rather than from assumptions made by the analyst. Put another way, data-driven algorithms induce models from the data. The induction process can include identification of variables to be included in the model, parameters that define the model, weights or coefficients in the model, or model complexity.
Second, predictive analytics algorithms automate the process of finding the patterns from the data. Powerful induction algorithms not only discover coefficients or weights for the models, but also the very form of the models. Decision trees algorithms, for example, learn which of the candidate inputs best predict a target variable in addition to identifying which values of the variables to use in building predictions. Other algorithms can be modified to perform searches, using exhaustive or greedy searches to find the best set of inputs and model parameters. If the variable helps reduce model error, the variable is included in the model. Otherwise, if the variable does not help to reduce model error, it is eliminated.
Another automation task available in many software packages and algorithms automates the process of transforming input variables so that they can be used effectively in the predictive models. For example, if there are a hundred variables that are candidate inputs to models that can be or should be transformed to remove skew, you can do this with some predictive analytics software in a single step rather than programming all one hundred transformations one at a time.
Predictive analytics doesn't do anything that any analyst couldn't accomplish with pencil and paper or a spreadsheet if given enough time; the algorithms, while powerful, have no common sense. Consider a supervised learning data set with 50 inputs and a single binary target variable with values 0 and 1. One way to try to identify which of the inputs is most related to the target variable is to plot each variable, one at a time, in a histogram. The target variable can be superimposed on the histogram, as shown in Figure 1.1. With 50 inputs, you need to look at 50 histograms. This is not uncommon for predictive modelers to do.
Figure 1.1 Histogram
If the patterns require examining two variables at a time, you can do so with a scatter plot. For 50 variables, there are 1,225 possible scatter plots to examine. A dedicated predictive modeler might actually do this, although it will take some time. However, if the patterns require that you examine three variables simultaneously, you would need to examine 19,600 3D scatter plots in order to examine all the possible three-way combinations. Even the most dedicated modelers will be hard-pressed to spend the time needed to examine so many plots.
You need algorithms to sift through all of the potential combinations of inputs in the data—the patterns—and identify which ones are the most interesting. The analyst can then focus on these patterns, undoubtedly a much smaller number of inputs to examine. Of the 19,600 three-way combinations of inputs, it may be that a predictive model identifies six of the variables as the most significant contributors to accurate models. In addition, of these six variables, the top three are particularly good predictors and much...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.