Applied Predictive Analytics

Name: Applied Predictive Analytics | Principles and Techniques for the Professional Data Analyst
Brand: Wiley
Price: 40.99 EUR
Availability: OnlineOnly

Principles and Techniques for the Professional Data Analyst

Dean Abbott(Author)

Wiley (Publisher)

Published on 31. March 2014

456 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-118-72769-0 (ISBN)

€40.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Person

Content

Introduction xxi

Chapter 1 Overview of Predictive Analytics 1

What Is Analytics? 3

What Is Predictive Analytics? 3

Supervised vs. Unsupervised Learning 5

Parametric vs. Non-Parametric Models 6

Business Intelligence 6

Predictive Analytics vs. Business Intelligence 8

Do Predictive Models Just State the Obvious? 9

Similarities between Business Intelligence and Predictive Analytics 9

Predictive Analytics vs. Statistics 10

Statistics and Analytics 11

Predictive Analytics and Statistics Contrasted 12

Predictive Analytics vs. Data Mining 13

Who Uses Predictive Analytics? 13

Challenges in Using Predictive Analytics 14

Obstacles in Management 14

Obstacles with Data 14

Obstacles with Modeling 15

Obstacles in Deployment 16

What Educational Background Is Needed to Become a Predictive Modeler? 16

Chapter 2 Setting Up the Problem 19

Predictive Analytics Processing Steps: CRISP-DM 19

Business Understanding 21

The Three-Legged Stool 22

Business Objectives 23

Defining Data for Predictive Modeling 25

Defining the Columns as Measures 26

Defining the Unit of Analysis 27

Which Unit of Analysis? 28

Defining the Target Variable 29

Temporal Considerations for Target Variable 31

Defining Measures of Success for Predictive Models 32

Success Criteria for Classification 32

Success Criteria for Estimation 33

Other Customized Success Criteria 33

Doing Predictive Modeling Out of Order 34

Building Models First 34

Early Model Deployment 35

Case Study: Recovering Lapsed Donors 35

Overview 36

Business Objectives 36

Data for the Competition 36

The Target Variables 36

Modeling Objectives 37

Model Selection and Evaluation Criteria 38

Model Deployment 39

Case Study: Fraud Detection 39

Overview 39

Business Objectives 39

Data for the Project 40

The Target Variables 40

Modeling Objectives 41

Model Selection and Evaluation Criteria 41

Model Deployment 41

Summary 42

Chapter 3 Data Understanding 43

What the Data Looks Like 44

Single Variable Summaries 44

Mean 45

Standard Deviation 45

The Normal Distribution 45

Uniform Distribution 46

Applying Simple Statistics in Data Understanding 47

Skewness 49

Kurtosis 51

Rank-Ordered Statistics 52

Categorical Variable Assessment 55

Data Visualization in One Dimension 58

Histograms 59

Multiple Variable Summaries 64

Hidden Value in Variable Interactions: Simpson's Paradox 64

The Combinatorial Explosion of Interactions 65

Correlations 66

Spurious Correlations 66

Back to Correlations 67

Crosstabs 68

Data Visualization, Two or Higher Dimensions 69

Scatterplots 69

Anscombe's Quartet 71

Scatterplot Matrices 75

Overlaying the Target Variable in Summary 76

Scatterplots in More Than Two Dimensions 78

The Value of Statistical Significance 80

Pulling It All Together into a Data Audit 81

Summary 82

Chapter 4 Data Preparation 83

Variable Cleaning 84

Incorrect Values 84

Consistency in Data Formats 85

Outliers 85

Multidimensional Outliers 89

Missing Values 90

Fixing Missing Data 91

Feature Creation 98

Simple Variable Transformations 98

Fixing Skew 99

Binning Continuous Variables 103

Numeric Variable Scaling 104

Nominal Variable Transformation 107

Ordinal Variable Transformations 108

Date and Time Variable Features 109

ZIP Code Features 110

Which Version of a Variable Is Best? 110

Multidimensional Features 112

Variable Selection Prior to Modeling 117

Sampling 123

Example: Why Normalization Matters for K-Means Clustering 139

Summary 143

Chapter 5 Itemsets and Association Rules 145

Terminology 146

Condition 147

Left-Hand-Side, Antecedent(s) 148

Right-Hand-Side, Consequent, Output, Conclusion 148

Rule (Item Set) 148

Support 149

Antecedent Support 149

Confidence, Accuracy 150

Lift 150

Parameter Settings 151

How the Data Is Organized 151

Standard Predictive Modeling Data Format 151

Transactional Format 152

Measures of Interesting Rules 154

Deploying Association Rules 156

Variable Selection 157

Interaction Variable Creation 157

Problems with Association Rules 158

Redundant Rules 158

Too Many Rules 158

Too Few Rules 159

Building Classification Rules from Association Rules 159

Summary 161

Chapter 6 Descriptive Modeling 163

Data Preparation Issues with Descriptive Modeling 164

Principal Component Analysis 165

The PCA Algorithm 165

Applying PCA to New Data 169

PCA for Data Interpretation 171

Additional Considerations before Using PCA 172

The Effect of Variable Magnitude on PCA Models 174

Clustering Algorithms 177

The K-Means Algorithm 178

Data Preparation for K-Means 183

Selecting the Number of Clusters 185

The Kohonen SOM Algorithm 192

Visualizing Kohonen Maps 194

Similarities with K-Means 196

Summary 197

Chapter 7 Interpreting Descriptive Models 199

Standard Cluster Model Interpretation 199

Problems with Interpretation Methods 202

Identifying Key Variables in Forming Cluster Models 203

Cluster Prototypes 209

Cluster Outliers 210

Summary 212

Chapter 8 Predictive Modeling 213

Decision Trees 214

The Decision Tree Landscape 215

Building Decision Trees 218

Decision Tree Splitting Metrics 221

Decision Tree Knobs and Options 222

Reweighting Records: Priors 224

Reweighting Records: Misclassification Costs 224

Other Practical Considerations for Decision Trees 229

Logistic Regression 230

Interpreting Logistic Regression Models 233

Other Practical Considerations for Logistic Regression 235

Neural Networks 240

Building Blocks: The Neuron 242

Neural Network Training 244

The Flexibility of Neural Networks 247

Neural Network Settings 249

Neural Network Pruning 251

Interpreting Neural Networks 252

Neural Network Decision Boundaries 253

Other Practical Considerations for Neural Networks 253

K-Nearest Neighbor 254

The k-NN Learning Algorithm 254

Distance Metrics for k-NN 258

Other Practical Considerations for k-NN 259

Naïve Bayes 264

Bayes' Theorem 264

The Naïve Bayes Classifier 268

Interpreting Naïve Bayes Classifiers 268

Other Practical Considerations for Naïve Bayes 269

Regression Models 270

Linear Regression 271

Linear Regression Assumptions 274

Variable Selection in Linear Regression 276

Interpreting Linear Regression Models 278

Using Linear Regression for Classification 279

Other Regression Algorithms 280

Summary 281

Chapter 9 Assessing Predictive Models 283

Batch Approach to Model Assessment 284

Percent Correct Classification 284

Rank-Ordered Approach to Model Assessment 293

Assessing Regression Models 301

Summary 304

Chapter 10 Model Ensembles 307

Motivation for Ensembles 307

The Wisdom of Crowds 308

Bias Variance Tradeoff 309

Bagging 311

Boosting 316

Improvements to Bagging and Boosting 320

Random Forests 320

Stochastic Gradient Boosting 321

Heterogeneous Ensembles 321

Model Ensembles and Occam's Razor 323

Interpreting Model Ensembles 323

Summary 326

Chapter 11 Text Mining 327

Motivation for Text Mining 328

A Predictive Modeling Approach to Text Mining 329

Structured vs. Unstructured Data 329

Why Text Mining Is Hard 330

Text Mining Applications 332

Data Sources for Text Mining 333

Data Preparation Steps 333

POS Tagging 333

Tokens 336

Stop Word and Punctuation Filters 336

Character Length and Number Filters 337

Stemming 337

Dictionaries 338

The Sentiment Polarity Movie Data Set 339

Text Mining Features 340

Term Frequency 341

Inverse Document Frequency 344

Tf-idf 344

Cosine Similarity 346

Multi-Word Features: N-Grams 346

Reducing Keyword Features 347

Grouping Terms 347

Modeling with Text Mining Features 347

Regular Expressions 349

Uses of Regular Expressions in Text Mining 351

Summary 352

Chapter 12 Model Deployment 353

General Deployment Considerations 354

Deployment Steps 355

Summary 375

Chapter 13 Case Studies 377

Survey Analysis Case Study: Overview 377

Business Understanding: Defining the Problem 378

Data Understanding 380

Data Preparation 381

Modeling 385

Deployment: "What-If" Analysis 391

Revisit Models 392

Deployment 401

Summary and Conclusions 401

Help Desk Case Study 402

Data Understanding: Defining the Data 403

Data Preparation 403

Modeling 405

Revisit Business Understanding 407

Deployment 409

Summary and Conclusions 411

Index 413

Chapter 1
Overview of Predictive Analytics

A small direct response company had developed dozens of programs in cooperation with major brands to sell books and DVDs. These affinity programs were very successful, but required considerable up-front work to develop the creative content and determine which customers, already engaged with the brand, were worth the significant marketing spend to purchase the books or DVDs on subscription. Typically, they first developed test mailings on a moderately sized sample to determine if the expected response rates were high enough to justify a larger program.

One analyst with the company identified a way to help the company become more profitable. What if one could identify the key characteristics of those who responded to the test mailing? Furthermore, what if one could generate a score for these customers and determine what minimum score would result in a high enough response rate to make the campaign profitable? The analyst discovered predictive analytics techniques that could be used for both purposes, finding key customer characteristics and using those characteristics to generate a score that could be used to determine which customers to mail.

Two decades before, the owner of a small company in Virginia had a compelling idea: Improve the accuracy and flexibility of guided munitions using optimal control. The owner and president, Roger Barron, began the process of deriving the complex mathematics behind optimal control using a technique known as variational calculus and hired a graduate student to assist him in the task. Programmers then implemented the mathematics in computer code so they could simulate thousands of scenarios. For each trajectory, the variational calculus minimized the miss distance while maximizing speed at impact as well as the angle of impact.

The variational calculus algorithm succeeded in identifying the optimal sequence of commands: how much the fins (control surfaces) needed to change the path of the munition to follow the optimal path to the target. The concept worked in simulation in the thousands of optimal trajectories that were run. Moreover, the mathematics worked on several munitions, one of which was the MK82 glide bomb, fitted (in simulation) with an inertial guidance unit to control the fins: an early smart-bomb.

There was a problem, however. The variational calculus was so computationally complex that the small computers on-board could not solve the problem in real time. But what if one could estimate the optimal guidance commands at any time during the flight from observable characteristics of the flight? After all, the guidance unit can compute where the bomb is in space, how fast it is going, and the distance of the target that was programmed into the unit when it was launched. If the estimates of the optimum guidance commands were close enough to the actual optimal path, it would be near optimal and still succeed. Predictive models were built to do exactly this. The system was called Optimal Path-to-Go guidance.

These two programs designed by two different companies seemingly could not be more different. One program knows characteristics of people, such as demographics and their level of engagement with a brand, and tries to predict a human decision. The second program knows locations of a bomb in space and tries to predict the best physical action for it to hit a target.

But they share something in common: They both need to estimate values that are unknown but tremendously useful. For the affinity programs, the models estimate whether or not an individual will respond to a campaign, and for the guidance program, the models estimate the best guidance command. In this sense, these two programs are very similar because they both involve predicting a value or values that are known historically, but are unknown at the time a decision is needed. Not only are these programs related in this sense, but they are far from unique; there are countless decisions businesses and government agencies make every day that can be improved by using historic data as an aid to making decisions or even to automate the decisions themselves.

This book describes the back-story behind how analysts build the predictive models like the ones described in these two programs. There is science behind much of what predictive modelers do, yet there is also plenty of art, where no theory can inform us as to the best action, but experience provides principles by which tradeoffs can be made as solutions are found. Without the art, the science would only be able to solve a small subset of problems we face. Without the science, we would be like a plane without a rudder or a kite without a tail, moving at a rapid pace without any control, unable to achieve our objectives.

What Is Analytics?

Analytics is the process of using computational methods to discover and report influential patterns in data. The goal of analytics is to gain insight and often to affect decisions. Data is necessarily a measure of historic information so, by definition, analytics examines historic data. The term itself rose to prominence in 2005, in large part due to the introduction of Google Analytics. Nevertheless, the ideas behind analytics are not new at all but have been represented by different terms throughout the decades, including cybernetics, data analysis, neural networks, pattern recognition, statistics, knowledge discovery, data mining, and now even data science.

The rise of analytics in recent years is pragmatic: As organizations collect more data and begin to summarize it, there is a natural progression toward using the data to improve estimates, forecasts, decisions, and ultimately, efficiency.

What Is Predictive Analytics?

Predictive analytics is the process of discovering interesting and meaningful patterns in data. It draws from several related disciplines, some of which have been used to discover patterns in data for more than 100 years, including pattern recognition, statistics, machine learning, artificial intelligence, and data mining. What differentiates predictive analytics from other types of analytics?

First, predictive analytics is data-driven, meaning that algorithms derive key characteristic of the models from the data itself rather than from assumptions made by the analyst. Put another way, data-driven algorithms induce models from the data. The induction process can include identification of variables to be included in the model, parameters that define the model, weights or coefficients in the model, or model complexity.

Second, predictive analytics algorithms automate the process of finding the patterns from the data. Powerful induction algorithms not only discover coefficients or weights for the models, but also the very form of the models. Decision trees algorithms, for example, learn which of the candidate inputs best predict a target variable in addition to identifying which values of the variables to use in building predictions. Other algorithms can be modified to perform searches, using exhaustive or greedy searches to find the best set of inputs and model parameters. If the variable helps reduce model error, the variable is included in the model. Otherwise, if the variable does not help to reduce model error, it is eliminated.

Another automation task available in many software packages and algorithms automates the process of transforming input variables so that they can be used effectively in the predictive models. For example, if there are a hundred variables that are candidate inputs to models that can be or should be transformed to remove skew, you can do this with some predictive analytics software in a single step rather than programming all one hundred transformations one at a time.

Predictive analytics doesn't do anything that any analyst couldn't accomplish with pencil and paper or a spreadsheet if given enough time; the algorithms, while powerful, have no common sense. Consider a supervised learning data set with 50 inputs and a single binary target variable with values 0 and 1. One way to try to identify which of the inputs is most related to the target variable is to plot each variable, one at a time, in a histogram. The target variable can be superimposed on the histogram, as shown in Figure 1.1. With 50 inputs, you need to look at 50 histograms. This is not uncommon for predictive modelers to do.

Figure 1.1 Histogram

If the patterns require examining two variables at a time, you can do so with a scatter plot. For 50 variables, there are 1,225 possible scatter plots to examine. A dedicated predictive modeler might actually do this, although it will take some time. However, if the patterns require that you examine three variables simultaneously, you would need to examine 19,600 3D scatter plots in order to examine all the possible three-way combinations. Even the most dedicated modelers will be hard-pressed to spend the time needed to examine so many plots.

You need algorithms to sift through all of the potential combinations of inputs in the data—the patterns—and identify which ones are the most interesting. The analyst can then focus on these patterns, undoubtedly a much smaller number of inputs to examine. Of the 19,600 three-way combinations of inputs, it may be that a predictive model identifies six of the variables as the most significant contributors to accurate models. In addition, of these six variables, the top three are particularly good predictors and much...

Content (EPUB)

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Applied Predictive Analytics

Description

More details

Other editions

Additional editions

Person

Content

Chapter 1
Overview of Predictive Analytics

What Is Analytics?

What Is Predictive Analytics?

System requirements