Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Galit Shmueli, PhD, is Distinguished Professor and Institute Director at National Tsing Hua University's Institute of Service Science. She has designed and instructed business analytics courses since 2004 at University of Maryland, Statistics.com, The Indian School of Business, and National Tsing Hua University, Taiwan.
Peter C. Bruce, is Founder of the Institute for Statistics Education at Statistics.com, and Chief Learning Officer at Elder Research, Inc.
Peter Gedeck, PhD, is Senior Data Scientist at Collaborative Drug Discovery and teaches at statistics.com and the UVA School of Data Science. His specialty is the development of machine learning algorithms to predict biological and physicochemical properties of drug candidates.
Inbal Yahav, PhD, is a Senior Lecturer in The Coller School of Management at Tel Aviv University, Israel. Her work focuses on the development and adaptation of statistical models for use by researchers in the field of information systems.
Nitin R. Patel, PhD, is Co-founder and Lead Researcher at Cytel Inc. He was also a Co-founder of Tata Consultancy Services. A Fellow of the American Statistical Association, Dr. Patel has served as a Visiting Professor at the Massachusetts Institute of Technology and at Harvard University, USA.
Foreword by Ravi Bapna xix
Foreword by Gareth James xxi
Preface to the Second R Edition xxiii
Acknowledgments xxvi
Part I Preliminaries
Chapter 1 Introduction 3
1.1 What Is Business Analytics? 3
1.2 What Is Machine Learning? 5
1.3 Machine Learning, AI, and Related Terms 5
1.4 Big Data 7
1.5 Data Science 8
1.6 Why Are There So Many Different Methods? 8
1.7 Terminology and Notation 9
1.8 Road Maps to This Book 11
Order of Topics 13
Chapter 2 Overview of the Machine Learning Process 17
2.1 Introduction 17
2.2 Core Ideas in Machine Learning 18
Classification 18
Prediction 18
Association Rules and Recommendation Systems 18
Predictive Analytics 19
Data Reduction and Dimension Reduction 19
Data Exploration and Visualization 19
Supervised and Unsupervised Learning 20
2.3 The Steps in a Machine Learning Project 21
2.4 Preliminary Steps 23
Organization of Data 23
Predicting Home Values in the West Roxbury Neighborhood 23
Loading and Looking at the Data in R 24
Sampling from a Database 26
Oversampling Rare Events in Classification Tasks 27
Preprocessing and Cleaning the Data 28
2.5 Predictive Power and Overfitting 35
Overfitting 36
Creating and Using Data Partitions 38
2.6 Building a Predictive Model 41
Modeling Process 41
2.7 Using R for Machine Learning on a Local Machine 46
2.8 Automating Machine Learning Solutions 47
Predicting Power Generator Failure 48
Uber's Michelangelo 50
2.9 Ethical Practice in Machine Learning 52
Machine Learning Software: The State of the Market (by Herb Edelstein) 53
Problems 57
Part II Data Exploration and Dimension Reduction
Chapter 3 Data Visualization 63
3.1 Uses of Data Visualization 63
Base R or ggplot? 65
3.2 Data Examples 65
Example 1: Boston Housing Data 65
Example 2: Ridership on Amtrak Trains 67
3.3 Basic Charts: Bar Charts, Line Charts, and Scatter Plots 67
Distribution Plots: Boxplots and Histograms 70
Heatmaps: Visualizing Correlations and Missing Values 73
3.4 Multidimensional Visualization 75
Adding Variables: Color, Size, Shape, Multiple Panels, and Animation 76
Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering 79
Reference: Trend Lines and Labels 83
Scaling Up to Large Datasets 85
Multivariate Plot: Parallel Coordinates Plot 85
Interactive Visualization 88
3.5 Specialized Visualizations 91
Visualizing Networked Data 91
Visualizing Hierarchical Data: Treemaps 93
Visualizing Geographical Data: Map Charts 95
3.6 Major Visualizations and Operations, by Machine Learning Goal 97
Prediction 97
Classification 97
Time Series Forecasting 97
Unsupervised Learning 98
Problems 99
Chapter 4 Dimension Reduction 101
4.1 Introduction 101
4.2 Curse of Dimensionality 102
4.3 Practical Considerations 102
Example 1: House Prices in Boston 103
4.4 Data Summaries 103
Summary Statistics 104
Aggregation and Pivot Tables 104
4.5 Correlation Analysis 107
4.6 Reducing the Number of Categories in Categorical Variables 109
4.7 Converting a Categorical Variable to a Numerical Variable 111
4.8 Principal Component Analysis 111
Example 2: Breakfast Cereals 111
Principal Components 116
Normalizing the Data 117
Using Principal Components for Classification and Prediction 120
4.9 Dimension Reduction Using Regression Models 121
4.10 Dimension Reduction Using Classification and Regression Trees 121
Problems 123
Part III Performance Evaluation
Chapter 5 Evaluating Predictive Performance 129
5.1 Introduction 130
5.2 Evaluating Predictive Performance 130
Naive Benchmark: The Average 131
Prediction Accuracy Measures 131
Comparing Training and Holdout Performance 133
Cumulative Gains and Lift Charts 133
5.3 Judging Classifier Performance 136
Benchmark: The Naive Rule 136
Class Separation 136
The Confusion (Classification) Matrix 137
Using the Holdout Data 138
Accuracy Measures 139
Propensities and Threshold for Classification 139
Performance in Case of Unequal Importance of Classes 143
Asymmetric Misclassification Costs 146
Generalization to More Than Two Classes 149
5.4 Judging Ranking Performance 150
Cumulative Gains and Lift Charts for Binary Data 150
Decile-wise Lift Charts 153
Beyond Two Classes 154
Gains and Lift Charts Incorporating Costs and Benefits 154
Cumulative Gains as a Function of Threshold 155
5.5 Oversampling 156
Creating an Over-sampled Training Set 158
Evaluating Model Performance Using a Non-oversampled Holdout Set 159
Evaluating Model Performance If Only Oversampled Holdout Set Exists 159
Problems 162
Part IV Prediction and Classification Methods
Chapter 6 Multiple Linear Regression 167
6.1 Introduction 167
6.2 Explanatory vs. Predictive Modeling 168
6.3 Estimating the Regression Equation and Prediction 170
Example: Predicting the Price of Used Toyota Corolla Cars 171
Cross-validation and caret 175
6.4 Variable Selection in Linear Regression 176
Reducing the Number of Predictors 176
How to Reduce the Number of Predictors 178
Regularization (Shrinkage Models) 183
Problems 188
Chapter 7 k-Nearest Neighbors (kNN) 193
7.1 The k-NN Classifier (Categorical Outcome) 193
Determining Neighbors 194
Classification Rule 194
Example: Riding Mowers 195
Choosing k 196
Weighted k-NN 199
Setting the Cutoff Value 200
k-NN with More Than Two Classes 201
Converting Categorical Variables to Binary Dummies 201
7.2 k-NN for a Numerical Outcome 201
7.3 Advantages and Shortcomings of k-NN Algorithms 204
Problems 205
Chapter 8 The Naive Bayes Classifier 207
8.1 Introduction 207
Threshold Probability Method 208
Conditional Probability 208
Example 1: Predicting Fraudulent Financial Reporting 208
8.2 Applying the Full (Exact) Bayesian Classifier 209
Using the "Assign to the Most Probable Class" Method 210
Using the Threshold Probability Method 210
Practical Difficulty with the Complete (Exact) Bayes Procedure 210
8.3 Solution: Naive Bayes 211
The Naive Bayes Assumption of Conditional Independence 212
Using the Threshold Probability Method 212
Example 2: Predicting Fraudulent Financial Reports, Two Predictors 213
Example 3: Predicting Delayed Flights 214
Working with Continuous Predictors 218
8.4 Advantages and Shortcomings of the Naive Bayes Classifier 220
Problems 223
Chapter 9 Classification and Regression Trees 225
9.1 Introduction 226
Tree Structure 227
Decision Rules 227
Classifying a New Record 227
9.2 Classification Trees 228
Recursive Partitioning 228
Example 1: Riding Mowers 228
Measures of Impurity 231
9.3 Evaluating the Performance of a Classification Tree 235
Example 2: Acceptance of Personal Loan 236
9.4 Avoiding Overfitting 239
Stopping Tree Growth 242
Pruning the Tree 243
Best-Pruned Tree 245
9.5 Classification Rules from Trees 247
9.6 Classification Trees for More Than Two Classes 248
9.7 Regression Trees 249
Prediction 250
Measuring Impurity 250
Evaluating Performance 250
9.8 Advantages and Weaknesses of a Tree 250
9.9 Improving Prediction: Random Forests and Boosted Trees 252
Random Forests 252
Boosted Trees 254
Problems 257
Chapter 10 Logistic Regression 261
10.1 Introduction 261
10.2 The Logistic Regression Model 263
10.3 Example: Acceptance of Personal Loan 264
Model with a Single Predictor 265
Estimating the Logistic Model from Data: Computing Parameter Estimates 267
Interpreting Results in Terms of Odds (for a Profiling Goal) 270
10.4 Evaluating Classification Performance 271
10.5 Variable Selection 273
10.6 Logistic Regression for Multi-Class Classification 274
Ordinal Classes 275
Nominal Classes 276
10.7 Example of Complete Analysis: Predicting Delayed Flights 277
Data Preprocessing 282
Model-Fitting and Estimation 282
Model Interpretation 282
Model Performance 284
Variable Selection 285
Problems 289
Chapter 11 Neural Nets 293
11.1 Introduction 293
11.2 Concept and Structure of a Neural Network 294
11.3 Fitting a Network to Data 295
Example 1: Tiny Dataset 295
Computing Output of Nodes 296
Preprocessing the Data 299
Training the Model 300
Example 2: Classifying Accident Severity 304
Avoiding Overfitting 305
Using the Output for Prediction and Classification 305
11.4 Required User Input 307
11.5 Exploring the Relationship Between Predictors and Outcome 308
11.6 Deep Learning 309
Convolutional Neural Networks (CNNs) 310
Local Feature Map 311
A Hierarchy of Features 311
The Learning Process 312
Unsupervised Learning 312
Example: Classification of Fashion Images 313
Conclusion 320
11.7 Advantages and Weaknesses of Neural Networks 320
Problems 322
Chapter 12 Discriminant Analysis 325
12.1 Introduction 325
Example 1: Riding Mowers 326
Example 2: Personal Loan Acceptance 327
12.2 Distance of a Record from a Class 327
12.3 Fisher's Linear Classification Functions 329
12.4 Classification Performance of Discriminant Analysis 333
12.5 Prior Probabilities 334
12.6 Unequal Misclassification Costs 334
12.7 Classifying More Than Two Classes 336
Example 3: Medical Dispatch to Accident Scenes 336
12.8 Advantages and Weaknesses 339
Problems 341
Chapter 13 Generating, Comparing, and Combining Multiple Models 345
13.1 Ensembles 346
Why Ensembles Can Improve Predictive Power 346
Simple Averaging or Voting 348
Bagging 349
Boosting 349
Bagging and Boosting in R 349
Stacking 350
Advantages and Weaknesses of Ensembles 351
13.2 Automated Machine Learning (AutoML) 352
AutoML: Explore and Clean Data 352
AutoML: Determine Machine Learning Task 353
AutoML: Choose Features and Machine Learning Methods 354
AutoML: Evaluate Model Performance 354
AutoML: Model Deployment 356
Advantages and Weaknesses of Automated Machine Learning 357
13.3 Explaining Model Predictions 358
13.4 Summary 360
Problems 362
345
Part V Intervention and User Feedback
Chapter 14 Interventions: Experiments, Uplift Models, and Reinforcement Learning 367
14.1 A/B Testing 368
Example: Testing a New Feature in a Photo Sharing App 369
The Statistical Test for Comparing Two Groups (T-Test) 370
Multiple Treatment Groups: A/B/n Tests 372
Multiple A/B Tests and the Danger of Multiple Testing 372
14.2 Uplift (Persuasion) Modeling 373
Gathering the Data 374
A Simple Model 376
Modeling Individual Uplift 376
Computing Uplift with R 378
Using the Results of an Uplift Model 378
14.3 Reinforcement Learning 380
Explore-Exploit: Multi-armed Bandits 380
Example of Using a Contextual Multi-Arm Bandit for Movie Recommendations 382
Markov Decision Process (MDP) 383
14.4 Summary 388
Problems 390
Part VI Mining Relationships Among Records
Chapter 15 Association Rules and Collaborative Filtering 393
15.1 Association Rules 394
Discovering Association Rules in Transaction Databases 394
Example 1: Synthetic Data on Purchases of Phone Faceplates 394
Generating Candidate Rules 395
The Apriori Algorithm 397
Selecting Strong Rules 397
Data Format 399
The Process of Rule Selection 400
Interpreting the Results 401
Rules and Chance 403
Example 2: Rules for Similar Book Purchases 405
15.2 Collaborative Filtering 407
Data Type and Format 407
Example 3: Netflix Prize Contest 408
User-Based Collaborative Filtering: "People Like You" 409
Item-Based Collaborative Filtering 411
Evaluating Performance 412
Example 4: Predicting Movie Ratings with MovieLens Data 413
Advantages and Weaknesses of Collaborative Filtering 416
Collaborative Filtering vs. Association Rules 417
15.3 Summary 419
Problems 421
Chapter 16 Cluster Analysis 425
16.1 Introduction 426
Example: Public Utilities 427
16.2 Measuring Distance Between Two Records 429
Euclidean Distance 429
Normalizing Numerical Variables 430
Other Distance Measures for Numerical Data 432
Distance Measures for Categorical Data 433
Distance Measures for Mixed Data 434
16.3 Measuring Distance Between Two Clusters 434
Minimum Distance 434
Maximum Distance 435
Average Distance 435
Centroid Distance 435
16.4 Hierarchical (Agglomerative) Clustering 437
Single Linkage 437
Complete Linkage 438
Average Linkage 438
Centroid Linkage 438
Ward's Method 438
Dendrograms: Displaying Clustering Process and Results 439
Validating Clusters 441
Limitations of Hierarchical Clustering 443
16.5 Non-Hierarchical Clustering: The k-Means Algorithm 444
Choosing the Number of Clusters (k) 445
Problems 450
Part VII Forecasting Time Series
Chapter 17 Handling Time Series 455
17.1 Introduction 455
17.2 Descriptive vs. Predictive Modeling 457
17.3 Popular Forecasting Methods in Business 457
Problems 466
Chapter 18 Regression-Based Forecasting 469
18.1 A Model with Trend 469
Linear Trend 469
Exponential Trend 473
Polynomial Trend 474
Problems 489
Chapter 19 Smoothing and Deep Learning Methods for Forecasting 499
19.1 Smoothing Methods: Introduction 500
19.2 Moving Average 500
Centered Moving Average for Visualization 500
Trailing Moving Average for Forecasting 501
Choosing Window Width (w) 504
Problems 516
Part VIII Data Analytics
Chapter 20 Social Network Analytics 527
20.1 Introduction 527
20.2 Directed vs. Undirected Networks 529
20.3 Visualizing and Analyzing Networks 530
Plot Layout 530
Edge List 533
Adjacency Matrix 533
Using Network Data in Classification and Prediction 534
Problems 548
Chapter 21 Text Mining 549
21.1 Introduction 549
21.2 The Tabular Representation of Text 550
21.3 Bag-of-Words vs. Meaning Extraction at Document Level 551
Problems 570
Chapter 22 Responsible Data Science 573
22.1 Introduction 573
22.2 Unintentional Harm 574
22.3 Legal Considerations 576
22.4 Principles of Responsible Data Science 577
Non-maleficence 578
Fairness 578
Transparency 579
Accountability 580
Data Privacy and Security 580
Problems 599
Part IX Cases
Chapter 23 Cases 603
23.1 Charles Book Club 603
The Book Industry 603
Database Marketing at Charles 604
Machine Learning Techniques 606
Assignment 608
23.2 German Credit 610
Background 610
Data 610
Assignment 614
Index 647
Business Analytics (BA) is the practice and art of bringing quantitative data to bear on decision-making. The term means different things to different organizations.
Consider the role of analytics in helping newspapers survive the transition to a digital world. One tabloid newspaper with a working-class readership in Britain had launched a web version of the paper and did tests on its home page to determine which images produced more hits: cats, dogs, or monkeys. This simple application, for this company, was considered analytics. By contrast, the Washington Post has a highly influential audience that is of interest to big defense contractors: it is perhaps the only newspaper where you routinely see advertisements for aircraft carriers. In the digital environment, the Post can track readers by time of day, location, and user subscription information. In this fashion, the display of the aircraft carrier advertisement in the online paper may be focused on a very small group of individuals-say, the members of the House and Senate Armed Services Committees who will be voting on the Pentagon's budget.
Business Analytics, or more generically, analytics, include a range of data analysis methods. Many powerful applications involve little more than counting, rule-checking, and basic arithmetic. For some organizations, this is what is meant by analytics.
The next level of business analytics, now termed Business Intelligence (BI), refers to data visualization and reporting for understanding "what happened and what is happening." This is done by use of charts, tables, and dashboards to display, examine, and explore data. BI, which earlier consisted mainly of generating static reports, has evolved into more user-friendly and effective tools and practices, such as creating interactive dashboards that allow the user not only to access real-time data, but also to directly interact with it. Effective dashboards are those that tie directly into company data and give managers a tool to quickly see what might not readily be apparent in a large complex database. One such tool for industrial operations managers displays customer orders in a single two-dimensional display, using color and bubble size as added variables, showing customer name, type of product, size of order, and length of time to produce.
Business Analytics now typically includes BI as well as sophisticated data analysis methods, such as statistical models and machine learning algorithms used for exploring data, quantifying and explaining relationships between measurements, and predicting new records. Methods like regression models are used to describe and quantify "on average" relationships (e.g., between advertising and sales), to predict new records (e.g., whether a new patient will react positively to a medication), and to forecast future values (e.g., next week's web traffic).
Readers familiar with earlier editions of this book may have noticed that the book title has changed from Data Mining for Business Intelligence to Data Mining for Business Analytics and, finally, in this edition to Machine Learning for Business Analytics. The first change reflected the advent of the term BA, which overtook the earlier term BI to denote advanced analytics. Today, BI is used to refer to data visualization and reporting. The second change reflects how the term machine learning has overtaken the older term data mining.
The widespread adoption of predictive analytics, coupled with the accelerating availability of data, has increased organizations' capabilities throughout the economy. A few examples are as follows:
Credit scoring: One long-established use of predictive modeling techniques for business prediction is credit scoring. A credit score is not some arbitrary judgment of creditworthiness; it is based mainly on a predictive model that uses prior data to predict repayment behavior.
Future purchases: A more recent (and controversial) example is Target's use of predictive modeling to classify sales prospects as "pregnant" or "not-pregnant." Those classified as pregnant could then be sent sales promotions at an early stage of pregnancy, giving Target a head start on a significant purchase stream.
Tax evasion: The US Internal Revenue Service found it was 25 times more likely to find tax evasion when enforcement activity was based on predictive models, allowing agents to focus on the most-likely tax cheats (Siegel, 2013).
The Business Analytics toolkit also includes statistical experiments, the most common of which is known to marketers as A/B testing. These are often used for pricing decisions:
Beware the organizational setting where analytics is a solution in search of a problem: a manager, knowing that business analytics and machine learning are hot areas, decides that her organization must deploy them too, to capture that hidden value that must be lurking somewhere. Successful use of analytics and machine learning requires both an understanding of the business context where value is to be captured and an understanding of exactly what the machine learning methods do.
In this book, machine learning (or data mining) refers to business analytics methods that go beyond counts, descriptive techniques, reporting, and methods based on business rules. While we do introduce data visualization, which is commonly the first step into more advanced analytics, the book focuses mostly on the more advanced data analytics tools. Specifically, it includes statistical and machine learning methods that inform decision-making, often in an automated fashion. Prediction is typically an important component, often at the individual level. Rather than "what is the relationship between advertising and sales," we might be interested in "what specific advertisement, or recommended product, should be shown to a given online shopper at this moment?" Or we might be interested in clustering customers into different "personas" that receive different marketing treatment and then assigning each new prospect to one of these personas.
The era of Big Data has accelerated the use of machine learning. Machine learning methods, with their power and automaticity, have the ability to cope with huge amounts of data and extract value.
The field of analytics is growing rapidly, both in terms of the breadth of applications and in terms of the number of organizations using advanced analytics. As a result, there is considerable overlap and inconsistency of definitions. Terms have also changed over time.
The older term data mining itself means different things to different people. To the general public, it may have a general, somewhat hazy and pejorative meaning of digging through vast stores of (often personal) data in search of something interesting. Data mining, as it refers to analytic techniques, has largely been superseded by the term machine learning. Other terms that organizations use are predictive analytics, predictive modeling, and most recently machine learning and artificial intelligence (AI).
Many practitioners, particularly those from the IT and computer science communities, use the term AI to refer to all the methods discussed in this book. AI originally referred to the general capability of a machine to act like a human, and, in its earlier days, existed mainly in the realm of science fiction and the unrealized ambitions of computer scientists. More recently, it has come to encompass the methods of statistical and machine learning discussed in this book, as the primary enablers of that grand vision, and sometimes the term is used loosely to mean the same thing as machine learning. More broadly, it includes generative capabilities such as the creation of images, audio, and video.
A variety of techniques for exploring data and building models have been around for a long time in the world of statistics: linear regression, logistic regression, discriminant analysis, and principal components analysis, for example. However, the core tenets of classical statistics-computing is difficult and data are scarce-do not apply in machine learning applications where both data and computing power are plentiful.
This gives rise to Daryl Pregibon's description of "data mining" (in the sense of machine learning) as "statistics at scale and speed" (Pregibon, 1999). Another major difference between the fields of statistics and machine learning is the focus in statistics on inference from a sample to the population regarding an "average effect"-for example, "a $1 price increase will reduce average demand by 2 boxes." In contrast, the focus in machine learning is on predicting individual records-"the predicted demand for person i given a $1 price increase is 1 box, while for person j it is 3 boxes." The emphasis that...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.