Data Mining for Business Analytics

Concepts, Techniques, and Applications with XLMiner
 
 
Wiley (Verlag)
  • 3. Auflage
  • |
  • erschienen am 22. April 2016
  • |
  • 552 Seiten
 
E-Book | ePUB mit Adobe DRM | Systemvoraussetzungen
978-1-118-72924-3 (ISBN)
 
Data Mining for Business Analytics: Concepts, Techniques, and Applications in XLMiner®, Third Edition presents an applied approach to data mining and predictive analytics with clear exposition, hands-on exercises, and real-life case studies. Readers will work with all of the standard data mining methods using the Microsoft® Office Excel® add-in XLMiner® to develop predictive models and learn how to obtain business value from Big Data.
Featuring updated topical coverage on text mining, social network analysis, collaborative filtering, ensemble methods, uplift modeling and more, the Third Edition also includes:
* Real-world examples to build a theoretical and practical understanding of key data mining methods
* End-of-chapter exercises that help readers better understand the presented material
* Data-rich case studies to illustrate various applications of data mining techniques
* Completely new chapters on social network analysis and text mining
* A companion site with additional data sets, instructors material that include solutions to exercises and case studies, and Microsoft PowerPoint® slides
* Free 140-day license to use XLMiner for Education software
Data Mining for Business Analytics: Concepts, Techniques, and Applications in XLMiner®, Third Edition is an ideal textbook for upper-undergraduate and graduate-level courses as well as professional programs on data mining, predictive modeling, and Big Data analytics. The new edition is also a unique reference for analysts, researchers, and practitioners working with predictive analytics in the fields of business, finance, marketing, computer science, and information technology.
Praise for the Second Edition
"...full of vivid and thought-provoking anecdotes... needs to be read by anyone with a serious interest in research and marketing."- Research Magazine
"Shmueli et al. have done a wonderful job in presenting the field of data mining - a welcome addition to the literature." - ComputingReviews.com
"Excellent choice for business analysts...The book is a perfect fit for its intended audience." - Keith McCormick, Consultant and Author of SPSS Statistics For Dummies, Third Edition and SPSS Statistics for Data Analysis and Visualization
Galit Shmueli, PhD, is Distinguished Professor at National Tsing Hua University's Institute of Service Science. She has designed and instructed data mining courses since 2004 at University of Maryland, Statistics.com, The Indian School of Business, and National Tsing Hua University, Taiwan. Professor Shmueli is known for her research and teaching in business analytics, with a focus on statistical and data mining methods in information systems and healthcare. She has authored over 70 journal articles, books, textbooks and book chapters.
Peter C. Bruce is President and Founder of the Institute for Statistics Education at www.statistics.com. He has written multiple journal articles and is the developer of Resampling Stats software. He is the author of Introductory Statistics and Analytics: A Resampling Perspective, also published by Wiley.
Nitin R. Patel, PhD, is Chairman and cofounder of Cytel, Inc., based in Cambridge, Massachusetts. A Fellow of the American Statistical Association, Dr. Patel has also served as a Visiting Professor at the Massachusetts Institute of Technology and at Harvard University. He is a Fellow of the Computer Society of India and was a professor at the Indian Institute of Management, Ahmedabad for 15 years.
3. Auflage
  • Englisch
  • Hoboken
  • |
  • USA
John Wiley & Sons
  • 7,87 MB
978-1-118-72924-3 (9781118729243)
1118729242 (1118729242)
weitere Ausgaben werden ermittelt
Galit Shmueli, PhD, is Distinguished Professor at National Tsing Hua University's Institute of Service Science. She has designed and instructed data mining courses since 2004 at University of Maryland, Statistics.com, The Indian School of Business, and National Tsing Hua University, Taiwan. Professor Shmueli is known for her research and teaching in business analytics, with a focus on statistical and data mining methods in information systems and healthcare. She has authored over 70 journal articles, books, textbooks and book chapters.
Peter C. Bruce is President and Founder of the Institute for Statistics Education at www.statistics.com. He has written multiple journal articles and is the developer of Resampling Stats software. He is the author of Introductory Statistics and Analytics: A Resampling Perspective, also published by Wiley.
Nitin R. Patel, PhD, is Chairman and cofounder of Cytel, Inc., based in Cambridge, Massachusetts. A Fellow of the American Statistical Association, Dr. Patel has also served as a Visiting Professor at the Massachusetts Institute of Technology and at Harvard University. He is a Fellow of the Computer Society of India and was a professor at the Indian Institute of Management, Ahmedabad for 15 years.
  • Cover
  • Title Page
  • Copyright
  • Dedication
  • Contents
  • Foreword
  • Preface to the Third Edition
  • Preface to the First Edition
  • Acknowledgments
  • Part I: Preliminaries
  • Chapter 1: Introduction
  • 1.1 What Is Business Analytics?
  • 1.2 What Is Data Mining?
  • 1.3 Data Mining and Related Terms
  • 1.4 Big Data
  • 1.5 Data Science
  • 1.6 Why Are There So Many Different Methods?
  • 1.7 Terminology and Notation
  • 1.8 Road Maps to This Book
  • Order of Topics
  • Chapter 2: Overview of the Data Mining Process
  • 2.1 Introduction
  • 2.2 Core Ideas in Data Mining
  • Classification
  • Prediction
  • Association Rules and Recommendation Systems
  • Predictive Analytics
  • Data Reduction and Dimension Reduction
  • Data Exploration and Visualization
  • Supervised and Unsupervised Learning
  • 2.3 The Steps in Data Mining
  • 2.4 Preliminary Steps
  • Organization of Datasets
  • Sampling from a Database
  • Oversampling Rare Events in Classification Tasks
  • Preprocessing and Cleaning the Data
  • 2.5 Predictive Power and Overfitting
  • Creation and Use of Data Partitions
  • Overfitting
  • 2.6 Building a Predictive Model with XLMiner
  • Predicting Home Values in the West Roxbury Neighborhood
  • Modeling Process
  • 2.7 Using Excel for Data Mining
  • 2.8 Automating Data Mining Solutions
  • Data Mining Software Tools: the State of the Market
  • Problems
  • Part II: Data Exploration and Dimension Reduction
  • Chapter 3: Data Visualization
  • 3.1 Uses of Data Visualization
  • 3.2 Data Examples
  • Example 1: Boston Housing Data
  • Example 2: Ridership on Amtrak Trains
  • 3.3 Basic Charts: Bar Charts, Line Graphs, and Scatter Plots
  • Distribution Plots: Boxplots and Histograms
  • Heatmaps: Visualizing Correlations and Missing Values
  • 3.4 Multidimensional Visualization
  • Adding Variables: Color, Size, Shape, Multiple Panels, and Animation
  • Manipulations: Rescaling, Aggregation and Hierarchies, Zooming, Filtering
  • Reference: Trend Line and Labels
  • Scaling up to Large Datasets
  • Multivariate Plot: Parallel Coordinates Plot
  • Interactive Visualization
  • 3.5 Specialized Visualizations
  • Visualizing Networked Data
  • Visualizing Hierarchical Data: Treemaps
  • Visualizing Geographical Data: Map Charts
  • 3.6 Summary: Major Visualizations and Operations, by Data Mining Goal
  • Prediction
  • Classification
  • Time Series Forecasting
  • Unsupervised Learning
  • Problems
  • Chapter 4: Dimension Reduction
  • 4.1 Introduction
  • 4.2 Curse of Dimensionality
  • 4.3 Practical Considerations
  • Example 1: House Prices in Boston
  • 4.4 Data Summaries
  • Summary Statistics
  • Pivot Tables
  • 4.5 Correlation Analysis
  • 4.6 Reducing the Number of Categories in Categorical Variables
  • 4.7 Converting a Categorical Variable to a Numerical Variable
  • 4.8 Principal Components Analysis
  • Example 2: Breakfast Cereals
  • Principal Components
  • Normalizing the Data
  • Using Principal Components for Classification and Prediction
  • 4.9 Dimension Reduction Using Regression Models
  • 4.10 Dimension Reduction Using Classification and Regression Trees
  • Problems
  • Part III: Performance Evaluation
  • Chapter 5: Evaluating Predictive Performance
  • 5.1 Introduction
  • 5.2 Evaluating Predictive Performance
  • Benchmark: The Average
  • Prediction Accuracy Measures
  • Comparing Training and Validation Performance
  • Lift Chart
  • 5.3 Judging Classifier Performance
  • Benchmark: The Naive Rule
  • Class Separation
  • The Classification Matrix
  • Using the Validation Data
  • Accuracy Measures
  • Propensities and Cutoff for Classification
  • Performance in Unequal Importance of Classes
  • Asymmetric Misclassification Costs
  • Generalization to More Than Two Classes
  • 5.4 Judging Ranking Performance
  • Lift Charts for Binary Data
  • Decile Lift Charts
  • Beyond Two Classes
  • Lift Charts Incorporating Costs and Benefits
  • Lift as Function of Cutoff
  • 5.5 Oversampling
  • Oversampling the Training Set
  • Evaluating Model Performance Using a Non-oversampled Validation Set
  • Evaluating Model Performance If Only Oversampled Validation Set Exists
  • Problems
  • Part IV: Prediction and Classification Methods
  • Chapter 6: Multiple Linear Regression
  • 6.1 Introduction
  • 6.2 Explanatory vs. Predictive Modeling
  • 6.3 Estimating the Regression Equation and Prediction
  • Example: Predicting the Price of Used Toyota Corolla Cars
  • 6.4 Variable Selection in Linear Regression
  • Reducing the Number of Predictors
  • How to Reduce the Number of Predictors
  • Problems
  • Chapter 7: ?-Nearest-Neighbors (?-NN)
  • 7.1 The -NN Classifier (categorical outcome)
  • Determining Neighbors
  • Classification Rule
  • Example: Riding Mowers
  • Choosing
  • Setting the Cutoff Value
  • -NN with More Than Two Classes
  • Converting Categorical Variables to Binary Dummies
  • 7.2 -NN for a Numerical Response
  • 7.3 Advantages and Shortcomings of -NN Algorithms
  • Problems
  • Chapter 8: The Naive Bayes Classifier
  • 8.1 Introduction
  • Cutoff Probability Method
  • Conditional Probability
  • Example 1: Predicting Fraudulent Financial Reporting
  • 8.2 Applying the Full (Exact) Bayesian Classifier
  • Using the "Assign to the Most Probable Class" Method
  • Using the Cutoff Probability Method
  • Practical Difficulty with the Complete (Exact) Bayes Procedure
  • Solution: Naive Bayes
  • Example 2: Predicting Fraudulent Financial Reports, Two Predictors
  • Example 3: Predicting Delayed Flights
  • 8.3 Advantages and Shortcomings of the Naive Bayes Classifier
  • Problems
  • Chapter 9: Classification and Regression Trees
  • 9.1 Introduction
  • 9.2 Classification Trees
  • Recursive Partitioning
  • Example 1: Riding Mowers
  • Measures of Impurity
  • Tree Structure
  • Classifying a New Observation
  • 9.3 Evaluating the Performance of a Classification Tree
  • Example 2: Acceptance of Personal Loan
  • 9.4 Avoiding Overfitting
  • Stopping Tree Growth: CHAID
  • Pruning the Tree
  • 9.5 Classification Rules from Trees
  • 9.6 Classification Trees for More Than two Classes
  • 9.7 Regression Trees
  • Prediction
  • Measuring Impurity
  • Evaluating Performance
  • 9.8 Advantages, Weaknesses, and Extensions
  • 9.9 Improving Prediction: Multiple Trees
  • Problems
  • Chapter 10: Logistic Regression
  • 10.1 Introduction
  • 10.2 The Logistic Regression Model
  • Example: Acceptance of Personal Loan
  • Model with a Single Predictor
  • Estimating the Logistic Model from Data: Computing Parameter Estimates
  • Interpreting Results in Terms of Odds (for a Profiling Goal)
  • 10.3 Evaluating Classification Performance
  • Variable Selection
  • 10.4 Example of Complete Analysis: Predicting Delayed Flights
  • Data Preprocessing
  • Model Fitting and Estimation
  • Model Interpretation
  • Model Performance
  • Variable Selection
  • 10.5 Appendix: Logistic Regression for Profiling
  • Appendix A: Why Linear Regression Is Problematic for a Categorical Response
  • Appendix B: Evaluating Explanatory Power
  • Appendix C: Logistic Regression for More Than Two Classes
  • Problems
  • Chapter 11: Neural Nets
  • 11.1 Introduction
  • 11.2 Concept and Structure of a Neural Network
  • 11.3 Fitting a Network to Data
  • Example 1: Tiny Dataset
  • Computing Output of Nodes
  • Preprocessing the Data
  • Training the Model
  • Example 2: Classifying Accident Severity
  • Avoiding Overfitting
  • Using the Output for Prediction and Classification
  • 11.4 Required User Input
  • 11.5 Exploring the Relationship Between Predictors and Response
  • 11.6 Advantages and Weaknesses of Neural Networks
  • Unsupervised Feature Extraction and Deep Learning
  • Problems
  • Chapter 12: Discriminant Analysis
  • 12.1 Introduction
  • Example 1: Riding Mowers
  • Example 2: Personal Loan Acceptance
  • 12.2 Distance of an Observation from a Class
  • 12.3 Fisher's Linear Classification Functions
  • 12.4 Classification Performance of Discriminant Analysis
  • 12.5 Prior Probabilities
  • 12.6 Unequal Misclassification Costs
  • 12.7 Classifying More Than Two Classes
  • Example 3: Medical Dispatch to Accident Scenes
  • 12.8 Advantages and Weaknesses
  • Problems
  • Chapter 13: Combining Methods: Ensembles and Uplift Modeling
  • 13.1 Ensembles
  • Why Ensembles Can Improve Predictive Power
  • Simple Averaging
  • Bagging
  • Boosting
  • Advantages and Weaknesses of Ensembles
  • 13.2 Uplift (Persuasion) Modeling
  • A-B Testing
  • Uplift
  • Gathering the Data
  • A Simple Model
  • Modeling Individual Uplift
  • Using the Results of an Uplift Model
  • 13.3 Summary
  • Problems
  • Part V: Mining Relationships among Records
  • Chapter 14: Association Rules and Collaborative Filtering
  • 14.1 Association Rules
  • Discovering Association Rules in Transaction Databases
  • Example 1: Synthetic Data on Purchases of Phone Faceplates
  • Generating Candidate Rules
  • The Apriori Algorithm
  • Selecting Strong Rules
  • Data Format
  • The Process of Rule Selection
  • Interpreting the Results
  • Rules and Chance
  • Example 2: Rules for Similar Book Purchases
  • 14.2 Collaborative Filtering
  • Data Type and Format
  • Example 3: Netflix Prize Contest
  • User-Based Collaborative Filtering: "People Like You"
  • Item-Based Collaborative Filtering
  • Advantages and Weaknesses of Collaborative Filtering
  • Collaborative Filtering vs. Association Rules
  • 14.3 Summary
  • Problems
  • Chapter 15: Cluster Analysis
  • 15.1 Introduction
  • Example: Public Utilities
  • 15.2 Measuring Distance Between Two Observations
  • Euclidean Distance
  • Normalizing Numerical Measurements
  • Other Distance Measures for Numerical Data
  • Distance Measures for Categorical Data
  • Distance Measures for Mixed Data
  • 15.3 Measuring Distance Between Two Clusters
  • Minimum Distance
  • Maximum Distance
  • Average Distance
  • Centroid Distance
  • 15.4 Hierarchical (Agglomerative) Clustering
  • Single Linkage
  • Complete Linkage
  • Average Linkage (in XLMiner: "Group Average Linkage")
  • Centroid Linkage
  • Ward's Method
  • Dendrograms: Displaying Clustering Process and Results
  • Validating Clusters
  • Limitations of Hierarchical Clustering
  • 15.5 Non-hierarchical Clustering: The -Means Algorithm
  • Initial Partition into Clusters
  • Problems
  • Part VI: Forecasting Time Series
  • Chapter 16: Handling Time Series
  • 16.1 Introduction
  • 16.2 Descriptive vs. Predictive Modeling
  • 16.3 Popular Forecasting Methods in Business
  • Combining Methods
  • 16.4 Time Series Components
  • Example: Ridership on Amtrak Trains
  • 16.5 Data Partitioning and Performance Evaluation
  • Benchmark Performance: Naive Forecasts
  • Generating Future Forecasts
  • Problems
  • Chapter 17: Regression-Based Forecasting
  • 17.1 A Model with Trend
  • Linear Trend
  • Exponential Trend
  • Polynomial Trend
  • 17.2 A Model with Seasonality
  • 17.3 A Model with Trend and Seasonality
  • 17.4 Autocorrelation and ARIMA Models
  • Computing Autocorrelation
  • Improving Forecasts by Integrating Autocorrelation Information
  • Evaluating Predictability
  • Problems
  • Chapter 18: Smoothing Methods
  • 18.1 Introduction
  • 18.2 Moving Average
  • Centered Moving Average for Visualization
  • Trailing Moving Average for Forecasting
  • Choosing Window Width ( )
  • 18.3 Simple Exponential Smoothing
  • Choosing Smoothing Parameter
  • Relation between Moving Average and Simple Exponential Smoothing
  • 18.4 Advanced Exponential Smoothing
  • Series with a Trend
  • Series with a Trend and Seasonality
  • Series with Seasonality (No Trend)
  • Problems
  • Part VII: Data Analytics
  • Chapter 19: Social Network Analytics
  • 19.1 Introduction
  • 19.2 Directed vs. Undirected Networks
  • 19.3 Visualizing and analyzing networks
  • Graph Layout
  • Adjacency List
  • Adjacency Matrix
  • Using Network Data in Classification and Prediction
  • 19.4 Social Data Metrics and Taxonomy
  • Node-Level Centrality Metrics
  • Egocentric Network
  • Network Metrics
  • 19.5 Using Network Metrics in Prediction and Classification
  • Link Prediction
  • Entity Resolution
  • Collaborative Filtering
  • 19.6 Advantages and Disadvantages
  • Problems
  • Chapter 20: Text Mining
  • 20.1 Introduction
  • 20.2 The Spreadsheet Representation of Text: "Bag-of-Words"
  • 20.3 Bag-of-Words vs. Meaning Extraction at Document Level
  • 20.4 Preprocessing the Text
  • Tokenization
  • Text Reduction
  • Presence/Absence vs. Frequency
  • Term Frequency---Inverse Document Frequency (TF-IDF)
  • From Terms to Concepts: Latent Semantic Indexing
  • Extracting Meaning
  • 20.5 Implementing Data Mining Methods
  • 20.6 Example: Online Discussions on Autos and Electronics
  • Importing and Labeling the Records
  • Tokenization
  • Text Processing and Reduction
  • Producing a Concept Matrix
  • Labeling the Documents
  • Fitting a Model
  • Prediction
  • 20.7 Summary
  • Problems
  • Part VIII: Cases
  • Chapter 21: Cases
  • 21.1 Charles Book Club
  • The Book Industry
  • Database Marketing at Charles
  • Data Mining Techniques
  • Assignment
  • 21.2 German Credit
  • Background
  • Data
  • Assignment
  • 21.3 Tayko Software Cataloger
  • Background
  • The Mailing Experiment
  • Data
  • Assignment
  • 21.4 Political Persuasion
  • Background
  • Predictive Analytics Arrives in US Politics
  • Political Targeting
  • Uplift
  • Data
  • Assignment
  • 21.5 Taxi Cancellations
  • Business Situation
  • Assignment
  • 21.6 Segmenting Consumers of Bath Soap
  • Business Situation
  • Key Problems
  • Data
  • Measuring Brand Loyalty
  • Assignment
  • Appendix
  • 21.7 Direct-Mail Fundraising
  • Background
  • Data
  • Assignment
  • 21.8 Catalog Cross-Selling
  • Background
  • Assignment
  • 21.9 Predicting Bankruptcy
  • Predicting Corporate Bankruptcy
  • Assignment
  • 21.10 Time Series Case: Forecasting Public Transportation Demand
  • Background
  • Problem Description
  • Available Data
  • Assignment Goal
  • Assignment
  • Tips and Suggested Steps
  • References
  • Data Files Used in the Book
  • Index
  • EULA

Chapter 1
Introduction


1.1 What is Business Analytics?


Business analytics (BA) is the practice and art of bringing quantitative data to bear on decision making. The term means different things to different organizations.

Consider the role of analytics in helping newspapers survive the transition to a digital world. One tabloid newspaper with a working-class readership in Britain had launched a web version of the paper, and did tests on its home page to determine which images produced more hits: cats, dogs, or monkeys. This simple application, for this company, was considered analytics. By contrast, the Washington Post has a highly influential audience that is of interest to big defense contractors: it is perhaps the only newspaper where you routinely see advertisements for aircraft carriers. In the digital environment, the Post can track readers by time of day, location, and user subscription information. In this fashion, the display of the aircraft carrier advertisement in the online paper may be focused on a very small group of individuals-say, the members of the House and Senate Armed Services Committees who will be voting on the Pentagon's budget.

Business Analytics, or more generically, analytics, includes a range of data analysis methods. Many powerful applications involve little more than counting, rule checking, and basic arithmetic. For some organizations, this is what is meant by analytics.

The next level of business analytics, now termed business intelligence (BI), refers to data visualization and reporting for understanding "what happened and what is happening." This is done by use of charts, tables, and dashboards to display, examine, and explore data. BI, which earlier consisted mainly of generating static reports, has evolved into more user-friendly and effective tools and practices, such as creating interactive dashboards that allow the user not only to access real-time data but also to directly interact with it. Effective dashboards are those that tie directly into company data, and give managers a tool to quickly see what might not readily be apparent in a large complex database. One such tool for industrial operations managers displays customer orders in a single two-dimensional display, using color and bubble size as added variables, showing customer name, type of product, size of order, and length of time to produce.

Business Analytics now typically includes BI as well as sophisticated data analysis methods, such as statistical models and data mining algorithms used for exploring data, quantifying and explaining relationships between measurements, and predicting new records. Methods like regression models are used to describe and quantify "on average" relationships (e.g., between advertising and sales), to predict new records (e.g., whether a new patient will react positively to a medication), and to forecast future values (e.g., next week's web traffic).

Readers familiar with earlier editions of this book might have noticed that the book title changed from Data Mining for Business Intelligence to Data Mining for Business Analytics in this edition. The change reflects the more recent term BA, which overtook the earlier term BI to denote advanced analytics. Today, BI is used to refer to data visualization and reporting.

Who Uses Predictive Analytics?


The widespread adoption of predictive analytics, coupled with the accelerating availability of data, has increased organizations' capabilities throughout the economy. A few examples:

Credit scoring: One long-established use of predictive modeling techniques for business prediction is credit scoring. A credit score is not some arbitrary judgment of credit-worthiness; it is based mainly on a predictive model that uses prior data to predict repayment behavior.

Future purchases: A more recent (and controversial) example is Target's use of predictive modeling to classify sales prospects as "pregnant" or "not-pregnant." Those classified as pregnant could then be sent sales promotions at an early stage of pregnancy, giving Target a head start on a significant purchase stream.

Tax evasion: The US Internal Revenue Service found it was 25 times more likely to find tax evasion when enforcement activity was based on predictive models, allowing agents to focus on the most likely tax cheats (Siegel, 2013).

The business analytics toolkit also includes statistical experiments, the most common of which is known to marketers as A-B testing. These are often used for pricing decisions:

  • Orbitz, the travel site, found that it could price hotel options higher for Mac users than Windows users.
  • Staples online store found it could charge more for staplers if a customer lived far from a Staples store.

Beware the organizational setting where analytics is a solution in search of a problem: A manager, knowing that business analytics and data mining are hot areas, decides that her organization must deploy them too, to capture that hidden value that must be lurking somewhere. Successful use of analytics and data mining requires both an understanding of the business context where value is to be captured, and an understanding of exactly what the data mining methods do.

1.2 What is Data Mining?


In this book, data mining refers to business analytics methods that go beyond counts, descriptive techniques, reporting, and methods based on business rules. While we do introduce data visualization, which is commonly the first step into more advanced analytics, the book focuses mostly on the more advanced data analytics tools. Specifically, it includes statistical and machine-learning methods that inform decision making, often in automated fashion. Prediction is typically an important component, often at the individual level. Rather than "what is the relationship between advertising and sales," we might be interested in "what specific advertisement, or recommended product, should be shown to a given online shopper at this moment?" Or we might be interested in clustering customers into different "personas" that receive different marketing treatment, then assigning each new prospect to one of these personas.

The era of big data has accelerated the use of data mining. Data mining methods, with their power and automaticity, have the ability to cope with huge amounts of data and extract value.

1.3 Data Mining and Related Terms


The field of analytics is growing rapidly, both in terms of the breadth of applications, and in terms of the number of organizations using advanced analytics. As a result there is considerable overlap and inconsistency of definitions.

The term data mining itself means different things to different people. To the general public, it may have a general, somewhat hazy and pejorative meaning of digging through vast stores of (often personal) data in search of something interesting. One major consulting firm has a "data mining department," but its responsibilities are in the area of studying and graphing past data in search of general trends. And, to confuse matters, their more advanced predictive models are the responsibility of an "advanced analytics department." Other terms that organizations use are predictive analytics, predictive modeling, and machine learning.

Data mining stands at the confluence of the fields of statistics and machine learning (also known as artificial intelligence). A variety of techniques for exploring data and building models have been around for a long time in the world of statistics: linear regression, logistic regression, discriminant analysis, and principal components analysis, for example. But the core tenets of classical statistics-computing is difficult and data are scarce-do not apply in data mining applications where both data and computing power are plentiful.

This gives rise to Daryl Pregibon's description of data mining as "statistics at scale and speed" (Pregibon, 1999). Another major difference between the fields of statistics and machine learning is the focus in statistics on inference from a sample to the population regarding an "average effect"-for example, a $1 price increase will reduce average demand by 2 boxes"Incontrast, thefocusinmachinelearningisonpredictingindividualrecords - "the predicteddemandforperson i givena $1 price increase is 1 box, while for person j it is 3 boxes." The emphasis that classical statistics places on inference (determining whether a pattern or interesting result might have happened by chance in our sample) is absent from data mining.

In comparison to statistics, data mining deals with large datasets in an open-ended fashion, making it impossible to put the strict limits around the question being addressed that inference would require. As a result the general approach to data mining is vulnerable to the danger of overfitting, where a model is fit so closely to the available sample of data that it describes not merely structural characteristics of the data but random peculiarities as well. In engineering terms, the model is fitting the noise, not just the signal.

In this book, we use the term machine learning to refer to algorithms that learn directly from data, especially local patterns, often in layered or iterative fashion. In contrast, we use statistical models to refer to...

Dateiformat: EPUB
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat EPUB ist sehr gut für Romane und Sachbücher geeignet - also für "fließenden" Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Weitere Informationen finden Sie in unserer E-Book Hilfe.


Download (sofort verfügbar)

103,99 €
inkl. 19% MwSt.
Download / Einzel-Lizenz
ePUB mit Adobe DRM
siehe Systemvoraussetzungen
E-Book bestellen

Unsere Web-Seiten verwenden Cookies. Mit der Nutzung dieser Web-Seiten erklären Sie sich damit einverstanden. Mehr Informationen finden Sie in unserem Datenschutzhinweis. Ok