Doing Data Science

Name: Doing Data Science | Straight Talk from the Frontline
Brand: O'Reilly
Price: 31.49 EUR
Availability: OnlineOnly

Straight Talk from the Frontline

Cathy O'Neil(Autor*in)

O'Reilly (Verlag)

Erschienen am 9. Oktober 2013

408 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-4493-6389-5 (ISBN)

31,49 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Inhalt

Intro
Copyright
Table of Contents
Preface
Motivation
Origins of the Class
Origins of the Book
What to Expect from This Book
How This Book Is Organized
How to Read This Book
How Code Is Used in This Book
Who This Book Is For
Prerequisites
Supplemental Reading
About the Contributors
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Chapter 1. Introduction: What Is Data Science?
Big Data and Data Science Hype
Getting Past the Hype
Why Now?
Datafication
The Current Landscape (with a Little History)
Data Science Jobs
A Data Science Profile
Thought Experiment: Meta-Definition
OK, So What Is a Data Scientist, Really?
In Academia
In Industry
Chapter 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
Statistical Thinking in the Age of Big Data
Statistical Inference
Populations and Samples
Populations and Samples of Big Data
Big Data Can Mean Big Assumptions
Modeling
Exploratory Data Analysis
Philosophy of Exploratory Data Analysis
Exercise: EDA
The Data Science Process
A Data Scientist's Role in This Process
Thought Experiment: How Would You Simulate Chaos?
Case Study: RealDirect
How Does RealDirect Make Money?
Exercise: RealDirect Data Strategy
Chapter 3. Algorithms
Machine Learning Algorithms
Three Basic Algorithms
Linear Regression
k-Nearest Neighbors (k-NN)
k-means
Exercise: Basic Machine Learning Algorithms
Solutions
Summing It All Up
Thought Experiment: Automated Statistician
Chapter 4. Spam Filters, Naive Bayes, and Wrangling
Thought Experiment: Learning by Example
Why Won't Linear Regression Work for Filtering Spam?
How About k-nearest Neighbors?
Naive Bayes
Bayes Law
A Spam Filter for Individual Words
A Spam Filter That Combines Words: Naive Bayes
Fancy It Up: Laplace Smoothing
Comparing Naive Bayes to k-NN
Sample Code in bash
Scraping the Web: APIs and Other Tools
Jake's Exercise: Naive Bayes for Article Classification
Sample R Code for Dealing with the NYT API
Chapter 5. Logistic Regression
Thought Experiments
Classifiers
Runtime
You
Interpretability
Scalability
M6D Logistic Regression Case Study
Click Models
The Underlying Math
Estimating a and ß
Newton's Method
Stochastic Gradient Descent
Implementation
Evaluation
Media 6 Degrees Exercise
Sample R Code
Chapter 6. Time Stamps and Financial Modeling
Kyle Teague and GetGlue
Timestamps
Exploratory Data Analysis (EDA)
Metrics and New Variables or Features
What's Next?
Cathy O'Neil
Thought Experiment
Financial Modeling
In-Sample, Out-of-Sample, and Causality
Preparing Financial Data
Log Returns
Example: The S&P Index
Working out a Volatility Measurement
Exponential Downweighting
The Financial Modeling Feedback Loop
Why Regression?
Adding Priors
A Baby Model
Exercise: GetGlue and Timestamped Event Data
Exercise: Financial Data
Chapter 7. Extracting Meaning from Data
William Cukierski
Background: Data Science Competitions
Background: Crowdsourcing
The Kaggle Model
A Single Contestant
Their Customers
Thought Experiment: What Are the Ethical Implications of a Robo-Grader?
Feature Selection
Example: User Retention
Filters
Wrappers
Embedded Methods: Decision Trees
Entropy
The Decision Tree Algorithm
Handling Continuous Variables in Decision Trees
Random Forests
User Retention: Interpretability Versus Predictive Power
David Huffaker: Google's Hybrid Approach to Social Research
Moving from Descriptive to Predictive
Social at Google
Privacy
Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
Chapter 8. Recommendation Engines: Building a User-Facing Data Product at Scale
A Real-World Recommendation Engine
Nearest Neighbor Algorithm Review
Some Problems with Nearest Neighbors
Beyond Nearest Neighbor: Machine Learning Classification
The Dimensionality Problem
Singular Value Decomposition (SVD)
Important Properties of SVD
Principal Component Analysis (PCA)
Alternating Least Squares
Fix V and Update U
Last Thoughts on These Algorithms
Thought Experiment: Filter Bubbles
Exercise: Build Your Own Recommendation System
Sample Code in Python
Chapter 9. Data Visualization and Fraud Detection
Data Visualization History
Gabriel Tarde
Mark's Thought Experiment
What Is Data Science, Redux?
Processing
Franco Moretti
A Sample of Data Visualization Projects
Mark's Data Visualization Projects
New York Times Lobby: Moveable Type
Project Cascade: Lives on a Screen
Cronkite Plaza
eBay Transactions and Books
Public Theater Shakespeare Machine
Goals of These Exhibits
Data Science and Risk
About Square
The Risk Challenge
The Trouble with Performance Estimation
Model Building Tips
Data Visualization at Square
Ian's Thought Experiment
Data Visualization for the Rest of Us
Data Visualization Exercise
Chapter 10. Social Networks and Data Journalism
Social Network Analysis at Morning Analytics
Case-Attribute Data versus Social Network Data
Social Network Analysis
Terminology from Social Networks
Centrality Measures
The Industry of Centrality Measures
Thought Experiment
Morningside Analytics
How Visualizations Help Us Find Schools of Fish
More Background on Social Network Analysis from a Statistical Point of View
Representations of Networks and Eigenvalue Centrality
A First Example of Random Graphs: The Erdos-Renyi Model
A Second Example of Random Graphs: The Exponential Random Graph Model
Data Journalism
A Bit of History on Data Journalism
Writing Technical Journalism: Advice from an Expert
Chapter 11. Causality
Correlation Doesn't Imply Causation
Asking Causal Questions
Confounders: A Dating Example
OK Cupid's Attempt
The Gold Standard: Randomized Clinical Trials
A/B Tests
Second Best: Observational Studies
Simpson's Paradox
The Rubin Causal Model
Visualizing Causality
Definition: The Causal Effect
Three Pieces of Advice
Chapter 12. Epidemiology
Madigan's Background
Thought Experiment
Modern Academic Statistics
Medical Literature and Observational Studies
Stratification Does Not Solve the Confounder Problem
What Do People Do About Confounding Things in Practice?
Is There a Better Way?
Research Experiment (Observational Medical Outcomes Partnership)
Closing Thought Experiment
Chapter 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
Claudia's Data Scientist Profile
The Life of a Chief Data Scientist
On Being a Female Data Scientist
Data Mining Competitions
How to Be a Good Modeler
Data Leakage
Market Predictions
Amazon Case Study: Big Spenders
A Jewelry Sampling Problem
IBM Customer Targeting
Breast Cancer Detection
Pneumonia Prediction
How to Avoid Leakage
Evaluating Models
Accuracy: Meh
Probabilities Matter, Not 0s and 1s
Choosing an Algorithm
A Final Example
Parting Thoughts
Chapter 14. Data Engineering: MapReduce, Pregel, and Hadoop
About David Crawshaw
Thought Experiment
MapReduce
Word Frequency Problem
Enter MapReduce
Other Examples of MapReduce
What Can't MapReduce Do?
Pregel
About Josh Wills
Thought Experiment
On Being a Data Scientist
Data Abundance Versus Data Scarcity
Designing Models
Economic Interlude: Hadoop
A Brief Introduction to Hadoop
Cloudera
Back to Josh: Workflow
So How to Get Started with Hadoop?
Chapter 15. The Students Speak
Process Thinking
Naive No Longer
Helping Hands
Your Mileage May Vary
Bridging Tunnels
Some of Our Work
Chapter 16. Next-Generation Data Scientists, Hubris, and Ethics
What Just Happened?
What Is Data Science (Again)?
What Are Next-Gen Data Scientists?
Being Problem Solvers
Cultivating Soft Skills
Being Question Askers
Being an Ethical Data Scientist
Career Advice
Index
About the Authors

Systemvoraussetzungen

Als PDF speichern Als Link merken