Becoming a Data Head

Name: Becoming a Data Head | How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning
Brand: Wiley
Price: 27.99 EUR
Availability: OnlineOnly

How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning

Alex J. Gutman Jordan Goldmeier(Autor*in)

Wiley (Verlag)

1. Auflage

Erschienen am 13. April 2021

272 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-74171-8 (ISBN)

27,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Personen

Inhalt

Acknowledgments xiii

Foreword xxiii

Introduction xxvii

Part One Thinking Like a Data Head

Chapter 1 What Is the Problem? 3

Questions a Data Head Should Ask 4

Why Is This Problem Important? 4

Who Does This Problem Affect? 6

What If We Don't Have the Right Data? 6

When Is the Project Over? 7

What If We Don't Like the Results? 7

Understanding Why Data Projects Fail 8

Customer Perception 8

Discussion 10

Working on Problems That Matter 11

Chapter Summary 11

Chapter 2 What Is Data? 13

Data vs. Information 13

An Example Dataset 14

Data Types 15

How Data Is Collected and Structured 16

Observational vs. Experimental Data 16

Structured vs. Unstructured Data 17

Basic Summary Statistics 18

Chapter Summary 19

Chapter 3 Prepare to Think Statistically 21

Ask Questions 22

There Is Variation in All Things 23

Scenario: Customer Perception (The Sequel) 24

Case Study: Kidney-Cancer Rates 26

Probabilities and Statistics 28

Probability vs. Intuition 29

Discovery with Statistics 31

Chapter Summary 33

Part Two Speaking Like a Data Head

Chapter 4 Argue with the Data 37

What Would You Do? 38

Missing Data Disaster 39

Tell Me the Data Origin Story 43

Who Collected the Data? 44

How Was the Data Collected? 44

Is the Data Representative? 45

Is There Sampling Bias? 46

What Did You Do with Outliers? 46

What Data Am I Not Seeing? 47

How Did You Deal with Missing Values? 47

Can the Data Measure What You Want It to Measure? 48

Argue with Data of All Sizes 48

Chapter Summary 49

Chapter 5 Explore the Data 51

Exploratory Data Analysis and You 52

Embracing the Exploratory Mindset 52

Questions to Guide You 53

The Setup 53

Can the Data Answer the Question? 54

Set Expectations and Use Common Sense 54

Do the Values Make Intuitive Sense? 54

Watch Out: Outliers and Missing Values 58

Did You Discover Any Relationships? 59

Understanding Correlation 59

Watch Out: Misinterpreting Correlation 60

Watch Out: Correlation Does Not Imply Causation 62

Did You Find New Opportunities in the Data? 63

Chapter Summary 63

Chapter 6 Examine the Probabilities 65

Take a Guess 66

The Rules of the Game 66

Notation 67

Conditional Probability and Independent Events 69

The Probability of Multiple Events 69

Two Things That Happen Together 69

One Thing or the Other 70

Probability Thought Exercise 72

Next Steps 73

Be Careful Assuming Independence 74

Don't Fall for the Gambler's Fallacy 74

All Probabilities Are Conditional 75

Don't Swap Dependencies 76

Bayes' Theorem 76

Ensure the Probabilities Have Meaning 79

Calibration 80

Rare Events Can, and Do, Happen 80

Chapter Summary 81

Chapter 7 Challenge the Statistics 83

Quick Lessons on Inference 83

Give Yourself Some Wiggle Room 84

More Data, More Evidence 84

Challenge the Status Quo 85

Evidence to the Contrary 86

Balance Decision Errors 88

The Process of Statistical Inference 89

The Questions You Should Ask to Challenge the Statistics 90

What Is the Context for These Statistics? 90

What Is the Sample Size? 91

What Are You Testing? 92

What Is the Null Hypothesis? 92

Assuming Equivalence 93

What Is the Significance Level? 93

How Many Tests Are You Doing? 94

Can I See the Confidence Intervals? 95

Is This Practically Significant? 96

Are You Assuming Causality? 96

Chapter Summary 97

Part Three Understanding the Data Scientist's Toolbox

Chapter 8 Search for Hidden Groups 101

Unsupervised Learning 102

Dimensionality Reduction 102

Creating Composite Features 103

Principal Component Analysis 105

Principal Components in Athletic Ability 105

PCA Summary 108

Potential Traps 109

Clustering 110

k-Means Clustering 111

Clustering Retail Locations 111

Potential Traps 113

Chapter Summary 114

Chapter 9 Understand the Regression Model 117

Supervised Learning 117

Linear Regression: What It Does 119

Least Squares Regression: Not Just a Clever Name 120

Linear Regression: What It Gives You 123

Extending to Many Features 124

Linear Regression: What Confusion It Causes 125

Omitted Variables 125

Multicollinearity 126

Data Leakage 127

Extrapolation Failures 128

Many Relationships Aren't Linear 128

Are You Explaining or Predicting? 128

Regression Performance 130

Other Regression Models 131

Chapter Summary 131

Chapter 10 Understand the Classification Model 133

Introduction to Classification 133

What You'll Learn 134

Classification Problem Setup 135

Logistic Regression 135

Logistic Regression: So What? 138

Decision Trees 139

Ensemble Methods 142

Random Forests 143

Gradient Boosted Trees 143

Interpretability of Ensemble Models 145

Watch Out for Pitfalls 145

Misapplication of the Problem 146

Data Leakage 146

Not Splitting Your Data 146

Choosing the Right Decision Threshold 147

Misunderstanding Accuracy 147

Confusion Matrices 148

Chapter Summary 150

Chapter 11 Understand Text Analytics 151

Expectations of Text Analytics 151

How Text Becomes Numbers 153

A Big Bag of Words 153

N-Grams 157

Word Embeddings 158

Topic Modeling 160

Text Classification 163

Naïve Bayes 164

Sentiment Analysis 166

Practical Considerations When Working with Text 167

Big Tech Has the Upper Hand 168

Chapter Summary 169

Chapter 12 Conceptualize Deep Learning 171

Neural Networks 172

How Are Neural Networks Like the Brain? 172

A Simple Neural Network 173

How a Neural Network Learns 174

A Slightly More Complex Neural Network 175

Applications of Deep Learning 178

The Benefits of Deep Learning 179

How Computers "See" Images 180

Convolutional Neural Networks 182

Deep Learning on Language and Sequences 183

Deep Learning in Practice 185

Do You Have Data? 185

Is Your Data Structured? 186

What Will the Network Look Like? 186

Artificial Intelligence and You 187

Big Tech Has the Upper Hand 188

Ethics in Deep Learning 189

Chapter Summary 190

Part Four Ensuring Success

Chapter 13 Watch Out for Pitfalls 193

Biases and Weird Phenomena in Data 194

Survivorship Bias 194

Regression to the Mean 195

Simpson's Paradox 195

Confirmation Bias 197

Effort Bias (aka the "Sunk Cost Fallacy") 197

Algorithmic Bias 198

Uncategorized Bias 198

The Big List of Pitfalls 199

Statistical and Machine Learning Pitfalls 199

Project Pitfalls 200

Chapter Summary 202

Chapter 14 Know the People and Personalities 203

Seven Scenes of Communication Breakdowns 204

The Postmortem 204

Storytime 205

The Telephone Game 206

Into the Weeds 206

The Reality Check 207

The Takeover 207

The Blowhard 208

Data Personalities 208

Data Enthusiasts 209

Data Cynics 209

Data Heads 209

Chapter Summary 210

Chapter 15 What's Next? 211

Index 215

Introduction

Data is perhaps the single most important aspect to your job, whether you want it to be or not. And you're likely reading this book because you want to be able to understand what it's all about.

To begin, it's worth stating what has almost become cliché: we create and consume more information than ever before. Without a doubt, we are in the age of data. And this age of data has created an entire industry of promises, buzzwords, and products many of which you, your managers, colleagues, and subordinates are or will be using. But, despite the claims and proliferation of data promises and products, data science projects are failing at alarming rates.1

To be sure, we're not saying all data promises are empty or all products are terrible. Rather, to truly get your head around this space, you must embrace a fundamental truth: this stuff is complex. Working with data is about numbers, nuance, and uncertainty. Data is important, yes, but it's rarely simple. And yet, there is an entire industry that would have us think otherwise. An industry that promises certainty in an uncertain world and plays on companies' fear of missing out. We, the authors, call this the Data Science Industrial Complex.

THE DATA SCIENCE INDUSTRIAL COMPLEX

It's a problem for everyone involved. Businesses endlessly pursue products that will do their thinking for them. Managers hire analytics professionals who really aren't. Data scientists are hired to work in companies that aren't ready for them. Executives are forced to listen to technobabble and pretend to understand. Projects stall. Money is wasted.

Meanwhile, the Data Science Industrial Complex is churning out new concepts faster than our ability to define and articulate the opportunities (and problems) they create. Blink, and you'll miss one. When your authors started working together, Big Data was all the rage. As time went on, data science became the hot new topic. Since then, machine learning, deep learning, and artificial intelligence have become the next focus.

To the curious and critical thinkers among us, something doesn't sit well. Are the problems really new? Or are these new definitions just rebranding old problems?

The answer, of course, is yes to both.

But the bigger question we hope you're asking yourself is, How can I think and speak critically about data?

Let us show you how.

By reading this book, you'll learn the tools, terms, and thinking necessary to navigate the Data Science Industrial Complex. You'll understand data and its challenges at a deeper level. You'll be able to think critically about the data and results you come across, and you'll be able to speak intelligently about all things data.

In short, you'll become a Data Head.

WHY WE CARE

Before we get into the details, it's worth discussing why your authors, Alex and Jordan, care so much about this topic. In this section, we share two important examples of how data affected society at large and impacted us personally.

The Subprime Mortgage Crises

We were fresh out of college when the subprime mortgage crisis hit. We both landed jobs in 2009 for the Air Force, at a time when jobs were hard to find. We were both lucky. We had an in-demand skill: working with data. We had our hands in data every single day, working to operationalize research from Air Force analysts and scientists into products the government could use. Our hiring would be a harbinger of the focus the country would soon place on the types of roles we filled. As two data workers, we looked on the mortgage crisis with interest and curiosity.

The subprime mortgage crises had a lot of contributing factors behind it.2 In our attempt to offer it up as an example here, we don't want to negate other factors. However, put simply, we see it as a major data failure. Banks and investors created models to understand the value of mortgage-backed collateralized debt obligations (CDOs). You might remember those as the investment vehicles behind the United States' market collapse.

Mortgage-backed CDOs were thought to be a safe investment because they spread the risk associated with loan default across multiple investment units. The idea was that in a portfolio of mortgages, if only a few went into default, this would not materially affect the underlying value of the entire portfolio.

And yet, upon reflection we know that some fundamental underlying assumptions were wrong. Chief among them were that default outcomes were independent events. If Person A defaults on a loan, it wouldn't impact Person B's risk of default. We would all soon learn defaults functioned more like dominoes where a previous default could predict further defaults. When one mortgage defaulted, the property values surrounding the home dropped, and the risk of defaults on those homes increased. The default effectively dragged the neighboring houses down into a sinkhole.

Assuming independence when events are in fact connected is a common error in statistics.

But let's go further into this story. Investment banks created models that overvalued these investments. A model, which we'll talk about later in the book, is a deliberate oversimplification of reality. It uses assumptions about the real world in an attempt to understand and make predictions about certain phenomena.

And who were these people who created and understood these models? They were the people who would lay the groundwork for what today we call the data scientist. Our kind of people. Statisticians, economists, physicists-folks who did machine learning, artificial intelligence, and statistics. They worked with data. And they were smart. Super smart.

And yet, something went wrong. Did they not ask the correct questions of their work? Were disclosures of risk lost in a game of telephone from the analysts to the decision makers, with uncertainty being stripped away piece by piece, giving an illusion of a perfectly predictable housing market? Did the people involved flat out lie about results?

More personal to us, how could we avoid similar mistakes in our own work?

We had many questions and could only speculate the answers, but one thing was clear-this was a large-scale data disaster at work. And it wouldn't be the last.

The 2016 United States General Election

On November 8, 2016, the Republican candidate, Donald J. Trump, won the general election of the United States beating the assumed front-runner and Democratic challenger, Hillary Clinton. For the political pollsters this came as a shock. Their models hadn't predicted his win. And this was supposed to be the year for election prediction.

In 2008, Nate Silver's FiveThirtyEight blog-then part of The New York Times-had done a fantastic job predicting Barack Obama's win. At the time, pundits were skeptical that his forecasting algorithm could accurately predict the election. In 2012, once again, Nate Silver was front and center predicting another win for Barack Obama.

By this point, the business world was starting to embrace data and hire data scientists. The successful prediction by Nate Silver of Barack Obama's reelection only reinforced the importance and perhaps oracle-like abilities of forecasting with data. Articles in business magazines warned executives to adopt data or be swallowed by a data-driven competitor. The Data Science Industrial Complex was in full force.

By 2016, every major news outlet had invested in a prediction algorithm to forecast the general election outcome. The vast, vast majority of them by and large suggested an overwhelming victory for the Democratic candidate, Hillary Clinton. Oh, how wrong they were.

Let's contrast how wrong they were as we compare it against the subprime mortgage crisis. One could argue that we learned a lot from the past. That interest in data science would give rise to avoiding past mistakes. Yes, it's true: since 2008-and 2012-news organizations hired data scientists, invested in polling research, created data teams, and spent more money ensuring they received good data.

Which begs the question: with all that time, money, effort, and education-what happened?3

Our Hypothesis

Why do data problems like this occur? We assign three causes: hard problems, lack of critical thinking, and poor communication.

First (as we said earlier), this stuff is complex. Many data problems are fundamentally difficult. Even with lots of data, the right tools and techniques, and the smartest analysts, mistakes happen. Predictions can and will be wrong. This is not a criticism of data and statistics. It's simply reality.

Second, some analysts and stakeholders stopped thinking critically about data problems. The Data Science Industrial Complex, in its hubris, painted a picture of certainty and simplicity, and a subset of people drank the proverbial "Kool-Aid." Perhaps it's human nature-people don't want to admit they don't know what is going to happen. But a key part of thinking about and using data correctly is recognizing wrong decisions can happen. This means communicating and understanding risks and uncertainties. Somehow this message got lost. While we'd hope the tremendous progress in research and methods in data and analysis would sharpen everyone's critical thinking, it caused some to turn it off.

The third reason...

Systemvoraussetzungen

Als PDF speichern Als Link merken