Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
"Turn yourself into a Data Head. You'll become a more valuable employee and make your organization more successful." Thomas H. Davenport, Research Fellow, Author of Competing on Analytics, Big Data @ Work, and The AI Advantage
You've heard the hype around data - now get the facts.
In Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning, award-winning data scientists Alex Gutman and Jordan Goldmeier pull back the curtain on data science and give you the language and tools necessary to talk and think critically about it.
You'll learn how to:
Becoming a Data Head is a complete guide for data science in the workplace: covering everything from the personalities you'll work with to the math behind the algorithms. The authors have spent years in data trenches and sought to create a fun, approachable, and eminently readable book. Anyone can become a Data Head-an active participant in data science, statistics, and machine learning. Whether you're a business professional, engineer, executive, or aspiring data scientist, this book is for you.
ALEX J. GUTMAN, PhD, is a Data Scientist, Corporate Trainer, and Accredited Professional Statistician. His professional focus is on statistical and machine learning and he has extensive experience working as a Data Scientist for the Department of Defense and two Fortune 50 companies.
JORDAN GOLDMEIER is a Data Scientist, author, speaker, and community leader. He is a seven-time recipient of the Microsoft Most Valuable Professional Award and he has taught analytics to members of the Pentagon and Fortune 500 companies.
Acknowledgments xiii
Foreword xxiii
Introduction xxvii
Part One Thinking Like a Data Head
Chapter 1 What Is the Problem? 3
Questions a Data Head Should Ask 4
Why Is This Problem Important? 4
Who Does This Problem Affect? 6
What If We Don't Have the Right Data? 6
When Is the Project Over? 7
What If We Don't Like the Results? 7
Understanding Why Data Projects Fail 8
Customer Perception 8
Discussion 10
Working on Problems That Matter 11
Chapter Summary 11
Chapter 2 What Is Data? 13
Data vs. Information 13
An Example Dataset 14
Data Types 15
How Data Is Collected and Structured 16
Observational vs. Experimental Data 16
Structured vs. Unstructured Data 17
Basic Summary Statistics 18
Chapter Summary 19
Chapter 3 Prepare to Think Statistically 21
Ask Questions 22
There Is Variation in All Things 23
Scenario: Customer Perception (The Sequel) 24
Case Study: Kidney-Cancer Rates 26
Probabilities and Statistics 28
Probability vs. Intuition 29
Discovery with Statistics 31
Chapter Summary 33
Part Two Speaking Like a Data Head
Chapter 4 Argue with the Data 37
What Would You Do? 38
Missing Data Disaster 39
Tell Me the Data Origin Story 43
Who Collected the Data? 44
How Was the Data Collected? 44
Is the Data Representative? 45
Is There Sampling Bias? 46
What Did You Do with Outliers? 46
What Data Am I Not Seeing? 47
How Did You Deal with Missing Values? 47
Can the Data Measure What You Want It to Measure? 48
Argue with Data of All Sizes 48
Chapter Summary 49
Chapter 5 Explore the Data 51
Exploratory Data Analysis and You 52
Embracing the Exploratory Mindset 52
Questions to Guide You 53
The Setup 53
Can the Data Answer the Question? 54
Set Expectations and Use Common Sense 54
Do the Values Make Intuitive Sense? 54
Watch Out: Outliers and Missing Values 58
Did You Discover Any Relationships? 59
Understanding Correlation 59
Watch Out: Misinterpreting Correlation 60
Watch Out: Correlation Does Not Imply Causation 62
Did You Find New Opportunities in the Data? 63
Chapter Summary 63
Chapter 6 Examine the Probabilities 65
Take a Guess 66
The Rules of the Game 66
Notation 67
Conditional Probability and Independent Events 69
The Probability of Multiple Events 69
Two Things That Happen Together 69
One Thing or the Other 70
Probability Thought Exercise 72
Next Steps 73
Be Careful Assuming Independence 74
Don't Fall for the Gambler's Fallacy 74
All Probabilities Are Conditional 75
Don't Swap Dependencies 76
Bayes' Theorem 76
Ensure the Probabilities Have Meaning 79
Calibration 80
Rare Events Can, and Do, Happen 80
Chapter Summary 81
Chapter 7 Challenge the Statistics 83
Quick Lessons on Inference 83
Give Yourself Some Wiggle Room 84
More Data, More Evidence 84
Challenge the Status Quo 85
Evidence to the Contrary 86
Balance Decision Errors 88
The Process of Statistical Inference 89
The Questions You Should Ask to Challenge the Statistics 90
What Is the Context for These Statistics? 90
What Is the Sample Size? 91
What Are You Testing? 92
What Is the Null Hypothesis? 92
Assuming Equivalence 93
What Is the Significance Level? 93
How Many Tests Are You Doing? 94
Can I See the Confidence Intervals? 95
Is This Practically Significant? 96
Are You Assuming Causality? 96
Chapter Summary 97
Part Three Understanding the Data Scientist's Toolbox
Chapter 8 Search for Hidden Groups 101
Unsupervised Learning 102
Dimensionality Reduction 102
Creating Composite Features 103
Principal Component Analysis 105
Principal Components in Athletic Ability 105
PCA Summary 108
Potential Traps 109
Clustering 110
k-Means Clustering 111
Clustering Retail Locations 111
Potential Traps 113
Chapter Summary 114
Chapter 9 Understand the Regression Model 117
Supervised Learning 117
Linear Regression: What It Does 119
Least Squares Regression: Not Just a Clever Name 120
Linear Regression: What It Gives You 123
Extending to Many Features 124
Linear Regression: What Confusion It Causes 125
Omitted Variables 125
Multicollinearity 126
Data Leakage 127
Extrapolation Failures 128
Many Relationships Aren't Linear 128
Are You Explaining or Predicting? 128
Regression Performance 130
Other Regression Models 131
Chapter Summary 131
Chapter 10 Understand the Classification Model 133
Introduction to Classification 133
What You'll Learn 134
Classification Problem Setup 135
Logistic Regression 135
Logistic Regression: So What? 138
Decision Trees 139
Ensemble Methods 142
Random Forests 143
Gradient Boosted Trees 143
Interpretability of Ensemble Models 145
Watch Out for Pitfalls 145
Misapplication of the Problem 146
Data Leakage 146
Not Splitting Your Data 146
Choosing the Right Decision Threshold 147
Misunderstanding Accuracy 147
Confusion Matrices 148
Chapter Summary 150
Chapter 11 Understand Text Analytics 151
Expectations of Text Analytics 151
How Text Becomes Numbers 153
A Big Bag of Words 153
N-Grams 157
Word Embeddings 158
Topic Modeling 160
Text Classification 163
Naïve Bayes 164
Sentiment Analysis 166
Practical Considerations When Working with Text 167
Big Tech Has the Upper Hand 168
Chapter Summary 169
Chapter 12 Conceptualize Deep Learning 171
Neural Networks 172
How Are Neural Networks Like the Brain? 172
A Simple Neural Network 173
How a Neural Network Learns 174
A Slightly More Complex Neural Network 175
Applications of Deep Learning 178
The Benefits of Deep Learning 179
How Computers "See" Images 180
Convolutional Neural Networks 182
Deep Learning on Language and Sequences 183
Deep Learning in Practice 185
Do You Have Data? 185
Is Your Data Structured? 186
What Will the Network Look Like? 186
Artificial Intelligence and You 187
Big Tech Has the Upper Hand 188
Ethics in Deep Learning 189
Chapter Summary 190
Part Four Ensuring Success
Chapter 13 Watch Out for Pitfalls 193
Biases and Weird Phenomena in Data 194
Survivorship Bias 194
Regression to the Mean 195
Simpson's Paradox 195
Confirmation Bias 197
Effort Bias (aka the "Sunk Cost Fallacy") 197
Algorithmic Bias 198
Uncategorized Bias 198
The Big List of Pitfalls 199
Statistical and Machine Learning Pitfalls 199
Project Pitfalls 200
Chapter Summary 202
Chapter 14 Know the People and Personalities 203
Seven Scenes of Communication Breakdowns 204
The Postmortem 204
Storytime 205
The Telephone Game 206
Into the Weeds 206
The Reality Check 207
The Takeover 207
The Blowhard 208
Data Personalities 208
Data Enthusiasts 209
Data Cynics 209
Data Heads 209
Chapter Summary 210
Chapter 15 What's Next? 211
Index 215
Data is perhaps the single most important aspect to your job, whether you want it to be or not. And you're likely reading this book because you want to be able to understand what it's all about.
To begin, it's worth stating what has almost become cliché: we create and consume more information than ever before. Without a doubt, we are in the age of data. And this age of data has created an entire industry of promises, buzzwords, and products many of which you, your managers, colleagues, and subordinates are or will be using. But, despite the claims and proliferation of data promises and products, data science projects are failing at alarming rates.1
To be sure, we're not saying all data promises are empty or all products are terrible. Rather, to truly get your head around this space, you must embrace a fundamental truth: this stuff is complex. Working with data is about numbers, nuance, and uncertainty. Data is important, yes, but it's rarely simple. And yet, there is an entire industry that would have us think otherwise. An industry that promises certainty in an uncertain world and plays on companies' fear of missing out. We, the authors, call this the Data Science Industrial Complex.
It's a problem for everyone involved. Businesses endlessly pursue products that will do their thinking for them. Managers hire analytics professionals who really aren't. Data scientists are hired to work in companies that aren't ready for them. Executives are forced to listen to technobabble and pretend to understand. Projects stall. Money is wasted.
Meanwhile, the Data Science Industrial Complex is churning out new concepts faster than our ability to define and articulate the opportunities (and problems) they create. Blink, and you'll miss one. When your authors started working together, Big Data was all the rage. As time went on, data science became the hot new topic. Since then, machine learning, deep learning, and artificial intelligence have become the next focus.
To the curious and critical thinkers among us, something doesn't sit well. Are the problems really new? Or are these new definitions just rebranding old problems?
The answer, of course, is yes to both.
But the bigger question we hope you're asking yourself is, How can I think and speak critically about data?
Let us show you how.
By reading this book, you'll learn the tools, terms, and thinking necessary to navigate the Data Science Industrial Complex. You'll understand data and its challenges at a deeper level. You'll be able to think critically about the data and results you come across, and you'll be able to speak intelligently about all things data.
In short, you'll become a Data Head.
Before we get into the details, it's worth discussing why your authors, Alex and Jordan, care so much about this topic. In this section, we share two important examples of how data affected society at large and impacted us personally.
We were fresh out of college when the subprime mortgage crisis hit. We both landed jobs in 2009 for the Air Force, at a time when jobs were hard to find. We were both lucky. We had an in-demand skill: working with data. We had our hands in data every single day, working to operationalize research from Air Force analysts and scientists into products the government could use. Our hiring would be a harbinger of the focus the country would soon place on the types of roles we filled. As two data workers, we looked on the mortgage crisis with interest and curiosity.
The subprime mortgage crises had a lot of contributing factors behind it.2 In our attempt to offer it up as an example here, we don't want to negate other factors. However, put simply, we see it as a major data failure. Banks and investors created models to understand the value of mortgage-backed collateralized debt obligations (CDOs). You might remember those as the investment vehicles behind the United States' market collapse.
Mortgage-backed CDOs were thought to be a safe investment because they spread the risk associated with loan default across multiple investment units. The idea was that in a portfolio of mortgages, if only a few went into default, this would not materially affect the underlying value of the entire portfolio.
And yet, upon reflection we know that some fundamental underlying assumptions were wrong. Chief among them were that default outcomes were independent events. If Person A defaults on a loan, it wouldn't impact Person B's risk of default. We would all soon learn defaults functioned more like dominoes where a previous default could predict further defaults. When one mortgage defaulted, the property values surrounding the home dropped, and the risk of defaults on those homes increased. The default effectively dragged the neighboring houses down into a sinkhole.
Assuming independence when events are in fact connected is a common error in statistics.
But let's go further into this story. Investment banks created models that overvalued these investments. A model, which we'll talk about later in the book, is a deliberate oversimplification of reality. It uses assumptions about the real world in an attempt to understand and make predictions about certain phenomena.
And who were these people who created and understood these models? They were the people who would lay the groundwork for what today we call the data scientist. Our kind of people. Statisticians, economists, physicists-folks who did machine learning, artificial intelligence, and statistics. They worked with data. And they were smart. Super smart.
And yet, something went wrong. Did they not ask the correct questions of their work? Were disclosures of risk lost in a game of telephone from the analysts to the decision makers, with uncertainty being stripped away piece by piece, giving an illusion of a perfectly predictable housing market? Did the people involved flat out lie about results?
More personal to us, how could we avoid similar mistakes in our own work?
We had many questions and could only speculate the answers, but one thing was clear-this was a large-scale data disaster at work. And it wouldn't be the last.
On November 8, 2016, the Republican candidate, Donald J. Trump, won the general election of the United States beating the assumed front-runner and Democratic challenger, Hillary Clinton. For the political pollsters this came as a shock. Their models hadn't predicted his win. And this was supposed to be the year for election prediction.
In 2008, Nate Silver's FiveThirtyEight blog-then part of The New York Times-had done a fantastic job predicting Barack Obama's win. At the time, pundits were skeptical that his forecasting algorithm could accurately predict the election. In 2012, once again, Nate Silver was front and center predicting another win for Barack Obama.
By this point, the business world was starting to embrace data and hire data scientists. The successful prediction by Nate Silver of Barack Obama's reelection only reinforced the importance and perhaps oracle-like abilities of forecasting with data. Articles in business magazines warned executives to adopt data or be swallowed by a data-driven competitor. The Data Science Industrial Complex was in full force.
By 2016, every major news outlet had invested in a prediction algorithm to forecast the general election outcome. The vast, vast majority of them by and large suggested an overwhelming victory for the Democratic candidate, Hillary Clinton. Oh, how wrong they were.
Let's contrast how wrong they were as we compare it against the subprime mortgage crisis. One could argue that we learned a lot from the past. That interest in data science would give rise to avoiding past mistakes. Yes, it's true: since 2008-and 2012-news organizations hired data scientists, invested in polling research, created data teams, and spent more money ensuring they received good data.
Which begs the question: with all that time, money, effort, and education-what happened?3
Why do data problems like this occur? We assign three causes: hard problems, lack of critical thinking, and poor communication.
First (as we said earlier), this stuff is complex. Many data problems are fundamentally difficult. Even with lots of data, the right tools and techniques, and the smartest analysts, mistakes happen. Predictions can and will be wrong. This is not a criticism of data and statistics. It's simply reality.
Second, some analysts and stakeholders stopped thinking critically about data problems. The Data Science Industrial Complex, in its hubris, painted a picture of certainty and simplicity, and a subset of people drank the proverbial "Kool-Aid." Perhaps it's human nature-people don't want to admit they don't know what is going to happen. But a key part of thinking about and using data correctly is recognizing wrong decisions can happen. This means communicating and understanding risks and uncertainties. Somehow this message got lost. While we'd hope the tremendous progress in research and methods in data and analysis would sharpen everyone's critical thinking, it caused some to turn it off.
The third reason...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.