Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Practical, accessible guide to becoming a data scientist, updated to include the latest advances in data science and related fields.
Becoming a data scientist is hard. The job focuses on mathematical tools, but also demands fluency with software engineering, understanding of a business situation, and deep understanding of the data itself. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.
The focus of The Data Science Handbook is on practical applications and the ability to solve real problems, rather than theoretical formalisms that are rarely needed in practice. Among its key points are:
Data science is a quickly evolving field, and this 2nd edition has been updated to reflect the latest developments, including the revolution in AI that has come from Large Language Models and the growth of ML Engineering as its own discipline. Much of data science has become a skillset that anybody can have, making this book not only for aspiring data scientists, but also for professionals in other fields who want to use analytics as a force multiplier in their organization.
Field Cady is a data scientist, researcher and author based in Seattle, WA, USA. He has worked for a range of companies including Google, the Allen Institute for Artificial Intelligence, and several startups. He received a BS in physics and math from Stanford and did graduate work computer science at Carnegie Mellon. He is the author of The Data Science Handbook (Wiley 2017).
Preface to the First Edition xvii
Preface to the Second Edition xix
1 Introduction 1
1.1 What Data Science Is and Isn't 2
1.2 This Book's Slogan: Simple Models Are Easier to Work With 3
1.3 How Is This Book Organized? 4
1.4 How to Use This Book? 4
1.5 Why Is It All in Python, Anyway? 4
1.6 Example Code and Datasets 5
1.7 Parting Words 5
Part I The Stuff You'll Always Use 7
2 The Data Science Road Map 9
2.1 Frame the Problem 10
2.2 Understand the Data: Basic Questions 11
2.3 Understand the Data: Data Wrangling 12
2.4 Understand the Data: Exploratory Analysis 12
2.5 Extract Features 13
2.6 Model 14
2.7 Present Results 14
2.8 Deploy Code 14
2.9 Iterating 15
2.10 Glossary 15
3 Programming Languages 17
3.1 Why Use a Programming Language? What Are the Other Options? 17
3.2 A Survey of Programming Languages for Data Science 18
3.3 Where to Write Code 20
3.4 Python Overview and Example Scripts 21
3.5 Python Data Types 25
3.6 GOTCHA: Hashable and Unhashable Types 30
3.7 Functions and Control Structures 31
3.8 Other Parts of Python 33
3.9 Python's Technical Libraries 35
3.10 Other Python Resources 39
3.11 Further Reading 39
3.12 Glossary 40
3a Interlude: My Personal Toolkit 41
4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning 43
4.1 The Worst Dataset in the World 43
4.2 How to Identify Pathologies 44
4.3 Problems with Data Content 44
4.4 Formatting Issues 46
4.5 Example Formatting Script 49
4.6 Regular Expressions 50
4.7 Life in the Trenches 53
4.8 Glossary 54
5 Visualizations and Simple Metrics 55
5.1 A Note on Python's Visualization Tools 56
5.2 Example Code 56
5.3 Pie Charts 56
5.4 Bar Charts 58
5.5 Histograms 59
5.6 Means, Standard Deviations, Medians, and Quantiles 61
5.7 Boxplots 62
5.8 Scatterplots 64
5.9 Scatterplots with Logarithmic Axes 65
5.10 Scatter Matrices 67
5.11 Heatmaps 68
5.12 Correlations 69
5.13 Anscombe's Quartet and the Limits of Numbers 71
5.14 Time Series 72
5.15 Further Reading 75
5.16 Glossary 75
6 Overview: Machine Learning and Artificial Intelligence 77
6.1 Historical Context 77
6.2 The Central Paradigm: Learning a Function from Example 78
6.3 Machine Learning Data: Vectors and Feature Extraction 79
6.4 Supervised, Unsupervised, and In-Between 79
6.5 Training Data, Testing Data, and the Great Boogeyman of Overfitting 80
6.6 Reinforcement Learning 81
6.7 ML Models as Building Blocks for AI Systems 82
6.8 ML Engineering as a New Job Role 82
6.9 Further Reading 83
6.10 Glossary 83
7 Interlude: Feature Extraction Ideas 85
7.1 Standard Features 85
7.2 Features that Involve Grouping 86
7.3 Preview of More Sophisticated Features 86
7.4 You Get What You Measure: Defining the Target Variable 87
8 Machine-Learning Classification 89
8.1 What Is a Classifier, and What Can You Do with It? 89
8.2 A Few Practical Concerns 90
8.3 Binary Versus Multiclass 90
8.4 Example Script 91
8.5 Specific Classifiers 92
8.6 Evaluating Classifiers 102
8.7 Selecting Classification Cutoffs 105
8.8 Further Reading 106
8.9 Glossary 106
9 Technical Communication and Documentation 109
9.1 Several Guiding Principles 109
9.2 Slide Decks 112
9.3 Written Reports 114
9.4 Speaking: What Has Worked for Me 115
9.5 Code Documentation 117
9.6 Further Reading 117
9.7 Glossary 117
Part II Stuff You Still Need to Know 119
10 Unsupervised Learning: Clustering and Dimensionality Reduction 121
10.1 The Curse of Dimensionality 121
10.2 Example: Eigenfaces for Dimensionality Reduction 123
10.3 Principal Component Analysis and Factor Analysis 125
10.4 Skree Plots and Understanding Dimensionality 127
10.5 Factor Analysis 127
10.6 Limitations of PCA 128
10.7 Clustering 128
10.8 Further Reading 133
10.9 Glossary 134
11 Regression 135
11.1 Example: Predicting Diabetes Progression 136
11.2 Fitting a Line with Least Squares 137
11.3 Alternatives to Least Squares 139
11.4 Fitting Nonlinear Curves 139
11.5 Goodness of Fit: R 2 and Correlation 141
11.6 Correlation of Residuals 142
11.7 Linear Regression 142
11.8 LASSO Regression and Feature Selection 144
11.9 Further Reading 145
11.10 Glossary 145
12 Data Encodings and File Formats 147
12.1 Typical File Format Categories 147
12.2 CSV Files 149
12.3 JSON Files 150
12.4 XML Files 151
12.5 HTML Files 153
12.6 Tar Files 154
12.7 GZip Files 155
12.8 Zip Files 155
12.9 Image Files: Rasterized, Vectorized, and/or Compressed 156
12.10 It's All Bytes at the End of the Day 157
12.11 Integers 158
12.12 Floats 158
12.13 Text Data 159
12.14 Further Reading 161
12.15 Glossary 161
13 Big Data 163
13.1 What Is Big Data? 163
13.2 When to Use - And not Use - Big Data 164
13.3 Hadoop: The File System and the Processor 165
13.4 Example PySpark Script 165
13.5 Spark Overview 166
13.6 Spark Operations 168
13.7 PySpark Data Frames 169
13.8 Two Ways to Run PySpark 170
13.9 Configuring Spark 170
13.10 Under the Hood 172
13.11 Spark Tips and Gotchas 172
13.12 The MapReduce Paradigm 173
13.13 Performance Considerations 174
13.14 Further Reading 175
13.15 Glossary 176
14 Databases 177
14.1 Relational Databases and MySQL® 178
14.2 Key-Value Stores 183
14.3 Wide-Column Stores 183
14.4 Document Stores 184
14.5 Further Reading 186
14.6 Glossary 186
15 Software Engineering Best Practices 187
15.1 Coding Style 187
15.2 Version Control and Git for Data Scientists 189
15.3 Testing Code 191
15.4 Test-Driven Development 193
15.5 AGILE Methodology 194
15.6 Further Reading 194
15.7 Glossary 194
16 Traditional Natural Language Processing 197
16.1 Do I Even Need NLP? 197
16.2 The Great Divide: Language Versus Statistics 198
16.3 Example: Sentiment Analysis on Stock Market Articles 198
16.4 Software and Datasets 200
16.5 Tokenization 201
16.6 Central Concept: Bag-of-Words 201
16.7 Word Weighting: TF-IDF 202
16.8 n-Grams 202
16.9 Stop Words 203
16.10 Lemmatization and Stemming 203
16.11 Synonyms 204
16.12 Part of Speech Tagging 204
16.13 Common Problems 204
16.14 Advanced Linguistic NLP: Syntax Trees, Knowledge, and Understanding 206
16.15 Further Reading 207
16.16 Glossary 207
17 Time Series Analysis 209
17.1 Example: Predicting Wikipedia Page Views 210
17.2 A Typical Workflow 213
17.3 Time Series Versus Time-Stamped Events 213
17.4 Resampling and Interpolation 214
17.5 Smoothing Signals 216
17.6 Logarithms and Other Transformations 217
17.7 Trends and Periodicity 217
17.8 Windowing 217
17.9 Brainstorming Simple Features 218
17.10 Better Features: Time Series as Vectors 219
17.11 Fourier Analysis: Sometimes a Magic Bullet 220
17.12 Time Series in Context: The Whole Suite of Features 222
17.13 Further Reading 222
17.14 Glossary 222
18 Probability 225
18.1 Flipping Coins: Bernoulli Random Variables 225
18.2 Throwing Darts: Uniform Random Variables 226
18.3 The Uniform Distribution and Pseudorandom Numbers 227
18.4 Nondiscrete, Noncontinuous Random Variables 228
18.5 Notation, Expectations, and Standard Deviation 230
18.6 Dependence, Marginal, and Conditional Probability 231
18.7 Understanding the Tails 232
18.8 Binomial Distribution 234
18.9 Poisson Distribution 234
18.10 Normal Distribution 235
18.11 Multivariate Gaussian 236
18.12 Exponential Distribution 237
18.13 Log-Normal Distribution 238
18.14 Entropy 238
18.15 Further Reading 240
18.16 Glossary 240
19 Statistics 243
19.1 Statistics in Perspective 243
19.2 Bayesian Versus Frequentist: Practical Tradeoffs and Differing Philosophies 244
19.3 Hypothesis Testing: Key Idea and Example 245
19.4 Multiple Hypothesis Testing 246
19.5 Parameter Estimation 247
19.6 Hypothesis Testing: t-Test 248
19.7 Confidence Intervals 250
19.8 Bayesian Statistics 252
19.9 Naive Bayesian Statistics 253
19.10 Bayesian Networks 253
19.11 Choosing Priors: Maximum Entropy or Domain Knowledge 254
19.12 Further Reading 255
19.13 Glossary 255
20 Programming Language Concepts 257
20.1 Programming Paradigms 257
20.2 Compilation and Interpretation 264
20.3 Type Systems 266
20.4 Further Reading 267
20.5 Glossary 267
21 Performance and Computer Memory 269
21.1 A Word of Caution 269
21.2 Example Script 270
21.3 Algorithm Performance and Big-O Notation 272
21.4 Some Classic Problems: Sorting a List and Binary Search 273
21.5 Amortized Performance and Average Performance 276
21.6 Two Principles: Reducing Overhead and Managing Memory 277
21.7 Performance Tip: Use Numerical Libraries When Applicable 278
21.8 Performance Tip: Delete Large Structures You Don't Need 280
21.9 Performance Tip: Use Built-In Functions When Possible 280
21.10 Performance Tip: Avoid Superfluous Function Calls 280
21.11 Performance Tip: Avoid Creating Large New Objects 281
21.12 Further Reading 281
21.13 Glossary 281
Part III Specialized or Advanced Topics 283
22 Computer Memory and Data Structures 285
22.1 Virtual Memory, the Stack, and the Heap 285
22.2 Example C Program 286
22.3 Data Types and Arrays in Memory 286
22.4 Structs 287
22.5 Pointers, the Stack, and the Heap 288
22.6 Key Data Structures 292
22.7 Further Reading 297
22.8 Glossary 297
23 Maximum-Likelihood Estimation and Optimization 299
23.1 Maximum-Likelihood Estimation 299
23.2 A Simple Example: Fitting a Line 300
23.3 Another Example: Logistic Regression 301
23.4 Optimization 302
23.5 Gradient Descent 303
23.6 Convex Optimization 306
23.7 Stochastic Gradient Descent 307
23.8 Further Reading 308
23.9 Glossary 308
24 Deep Learning and AI 309
24.1 A Note on Libraries and Hardware 310
24.2 A Note on Training Data 310
24.3 Simple Deep Learning: Perceptrons 311
24.4 What Is a Tensor? 314
24.5 Convolutional Neural Networks 315
24.6 Example: The MNIST Handwriting Dataset 317
24.7 Autoencoders and Latent Vectors 318
24.8 Generative AI and GANs 321
24.9 Diffusion Models 323
24.10 RNNs, Hidden State, and the Encoder-Decoder 324
24.11 Attention and Transformers 325
24.12 Stable Diffusion: Bringing the Parts Together 326
24.13 Large Language Models and Prompt Engineering 327
24.14 Further Reading 328
24.15 Glossary 329
25 Stochastic Modeling 331
25.1 Markov Chains 331
25.2 Two Kinds of Markov Chain, Two Kinds of Questions 333
25.3 Hidden Markov Models and the Viterbi Algorithm 334
25.4 The Viterbi Algorithm 336
25.5 Random Walks 337
25.6 Brownian Motion 338
25.7 ARIMA Models 339
25.8 Continuous-Time Markov Processes 339
25.9 Poisson Processes 340
25.10 Further Reading 341
25.11 Glossary 341
26 Parting Words: Your Future as a Data Scientist 343
Index 345
The goal of this book is to turn you into a data scientist, and there are two parts to this mission. First, there is a set of specific concepts, tools, and techniques that you can go out and solve problems with today. They include buzzwords such as machine learning (ML), Spark, and natural language processing (NLP). They also include concepts that are distinctly less sexy but often more useful, like regular expressions, unit tests, and SQL queries. It would be impossible to give an exhaustive list in any single book, but I cast a wide net.
That brings me to the second part of my goal. Tools are constantly changing, and your long-term future as a data scientist depends less on what you know today and more on what you are able to learn going forward. To that end, I want to help you understand the concepts behind the algorithms and the technological fundamentals that underlie the tools we use. For example, this is why I spend a fair amount of time on computer memory and optimization: they are often the underlying reason that one approach is better than another. If you understand the key concepts, you can make the right trade-offs, and you will be able to see how new ideas are related to older ones.
As the field evolves, data science is becoming not just a discipline in its own right, but also a skillset that anybody can have. The software tools are getting better and easier to use, best practices are becoming widely known, and people are learning many of the key skills in school before they've even started their career. There will continue to be data science specialists, but there is also a growing number of the so-called "citizen data scientists" whose real job is something else. They are engineers, biologists, UX designers, programmers, and economists: professionals from all fields who have learned the techniques of data science and are fruitfully applying them to their main discipline.
This book is aimed at anybody who is entering the field. Depending on your background, some parts of it may be stuff you already know. Especially for citizen data scientists, other parts may be unnecessary for your work. But taken as a whole, this book will give you a practical skillset for today, and a solid foundation for your future in data science.
Despite the fact that "data science" is widely practiced and studied today, the term itself is somewhat elusive. So before we go any further, I'd like to give you the definition that I use. I've found that this one gets right to the heart of what sets it apart from other disciplines. Here goes:
Data science means doing analytically oriented work that, for one reason or another, requires a substantial amount of software engineering skills.
Often the final deliverable is the kind of thing a statistician or business analyst might provide, but achieving that goal demands software skills that your typical analyst simply doesn't have - writing a custom parser for an obscure data format, complex preprocessing logic that must be kept in order, etc. Other times the data scientist will need to write production software based on their insights, or perhaps make their model available in real time. Often the dataset itself is so large that just creating a pie chart requires that the work be done in parallel across a cluster of computers. And sometimes, it's just a really gnarly SQL query that most people struggle to wrap their heads around.
Nate Silver, a statistician famous for accurate forecasting of US elections, once said: "I think data scientist is a sexed-up term for statistician." He has a point, but what he said is only partly true. The discipline of statistics deals mostly with rigorous mathematical methods for solving well-defined problems; data scientists spend most of their time getting data and the problem into a form where statistical methods can even be applied. This involves making sure that the analytics problem is a good match to business objectives, choosing what to measure and how to quantify things (typically more the domain of a BI analyst), extracting meaningful features from the raw data, and coping with any pathologies of the data or weird edge cases (which often requires a level of coding more typical of a software engineer). Once that heavy lifting is done, you can apply statistical tools to get the final results - although, in practice, you often don't even need them. Professional statisticians need to do a certain amount of preprocessing themselves, but there is a massive difference in degree.
Historically, statistics focused on rigorous methods to analyze clean datasets, such as those that come out of controlled experiments in medicine and agriculture. Often the data was gathered explicitly to support the statisticians' analysis! In the 2000s though a new class of datasets became popular to analyze. "Big Data" used new cluster computing tools to study large, messy, heterogenous datasets of the sort that would make statisticians shudder: HTML pages, image files, e-mails, raw output logs of web servers, and so on. These datasets don't fit the mold of relational databases or statistical tools, and they were not designed to facilitate any particular statistical analysis; so for decades, they were just piling up without being analyzed. Data science came into being as a way to finally milk them for insights. Most of the first data scientists were computer programmers or ML experts who were working on Big Data problems, not statisticians in the traditional sense.
The lines have now blurred: statisticians do more coding than they used to, Big Data tools are less central to the work of a data scientist, and ML is used by a broad swatch of people. And this is healthy: the differences between these fields are, after all, really just a matter of degree and/or historical accident. But, in practical terms, "data scientists" are still the jacks-of-all-trades in the middle. They can do statistics, but if you're looking to tease every last insight out of clinical trial data, you should consult a statistician. They can train and deploy ML models, but if you're trying to eke performance out of a large neural network an ML engineer would be better. They can turn business questions into math problems, but they may not have the deep business knowledge of an analyst.
There is a common theme in this book that I would like to call out at as the book's explicit motto: simple models are easier to work with. Let me explain.
People tend to idolize and gravitate toward complicated analytical models like deep neural nets, Bayesian networks, ARIMA models, and the like. There are good reasons to use these tools; the best-performing models in the world are usually complicated, there may be fancy ways to bake in expert knowledge, etc. There are also bad reasons to use these tools, like ego and pressure to use to latest buzzwords.
But seasoned data scientists understand that there is more to a model than how accurate it is. Simple models are, above all, easier to reason about. If you're trying to understand what patterns in the data your model is picking up on, simple models are the way to go. Oftentimes this is the whole point of a model anyway: we are just trying to get insights into the system we are studying, and a model's performance is just used to gauge how fully it has captured the relevant patterns in the data.
A related advantage of simple models is supremely mundane: stuff breaks, and they make it easier to find what's broken. Bad training data, perverse inputs to the model, and data that is incorrectly formatted - all of these are liable to cause conspicuous failures, and it's easy to figure out what went wrong by dissecting the model. For this reason, I like "stunt double models," which have the same input/output format as a complicated one and are used to debug the model's integration with other systems.
Simple models are less prone to overfitting. If your dataset is small, a fancy model will often actually perform worse: it essentially memorizes the training data, rather than extracting general patterns from it. The simpler a model, the less you have to worry about the size of your dataset (though admittedly this can create a square-peg-in-a-round-hole situation where the model can't fit the data well and performance degrades).
Simple models are easier to hack and jury-rig. Frequently they have a small number of tunable parameters, with clear meanings that you can adjust to suit the business needs at hand.
The inferior performance of simple models can act as a performance benchmark, a level that the fancier model must meaningfully exceed in order to justify its extra complexity. And if a simple model performs particularly badly, this may suggest that there isn't enough signal in the data to make the problem worthwhile.
On the other hand, when there is enough training data and it is representative of what you expect to see, fancier models do perform better. You usually don't want to leave money on the table by deploying grossly inferior models simply because they are easier to debug. And there are many situations, like cutting-edge AI, where the relevant patterns are very complicated, and it takes a complicated model to accurately capture them. Even in...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.