The Data Science Handbook

Name: The Data Science Handbook
Brand: Wiley
Price: 64.99 EUR
Availability: OnlineOnly

Field Cady(Autor*in)

Wiley (Verlag)

2. Auflage

Erschienen am 31. Oktober 2024

651 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-394-23450-9 (ISBN)

64,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Practical, accessible guide to becoming a data scientist, updated to include the latest advances in data science and related fields.

Becoming a data scientist is hard. The job focuses on mathematical tools, but also demands fluency with software engineering, understanding of a business situation, and deep understanding of the data itself. This book provides a crash course in data science, combining all the necessary skills into a unified discipline.

The focus of The Data Science Handbook is on practical applications and the ability to solve real problems, rather than theoretical formalisms that are rarely needed in practice. Among its key points are:

An emphasis on software engineering and coding skills, which play a significant role in most real data science problems.
Extensive sample code, detailed discussions of important libraries, and a solid grounding in core concepts from computer science (computer architecture, runtime complexity, and programming paradigms).
A broad overview of important mathematical tools, including classical techniques in statistics, stochastic modeling, regression, numerical optimization, and more.
Extensive tips about the practical realities of working as a data scientist, including understanding related jobs functions, project life cycles, and the varying roles of data science in an organization.
Exactly the right amount of theory. A solid conceptual foundation is required for fitting the right model to a business problem, understanding a tool's limitations, and reasoning about discoveries.

Data science is a quickly evolving field, and this 2nd edition has been updated to reflect the latest developments, including the revolution in AI that has come from Large Language Models and the growth of ML Engineering as its own discipline. Much of data science has become a skillset that anybody can have, making this book not only for aspiring data scientists, but also for professionals in other fields who want to use analytics as a force multiplier in their organization.

Weitere Details

Weitere Ausgaben

Person

Inhalt

Preface to the First Edition xvii

Preface to the Second Edition xix

1 Introduction 1

1.1 What Data Science Is and Isn't 2

1.2 This Book's Slogan: Simple Models Are Easier to Work With 3

1.3 How Is This Book Organized? 4

1.4 How to Use This Book? 4

1.5 Why Is It All in Python, Anyway? 4

1.6 Example Code and Datasets 5

1.7 Parting Words 5

Part I The Stuff You'll Always Use 7

2 The Data Science Road Map 9

2.1 Frame the Problem 10

2.2 Understand the Data: Basic Questions 11

2.3 Understand the Data: Data Wrangling 12

2.4 Understand the Data: Exploratory Analysis 12

2.5 Extract Features 13

2.6 Model 14

2.7 Present Results 14

2.8 Deploy Code 14

2.9 Iterating 15

2.10 Glossary 15

3 Programming Languages 17

3.1 Why Use a Programming Language? What Are the Other Options? 17

3.2 A Survey of Programming Languages for Data Science 18

3.3 Where to Write Code 20

3.4 Python Overview and Example Scripts 21

3.5 Python Data Types 25

3.6 GOTCHA: Hashable and Unhashable Types 30

3.7 Functions and Control Structures 31

3.8 Other Parts of Python 33

3.9 Python's Technical Libraries 35

3.10 Other Python Resources 39

3.11 Further Reading 39

3.12 Glossary 40

3a Interlude: My Personal Toolkit 41

4 Data Munging: String Manipulation, Regular Expressions, and Data Cleaning 43

4.1 The Worst Dataset in the World 43

4.2 How to Identify Pathologies 44

4.3 Problems with Data Content 44

4.4 Formatting Issues 46

4.5 Example Formatting Script 49

4.6 Regular Expressions 50

4.7 Life in the Trenches 53

4.8 Glossary 54

5 Visualizations and Simple Metrics 55

5.1 A Note on Python's Visualization Tools 56

5.2 Example Code 56

5.3 Pie Charts 56

5.4 Bar Charts 58

5.5 Histograms 59

5.6 Means, Standard Deviations, Medians, and Quantiles 61

5.7 Boxplots 62

5.8 Scatterplots 64

5.9 Scatterplots with Logarithmic Axes 65

5.10 Scatter Matrices 67

5.11 Heatmaps 68

5.12 Correlations 69

5.13 Anscombe's Quartet and the Limits of Numbers 71

5.14 Time Series 72

5.15 Further Reading 75

5.16 Glossary 75

6 Overview: Machine Learning and Artificial Intelligence 77

6.1 Historical Context 77

6.2 The Central Paradigm: Learning a Function from Example 78

6.3 Machine Learning Data: Vectors and Feature Extraction 79

6.4 Supervised, Unsupervised, and In-Between 79

6.5 Training Data, Testing Data, and the Great Boogeyman of Overfitting 80

6.6 Reinforcement Learning 81

6.7 ML Models as Building Blocks for AI Systems 82

6.8 ML Engineering as a New Job Role 82

6.9 Further Reading 83

6.10 Glossary 83

7 Interlude: Feature Extraction Ideas 85

7.1 Standard Features 85

7.2 Features that Involve Grouping 86

7.3 Preview of More Sophisticated Features 86

7.4 You Get What You Measure: Defining the Target Variable 87

8 Machine-Learning Classification 89

8.1 What Is a Classifier, and What Can You Do with It? 89

8.2 A Few Practical Concerns 90

8.3 Binary Versus Multiclass 90

8.4 Example Script 91

8.5 Specific Classifiers 92

8.6 Evaluating Classifiers 102

8.7 Selecting Classification Cutoffs 105

8.8 Further Reading 106

8.9 Glossary 106

9 Technical Communication and Documentation 109

9.1 Several Guiding Principles 109

9.2 Slide Decks 112

9.3 Written Reports 114

9.4 Speaking: What Has Worked for Me 115

9.5 Code Documentation 117

9.6 Further Reading 117

9.7 Glossary 117

Part II Stuff You Still Need to Know 119

10 Unsupervised Learning: Clustering and Dimensionality Reduction 121

10.1 The Curse of Dimensionality 121

10.2 Example: Eigenfaces for Dimensionality Reduction 123

10.3 Principal Component Analysis and Factor Analysis 125

10.4 Skree Plots and Understanding Dimensionality 127

10.5 Factor Analysis 127

10.6 Limitations of PCA 128

10.7 Clustering 128

10.8 Further Reading 133

10.9 Glossary 134

11 Regression 135

11.1 Example: Predicting Diabetes Progression 136

11.2 Fitting a Line with Least Squares 137

11.3 Alternatives to Least Squares 139

11.4 Fitting Nonlinear Curves 139

11.5 Goodness of Fit: R 2 and Correlation 141

11.6 Correlation of Residuals 142

11.7 Linear Regression 142

11.8 LASSO Regression and Feature Selection 144

11.9 Further Reading 145

11.10 Glossary 145

12 Data Encodings and File Formats 147

12.1 Typical File Format Categories 147

12.2 CSV Files 149

12.3 JSON Files 150

12.4 XML Files 151

12.5 HTML Files 153

12.6 Tar Files 154

12.7 GZip Files 155

12.8 Zip Files 155

12.9 Image Files: Rasterized, Vectorized, and/or Compressed 156

12.10 It's All Bytes at the End of the Day 157

12.11 Integers 158

12.12 Floats 158

12.13 Text Data 159

12.14 Further Reading 161

12.15 Glossary 161

13 Big Data 163

13.1 What Is Big Data? 163

13.2 When to Use - And not Use - Big Data 164

13.3 Hadoop: The File System and the Processor 165

13.4 Example PySpark Script 165

13.5 Spark Overview 166

13.6 Spark Operations 168

13.7 PySpark Data Frames 169

13.8 Two Ways to Run PySpark 170

13.9 Configuring Spark 170

13.10 Under the Hood 172

13.11 Spark Tips and Gotchas 172

13.12 The MapReduce Paradigm 173

13.13 Performance Considerations 174

13.14 Further Reading 175

13.15 Glossary 176

14 Databases 177

14.1 Relational Databases and MySQL® 178

14.2 Key-Value Stores 183

14.3 Wide-Column Stores 183

14.4 Document Stores 184

14.5 Further Reading 186

14.6 Glossary 186

15 Software Engineering Best Practices 187

15.1 Coding Style 187

15.2 Version Control and Git for Data Scientists 189

15.3 Testing Code 191

15.4 Test-Driven Development 193

15.5 AGILE Methodology 194

15.6 Further Reading 194

15.7 Glossary 194

16 Traditional Natural Language Processing 197

16.1 Do I Even Need NLP? 197

16.2 The Great Divide: Language Versus Statistics 198

16.3 Example: Sentiment Analysis on Stock Market Articles 198

16.4 Software and Datasets 200

16.5 Tokenization 201

16.6 Central Concept: Bag-of-Words 201

16.7 Word Weighting: TF-IDF 202

16.8 n-Grams 202

16.9 Stop Words 203

16.10 Lemmatization and Stemming 203

16.11 Synonyms 204

16.12 Part of Speech Tagging 204

16.13 Common Problems 204

16.14 Advanced Linguistic NLP: Syntax Trees, Knowledge, and Understanding 206

16.15 Further Reading 207

16.16 Glossary 207

17 Time Series Analysis 209

17.1 Example: Predicting Wikipedia Page Views 210

17.2 A Typical Workflow 213

17.3 Time Series Versus Time-Stamped Events 213

17.4 Resampling and Interpolation 214

17.5 Smoothing Signals 216

17.6 Logarithms and Other Transformations 217

17.7 Trends and Periodicity 217

17.8 Windowing 217

17.9 Brainstorming Simple Features 218

17.10 Better Features: Time Series as Vectors 219

17.11 Fourier Analysis: Sometimes a Magic Bullet 220

17.12 Time Series in Context: The Whole Suite of Features 222

17.13 Further Reading 222

17.14 Glossary 222

18 Probability 225

18.1 Flipping Coins: Bernoulli Random Variables 225

18.2 Throwing Darts: Uniform Random Variables 226

18.3 The Uniform Distribution and Pseudorandom Numbers 227

18.4 Nondiscrete, Noncontinuous Random Variables 228

18.5 Notation, Expectations, and Standard Deviation 230

18.6 Dependence, Marginal, and Conditional Probability 231

18.7 Understanding the Tails 232

18.8 Binomial Distribution 234

18.9 Poisson Distribution 234

18.10 Normal Distribution 235

18.11 Multivariate Gaussian 236

18.12 Exponential Distribution 237

18.13 Log-Normal Distribution 238

18.14 Entropy 238

18.15 Further Reading 240

18.16 Glossary 240

19 Statistics 243

19.1 Statistics in Perspective 243

19.2 Bayesian Versus Frequentist: Practical Tradeoffs and Differing Philosophies 244

19.3 Hypothesis Testing: Key Idea and Example 245

19.4 Multiple Hypothesis Testing 246

19.5 Parameter Estimation 247

19.6 Hypothesis Testing: t-Test 248

19.7 Confidence Intervals 250

19.8 Bayesian Statistics 252

19.9 Naive Bayesian Statistics 253

19.10 Bayesian Networks 253

19.11 Choosing Priors: Maximum Entropy or Domain Knowledge 254

19.12 Further Reading 255

19.13 Glossary 255

20 Programming Language Concepts 257

20.1 Programming Paradigms 257

20.2 Compilation and Interpretation 264

20.3 Type Systems 266

20.4 Further Reading 267

20.5 Glossary 267

21 Performance and Computer Memory 269

21.1 A Word of Caution 269

21.2 Example Script 270

21.3 Algorithm Performance and Big-O Notation 272

21.4 Some Classic Problems: Sorting a List and Binary Search 273

21.5 Amortized Performance and Average Performance 276

21.6 Two Principles: Reducing Overhead and Managing Memory 277

21.7 Performance Tip: Use Numerical Libraries When Applicable 278

21.8 Performance Tip: Delete Large Structures You Don't Need 280

21.9 Performance Tip: Use Built-In Functions When Possible 280

21.10 Performance Tip: Avoid Superfluous Function Calls 280

21.11 Performance Tip: Avoid Creating Large New Objects 281

21.12 Further Reading 281

21.13 Glossary 281

Part III Specialized or Advanced Topics 283

22 Computer Memory and Data Structures 285

22.1 Virtual Memory, the Stack, and the Heap 285

22.2 Example C Program 286

22.3 Data Types and Arrays in Memory 286

22.4 Structs 287

22.5 Pointers, the Stack, and the Heap 288

22.6 Key Data Structures 292

22.7 Further Reading 297

22.8 Glossary 297

23 Maximum-Likelihood Estimation and Optimization 299

23.1 Maximum-Likelihood Estimation 299

23.2 A Simple Example: Fitting a Line 300

23.3 Another Example: Logistic Regression 301

23.4 Optimization 302

23.5 Gradient Descent 303

23.6 Convex Optimization 306

23.7 Stochastic Gradient Descent 307

23.8 Further Reading 308

23.9 Glossary 308

24 Deep Learning and AI 309

24.1 A Note on Libraries and Hardware 310

24.2 A Note on Training Data 310

24.3 Simple Deep Learning: Perceptrons 311

24.4 What Is a Tensor? 314

24.5 Convolutional Neural Networks 315

24.6 Example: The MNIST Handwriting Dataset 317

24.7 Autoencoders and Latent Vectors 318

24.8 Generative AI and GANs 321

24.9 Diffusion Models 323

24.10 RNNs, Hidden State, and the Encoder-Decoder 324

24.11 Attention and Transformers 325

24.12 Stable Diffusion: Bringing the Parts Together 326

24.13 Large Language Models and Prompt Engineering 327

24.14 Further Reading 328

24.15 Glossary 329

25 Stochastic Modeling 331

25.1 Markov Chains 331

25.2 Two Kinds of Markov Chain, Two Kinds of Questions 333

25.3 Hidden Markov Models and the Viterbi Algorithm 334

25.4 The Viterbi Algorithm 336

25.5 Random Walks 337

25.6 Brownian Motion 338

25.7 ARIMA Models 339

25.8 Continuous-Time Markov Processes 339

25.9 Poisson Processes 340

25.10 Further Reading 341

25.11 Glossary 341

26 Parting Words: Your Future as a Data Scientist 343

Index 345

1
Introduction

The goal of this book is to turn you into a data scientist, and there are two parts to this mission. First, there is a set of specific concepts, tools, and techniques that you can go out and solve problems with today. They include buzzwords such as machine learning (ML), Spark, and natural language processing (NLP). They also include concepts that are distinctly less sexy but often more useful, like regular expressions, unit tests, and SQL queries. It would be impossible to give an exhaustive list in any single book, but I cast a wide net.

That brings me to the second part of my goal. Tools are constantly changing, and your long-term future as a data scientist depends less on what you know today and more on what you are able to learn going forward. To that end, I want to help you understand the concepts behind the algorithms and the technological fundamentals that underlie the tools we use. For example, this is why I spend a fair amount of time on computer memory and optimization: they are often the underlying reason that one approach is better than another. If you understand the key concepts, you can make the right trade-offs, and you will be able to see how new ideas are related to older ones.

As the field evolves, data science is becoming not just a discipline in its own right, but also a skillset that anybody can have. The software tools are getting better and easier to use, best practices are becoming widely known, and people are learning many of the key skills in school before they've even started their career. There will continue to be data science specialists, but there is also a growing number of the so-called "citizen data scientists" whose real job is something else. They are engineers, biologists, UX designers, programmers, and economists: professionals from all fields who have learned the techniques of data science and are fruitfully applying them to their main discipline.

This book is aimed at anybody who is entering the field. Depending on your background, some parts of it may be stuff you already know. Especially for citizen data scientists, other parts may be unnecessary for your work. But taken as a whole, this book will give you a practical skillset for today, and a solid foundation for your future in data science.

1.1 What Data Science Is and Isn't

Despite the fact that "data science" is widely practiced and studied today, the term itself is somewhat elusive. So before we go any further, I'd like to give you the definition that I use. I've found that this one gets right to the heart of what sets it apart from other disciplines. Here goes:

Data science means doing analytically oriented work that, for one reason or another, requires a substantial amount of software engineering skills.

Often the final deliverable is the kind of thing a statistician or business analyst might provide, but achieving that goal demands software skills that your typical analyst simply doesn't have - writing a custom parser for an obscure data format, complex preprocessing logic that must be kept in order, etc. Other times the data scientist will need to write production software based on their insights, or perhaps make their model available in real time. Often the dataset itself is so large that just creating a pie chart requires that the work be done in parallel across a cluster of computers. And sometimes, it's just a really gnarly SQL query that most people struggle to wrap their heads around.

Nate Silver, a statistician famous for accurate forecasting of US elections, once said: "I think data scientist is a sexed-up term for statistician." He has a point, but what he said is only partly true. The discipline of statistics deals mostly with rigorous mathematical methods for solving well-defined problems; data scientists spend most of their time getting data and the problem into a form where statistical methods can even be applied. This involves making sure that the analytics problem is a good match to business objectives, choosing what to measure and how to quantify things (typically more the domain of a BI analyst), extracting meaningful features from the raw data, and coping with any pathologies of the data or weird edge cases (which often requires a level of coding more typical of a software engineer). Once that heavy lifting is done, you can apply statistical tools to get the final results - although, in practice, you often don't even need them. Professional statisticians need to do a certain amount of preprocessing themselves, but there is a massive difference in degree.

Historically, statistics focused on rigorous methods to analyze clean datasets, such as those that come out of controlled experiments in medicine and agriculture. Often the data was gathered explicitly to support the statisticians' analysis! In the 2000s though a new class of datasets became popular to analyze. "Big Data" used new cluster computing tools to study large, messy, heterogenous datasets of the sort that would make statisticians shudder: HTML pages, image files, e-mails, raw output logs of web servers, and so on. These datasets don't fit the mold of relational databases or statistical tools, and they were not designed to facilitate any particular statistical analysis; so for decades, they were just piling up without being analyzed. Data science came into being as a way to finally milk them for insights. Most of the first data scientists were computer programmers or ML experts who were working on Big Data problems, not statisticians in the traditional sense.

The lines have now blurred: statisticians do more coding than they used to, Big Data tools are less central to the work of a data scientist, and ML is used by a broad swatch of people. And this is healthy: the differences between these fields are, after all, really just a matter of degree and/or historical accident. But, in practical terms, "data scientists" are still the jacks-of-all-trades in the middle. They can do statistics, but if you're looking to tease every last insight out of clinical trial data, you should consult a statistician. They can train and deploy ML models, but if you're trying to eke performance out of a large neural network an ML engineer would be better. They can turn business questions into math problems, but they may not have the deep business knowledge of an analyst.

1.2 This Book's Slogan: Simple Models Are Easier to Work With

There is a common theme in this book that I would like to call out at as the book's explicit motto: simple models are easier to work with. Let me explain.

People tend to idolize and gravitate toward complicated analytical models like deep neural nets, Bayesian networks, ARIMA models, and the like. There are good reasons to use these tools; the best-performing models in the world are usually complicated, there may be fancy ways to bake in expert knowledge, etc. There are also bad reasons to use these tools, like ego and pressure to use to latest buzzwords.

But seasoned data scientists understand that there is more to a model than how accurate it is. Simple models are, above all, easier to reason about. If you're trying to understand what patterns in the data your model is picking up on, simple models are the way to go. Oftentimes this is the whole point of a model anyway: we are just trying to get insights into the system we are studying, and a model's performance is just used to gauge how fully it has captured the relevant patterns in the data.

A related advantage of simple models is supremely mundane: stuff breaks, and they make it easier to find what's broken. Bad training data, perverse inputs to the model, and data that is incorrectly formatted - all of these are liable to cause conspicuous failures, and it's easy to figure out what went wrong by dissecting the model. For this reason, I like "stunt double models," which have the same input/output format as a complicated one and are used to debug the model's integration with other systems.

Simple models are less prone to overfitting. If your dataset is small, a fancy model will often actually perform worse: it essentially memorizes the training data, rather than extracting general patterns from it. The simpler a model, the less you have to worry about the size of your dataset (though admittedly this can create a square-peg-in-a-round-hole situation where the model can't fit the data well and performance degrades).

Simple models are easier to hack and jury-rig. Frequently they have a small number of tunable parameters, with clear meanings that you can adjust to suit the business needs at hand.

The inferior performance of simple models can act as a performance benchmark, a level that the fancier model must meaningfully exceed in order to justify its extra complexity. And if a simple model performs particularly badly, this may suggest that there isn't enough signal in the data to make the problem worthwhile.

On the other hand, when there is enough training data and it is representative of what you expect to see, fancier models do perform better. You usually don't want to leave money on the table by deploying grossly inferior models simply because they are easier to debug. And there are many situations, like cutting-edge AI, where the relevant patterns are very complicated, and it takes a complicated model to accurately capture them. Even in...

Systemvoraussetzungen

Als PDF speichern Als Link merken