
SQL for Data Scientists
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis is a resource that's dedicated to the Structured Query Language (SQL) and dataset design skills that data scientists use most. Aspiring data scientists will learn how to how to construct datasets for exploration, analysis, and machine learning. You can also discover how to approach query design and develop SQL code to extract data insights while avoiding common pitfalls.
You may be one of many people who are entering the field of Data Science from a range of professions and educational backgrounds, such as business analytics, social science, physics, economics, and computer science. Like many of them, you may have conducted analyses using spreadsheets as data sources, but never retrieved and engineered datasets from a relational database using SQL, which is a programming language designed for managing databases and extracting data.
This guide for data scientists differs from other instructional guides on the subject. It doesn't cover SQL broadly. Instead, you'll learn the subset of SQL skills that data analysts and data scientists use frequently. You'll also gain practical advice and direction on "how to think about constructing your dataset."
* Gain an understanding of relational database structure, query design, and SQL syntax
* Develop queries to construct datasets for use in applications like interactive reports and machine learning algorithms
* Review strategies and approaches so you can design analytical datasets
* Practice your techniques with the provided database and SQL code
In this book, author Renee Teate shares knowledge gained during a 15-year career working with data, in roles ranging from database developer to data analyst to data scientist. She guides you through SQL code and dataset design concepts from an industry practitioner's perspective, moving your data scientist career forward!
More details
Other editions
Additional editions


Person
RENÉE M. P. TEATE is the Director of Data Science at HelioCampus, a higher ed tech startup based in the Washington, DC area. She prepares datasets with SQL, develops predictive models with Python, and designs interactive dashboards in Tableau for university decision-makers. She created the "Becoming a Data Scientist" podcast, helped build the data science learning community on Twitter, and is a sought-after speaker at industry conferences.
Content
Introduction xix
Chapter 1 Data Sources 1
Data Sources 1
Tools for Connecting to Data Sources and Editing SQL 2
Relational Databases 3
Dimensional Data Warehouses 7
Asking Questions About the Data Source 9
Introduction to the Farmer's Market Database 11
A Note on Machine Learning Dataset Terminology 12
Exercises 13
Chapter 2 The SELECT Statement 15
The SELECT Statement 15
The Fundamental Syntax Structure of a SELECT Query 16
Selecting Columns and Limiting the Number of Rows Returned 16
The ORDER BY Clause: Sorting Results 18
Introduction to Simple Inline Calculations 20
More Inline Calculation Examples: Rounding 22
More Inline Calculation Examples: Concatenating Strings 24
Evaluating Query Output 26
SELECT Statement Summary 29
Exercises Using the Included Database 30
Chapter 3 The WHERE Clause 31
The WHERE Clause 31
Filtering SELECT Statement Results 32
Filtering on Multiple Conditions 34
Multi-Column Conditional Filtering 40
More Ways to Filter 41
BETWEEN 41
IN 42
LIKE 43
IS NULL 44
A Warning About Null Comparisons 44
Filtering Using Subqueries 46
Exercises Using the Included Database 47
Chapter 4 CASE Statements 49
CASE Statement Syntax 50
Creating Binary Flags Using CASE 52
Grouping or Binning Continuous Values Using CASE 53
Categorical Encoding Using CASE 56
CASE Statement Summary 59
Exercises Using the Included Database 60
Chapter 5 SQL JOINs 61
Database Relationships and SQL JOINs 61
A Common Pitfall when Filtering Joined Data 71
JOINs with More than Two Tables 74
Exercises Using the Included Database 76
Chapter 6 Aggregating Results for Analysis 79
GROUP BY Syntax 79
Displaying Group Summaries 80
Performing Calculations Inside Aggregate Functions 84
MIN and MAX 88
COUNT and COUNT DISTINCT 90
Average 91
Filtering with HAVING 93
CASE Statements Inside Aggregate Functions 94
Exercises Using the Included Database 96
Chapter 7 Window Functions and Subqueries 97
ROW NUMBER 98
RANK and DENSE RANK 101
NTILE 102
Aggregate Window Functions 103
LAG and LEAD 108
Exercises Using the Included Database 111
Chapter 8 Date and Time Functions 113
Setting datetime Field Values 114
EXTRACT and DATE_PART 115
DATE_ADD and DATE_SUB 116
DATEDIFF 118
TIMESTAMPDIFF 119
Date Functions in Aggregate Summaries and Window Functions 119
Exercises 126
Chapter 9 Exploratory Data Analysis with SQL 127
Demonstrating Exploratory Data Analysis with SQL 128
Exploring the Products Table 128
Exploring Possible Column Values 131
Exploring Changes Over Time 134
Exploring Multiple Tables Simultaneously 135
Exploring Inventory vs. Sales 138
Exercises 142
Chapter 10 Building SQL Datasets for Analytical Reporting 143
Thinking Through Analytical Dataset Requirements 144
Using Custom Analytical Datasets in SQL:
CTEs and Views 149
Taking SQL Reporting Further 153
Exercises 157
Chapter 11 More Advanced Query Structures 159
UNIONs 159
Self-Join to Determine To-Date Maximum 163
Counting New vs. Returning Customers by Week 167
Summary 171
Exercises 171
Chapter 12 Creating Machine Learning Datasets Using SQL 173
Datasets for Time Series Models 174
Datasets for Binary Classification 176
Creating the Dataset 178
Expanding the Feature Set 181
Feature Engineering 185
Taking Things to the Next Level 189
Exercises 189
Chapter 13 Analytical Dataset Development Examples 191
What Factors Correlate with Fresh Produce Sales? 191
How Do Sales Vary by Customer Zip Code,
Market Distance, and Demographic Data? 211
How Does Product Price Distribution Affect
Market Sales? 217
Chapter 14 Storing and Modifying Data 229
Storing SQL Datasets as Tables and Views 229
Adding a Timestamp Column 232
Inserting Rows and Updating Values in Database Tables 233
Using SQL Inside Scripts 236
In Closing 237
Exercises 238
Appendix Answers to Exercises 239
Index 255
Introduction
Who I Am and Why I'm Writing About This Topic
When I was first brainstorming topics for this book, I used two questions to narrow down my list: "Who is my audience?" and "What topic do I know well enough to write a book that would be worth publishing for that audience?"
The first question had an easy initial answer: I already have an audience of data-science-learning Twitter followers with whom I share resources and advice on "Becoming a Data Scientist" that I could keep in mind while narrowing down the topics.
So then I was left to figure out what I know that I could teach to people who want to become data scientists.
I have been designing and querying relational databases professionally for about 17 years: first as a database and web developer, then as a data analyst, and for the last 5 years, as a data scientist. SQL (Structured Query Language) has been a key tool for me throughout-whether I was working with MS Access, MS SQL Server, MySQL, Oracle, or Redshift databases, and whether I was summarizing data into reporting views in a data mart, extracting data to use in a data visualization tool like Tableau, or preparing a dataset for a machine learning project.
Since SQL is a tool I have used throughout my career, and because creating and retrieving datasets for analysis has been such an integral part of my job as a data scientist, I was surprised to learn that some data scientists don't know SQL or don't regularly write SQL code. But in an informal Twitter poll I conducted, which received responses from 979 data scientists, 19% of them reported wanting to learn, or learn more, SQL (74% reported already using SQL professionally). Additionally, 55% of 713 respondents who were working toward becoming data scientists said they wanted to learn, or learn more, SQL. So, my target audience had an interest in this topic.
According to an analysis of online job postings conducted by Jeff Hale of Towards Data Science, SQL is in the top three technology skills that data scientist jobs require. (See towardsdatascience.com/the-most-in-demand-skills-for-data-scientists-4a4a8db896db.) In an Indeed BeSeen article, Joy Garza lists SQL as one of the top-five in-demand tech skills for data scientists. (See https://web.archive.org/web/20200624031802/https://www.beseen.com/blog/talent/data-scientist-skills/.)
After learning how many working and prospective data scientists wanted to learn SQL, and how much of a need there is in the industry for people who know how to use it, SQL dataset development started to move to the top of the list of topics I could share my knowledge of with others.
There are many SQL books on the market that can be used to learn query syntax and advanced SQL functions-after all, the language has been around for 45 years and has been standardized since the late 1980s-but I hadn't found any definitive resources to refer people to when they asked me if I knew of any books that taught how to use SQL to construct datasets for machine learning, so I decided to write this book to cover SQL from a data scientist's point of view.
So, my goal in writing this book is not only to teach you how to write SQL code but to teach you how to think about summarizing data into analytical datasets that can be used for reports and machine learning: to use SQL like a data scientist does. Like I do.
Who This Book Is For
SQL for Data Scientists is designed to be a learning resource for anyone who wants to become (or who already is) a data analyst or data scientist, and wants to be able to pull data from databases to build their own datasets without having to rely on others in the organization to query the source system and transform it into flat files (or spreadsheets) for them.
There are plenty of SQL books out there, but many are either written as syntax references or written for people in other roles that create, query from, and maintain databases. However, this book is written from the perspective of a data scientist and is aimed at those who will primarily be extracting data from existing databases in order to generate datasets for analysis.
I won't assume that you've ever written SQL queries before, and we'll start with the basics, but I do assume that you have some basic understanding of what databases are and a general idea of how data might be used in reports, analyses, and machine learning algorithms. This book is meant to fill in the steps between finding a database that contains the data you need and starting the analysis. I aim to teach you how to think about structuring datasets for analysis and how to use SQL to extract the data from the database and get it into that form.
Why You Should Learn SQL if You Want to Be a Data Scientist
If you can use SQL to pull your own datasets, you don't have to rely on others in your organization to pull it for you, enabling you to work more efficiently. Requesting datasets usually involves a process of filling out a form or ticket describing in detail what data you need, waiting for your request to be fulfilled, then often clarifying your request after seeing the initial results, and then waiting again for modifications. If you can edit your own queries, you can not only design and retrieve your own datasets but then also adjust calculations or add fields as needed.
Additionally, running a SQL query that writes to a database table or exports to a file-effectively snapshotting the data in the form you need it in for your analysis-means you don't have to retrieve and reprocess the data in your machine learning script every time you run your code, speeding up the usually iterative model development process.
Some summaries and calculations can be done more efficiently in SQL than in other types of code, as well, so even if you are running the queries "live" each time you run your script, you may be able to lower the computational cost of your code by doing some of the transformations in SQL.
Finally, because it is a high-demand tech skill in data scientist job postings, learning SQL will increase your marketability and value to employers.
What I Hope You Gain from This Book
My goal is that by the time you finish reading this book and practicing the queries within (ideally both on the provided example database and on another database of your choosing, so you have to modify the example queries and apply them in another context), you will be able to think through the process of creating an analytical dataset and develop the SQL code necessary to generate your intended output.
I hope that even if you end up needing to use a SQL function that's not covered in this book, you will have gained enough baseline knowledge from the book to go look it up online and determine how to best use it in the query you are developing.
I also hope that this book will help you feel confident that you can pull your own data at work and get it into the form you need it in for your report or model without having to wait on others to do it for you.
Conventions
This book uses MySQL version 8.0-style SQL. No matter what type of database system you use (MS SQL Server, Redshift, PostgreSQL, Oracle, etc.), the query design concepts and syntax are very similar, when not identical across platforms. So, if you work with a database system other than MySQL, you might have to search for the equivalent code syntax for a few functions in the book, but the overall dataset design concepts are platform-independent, and the SQL keywords are cross-platform standards.
When you see code displayed in the following style:
SELECT * FROM Product that means it is a complete SQL query that you can use to select data from the Farmer's Market database described in Chapter 1, "Data Sources." If you're reading the printed version of this book, you can go to the book's website to get digital versions of the queries that you can copy and paste to try them out yourself.
Reserved SQL keywords like SELECT will appear in all-uppercase throughout the book, and column names will appear in all-lowercase. This isn't a requirement of SQL syntax (neither are line breaks), but is a convention used for readability.
Be aware that the Farmer's Market database will continue to evolve, and I will likely continue adding rows to its tables after this book goes to print, so the data values you see in the output when you run the queries yourself may not exactly match the screenshots included in the printed book.
Reader Support for This Book
Companion Download Files
As you work through the examples in this book, you may choose either to type in all the code manually or to use the source code files that accompany the book. All the source code used in this book, along with the Farmer's Market database, is available for download from both sqlfordatascientists.com and www.wiley.com/go/sqlfordatascientists.
How to Contact the Publisher
If you believe you've found a mistake in this book, please bring it to our attention. At John Wiley & Sons, we understand how important it is to provide our customers with accurate content, but even with our best efforts an error may occur.
In order to submit your possible errata, please email it to our Customer Service Team at wileysupport@wiley.com with the subject...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.