Data Science Essentials For Dummies

Name: Data Science Essentials For Dummies
Brand: Wiley
Price: 10.99 EUR
Availability: OnlineOnly

Lillian Pierson(Autor*in)

Wiley (Verlag)

1. Auflage

Erschienen am 13. November 2024

264 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-394-29701-6 (ISBN)

10,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

Introduction 1

About This Book 2

Foolish Assumptions 3

Icons Used in This Book 3

Where to Go from Here 4

Chapter 1: Wrapping Your Head Around Data Science 5

Seeing Who Can Make Use of Data Science 6

Inspecting the Pieces of the Data Science Puzzle 8

Collecting, querying, and consuming data 9

Applying mathematical modeling to data science tasks 11

Deriving insights from statistical methods 11

Coding, coding, coding - it's just part of the game 12

Applying data science to a subject area 12

Communicating data insights 14

Chapter 2: Tapping into Critical Aspects of Data Engineering 15

Defining the Three Vs 15

Grappling with data volume 16

Handling data velocity 16

Dealing with data variety 17

Identifying Important Data Sources 18

Grasping the Differences among Data Approaches 18

Defining data science 19

Defining machine learning engineering 20

Defining data engineering 20

Comparing machine learning engineers, data scientists, and data engineers 21

Storing and Processing Data for Data Science 22

Storing data and doing data science directly in the cloud 22

Processing data in real-time 27

Recognizing the Impact of Generative AI 27

The reshaping of data engineering 28

Tools and frameworks for supporting AI workloads 28

Chapter 3: Using a Machine to Learn from Data 29

Defining Machine Learning and Its Processes 29

Walking through the steps of the machine learning process 30

Becoming familiar with machine learning terms 30

Considering Learning Styles 31

Learning with supervised algorithms 31

Learning with unsupervised algorithms 32

Learning with reinforcement 32

Seeing What You Can Do 32

Selecting algorithms based on function 33

Generating real-time analytics with Spark 36

Chapter 4: Math, Probability, and Statistical Modeling 39

Exploring Probability and Inferential Statistics 40

Probability distributions 42

Conditional probability with Naïve Bayes 44

Quantifying Correlation 45

Calculating correlation with Pearson's r 45

Ranking variable pairs using Spearman's rank correlation 47

Reducing Data Dimensionality with Linear Algebra 48

Decomposing data to reduce dimensionality 48

Reducing dimensionality with factor analysis 52

Decreasing dimensionality and removing outliers with PCA 53

Modeling Decisions with Multiple Criteria Decision-Making 54

Turning to traditional MCDM 55

Focusing on fuzzy MCDM 57

Introducing Regression Methods 57

Linear regression 57

Logistic regression 59

Ordinary least squares regression methods 60

Detecting Outliers 60

Analyzing extreme values 60

Detecting outliers with univariate analysis 61

Detecting outliers with multivariate analysis 62

Introducing Time Series Analysis 64

Identifying patterns in time series 64

Modeling univariate time series data 65

Chapter 5: Grouping Your Way into Accurate Predictions 67

Starting with Clustering Basics 68

Getting to know clustering algorithms 69

Examining clustering similarity metrics 71

Identifying Clusters in Your Data 72

Clustering with the k-means algorithm 72

Estimating clusters with kernel density estimation 74

Clustering with hierarchical algorithms 75

Dabbling in the DBScan neighborhood 77

Categorizing Data with Decision Tree and Random Forest Algorithms 79

Drawing a Line between Clustering and Classification 80

Introducing instance-based learning classifiers 81

Getting to know classification algorithms 81

Making Sense of Data with Nearest Neighbor Analysis 84

Classifying Data with Average Nearest Neighbor Algorithms 86

Classifying with K-Nearest Neighbor Algorithms 89

Understanding how the k-nearest neighbor algorithm works 90

Knowing when to use the k-nearest neighbor algorithm 91

Exploring common applications of k-nearest neighbor algorithms 92

Solving Real-World Problems with Nearest Neighbor Algorithms 92

Seeing k-nearest neighbor algorithms in action 92

Seeing average nearest neighbor algorithms in action 93

Chapter 6: Coding Up Data Insights and Decision Engines 95

Seeing Where Python Fits into Your Data Science Strategy 95

Using Python for Data Science 96

Sorting out the various Python data types 98

Putting loops to good use in Python 101

Having fun with functions 103

Keeping cool with classes 104

Checking out some useful Python libraries 107

Chapter 7: Generating Insights with Software Applications 115

Choosing the Best Tools for Your Data Science Strategy 116

Getting a Handle on SQL and Relational Databases 118

Investing Some Effort into Database Design 123

Defining data types 123

Designing constraints properly 124

Normalizing your database 124

Narrowing the Focus with SQL Functions 127

Making Life Easier with Excel 131

Using Excel to quickly get to know your data 132

Reformatting and summarizing with PivotTables 137

Automating Excel tasks with macros 139

Chapter 8: Telling Powerful Stories with Data 143

Data Visualizations: The Big Three 144

Data storytelling for decision-makers 145

Data showcasing for analysts 145

Designing data art for activists 146

Designing to Meet the Needs of Your Target Audience 146

Step 1: Brainstorm (All about Eve) 147

Step 2: Define the purpose 148

Step 3: Choose the most functional visualization type for your purpose 149

Picking the Most Appropriate Design Style 150

Inducing a calculating, exacting response 150

Eliciting a strong emotional response 151

Selecting the Appropriate Data Graphic Type 152

Standard chart graphics 154

Comparative graphics 157

Statistical plots 161

Topology structures 162

Spatial plots and maps 164

Testing Data Graphics 167

Adding Context 168

Creating context with data 169

Creating context with annotations 169

Creating context with graphical elements 169

Chapter 9: Ten Free or Low-Cost Data Science Libraries and Platforms 171

Scraping the Web with Beautiful Soup 171

Wrangling Data with pandas 172

Visualizing Data with Looker Studio 172

Machine Learning with scikit-learn 172

Creating Interactive Dashboards with Streamlit 173

Doing Geospatial Data Visualization with Kepler.gl 173

Making Charts with Tableau Public 173

Doing Web-Based Data Visualization with RAWGraphs 174

Making Cool Infographics with Infogram 174

Making Cool Infographics with Canva 174

Index 175

Chapter 1

Wrapping Your Head Around Data Science

IN THIS CHAPTER

Deploying data science methods across various industries

Piecing together the core data science components

Identifying viable data science solutions to business challenges

Exploring data science career alternatives

For over a decade now, everyone has been absolutely deluged by data. It's coming from every computer, every mobile device, every camera, and every imaginable sensor - and now it's even coming from watches and other wearable technologies. Data is generated in every social media interaction we humans make, every file we save, every picture we take, and every query we submit; data is even generated when we do something as simple as ask a favorite search engine for directions to the closest ice cream shop.

If you're anything like I was, you may have wondered, "What's the point of all this data? Why use valuable resources to generate and collect it?" Although even just two decades ago, no one was in a position to make much use of most of the data that's generated, the tides today have definitely turned. Specialists known as data engineers are constantly finding innovative and powerful new ways to capture, collate, and condense unimaginably massive volumes of data. Other specialists known as data scientists are leading change by deriving valuable and actionable insights from that data.

In its truest form, data science represents the optimization of processes and resources. Data science produces data insights - actionable, data-informed conclusions or predictions that you can use to understand and improve your business, your investments, your health, and even your lifestyle and social life. Using data science insights is like being able to see in the dark. For any goal or pursuit you can imagine, you can find data science methods to help you predict the most direct route from where you are to where you want to be - and to anticipate every pothole in the road between both places.

In this chapter, I explain the difference between data science and data engineering.

Seeing Who Can Make Use of Data Science

The terms data science and data engineering are often misused and confused, so let me start off by clarifying that these two fields are, in fact, separate and distinct domains of expertise. Data science is the computational science of extracting meaningful insights from raw data and then effectively communicating those insights to generate value. Data engineering, on the other hand, is an engineering domain that's dedicated to building and maintaining systems that overcome data processing bottlenecks and data handling problems for applications that consume, process, and store large volumes, varieties, and velocities of data.

In both data science and data engineering, you commonly work with the following types of data:

Structured data: Data that is stored, processed, and manipulated in a traditional relational database management system (RDBMS). An example of this type of data can be seen in the tabular schema of rows and columns you'd commonly encounter when working with corporate databases.
Unstructured data: Data that is commonly generated from human activities and doesn't fit into a structured database format. Examples of unstructured data are data that comprises email documents, Microsoft Word documents or audio or video files.
Semistructured data: Data that doesn't fit into a structured database system but is nonetheless organizable by tags that are useful for creating a form of order and hierarchy in the data. XML and JSON files are examples of data that comes in semistructured form.

In the past, only large tech companies with massive funding had the skills and computing resources required to implement data science methodologies to optimize and improve their business, but that hasn't been the case for quite a while now. The proliferation of data has created a demand for insights, and this demand is embedded in many aspects of modern culture - from the Uber passenger who expects the driver to show up exactly at the time and location predicted by the Uber app to the online shopper who expects the Amazon platform to recommend the best product alternatives for comparing similar goods before making a purchase. Data and the need for data-informed insights are ubiquitous. Because organizations of all sizes are beginning to recognize that they're immersed in a sink-or-swim, data-driven, competitive environment, data know-how has emerged as a core and requisite function in almost every line of business.

What does this mean for the average knowledge worker? It means that everyday employees are increasingly expected to support a progressively advancing set of technological and data requirements. Why? Because almost all industries are reliant on data technologies and the insights they spur. Consequently, many people are in continuous need of upgrading their data skills, or else they face the real possibility of being replaced by a more data-savvy employee.

The good news is that upgrading data skills doesn't usually require people to go back to college or earn a university degree in statistics, computer science, or data science. The bad news is that, even with professional training or self-teaching, it always takes extra work to stay industry-relevant and tech-savvy. In this respect, the data revolution isn't so different from any other change that has hit industry in the past. The fact is, in order to stay relevant, you need to take the time and effort to acquire the skills that keep you current. When you're learning how to do data science, you can take some courses, educate yourself using online resources, read books like this one, and attend events where you can learn what you need to know to stay on top of the game.

Who can use data science? You can. Your organization can. Your employer can. Anyone who has a bit of understanding and training can begin using data insights to improve their lives, their careers, and the well-being of their businesses. Data science represents a change in the way you approach the world. When determining outcomes, people once used to make their best guess, act on that guess, and then hope for the desired result. With data insights, however, people now have access to the predictive vision that they need to truly drive change and achieve the results they want.

Here are some examples of ways you can use data insights to make the world, and your company, a better place:

Develop key performance indicators (KPIs) for your business systems. Use KPIs to track performance and optimize the return on investment (ROI) for measurable business activities.
Develop your marketing strategy. Use data insights and predictive analytics to identify marketing strategies that work, eliminate underperforming efforts, and test new marketing strategies.
Keep communities safe. Predictive policing applications help law enforcement personnel predict and prevent local criminal activities.
Help make the world a better place for those less fortunate. Data scientists in developing nations are using social data, mobile data, and data from websites to generate real-time analytics that improve the effectiveness of humanitarian responses to disasters, epidemics, food scarcity issues, and more.

Inspecting the Pieces of the Data Science Puzzle

To practice data science, in the true meaning of the term, you need the analytical know-how of math and statistics, the coding skills necessary to work with data, and an area of subject matter expertise. Without this expertise, you may as well call yourself a mathematician or a statistician. Similarly, a programmer without subject matter expertise and analytical know-how may better be considered a software engineer or developer, but not a data scientist.

The need for data-informed business and product strategy has been increasing exponentially for about a decade now, forcing all business sectors and industries to adopt a data science approach. As such, different flavors of data science have emerged. The following are just a few titles under which experts of every discipline are required to know and regularly do data science:

Clinical biostatistician
Data and tech policy analyst
Data scientist-geospatial and agriculture analyst
Data scientist-health care
Digital banking product owner
Director of data science-advertising technology
Geotechnical data scientist
Global channel ops-data excellence lead

Nowadays, it's almost impossible to differentiate between a proper data scientist and a subject matter expert (SME) whose success depends heavily on their ability to use data science to generate insights. Looking at a person's job title may or may not be helpful, simply because many roles are titled data scientist when they may as well be labeled data strategist or product manager, based on the actual requirements. In addition, many knowledge workers are doing daily data science and not working under the title of data scientist. It's an overhyped, often misleading label that's not always helpful if you're trying to find out what a data scientist does by looking at online job boards.

To shed some light,...

Systemvoraussetzungen

Als PDF speichern Als Link merken