Data Wrangling on AWS

Name: Data Wrangling on AWS | Clean and organize complex data for analysis
Brand: De Gruyter
Price: 28.79 EUR
Availability: OnlineOnly

Clean and organize complex data for analysis

Navnit Shukla Sankar M Sam Palani(Autor*in)

De Gruyter (Verlag)

1. Auflage

Erschienen am 14. August 2023

420 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-80181-766-0 (ISBN)

28,79 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Personen

Inhalt

Table of Contents - Introduction to Data Wrangling on AWS
- Working with AWS GlueDataBrew
- Introducing AWS Data Wrangler
- Introducing Amazon SageMaker Data Wrangler
- Working with Amazon S3
- Working with AWS Glue
- Working with Athena
- Working with Quicksight
- Perform Pandas operation with AWS Data Wrangler
- Optimizing ML data with AWS SageMaker Data Wrangler
- Security and Monitoring

1 Getting Started with Data Wrangling

In the introductory section of this book, we listed use cases regarding how organizations use data to bring value to customers. Apart from that, organizations collect a lot of other data so that they can understand the finances of customers, which helps them share it with stakeholders, including log data for security, system health checks, and customer data, which is required for working on use cases such as Customer 360s.

We talked about all these use cases and how collecting data from different data sources is required to solve them. However, from collecting data to solving these business use cases, one very important step is to clean the data. That is where data wrangling comes into the picture.

In this chapter, we are going to learn the basics of data wrangling and cover the following topics:

Introducing data wrangling
The steps involved in data wrangling
Best practices for data wrangling
Options available within Amazon Web Services (AWS) to perform data wrangling

Introducing data wrangling

For organizations to become data-driven to provide value to customers or make more informed business decisions, they need to collect a lot of data from different data sources such as clickstreams, log data, transactional systems, and flat files and store them in different data stores such as data lakes, databases, and data warehouses as raw data. Once this data is stored in different data stores, it needs to be cleansed, transformed, organized, and joined from different data sources to provide more meaningful information to downstream applications such as machine learning models to provide product recommendations or look for traffic conditions. Alternatively, it can be used by business or data analytics to extract meaningful business information:

Figure 1.1: Data pipeline

The 80-20 rule of data analysis

When organizations collect data from different data sources, it is not of much use initially. It is estimated that data scientists spend about 80% of their time cleaning data. This means that only 20% of their time will be spent analyzing and creating insights from the data science process:

Figure 1.2: Work distribution of a data scientist

Now that we understand the basic concept of data wrangling, we'll learn why it is essential, and the various benefits we get from it.

Advantages of data wrangling

If we go back to the analogy of oil, when we first extract it, it is in the form of crude oil, which is not of much use. To make it useful, it has to go through a refinery, where the crude oil is put in a distillation unit. In this distillation process, the liquids and vapors are separated into petroleum components called fractions according to their boiling points. Heavy fractions are on the bottom while light fractions are on the top, as seen here:

Figure 1.3: Crude oil processing

The following figure showcases how oil processing correlates to the data wrangling process:

Figure 1.4: The data wrangling process

Data wrangling brings many advantages:

Enhanced data quality: Data wrangling helps improve the overall quality of the data. It involves identifying and handling missing values, outliers, inconsistencies, and errors. By addressing these issues, data wrangling ensures that the data used for analysis is accurate and reliable, leading to more robust and trustworthy results.
Improved data consistency: Raw data often comes from various sources or in different formats, resulting in inconsistencies in naming conventions, units of measurement, or data structure. Data wrangling allows you to standardize and harmonize the data, ensuring consistency across the dataset. Consistent data enables easier integration and comparison of information, facilitating effective analysis and interpretation.
Increased data completeness: Incomplete data can pose challenges during analysis and modeling. Data wrangling methods allow you to handle missing data by applying techniques such as imputation, where missing values are estimated or filled in based on existing information. By dealing with missing data appropriately, data wrangling helps ensure a more complete dataset, reducing potential biases and improving the accuracy of analyses.
Facilitates data integration: Organizations often have data spread across multiple systems and sources, making integration a complex task. Data wrangling helps in merging and integrating data from various sources, allowing analysts to work with a unified dataset. This integration facilitates a holistic view of the data, enabling comprehensive analyses and insights that might not be possible when working with fragmented data.
Streamlined data transformation: Data wrangling provides the tools and techniques to transform raw data into a format suitable for analysis. This transformation includes tasks such as data normalization, aggregation, filtering, and reformatting. By streamlining these processes, data wrangling simplifies the data preparation stage, saving time and effort for analysts and enabling them to focus more on the actual analysis and interpret the results.
Enables effective feature engineering: Feature engineering involves creating new derived variables or transforming existing variables to improve the performance of machine learning models. Data wrangling provides a foundation for feature engineering by preparing the data in a way that allows for meaningful transformations. By performing tasks such as scaling, encoding categorical variables, or creating interaction terms, data wrangling helps derive informative features that enhance the predictive power of models.
Supports data exploration and visualization: Data wrangling often involves exploratory data analysis (EDA), where analysts gain insights and understand patterns in the data before formal modeling. By cleaning and preparing the data, data wrangling enables effective data exploration, helping analysts uncover relationships, identify trends, and visualize the data using charts, graphs, or other visual representations. These exploratory steps are crucial for forming hypotheses, making data-driven decisions, and communicating insights effectively.

Now that we have learned about the advantages of data wrangling, let's understand the steps involved in the data wrangling process.

The steps involved in data wrangling

Similar to crude oil, raw data has to go through multiple data wrangling steps to become meaningful. In this section, we are going to learn the six-step process involved in data wrangling:

Data discovery
Data structuring
Data cleaning
Data enrichment
Data validation
Data publishing

Before we begin, it's important to understand these activities may or may not need to be followed sequentially, or in some cases, you may skip any of these steps.

Also, keep in mind that these steps are iterative and differ for different personas, such as data analysts, data scientists, and data engineers.

As an example, data discovery for data engineers may vary from what data discovery means for a data analyst or data scientist:

Figure 1.5: The steps of the data-wrangling process

Let's start learning about these steps in detail.

Data discovery

The first step of the data wrangling process is data discovery. This is one of the most important steps of data wrangling. In data discovery, we familiarize ourselves with the kind of data we have as raw data, what use case we are looking to solve with that data, what kind of relationships exist between the raw data, what the data format will look like, such as CSV or Parquet, what kind of tools are available for storing, transforming, and querying this data, and how we wish to organize this data, such as by folder structure, file size, partitions, and so on to make it easy to access.

Let's understand this by looking at an example.

In this example, we will try to understand how data discovery varies based on the persona. Let's assume we have two colleagues, James and Jean. James is a data engineer while Jean is a data analyst, and they both work for a car-selling company.

Jean is new to the organization and she is required to analyze car sales numbers for Southern California. She has reached out to James and asked him for data from the sales table from the production system.

Here is the data discovery process for Jane (a data...

Systemvoraussetzungen

Als PDF speichern Als Link merken