Data Wrangling

Name: Data Wrangling | Concepts, Applications and Tools
Brand: Wiley
Price: 168.99 EUR
Availability: OnlineOnly

Concepts, Applications and Tools

M. Niranjanamurthy Kavita Sheoran Geetika Dhand Prabhjot Kaur(Herausgeber*in)

Wiley (Verlag)

1. Auflage

Erschienen am 16. Juni 2023

368 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-87984-8 (ISBN)

168,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Personen

Inhalt

1
Basic Principles of Data Wrangling

Akshay Singh*, Surender Singh and Jyotsna Rathee

Department of Information Technology, Maharaja Surajmal Institute of Technology, Janakpuri, New Delhi, India

Abstract

Data wrangling is considered to be a crucial step of data science lifecycle. The quality of data analysis directly depends on the quality of data itself. As the data sources are increasing with a fast pace, it is more than essential to organize the data for analysis. The process of cleaning, structuring, and enriching raw data into the required data format in order to make better judgments in less time is known as data wrangling. It entails the manual conversion and mapping of data from one raw form to another in order to facilitate data consumption and organization. It is also known as data munging, meaning "digestible" data. The iterative process of gathering, filtering, converting, exploring, and integrating data come under the data wrangling pipeline. The foundation of data wrangling is data gathering. The data is extracted, parsed, and scraped before the process of removing unnecessary information from raw data. Data filtering or scrubbing includes removing corrupt and invalid data, thus keeping only the needful data. The data is transformed from unstructured to a bit structured form. Then, the data is converted from one format to another format. To name a few, some common formats are CSV, JSON, XML, SQL, etc. The preanalysis of data is to be done in data exploration step. Some preliminary queries are applied on the data to get the sense of the available data. The hypothesis and statistical analysis can be formed after basic exploration. After exploring the data, the process of integrating data begins in which the smaller pieces of data are added up to form big data. After that, validation rules are applied on data to verify its quality, consistency, and security. In the end, analysts prepare and publish the wrangled data for further analysis. Various platforms available for publishing the wrangled data are GitHub, Kaggle, Data Studio, personal blogs, websites, etc.

Keywords: Data wrangling, big data, data analysis, cleaning, structuring, validating, optimization

1.1 Introduction

Meaningless raw facts and figures are termed as data which are of no use. Data are analyzed so that it provides certain meaning to raw facts, which is known as information. In current scenario, we have ample amount of data that is increasing many folds day by day which is to be managed and examined for better performance for meaningful analysis of data. To answer such inquiries, we must first wrangle our data into the appropriate format. The most time-consuming part and essential part is wrangling of data [1].

Definition 1-"Data wrangling is the process by which the data required by an application is identified, extracted, cleaned and integrated, to yield a data set that is suitable for exploration and analysis." [2]

Definition 2-"Data wrangling/data munging/data cleaning can be defined as the process of cleaning, organizing, and transforming raw data into the desired format for analysts to use for prompt decision making."

Definition 3-"Data wrangling is defined as an art of data transformation or data preparation." [3]

Definition 4-"Data wrangling term is derived and defined as a process to prepare the data for analysis with data visualization aids that accelerates the faster process." [4]

Definition 5-"Data wrangling is defined as a process of iterative data exploration and transformation that enables analysis." [1]

Although data wrangling is sometimes misunderstood as ETL techniques, these two are totally different with each other. Extract, transform, and load ETL techniques require handiwork from professionals and professionals at different levels of the process. Volume, velocity, variety, and veracity, i.e., 4 V's of big data becomes exorbitant in ETL technology [2].

We can categorize values into two sorts along a temporal dimension in any phase of life where we have to deal with data: near-term value and long-term value. We probably have a long list of questions we want to address with our data in the near future. Some of these inquiries may be ambiguous, such as "Are consumers actually changing toward communicating with us via their mobile devices?" Other, more precise inquiries can include: "When will our clients' interactions largely originate from mobile devices rather than desktops or laptops?" Various research work, different projects, product sale, company's new product to be launched, different businesses etc. can be tackled in less time with more efficiency using data wrangling.

Aim of Data Wrangling: Data wrangling aims are as follows:
1. Improves data usage.
2. Makes data compatible for end users.
3. Makes analysis of data easy.
4. Integrates data from different sources, different file formats.
5. Better audience/customer coverage.
6. Takes less time to organize raw data.
7. Clear visualization of data.

In the first section, we demonstrate the workflow framework of all the activities that fit into the process of data wrangling by providing a workflow structure that integrates actions focused on both sorts of values. The key building pieces for the same are introduced: data flow, data wrangling activities, roles, and responsibilities [10]. When commencing on a project that involves data wrangling, we will consider all of these factors at a high level.

The main aim is to ensure that our efforts are constructive rather than redundant or conflicting, as well as within a single project by leveraging formal language as well as processes to boost efficiency and continuity. Effective data wrangling necessitates more than just well-defined workflows and processes.

Another aspect of value to think about is how it will be provided within an organization. Will organizations use the exact values provided to them and analyze the data using some automated tools? Will organizations use the values provided to them in an indirect manner, such as by allowing employees in your company to pursue a different path than the usual?

Indirect Value: By influencing the decisions of others and motivating process adjustments. In the insurance industry, for example, risk modeling is used.
Direct Value: By feeding automated processes, data adds value to a company. Consider Netflix's recommendation engine [6].

Data has a long history of providing indirect value. Accounting, insurance risk modeling, medical research experimental design, and intelligence analytics are all based on it. The data used to generate reports and visualizations come under the category of indirect value. This can be accomplished when people read our report or visualization, assimilate the information into their existing world knowledge, and then apply that knowledge to improve their behaviors. The data here has an indirect influence on other people's judgments. The majority of our data's known potential value will be given indirectly in the near future.

Giving data-driven systems decisions for speed, accuracy, or customization provides direct value from data. The most common example is resource distribution and routing that is automated. This resource is primarily money in the field of high-frequency trading and modern finance. Physical goods are routed automatically in some industries, such as Amazon or Flipkart. Hotstar and Netflix, for example, employ automated processes to optimize the distribution of digital content to their customers. For example, antilock brakes in automobiles employ sensor data to channel energy to individual wheels on a smaller scale. Modern testing systems, such as the GRE graduate school admission exam, dynamically order questions based on the tester's progress. A considerable percentage of operational choices is directly handled by data-driven systems in all of these situations, with no human input.

1.2 Data Workflow Structure

In order to derive direct, automated value from our data, we must first derive indirect, human-mediated value. To begin, human monitoring is essential to determine what is "in" our data and whether the data's quality is high enough to be used in direct and automated methods. We cannot anticipate valuable outcomes from sending data into an automated system blindly. To fully comprehend the possibilities of the data, reports must be written and studied. As the potential of the data becomes clearer, automated methods can be built to utilize it directly. This is the logical evolution of information sets: from immediate solutions to identified problems to longer-term analyses of a dataset's fundamental quality and potential applications, and finally to automated data creation systems. The passage of data through three primary data stages:

raw,
refined,
produced,

is at the heart of this progression.

1.3 Raw Data Stage

In the raw data stage, there are three main actions: data input, generic metadata creation, and proprietary metadata creation. As illustrated in

Figure 1.1 Actions in the raw data stage.

Figure 1.1, based on their production, we can classify these actions into two groups. The two...

Systemvoraussetzungen

Als PDF speichern Als Link merken

Data Wrangling

Beschreibung

Weitere Details

Weitere Ausgaben

Personen

Inhalt

1 Basic Principles of Data Wrangling

Abstract

1.1 Introduction

1.2 Data Workflow Structure

1.3 Raw Data Stage

Systemvoraussetzungen

1
Basic Principles of Data Wrangling