Data-Driven Modeling

Name: Data-Driven Modeling
Brand: Polity
Price: 168.99 EUR
Availability: OnlineOnly

Arindam Mondal Souvik Ganguli(Editor)

Polity (Publisher)

1st Edition

Published on 23. December 2025

342 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-394-28790-1 (ISBN)

€168.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

1
Fundamentals of Data Analysis and Preprocessing

Sudipta Hazra1* and Arindam Mondal2

1MCKV Institute of Engineering, Howrah, West Bengal, India

2Dr. B. C. Roy Engineering College, Durgapur, West Bengal, India

Abstract

The general structure for data curation was proposed in this chapter. It covers the many stages of preparation and preprocessing for data. Many other datasets can be fitted by the overall framework that is described. Raw data that have not been cleaned and curated are typically unsuitable for drawing accurate conclusions. Within the topic of data curation and preparation, the most widely used algorithms and strategies are covered in detail in this chapter. The methodology for data curation, imputation, feature extraction, correlation analysis, and realworld implementation of these algorithms is covered in this chapter's framework. We also offered methods that we have created based on our data processing skills. Lastly, we demonstrated with a real-world example how applying various imputation techniques affects support vector machine efficiency and performance. The chapter outlines a process for taking unstructured, unprocessed data and turning it into well-organized data that may be used with sophisticated machine learning algorithms or other advanced data analysis techniques.

Keywords: Data analysis, data preprocessing, data mining

1.1 Introduction

Research in a wide range of disciplines, including science, engineering, management, and process control, starts with data analysis (DA). Symbolic and numeric attributes are used to collect data about a certain topic. These data come from a variety of sources, including sensors and people, all with varying levels of complexity and dependability. A deeper comprehension of the relevant phenomenon results from the analysis of these data. Therefore, the primary goal of any DA is to find information that may be applied to decision-making or problem-solving [1]. Problems with the data, though, might make this impossible. Most of the time, data errors are not detected until the DA phase appears. For instance, DA is carried out in the creation of knowledge-based systems in order to find and produce new facts for assembling a trustworthy and extensive facts base. Therefore, the data determine the dependability of the knowledge base's section created using DA techniques such as induction.

Numerous efforts are being fabricated to either develop an analysis tool or use commercially available solutions for DA. A few initiatives have disregarded the reality that real-world data often have issues and that preparing the data in some way is typically necessary before doing an effective analysis of the data. This means that data preparation features should be available in research or commercial tools so that they can be utilized either before to or throughout the real DA procedure. Data preparation may have multiple goals. Apart from addressing data issues such as tainted data and absent or irrelevant attributes in datasets, there is a chance that someone would also like to know more about the type of data or alter its structure (such as granularity levels) to make it more suitable for a more effective DA.

Authors draw a similar parallel between human information processing and data preprocessing (DP): "Think about the information processing mechanism used by humans" [2]. Through the sense organs, sensory signals are received and processed. The initial phases of analysis are carried out by low level computing structures, after which the data are sent to other processing structures. Events or concepts can be used to drive the processing system. Whereas conceptually driven processing is usually reverse of bottom-up, propelled by objectives, motives, and appropriate data into prospect, processing based on events usually works from the bottom up, looking for structures in which to integrate the input.

Numerous justifications have been offered for the function of and necessity for DP. When it comes to modeling, the desired information can be combined with variations in the data that result from modifications in progression or system circumstances, furthermore in the gathering and transfer of data. These impacts can be avoided in advance with appropriate DP, leading to more frugal models. Although they might not be more accurate predictors, these models should be more reliable [3]. Therefore, fewer phenomena would need to be represented as a result of data preparation, but estimating errors might potentially contribute to an increase in variance. When it comes to learning, data pretreatment would enable users to choose which ideas to learn, how to show the DA results, and how to represent the data in a way that makes them easier to understand and use in the real world.

Preprocessing is typically the primary action performed on any batch of data. DP takes a lot of time and is frequently semiautomated. Effective methods for automatic DP are crucial because the volume of data generated by contemporary process supervision and data collecting systems is increasing and necessitating greater data processing [4]. In this study, we want to address common data-related issues, strategies for resolving them, and the advantages of using these approaches for data pretreatment.

1.2 Data Preprocessing

DP is the act of merely transforming raw data into a form that can be understood. Real-world data are not always complete, consistent, noisy, or redundant. Data preparation comprises several steps that help to organize raw data into a logical format. The Figure 1.1 below illustrates the many stages of data preparation.

Figure 1.1 Steps in DP.

Data cleaning: Data cleansing is the process of locating incorrect and corrupt records in a record set or database table. Finding inconsistent, erroneous, incomplete, and irrelevant data is the main goal of the cleaning process, after which techniques are applied to either update or eliminate it.

Data integration: The primary objectives of data integration are to unify data from many sources and show it in a coherent way. All disputes resulting from merging data with different representations are resolved. This process is crucial because of its many industrial and scientific applications. The importance of data integration increases with the amount and exponential growth of data.

Data transformation: In order to change raw data into a form that can be understood, data transformation is essential. It is made up of generalization, aggregation, and data normalization. Data normalization facilitates the organization of a database's tables and columns to reduce redundancy. This reduces the complexity and processing time. For a quicker overview, data aggregation aids in the creation of a concise summary. Another name for the process of generalizing data is rolling up data. It facilitates the generalization of data and builds assessment databases with various levels of summary.

Data reduction: The practice of organizing and simplifying digital information is known as data reduction. Most of the time, empirical and experimental methods are used to obtain these data. It entails breaking up massive volumes of data into more manageable and insightful pieces.

Data discretization: In situations where you want to classify a lot of numerical data using only nominal values, the idea of data discretization is crucial. The continuous data in this case are divided into discrete forms, and the nominal value is defined as the values of these discrete sets. In essence, it is the process of transferring continuous data properties with the least amount of information loss into a defined collection of intervals.

Everything done before the real DA process begins is known as data preparation. In essence, it is a transformation T that creates a set of new data vectors Yij from the raw real-world data vectors Xik.

(1.1)

such that: (i) Yij preserves the "valuable information" in Xik, (ii) Yij eliminates at least one of the problems in Xik and (iii) Yij is more useful than Xik.

In the above relation:

i = 1, . n where n = number of objects,
j = 1, . m where m = number of features after preprocessing,
k = 1, . l where l = number of attributes/features before preprocessing, and in general, m ? 1.

Finding and presenting important facts, such as meaningful patterns in the data, are the aims of DA. Valuable information is knowledge that exists in the data. Four characteristics are defined for meaningful information [5]. These are legitimate, unique, possibly helpful, and finally intelligible. Data difficulties are circumstances that make it difficult to utilize any DA tool effectively or that could lead to outcomes that are not acceptable. Preprocessing data can be done for a number of reasons, including resolving issues with the data that might make it impossible to analyze it in any way, comprehending the character of the data and conducting a more insightful analysis, and deriving deeper insights from a particular set of data. The majority of applications requires more than one type of data preparation. Determining the kind of preprocessing for data is consequently an essential responsibility.

1.2.1 Issues with Data

The real-world data are seldom without issues. The easiest way to display these is in Figure 1.2, which is also covered below. Problems can vary greatly in nature and severity for a...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Data-Driven Modeling

Description

More details

Other editions

Additional editions

Persons

Content

1
Fundamentals of Data Analysis and Preprocessing

Abstract

1.1 Introduction

1.2 Data Preprocessing

1.2.1 Issues with Data

System requirements

Schweitzer Fachinformationen

Data-Driven Modeling

Description

More details

Other editions

Additional editions

Persons

Content

1 Fundamentals of Data Analysis and Preprocessing

Abstract

1.1 Introduction

1.2 Data Preprocessing

1.2.1 Issues with Data

System requirements

1
Fundamentals of Data Analysis and Preprocessing