
Data Engineering and Data Science
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Written and edited by one of the most prolific and well-known experts in the field and his team, this exciting new volume is the "one-stop shop" for the concepts and applications of data science and engineering for data scientists across many industries.
The field of data science is incredibly broad, encompassing everything from cleaning data to deploying predictive models. However, it is rare for any single data scientist to be working across the spectrum day to day. Data scientists usually focus on a few areas and are complemented by a team of other scientists and analysts. Data engineering is also a broad field, but any individual data engineer doesn't need to know the whole spectrum of skills. Data engineering is the aspect of data science that focuses on practical applications of data collection and analysis. For all the work that data scientists do to answer questions using large sets of information, there have to be mechanisms for collecting and validating that information.
In this exciting new volume, the team of editors and contributors sketch the broad outlines of data engineering, then walk through more specific descriptions that illustrate specific data engineering roles. Data-driven discovery is revolutionizing the modeling, prediction, and control of complex systems. This book brings together machine learning, engineering mathematics, and mathematical physics to integrate modeling and control of dynamical systems with modern methods in data science. It highlights many of the recent advances in scientific computing that enable data-driven methods to be applied to a diverse range of complex systems, such as turbulence, the brain, climate, epidemiology, finance, robotics, and autonomy. Whether for the veteran engineer or scientist working in the field or laboratory, or the student or academic, this is a must-have for any library.
More details
Other editions
Additional editions

Persons
Kukatlapalli Pradeep Kumar, PhD, is an associate professor and the Program Coordinator for Data Science at Christ University, Bangalore, India. He has 13 years of research and academic experience. He has published in many journals and presented numerous conference papers.
Aynur Unal, PhD, educated at Stanford University (class of '73), has taught at Stanford University for almost 40 years and established the Acoustics Institute. Her work on "New Transform Domains for the Onset of Failures" received a prestigious research award.
Vinay Jha Pillai, PhD, is an associate professor in the Department of Electronics and Communication Engineering at CHRIST University, Bangalore, India. He has 12 years of academic experience and holds two patents. He has also completed two funded projects as principal investigator.
Hari Murthy, PhD, is a faculty member in the Department of Electronics and Communication Engineering, CHRIST University, Bengaluru, India. He finished his PhD from the University of Canterbury, New Zealand where his thesis was on novel anticorrosion materials. He has authored book chapters and published papers in international journals and conferences and has served as part of the program committees for several international conferences.
M. Niranjanamurthy, PhD, is an assistant professor in the Department of Computer Applications, M S Ramaiah Institute of Technology, Bangalore, Karnataka. He earned his PhD in computer science at JJTU, Rajasthan, India. He has over 11 years of teaching experience and two years of industry experience as a software engineer. He has published several books, and he is working on numerous books for Scrivener Publishing. He has published over 60 papers for scholarly journals and conferences, and he is working as a reviewer in 22 scientific journals. He also has numerous awards to his credit.
Content
Preface xv
1 Quality Assurance in Data Science: Need, Challenges and Focus 1
Jasmine K.S., Ajay D. K. and Aditya Raj
1.1 Introduction 1
1.2 Testing and Quality Assurance 3
1.3 Product Quality and Test Efforts 4
1.4 Data Masking in Data Model and Associated Risks 8
1.5 Prediction in Data Science 9
1.6 Role of Metrics in Evaluation 20
1.7 Quantity of Data in Quality Assurance 20
1.8 Identifying the Right Data Sources 20
1.9 Conclusion 21
2 Design and Implementation of Social Media Mining -- Knowledge Discovery Methods for Effective Digital Marketing Strategies 23
Prashant Bhat and Pradnya Malaganve
2.1 Introduction 24
2.2 Literature Review 26
2.3 Novel Framework for Social Media Data Mining and Knowledge Discovery 29
2.4 Classification for Comparison Analysis 34
2.5 Clustering Methodology to Provide Digital Marketing Strategies 38
2.6 Experimental Results 43
2.7 Conclusion 45
3 A Study on Big Data Engineering Using Cloud Data Warehouse 49
Manjunath T. N., Pushpa S. K., Ravindra S. Hegadi and Ananya Hathwar K. S.
3.1 Introduction 50
3.2 Comparison Study of Different Cloud Data Warehouses 51
3.3 Snowflake Cloud Data Warehouse 55
3.4 Google BigQuery Cloud Data Warehouse 58
3.5 Microsoft Azure Synapse Cloud Data Warehouse 61
3.6 Informatica Intelligent Cloud Services (IICS) 64
3.7 Conclusion 67
4 Data Mining with Cluster Analysis Through Partitioning Approach of Huge Transaction Data 71
Sampath Kini K. and Karthik Pai B.H.
4.1 Introduction 72
4.2 Methodology Used in Proposed Cluster Analysis System 75
4.3 Literature Survey on Existing Systems 80
4.4 Conclusion 82
5 Application of Data Science in Macromodeling of Nonlinear Dynamical Systems 85
Nagaraj S., Seshachalam D. and Jayalatha G.
5.1 Introduction 86
5.2 Nonlinear Autonomous Dynamical System 89
5.3 Nonlinear System - MOR 90
5.4 Data Science Life Cycle 92
5.5 Artificial Neural Network in Modeling 94
5.6 Neuron Spiking Model Using FitzHugh-Nagumo (F-N) System 99
5.7 Ring Oscillator Model 104
5.8 Nonlinear VLSI Interconnect Model Using Telegraph Equation 108
5.9 Macromodel Using Machine Learning 112
5.10 MOR of Dynamical Systems Using POD-ANN 115
5.11 Numerical Results 117
5.12 Conclusion 126
6 Comparative Analysis of Various Ensemble Approaches for Web Page Classification 137
J. Dutta, Yong Woon Kim and Dalia Dominic
6.1 Introduction 138
6.2 Literature Survey 139
6.3 Material and Methods 144
6.4 Ensemble Classifiers 146
6.5 Results 148
6.6 Conclusion 169
7 Feature Engineering and Selection Approach Over Malicious Image 173
P.M. Kavitha and B. Muruganantham
7.1 Introduction 173
7.2 Feature Engineering Techniques 176
7.3 Malicious Feature Engineering 182
7.4 Image Processing Technique 183
7.5 Image Processing Techniques for Analysis on Malicious Images 185
7.6 Conclusion 191
8 Cubic-Regression and Likelihood Based Boosting GAM to Model Drug Sensitivity for Glioblastoma 195
Satyawant Kumar, Vinai George Biju, Ho-Kyoung Lee and Blessy Baby Mathew
8.1 Introduction 196
8.2 Literature Survey 198
8.3 Materials and Methods 201
8.4 Evaluations, Results and Discussions 209
9 Unobtrusive Engagement Detection through Semantic Pose Estimation and Lightweight ResNet for an Online Class Environment 225
Michael Moses Thiruthuvanathan, Balachandran Krishnan and Madhavi Rangaswamy
9.1 Introduction 226
9.2 Related Work 230
9.3 Proposed Methodology 234
9.4 Experimentation 241
9.5 Results and Discussions 245
10 Building Rule Base for Decision Making -- A Fuzzy-Rough Approach 255
Sabu M. K., Neeraj Krishna M. S. and Reshmi R.
10.1 Introduction 256
10.2 Literature Review 258
10.3 Discretization of the Dataset Using Fuzzy Set Theory 260
10.4 Description of the Dataset 260
10.5 Process Involved in Proposed Work 261
10.6 Experiment 262
10.7 Evaluation Result 267
10.8 Discussion 273
11 An Effective Machine Learning Approach to Model Healthcare Data 279
Shaila H. Koppad, S. Anupama Kumar and Mohan Kumar
11.1 Introduction 280
11.2 Types of Data in Healthcare 281
11.3 Big Data in Healthcare 283
11.4 Different V's of Big Data 284
11.5 About COPD 285
11.6 Methodology Implemented 290
12 Recommendation Engine for Retail Domain Using Machine Learning Techniques 303
Chandrashekhara K. T., Gireesh Babu C. N. and Thungamani M.
12.1 Introduction 304
12.2 Proposed System 304
12.3 Results 312
12.3.1 ARIMA Forecasting 312
12.4 Conclusion 313
13 Mining Heterogeneous Lung Cancer from Computer Tomography (CT) Scan with the Confusion Matrix 317
Denny Dominic and Krishnan Balachandran
13.1 Introduction 317
13.2 Literature Review 319
13.3 Methodology 320
13.4 Result 326
13.5 Conclusion and Future Scope 332
References 332
14 ML Algorithms and Their Approach on COVID-19 Data Analysis 335
Kambaluru Ashok, Penumalli Anvesh Reddy and Kukatlapalli Pradeep Kumar
14.1 Introduction 336
14.2 DataSet 336
14.3 Types of Machine Learning Algorithms 338
14.4 Conclusion 348
15 Analysis and Design for the Early Stage Detection of Lung Diseases Using Machine Learning Algorithms 351
Sindhu Madhuri, Mahesh T. R., Vivek V., Shashikala H. K. and C. Saravanan
15.1 Introduction 352
15.2 Machine Learning Algorithms 358
15.3 Evaluation Metrics and Comparative Results for Early Detection of Lung Diseases 364
15.4 Conclusion 369
16 Estimation of Cancer Risk through Artificial Neural Network 373
K. Aditya Shastry, Sanjay H. A., Balaji N. and Karthik Pai B. H.
16.1 Introduction 373
16.2 Case Studies Related to Cancer Risk Estimation Using ANN 375
16.3 Datasets Used in Cancer Risk Estimation 388
16.4 Discussion 397
16.5 Future Scope 400
16.6 Conclusion 400
17 Applications and Advancements in Data Science and Analytics 409
T. Mamatha, A. Balaram, B. Rama Subba Reddy, C. Shoba Bindu and M. Niranjanamurthy
17.1 Data Science and Analytics in Software Testing 410
17.2 Applications of Data Science and Analytics 411
17.3 Selenium Testing Tool in Data Science 419
17.4 Challenges and Advancements in Data Science 425
17.5 Data Science and Analytics Tools 430
17.6 Conclusion 438
References 439
About the Editors 441
Index 443
1
Quality Assurance in Data Science: Need, Challenges and Focus
Jasmine K.S.*, Ajay D. K. and Aditya Raj
Dept. of MCA, RV College of Engineering®, Bengaluru, Karnataka, India
Abstract
It is widely accepted that quality is assured for a product through the process of testing. With the rapid development in the area of data science, research is going on with proper management of data and with its right usage, test engineers can learn about their users. One can predict the associated risks and with a focus on data masking based on the data model. Prescriptive and predictive analysis can be more accurate if the techniques are developed and the accuracy is measured using metrics. Preparing data with required quality and identifying the possible resources are challenging tasks faced by a data scientist. The effective and systematic use of advanced technologies like high-speed hardware and network computing, cloud computing, cross platform tools, etc., continues to be an elusive goal for many organizations. In this context, the chapter investigates the feasibility of novel and practical solutions in this aspect.
Keywords: Quality assurance, testing, data science, data analysis, decision making
1.1 Introduction
1.1.1 Quality Assurance and Testing
In the traditional software development approach, Quality Assurance was done at the later stages of the development process and feedback was collected for improvement. In almost all organizations there exists a Quality Assurance team, responsible for identifying the product defects and resolving them before release of the product.
But in the Agile development approach, the Quality Assurance (QA) team works collaboratively with the development team with a shared responsibility in improving the quality of the product. Irrespective of the project and domain QA plays an important role in delivering high-quality products. It helps developers establish goals and define quality standards. The process also helps in identifying and resolving defects in the products before releasing them to market.
1.1.2 Data Science and Quality Assurance
Data is available in structured and unstructured format. There is a need of methods and algorithms to extract knowledge and inference so that available data can be converted into wisdom to apply in any application domain. In the context of data science, QA is a process of analyzing and modeling the available data to ensure high quality and meeting quality standards. Models have to be created with a variety of data sets and robustness tests have to be conducted for trained models similar to any use case for real-world scenarios [6].
1.1.3 Background
Data science basically puts efforts to verify the general consistency among relevant data and applies scientific methods to ensure the data quality. Domain knowledge is necessary to ensure not only the quality of data but also to avoid errors and inconsistencies in the data collected in the Quality assurance process [1]. After collecting the data, interpretation of data is also a huge challenge. So it is essential that quality assurance should be a continuous process throughout the product development phase, starting from data collection till the delivery of the product. Expert opinion will add value to it. In order to validate the results obtained after data analysis, domain experts play an important role. Finding and resolving quality-related issues once the product is delivered will result in unnecessary wastage of time, money and effort [2, 9]. In order to perform data analytics in an efficient manner, a variety of data sets is essential [3]. Frequency, time and many iterations are part of data analytics. So there is a need for systems which focus on data quality [4, 10]. Software quality and hardware quality also play an equal role [5].
1.2 Testing and Quality Assurance
Testing is a process which ensures the intended behavior of any product within the given time frame and also helps to avoid additional efforts, time and cost overruns. System testing ensures not only product quality but also process quality.
1.2.1 Key Terminologies Associated With Testing
1.2.1(a) Defect count: A defect count is one of the measurements of product quality. It shows the number of undetected errors. The defect rate can be computed by defective products observed divided by the number of units tested. For example, if 5 out of 10 tested units are defective, the defect rate is 5 divided by 10.
1.2.1(b) Test case execution: During the test execution process the developed code is executed with generated test cases and actual output obtained is compared with the expected output. After the test execution process, bugs and test execution status are maintained for further measures. Following are steps involved in test execution:
Step 1: Gathering testing requirements
Step 2: Test plan
Step 3: Test design
Step 4: Test execution
Step 5: Defect reporting and tracking
Step 6: Defect Mapping
1.2.1(c) Test execution classification: The type of testing can be classified based on the purpose for which it has to be executed: for example, performance testing, defects testing, regression testing, etc. During the test execution process, the main focus is to verify how much the actual results varied from the expected results.
1.2.1(d) Pre-requisites for test execution phase
- Test Plan, test cases development and the test data should be ready
- Test environment setup and its validation
- Clear understanding of exit criteria
1.2.2(e) Post-requirements for test execution phase
- Validation of actual results with expected results
- Defects should be closed or deferred
- Test execution and defects summary report
- Prioritize the test plans based on the identified risks
- Continue the process
1.2.2(f) Test velocity
Testing velocity shows the number of tests one is running per day/weekly/ monthly, etc. It also shows the difference between the planned and actual time.
1.3 Product Quality and Test Efforts
1.3.1 Testing Metrics
Testing metrics are required to measure and estimate process and product quality and to improve the efficiency of the overall testing process. Two major categories are defect metrics and productivity metrics.
Example of productivity metrics: Mileage of any vehicle compared to its ideal mileage recommended by the manufacturer. Figure 1.1 shows the test velocity and loop count also we can see the thread properties.
Figure 1.1 Test velocity, loop count.
Example of defect metrics: Number of defects found, accepted, rejected and deferred in the payment process using a credit card.
A few automated testing tools: Selenium, JMeter, Appium, Junit
The chapter demonstrates a few testing aspects using JMeter. It helps to test and analyze overall performance under different load types.
Description of Figure 1.1:
- Name: Provide a custom name or, if you prefer, simply leave it named the default "Thread Group".
- Thread count: The number of threads indicates the number of users under test.
- Ramp-up Period: How quickly (in seconds) you want to add users. For example, if set to zero, all users will begin immediately. If set to 10, one user will be added at a time until the full amount is reached at the end of the 10 seconds. Let's say 5 seconds.
- Loop Count: It indicates the number of times the test should repeat.
Figure 1.2 demonstrates thread delay of 300 milliseconds.
Figure 1.2 Thread delay.
In Figure 1.3, successful test results show in green icons. Figure 1.3 also shows load time, connect time, errors, the request data, the response data, etc. Figure 1.4 demonstrates the Test results in table and and Figure 1.5 shows the Test report summary.
Figure 1.3 Test results.
Figure 1.4 Test results in table.
CSV file is the source of data user for testing and one can see the dynamic results in the View Results Tree (Figure 1.6). In our example, it is no longer Boston and London, but Philadelphia and Berlin, Portland and Rome, etc.
Figure 1.5 Test report summary.
Figure 1.6 View results tree.
1.3.2 How to Improve the Business Value to Products Using Test Automation
With the advancements in the area of automation testing, in order to ensure software quality, not only the developed software has to meet the functional requirements specifications, it also has to meet the non-functional requirements specifications such as operational efficiency, reliability, security, maintainability, availability, code efficiency, etc., through which one can increase the business value.
With automation testing one can manage the performance of test activities there by giving conformance to test requirements. In the current scenario, automation testing became a crucial part of quality assurance [7].
The business value testers are delivering with their test efforts and can be evaluated by measuring the following attributes:
- Percentage of...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.