
Data Science Handbook
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
DATA SCIENCE HANDBOOK
This desk reference handbook gives a hands-on experience on various algorithms and popular techniques used in real-time in data science to all researchers working in various domains.
Data Science is one of the leading research-driven areas in the modern era. It is having a critical role in healthcare, engineering, education, mechatronics, and medical robotics. Building models and working with data is not value-neutral. We choose the problems with which we work, make assumptions in these models, and decide on metrics and algorithms for the problems. The data scientist identifies the problem which can be solved with data and expert tools of modeling and coding.
The book starts with introductory concepts in data science like data munging, data preparation, and transforming data. Chapter 2 discusses data visualization, drawing various plots and histograms. Chapter 3 covers mathematics and statistics for data science. Chapter 4 mainly focuses on machine learning algorithms in data science. Chapter 5 comprises of outlier analysis and DBSCAN algorithm. Chapter 6 focuses on clustering. Chapter 7 discusses network analysis. Chapter 8 mainly focuses on regression and naive-bayes classifier. Chapter 9 covers web-based data visualizations with Plotly. Chapter 10 discusses web scraping.
The book concludes with a section discussing 19 projects on various subjects in data science.
Audience
The handbook will be used by graduate students up to research scholars in computer science and electrical engineering as well as industry professionals in a range of industries such as healthcare.
More details
Other editions
Additional editions


Person
Kolla Bhanu Prakash, PhD, is a Professor and Research Group Head for A.I. & Data Science Research group at K L University, India. He has published more than 80 research papers in international and national journals and conferences, as well as authored/edited 12 books and seven patents. His research interests include deep learning, data science, and quantum computing.
Content
Acknowledgment xi
Preface xiii
1 Data Munging Basics
1 Introduction 1
1.1 Filtering and Selecting Data 6
1.2 Treating Missing Values 11
1.3 Removing Duplicates 14
1.4 Concatenating and Transforming Data 16
1.5 Grouping and Data Aggregation 20
References 20
2 Data Visualization 23
2.1 Creating Standard Plots (Line, Bar, Pie) 26
2.2 Defining Elements of a Plot 30
2.3 Plot Formatting 33
2.4 Creating Labels and Annotations 38
2.5 Creating Visualizations from Time Series Data 42
2.6 Constructing Histograms, Box Plots, and Scatter Plots 44
References 54
3 Basic Math and Statistics 57
3.1 Linear Algebra 57
3.2 Calculus 58
3.2.1 Differential Calculus 58
3.2.2 Integral Calculus 58
3.3 Inferential Statistics 60
3.3.1 Central Limit Theorem 60
3.3.2 Hypothesis Testing 60
3.3.3 ANOVA 60
3.3.4 Qualitative Data Analysis 60
3.4 Using NumPy to Perform Arithmetic Operations on Data 61
3.5 Generating Summary Statistics Using Pandas and Scipy 64
3.6 Summarizing Categorical Data Using Pandas 68
3.7 Starting with Parametric Methods in Pandas and Scipy 84
3.8 Delving Into Non-Parametric Methods Using Pandas and Scipy 87
3.9 Transforming Dataset Distributions 91
References 94
4 Introduction to Machine Learning 97
4.1 Introduction to Machine Learning 97
4.2 Types of Machine Learning Algorithms 101
4.3 Explanatory Factor Analysis 114
4.4 Principal Component Analysis (PCA) 115
References 121
5 Outlier Analysis 123
5.1 Extreme Value Analysis Using Univariate Methods 123
5.2 Multivariate Analysis for Outlier Detection 125
5.3 DBSCan Clustering to Identify Outliers 127
References 133
6 Cluster Analysis 135
6.1 K-Means Algorithm 135
6.2 Hierarchial Methods 141
6.3 Instance-Based Learning w/ k-Nearest Neighbor 149
References 156
7 Network Analysis with NetworkX 157
7.1 Working with Graph Objects 159
7.2 Simulating a Social Network (ie; Directed Network Analysis) 163
7.3 Analyzing a Social Network 169
References 171
8 Basic Algorithmic Learning 173
8.1 Linear Regression 173
8.2 Logistic Regression 183
8.3 Naive Bayes Classifiers 189
References 195
9 Web-Based Data Visualizations with Plotly 197
9.1 Collaborative Aanalytics 197
9.2 Basic Charts 208
9.3 Statistical Charts 212
9.4 Plotly Maps 216
References 219
10 Web Scraping with Beautiful Soup 221
10.1 The BeautifulSoup Object 224
10.2 Exploring NavigableString Objects 228
10.3 Data Parsing 230
10.4 Web Scraping 233
10.5 Ensemble Models with Random Forests 235
References 254
Data Science Projects 257
11 Covid19 Detection and Prediction 259
Bibliography 275
12 Leaf Disease Detection 277
Bibliography 283
13 Brain Tumor Detection with Data Science 285
Bibliography 295
14 Color Detection with Python 297
Bibliography 300
15 Detecting Parkinson's Disease 301
Bibliography 302
16 Sentiment Analysis 303
Bibliography 306
17 Road Lane Line Detection 307
Bibliography 315
18 Fake News Detection 317
Bibliography 318
19 Speech Emotion Recognition 319
Bibliography 322
20 Gender and Age Detection with Data Science 323
Bibliography 339
21 Diabetic Retinopathy 341
Bibliography 350
22 Driver Drowsiness Detection in Python 351
Bibliography 356
23 Chatbot Using Python 357
Bibliography 363
24 Handwritten Digit Recognition Project 365
Bibliography 368
25 Image Caption Generator Project in Python 369
Bibliography 379
26 Credit Card Fraud Detection Project 381
Bibliography 391
27 Movie Recommendation System 393
Bibliography 411
28 Customer Segmentation 413
Bibliography 431
29 Breast Cancer Classification 433
Bibliography 443
30 Traffic Signs Recognition 445
Bibliography 453
1
Data Munging Basics
1 Introduction
Data gains value by transforming itself in to useful information. Every firm is more significant about the data generated from its all assets. The firm's data helps the different personnel in the organization to improve their business tasks, save time and expenditure amount on maintenance of it. The top level management fails in taking appropriate decision if they don't consider the data as important factor in understanding the business process. Many poor decisions related to the advertisement of company products leads to wastage of resources and affect the fame of the organization at every level. Companies may avoid squandering money by tracking the success of numerous marketing channels and concentrating on the ones that provide the best return on investment. As a result, a business can get more leads for less money spent on advertising [1].
Data Science provides study of discovering different data patterns from inter-disciplinary domains like business, education, research etc... Much of the information extracted is of the form unstructured like text and images and structured like in tabular format. The basic functional feature of data science involves the statistical techniques, inference rules, analytics for prediction, fundamental algorithms in machine learning, and novel methods for gleaning insights from huge data.
Business use cases which uses data science for serving the customers in different domains.
- Banking organization provides a mobile app to send recommendation on various loan offers to their applicants.
- One of the car manufacturing firms uses data science to build a 3-D printing screen for guiding driver less cars by enabling the object detection mechanism with more accuracy.
- An automation solution provider using cognitive approach develops an incident response system for failure detection in functionalities offered to their clients.
- General viewer behaviour is analysed by different channel subscribers based on the study of audience analytical platform and provide solution of grouping favourable TV channels.
- Cyber police department uses statistical tools to analyse the crime incidents occurring in particular locality with the capturing images from different CCTV footages and caution citizens to be-aware about those criminals.
- To safeguard the old age patients with memory loss or suffering with paralysis using body sensor information to analyse their health condition for their close relatives or care givers as part of building smart health care system.
Data science adopts four popular strategies [8] while exploring data they are (i) Understanding the problem in real time world (Probing Reality) (ii) Usage patterns of data (Discovery Patterns) (iii) Building Predictive data model for future perspective (predicting future events) (iv) Being empathetic business world (Understanding the people and the world)
- (i) Understanding the problem in real time world:- Active and passive methods are used in collecting data for a particular problem in business process to take action. All the responses collected during the business process are more important to perform analysis in taking appropriate decision and leads success in further subsequent decisions.
- (ii) Usage Patterns of Data (Discovery Patterns):- Divide and Conquer mechanism can be used to analyze the complex problems but it may not always the perfect solution without understanding the purpose of data. Much of the data is analyzed by clustering the data usage patterns this mechanism of clustering study helps to deal with real time digital marketing data.
- (iii) Building Predictive models (Predicting future events): Right from the study of statistics it is clear that many of the techniques in mathematics are evolved to analyze the current data and predict the future. The predictive analysis will really help in decision making in dealing with the current scenarios of data collection. The prediction of future endeavors will help us to add valuable knowledge for the current data.
- (iv) Emphatic in business world (Understanding the People and the world):- The toughest task by any organization in building the teams to understand the people in the real time world who are interacting with your organization for multiple reasons. Optimal decision making is possible only by understanding the real time scenarios of data generated during interaction and provides supported evidence for framing strategy in decision making solution for organization. High end domain knowledge like deep learning are used to understand the visual object recognition for study of the real time world.
Purpose of Data Science
Simple business intelligence tools are analyzed for unstructured data which is very small. Most of the data collected in traditional system were of the form of unstructured. The data was generated from different sources like financial reports, textual files, multimedia information, sensors and instrumental data. The business intelligence solutions cannot deal with huge volume of data with different complex formats. To process the complex formatted data we need high processing ability with improved analytical tools and algorithms for getting better insights that is done as part of data science.
Past and Future of Data Science
In 1962, John Tukey published a paper on the convergence of statistics and computers, showing how they may provide measurable results in hours. In 1974, Peter Naur written a book on Concise Survey of Computer Methods in which he coined the term data science many times to refer processing of data through specific mathematical methods. In 1977, an international association was established for statistical processing of data with the purpose of translating data into knowledge by combining modern computer technology, traditional statistical techniques, and domain knowledge. Tukey released Exploratory Data Analysis in the same year, emphasizing the importance of data.
Businesses began collecting enormous volumes of personal data in anticipation of new advertising efforts as early as 1994. Jacob Zahavi emphasized the need for new technology to manage the large volume of data generated by different organizations. William S. Cleveland published an article outlining on specialized learning methods and scope for Data Scientists which was used as case studies for businesses and education institute.
In 2002, a journal for Data Science was launched by international council for science. It focused on Data Science topics such as data systems modeling and its application. In 2013, IBM claimed that much of digital data collected all over the world is generated in the last two years, from then all organizations planned to build good amount of data for their benefits in decision making and started gaining good insights for improvement in the organization growth.
According to IDC, global data will exceed 175 zettabytes by 2025. Data Science allows businesses to swiftly interpret large amounts of data from a number of sources and turn that data into actionable insights for better data-driven decisions which is widely used in marketing, healthcare, finance, banking, policy work, and other fields. The market for Data Science platforms is expected to reach 178 billion dollars by 2025. Data science provides a platform for data scientists to explore many options for business organizations to track the latest developments in relevant to data gathering and maintenance for appropriate decision making.
BI (Business Intelligence) Vs DS (Data Science)
Business Intelligence is a process involved in decision making by getting insights in to the current data available as part of their organization transactions with respective all stake holders. It gathers data from all sources which can be from external or internal of the organization. The set of BI tools provide support for running queries, displaying results of data with good visualization mechanisms by performing analysis on revenue earned in that quarterly by facing business challenges. BI enables to provide suggestions based on market study, revealing revenue opportunities and business processes improvement. It is purely meant for building business strategies to earn profits in long run for the organization. Tools Like OLAP, warehouse ETL are used for storing and visualizing data in BI.
Data Science is a multi-disciplinary domain which performs study on data by extracting meaningful insights. It also uses tools relevant to data processing from machine learning and artificial intelligence to develop predictive models. It is further used for forecasting the future perspective growth in business organization carried functionalities. Python, R programming used to build the predictive data models by implementing efficient machine learning algorithms and the results are tracked based on high end visual communication techniques.
Data Munging Basics
Data Science is multi-disciplinary field which derives its features from artificial intelligence, machine learning and deep learning to uncover the more insights of data which is in different forms like structured (Tabular format of data) and unstructured (text, images). It performs study on specific problem domain areas and find or define solutions with available input data usage patterns and reveals good insights [1, 2].
Data Science deals with data to provide appropriate solutions to the relevant questions made by the study of those real time scenarios in the process of business. It is different from the business intelligence mechanism which only works on framing good business...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.