Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
DATA SCIENCE HANDBOOK
This desk reference handbook gives a hands-on experience on various algorithms and popular techniques used in real-time in data science to all researchers working in various domains.
Data Science is one of the leading research-driven areas in the modern era. It is having a critical role in healthcare, engineering, education, mechatronics, and medical robotics. Building models and working with data is not value-neutral. We choose the problems with which we work, make assumptions in these models, and decide on metrics and algorithms for the problems. The data scientist identifies the problem which can be solved with data and expert tools of modeling and coding.
The book starts with introductory concepts in data science like data munging, data preparation, and transforming data. Chapter 2 discusses data visualization, drawing various plots and histograms. Chapter 3 covers mathematics and statistics for data science. Chapter 4 mainly focuses on machine learning algorithms in data science. Chapter 5 comprises of outlier analysis and DBSCAN algorithm. Chapter 6 focuses on clustering. Chapter 7 discusses network analysis. Chapter 8 mainly focuses on regression and naive-bayes classifier. Chapter 9 covers web-based data visualizations with Plotly. Chapter 10 discusses web scraping.
The book concludes with a section discussing 19 projects on various subjects in data science.
Audience
The handbook will be used by graduate students up to research scholars in computer science and electrical engineering as well as industry professionals in a range of industries such as healthcare.
Kolla Bhanu Prakash, PhD, is a Professor and Research Group Head for A.I. & Data Science Research group at K L University, India. He has published more than 80 research papers in international and national journals and conferences, as well as authored/edited 12 books and seven patents. His research interests include deep learning, data science, and quantum computing.
Acknowledgment xi
Preface xiii
1 Data Munging Basics
1 Introduction 1
1.1 Filtering and Selecting Data 6
1.2 Treating Missing Values 11
1.3 Removing Duplicates 14
1.4 Concatenating and Transforming Data 16
1.5 Grouping and Data Aggregation 20
References 20
2 Data Visualization 23
2.1 Creating Standard Plots (Line, Bar, Pie) 26
2.2 Defining Elements of a Plot 30
2.3 Plot Formatting 33
2.4 Creating Labels and Annotations 38
2.5 Creating Visualizations from Time Series Data 42
2.6 Constructing Histograms, Box Plots, and Scatter Plots 44
References 54
3 Basic Math and Statistics 57
3.1 Linear Algebra 57
3.2 Calculus 58
3.2.1 Differential Calculus 58
3.2.2 Integral Calculus 58
3.3 Inferential Statistics 60
3.3.1 Central Limit Theorem 60
3.3.2 Hypothesis Testing 60
3.3.3 ANOVA 60
3.3.4 Qualitative Data Analysis 60
3.4 Using NumPy to Perform Arithmetic Operations on Data 61
3.5 Generating Summary Statistics Using Pandas and Scipy 64
3.6 Summarizing Categorical Data Using Pandas 68
3.7 Starting with Parametric Methods in Pandas and Scipy 84
3.8 Delving Into Non-Parametric Methods Using Pandas and Scipy 87
3.9 Transforming Dataset Distributions 91
References 94
4 Introduction to Machine Learning 97
4.1 Introduction to Machine Learning 97
4.2 Types of Machine Learning Algorithms 101
4.3 Explanatory Factor Analysis 114
4.4 Principal Component Analysis (PCA) 115
References 121
5 Outlier Analysis 123
5.1 Extreme Value Analysis Using Univariate Methods 123
5.2 Multivariate Analysis for Outlier Detection 125
5.3 DBSCan Clustering to Identify Outliers 127
References 133
6 Cluster Analysis 135
6.1 K-Means Algorithm 135
6.2 Hierarchial Methods 141
6.3 Instance-Based Learning w/ k-Nearest Neighbor 149
References 156
7 Network Analysis with NetworkX 157
7.1 Working with Graph Objects 159
7.2 Simulating a Social Network (ie; Directed Network Analysis) 163
7.3 Analyzing a Social Network 169
References 171
8 Basic Algorithmic Learning 173
8.1 Linear Regression 173
8.2 Logistic Regression 183
8.3 Naive Bayes Classifiers 189
References 195
9 Web-Based Data Visualizations with Plotly 197
9.1 Collaborative Aanalytics 197
9.2 Basic Charts 208
9.3 Statistical Charts 212
9.4 Plotly Maps 216
References 219
10 Web Scraping with Beautiful Soup 221
10.1 The BeautifulSoup Object 224
10.2 Exploring NavigableString Objects 228
10.3 Data Parsing 230
10.4 Web Scraping 233
10.5 Ensemble Models with Random Forests 235
References 254
Data Science Projects 257
11 Covid19 Detection and Prediction 259
Bibliography 275
12 Leaf Disease Detection 277
Bibliography 283
13 Brain Tumor Detection with Data Science 285
Bibliography 295
14 Color Detection with Python 297
Bibliography 300
15 Detecting Parkinson's Disease 301
Bibliography 302
16 Sentiment Analysis 303
Bibliography 306
17 Road Lane Line Detection 307
Bibliography 315
18 Fake News Detection 317
Bibliography 318
19 Speech Emotion Recognition 319
Bibliography 322
20 Gender and Age Detection with Data Science 323
Bibliography 339
21 Diabetic Retinopathy 341
Bibliography 350
22 Driver Drowsiness Detection in Python 351
Bibliography 356
23 Chatbot Using Python 357
Bibliography 363
24 Handwritten Digit Recognition Project 365
Bibliography 368
25 Image Caption Generator Project in Python 369
Bibliography 379
26 Credit Card Fraud Detection Project 381
Bibliography 391
27 Movie Recommendation System 393
Bibliography 411
28 Customer Segmentation 413
Bibliography 431
29 Breast Cancer Classification 433
Bibliography 443
30 Traffic Signs Recognition 445
Bibliography 453
Data gains value by transforming itself in to useful information. Every firm is more significant about the data generated from its all assets. The firm's data helps the different personnel in the organization to improve their business tasks, save time and expenditure amount on maintenance of it. The top level management fails in taking appropriate decision if they don't consider the data as important factor in understanding the business process. Many poor decisions related to the advertisement of company products leads to wastage of resources and affect the fame of the organization at every level. Companies may avoid squandering money by tracking the success of numerous marketing channels and concentrating on the ones that provide the best return on investment. As a result, a business can get more leads for less money spent on advertising [1].
Data Science provides study of discovering different data patterns from inter-disciplinary domains like business, education, research etc... Much of the information extracted is of the form unstructured like text and images and structured like in tabular format. The basic functional feature of data science involves the statistical techniques, inference rules, analytics for prediction, fundamental algorithms in machine learning, and novel methods for gleaning insights from huge data.
Business use cases which uses data science for serving the customers in different domains.
Data science adopts four popular strategies [8] while exploring data they are (i) Understanding the problem in real time world (Probing Reality) (ii) Usage patterns of data (Discovery Patterns) (iii) Building Predictive data model for future perspective (predicting future events) (iv) Being empathetic business world (Understanding the people and the world)
Simple business intelligence tools are analyzed for unstructured data which is very small. Most of the data collected in traditional system were of the form of unstructured. The data was generated from different sources like financial reports, textual files, multimedia information, sensors and instrumental data. The business intelligence solutions cannot deal with huge volume of data with different complex formats. To process the complex formatted data we need high processing ability with improved analytical tools and algorithms for getting better insights that is done as part of data science.
In 1962, John Tukey published a paper on the convergence of statistics and computers, showing how they may provide measurable results in hours. In 1974, Peter Naur written a book on Concise Survey of Computer Methods in which he coined the term data science many times to refer processing of data through specific mathematical methods. In 1977, an international association was established for statistical processing of data with the purpose of translating data into knowledge by combining modern computer technology, traditional statistical techniques, and domain knowledge. Tukey released Exploratory Data Analysis in the same year, emphasizing the importance of data.
Businesses began collecting enormous volumes of personal data in anticipation of new advertising efforts as early as 1994. Jacob Zahavi emphasized the need for new technology to manage the large volume of data generated by different organizations. William S. Cleveland published an article outlining on specialized learning methods and scope for Data Scientists which was used as case studies for businesses and education institute.
In 2002, a journal for Data Science was launched by international council for science. It focused on Data Science topics such as data systems modeling and its application. In 2013, IBM claimed that much of digital data collected all over the world is generated in the last two years, from then all organizations planned to build good amount of data for their benefits in decision making and started gaining good insights for improvement in the organization growth.
According to IDC, global data will exceed 175 zettabytes by 2025. Data Science allows businesses to swiftly interpret large amounts of data from a number of sources and turn that data into actionable insights for better data-driven decisions which is widely used in marketing, healthcare, finance, banking, policy work, and other fields. The market for Data Science platforms is expected to reach 178 billion dollars by 2025. Data science provides a platform for data scientists to explore many options for business organizations to track the latest developments in relevant to data gathering and maintenance for appropriate decision making.
Business Intelligence is a process involved in decision making by getting insights in to the current data available as part of their organization transactions with respective all stake holders. It gathers data from all sources which can be from external or internal of the organization. The set of BI tools provide support for running queries, displaying results of data with good visualization mechanisms by performing analysis on revenue earned in that quarterly by facing business challenges. BI enables to provide suggestions based on market study, revealing revenue opportunities and business processes improvement. It is purely meant for building business strategies to earn profits in long run for the organization. Tools Like OLAP, warehouse ETL are used for storing and visualizing data in BI.
Data Science is a multi-disciplinary domain which performs study on data by extracting meaningful insights. It also uses tools relevant to data processing from machine learning and artificial intelligence to develop predictive models. It is further used for forecasting the future perspective growth in business organization carried functionalities. Python, R programming used to build the predictive data models by implementing efficient machine learning algorithms and the results are tracked based on high end visual communication techniques.
Data Science is multi-disciplinary field which derives its features from artificial intelligence, machine learning and deep learning to uncover the more insights of data which is in different forms like structured (Tabular format of data) and unstructured (text, images). It performs study on specific problem domain areas and find or define solutions with available input data usage patterns and reveals good insights [1, 2].
Data Science deals with data to provide appropriate solutions to the relevant questions made by the study of those real time scenarios in the process of business. It is different from the business intelligence mechanism which only works on framing good business...
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.