
Big Data
Description
Alles über E-Books | Antworten auf Fragen rund um E-Books, Kopierschutz und Dateiformate finden Sie in unserem Info- & Hilfebereich.
Big Data: Concepts, Technology, and Architecture delivers a comprehensive treatment of Big Data tools, terminology, and technology perfectly suited to a wide range of business professionals, academic researchers, and students. Beginning with a fulsome overview of what we mean when we say, "Big Data," the book moves on to discuss every stage of the lifecycle of Big Data.
You'll learn about the creation of structured, unstructured, and semi-structured data, data storage solutions, traditional database solutions like SQL, data processing, data analytics, machine learning, and data mining. You'll also discover how specific technologies like Apache Hadoop, SQOOP, and Flume work.
Big Data also covers the central topic of big data visualization with Tableau, and you'll learn how to create scatter plots, histograms, bar, line, and pie charts with that software.
Accessibly organized, Big Data includes illuminating case studies throughout the material, showing you how the included concepts have been applied in real-world settings. Some of those concepts include:
* The common challenges facing big data technology and technologists, like data heterogeneity and incompleteness, data volume and velocity, storage limitations, and privacy concerns
* Relational and non-relational databases, like RDBMS, NoSQL, and NewSQL databases
* Virtualizing Big Data through encapsulation, partitioning, and isolating, as well as big data server virtualization
* Apache software, including Hadoop, Cassandra, Avro, Pig, Mahout, Oozie, and Hive
* The Big Data analytics lifecycle, including business case evaluation, data preparation, extraction, transformation, analysis, and visualization
Perfect for data scientists, data engineers, and database managers, Big Data also belongs on the bookshelves of business intelligence analysts who are required to make decisions based on large volumes of information. Executives and managers who lead teams responsible for keeping or understanding large datasets will also benefit from this book.
More details
Other editions
Additional editions


Persons
BALAMURUGAN BALUSAMY, PHD, is a Professor with the School of Computing Science and Engineering at Galgotias University, Greater Noida, India
NANDHINI ABIRAMI. R is an IT Consultant and Research Scholar at VIT University in Vellore.
SEIFEDINE KADRY, PhD, is a Professor of Data Science at the Faculty of Applied Computing and Technology at Noroff University College, Kristiansand, Norway.
AMIR H. GANDOMI, PHD, is a Professor of Data Science at the Faculty of Engineering & Information Technology, University of Technology Sydney, Australia.
Content
Acknowledgments xi
About the Author xii
1 Introduction to the World of Big Data 1
1.1 Understanding Big Data 1
1.2 Evolution of Big Data 2
1.3 Failure of Traditional Database in Handling Big Data 3
1.4 3 Vs of Big Data 4
1.5 Sources of Big Data 7
1.6 Different Types of Data 8
1.7 Big Data Infrastructure 11
1.8 Big Data Life Cycle 12
1.9 Big Data Technology 18
1.10 Big Data Applications 21
1.11 Big Data Use Cases 21
Chapter 1 Refresher 24
2 Big Data Storage Concepts 31
2.1 Cluster Computing 32
2.2 Distribution Models 37
2.3 Distributed File System 43
2.4 Relational and Non-Relational Databases 43
2.5 Scaling Up and Scaling Out Storage 47
Chapter 2 Refresher 48
3 NoSQL Database 53
3.1 Introduction to NoSQL 53
3.2 Why NoSQL 54
3.3 CAP Theorem 54
3.4 ACID 56
3.5 BASE 56
3.6 Schemaless Databases 57
3.7 NoSQL (Not Only SQL) 57
3.8 Migrating from RDBMS to NoSQL 76
Chapter 3 Refresher 77
4 Processing, Management Concepts, and Cloud Computing 83
Part I: Big Data Processing and Management Concepts 83
4.1 Data Processing 83
4.2 Shared Everything Architecture 85
4.3 Shared-Nothing Architecture 86
4.4 Batch Processing 88
4.5 Real-Time Data Processing 88
4.6 Parallel Computing 89
4.7 Distributed Computing 90
4.8 Big Data Virtualization 90
Part II: Managing and Processing Big Data in Cloud Computing 93
4.9 Introduction 93
4.10 Cloud Computing Types 94
4.11 Cloud Services 95
4.12 Cloud Storage 96
4.13 Cloud Architecture 101
Chapter 4 Refresher 103
5 Driving Big Data with Hadoop Tools and Technologies 111
5.1 Apache Hadoop 111
5.2 Hadoop Storage 114
5.3 Hadoop Computation 119
5.4 Hadoop 2.0 129
5.5 HBASE 138
5.6 Apache Cassandra 141
5.7 SQOOP 141
5.8 Flume 143
5.9 Apache Avro 144
5.10 Apache Pig 145
5.11 Apache Mahout 146
5.12 Apache Oozie 146
5.13 Apache Hive 149
5.14 Hive Architecture 151
5.15 Hadoop Distributions 152
Chapter 5 Refresher 153
6 Big Data Analytics 161
6.1 Terminology of Big Data Analytics 161
6.2 Big Data Analytics 162
6.3 Data Analytics Life Cycle 166
6.4 Big Data Analytics Techniques 170
6.5 Semantic Analysis 175
6.6 Visual analysis 178
6.7 Big Data Business Intelligence 178
6.8 Big Data Real-Time Analytics Processing 180
6.9 Enterprise Data Warehouse 181
Chapter 6 Refresher 182
7 Big Data Analytics with Machine Learning 187
7.1 Introduction to Machine Learning 187
7.2 Machine Learning Use Cases 188
7.3 Types of Machine Learning 189
Chapter 7 Refresher 196
8 Mining Data Streams and Frequent Itemset 201
8.1 Itemset Mining 201
8.2 Association Rules 206
8.3 Frequent Itemset Generation 210
8.4 Itemset Mining Algorithms 211
8.5 Maximal and Closed Frequent Itemset 229
8.6 Mining Maximal Frequent Itemsets: the GenMax Algorithm 233
8.7 Mining Closed Frequent Itemsets: the Charm Algorithm 236
8.8 CHARM Algorithm Implementation 236
8.9 Data Mining Methods 239
8.10 Prediction 240
8.11 Important Terms Used in Bayesian Network 241
8.12 Density Based Clustering Algorithm 249
8.13 DBSCAN 249
8.14 Kernel Density Estimation 250
8.15 Mining Data Streams 254
8.16 Time Series Forecasting 255
9 Cluster Analysis 259
9.1 Clustering 259
9.2 Distance Measurement Techniques 261
9.3 Hierarchical Clustering 263
9.4 Analysis of Protein Patterns in the Human Cancer-Associated Liver 266
9.5 Recognition Using Biometrics of Hands 267
9.6 Expectation Maximization Clustering Algorithm 274
9.7 Representative-Based Clustering 277
9.8 Methods of Determining the Number of Clusters 277
9.9 Optimization Algorithm 284
9.10 Choosing the Number of Clusters 288
9.11 Bayesian Analysis of Mixtures 290
9.12 Fuzzy Clustering 290
9.13 Fuzzy C-Means Clustering 291
10 Big Data Visualization 293
10.1 Big Data Visualization 293
10.2 Conventional Data Visualization Techniques 294
10.3 Tableau 297
10.4 Bar Chart in Tableau 309
10.5 Line Chart 310
10.6 Pie Chart 311
10.7 Bubble Chart 312
10.8 Box Plot 313
10.9 Tableau Use Cases 313
10.10 Installing R and Getting Ready 318
10.11 Data Structures in R 321
10.12 Importing Data from a File 335
10.13 Importing Data from a Delimited Text File 336
10.14 Control Structures in R 337
10.15 Basic Graphs in R 341
Index 347
1
Introduction to the World of Big Data
CHAPTER OBJECTIVE
This chapter deals with the introduction to big data, defining what actually big data means. The limitations of the traditional database, which led to the evolution of Big Data, are explained, and insight into big data key concepts is delivered. A comparative study is made between big data and traditional database giving a clear picture of the drawbacks of the traditional database and advantages of big data. The three Vs of big data (volume, velocity, and variety) that distinguish it from the traditional database are explained. With the evolution of big data, we are no longer limited to the structured data. The different types of human- and machine-generated data-that is, structured, semi-structured, and unstructured-that can be handled by big data are explained. The various sources contributing to this massive volume of data are given a clear picture. The chapter expands to show the various stages of big data life cycle starting from data generation, acquisition, preprocessing, integration, cleaning, transformation, analysis, and visualization to make business decisions. This chapter sheds light on various challenges of big data due to its heterogeneity, volume, velocity, and more.
1.1 Understanding Big Data
With the rapid growth of Internet users, there is an exponential growth in the data being generated. The data is generated from millions of messages we send and communicate via WhatsApp, Facebook, or Twitter, from the trillions of photos taken, and hours and hours of videos getting uploaded in YouTube every single minute. According to a recent survey 2.5 quintillion (2 500 000 000 000 000 000, or 2.5 × 1018) bytes of data are generated every day. This enormous amount of data generated is referred to as "big data." Big data does not only mean that the data sets are too large, it is a blanket term for the data that are too large in size, complex in nature, which may be structured or unstructured, and arriving at high velocity as well. Of the data available today, 80 percent has been generated in the last few years. The growth of big data is fueled by the fact that more data are generated on every corner of the world that needs to be captured.
Capturing this massive data gives only meager value unless this IT value is transformed into business value. Managing the data and analyzing them have always been beneficial to the organizations; on the other hand, converting these data into valuable business insights has always been the greatest challenge. Data scientists were struggling to find pragmatic techniques to analyze the captured data. The data has to be managed at appropriate speed and time to derive valuable insight from it. These data are so complex that it became difficult to process it using traditional database management systems, which triggered the evolution of the big data era. Additionally, there were constraints on the amount of data that traditional databases could handle. With the increase in the size of data either there was a decrease in performance and increase in latency or it was expensive to add additional memory units. All these limitations have been overcome with the evolution of big data technologies that lets us capture, store, process, and analyze the data in a distributed environment. Examples of Big data technologies are Hadoop, a framework for all big data process, Hadoop Distributed File System (HDFS) for distributed cluster storage, and MapReduce for processing.
1.2 Evolution of Big Data
The first documentary appearance of big data was in a paper in 1997 by NASA scientists narrating the problems faced in visualizing large data sets, which were a captivating challenge for the data scientists. The data sets were large enough, taxing more memory resources. This problem is termed big data. Big data, the broader concept, was first put forward by a noted consultancy: McKinsey. The three dimensions of big data, namely, volume, velocity, and variety, were defined by analyst Doug Laney. The processing life cycle of big data can be categorized into acquisition, preprocessing, storage and management, privacy and security, analyzing, and visualization.
The broader term big data encompasses everything that includes web data, such as click stream data, health data of patients, genomic data from biologic research, and so forth.
Figure 1.1 shows the evolution of big data. The growth of the data over the years is massive. It was just 600 MB in the 1950s but has grown by 2010 up to 100 petabytes, which is equal to 100 000 000 000 MB.
Figure 1.1 Evolution of Big Data.
1.3 Failure of Traditional Database in Handling Big Data
The Relational Database Management Systems (RDBMS) was the most prevalent data storage medium until recently to store the data generated by the organizations. A large number of vendors provide database systems. These RDBMS were devised to store the data that were beyond the storage capacity of a single computer. The inception of a new technology is always due to limitations in the older technologies and the necessity to overcome them. Below are the limitations of traditional database in handling big data.
- Exponential increase in data volume, which scales in terabytes and petabytes, has turned out to become a challenge to the RDBMS in handling such a massive volume of data.
- To address this issue, the RDBMS increased the number of processors and added more memory units, which in turn increased the cost.
- Almost 80% of the data fetched were of semi-structured and unstructured format, which RDBMS could not deal with.
- RDBMS could not capture the data coming in at high velocity.
Table 1.1 shows the differences in the attributes of RDBMS and big data.
1.3.1 Data Mining vs. Big Data
Table 1.2 shows a comparison between data mining and big data.
Table 1.1 Differences in the attributes of big data and RDBMS.
ATTRIBUTES RDBMS BIG DATA Data volume gigabytes to terabytes petabytes to zettabytes Organization centralized distributed Data type structured unstructured and semi-structured Hardware type high-end model commodity hardware Updates read/write many times write once, read many times Schema static dynamicTable 1.2 Data Mining vs. Big Data.
S. No. Data mining Big data 1) Data mining is the process of discovering the underlying knowledge from the data sets. Big data refers to massive volume of data characterized by volume, velocity, and variety. 2) Structured data retrieved from spread sheets, relational databases, etc. Structured, unstructured, or semi-structured data retrieved from non-relational databases, such as NoSQl. 3) Data mining is capable of processing large data sets, but the data processing costs are high. Big data tools and technologies are capable of storing and processing large volumes of data at a comparatively lower cost. 4) Data mining can process only data sets that range from gigabytes to terabytes. Big data technology is capable of storing and processing data that range from petabytes to zettabytes.1.4 3 Vs of Big Data
Big data is distinguished by its exceptional characteristics with various dimensions. Figure 1.2 illustrates various dimensions of big data. The first of its dimensions is the size of the data. Data size grows partially because the cluster storage with commodity hardware has made it cost effective. Commodity hardware is a low cost, low performance, and low specification functional hardware with no distinctive features. This is referred by the term "volume" in big data technology. The second dimension is the variety, which describes its heterogeneity to accept all the data types, be it structured, unstructured, or a mix of both. The third dimension is velocity, which relates to the rate at which the data is generated and being processed to derive the desired value out of the raw unprocessed data. The complexities of the data captured pose a new opportunity as well as a challenge for today's information technology era.
Figure 1.2 3 Vs of big data.
1.4.1 Volume
Data generated and processed by big data are continuously growing at an ever increasing pace. Volume grows exponentially owing to the fact that business enterprises are continuously capturing the data to make better and bigger business...
System requirements
File format: ePUB
Copy protection: Adobe-DRM (Digital Rights Management)
System requirements:
- Computer (Windows; MacOS X; Linux): Install the free reader Adobe Digital Editions prior to download (see eBook Help).
- Tablet/smartphone (Android; iOS): Install the free app Adobe Digital Editions or the app PocketBook before downloading (see eBook Help).
- E-reader: Bookeen, Kobo, Pocketbook, Sony, Tolino and many more (not Kindle).
The file format ePub works well for novels and non-fiction books – i.e., „flowing” text without complex layout. On an e-reader or smartphone, line and page breaks automatically adjust to fit the small displays.
This eBook uses Adobe-DRM, a „hard” copy protection. If the necessary requirements are not met, unfortunately you will not be able to open the eBook. You will therefore need to prepare your reading hardware before downloading.
Please note: We strongly recommend that you authorise using your personal Adobe ID after installation of any reading software.
For more information, see our ebook Help page.