Modern Big Data Architectures

Name: Modern Big Data Architectures | A Multi-Agent Systems Perspective
Brand: Wiley-Blackwell
Price: 41.99 EUR
Availability: OnlineOnly

A Multi-Agent Systems Perspective

Dominik Ryzko(Autor*in)

Wiley-Blackwell (Verlag)

1. Auflage

Erschienen am 9. April 2020

208 Seiten

E-Book

ePUB mit Adobe-DRM

Systemvoraussetzungen

978-1-119-59793-3 (ISBN)

41,99 €inkl. 7% MwSt.

Systemvoraussetzungen

für ePUB mit Adobe-DRM

E-Book Einzellizenz

Als Download verfügbar

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

List of Figures ix

List of Tables xi

Preface xiii

Acknowledgments xv

Acronyms xvii

Chapter 1 Introduction 1

1.1 Motivation 1

1.2 Assumptions 3

1.3 For Whom is This Book? 4

1.4 Book Structure 4

Chapter 2 Evolution of IT Architectures and Paradigms 7

2.1 Evolution of IT Architectures 7

2.1.1 Monolith 7

2.1.2 Service Oriented Architecture 9

2.1.3 Microservices 12

2.2 Actors and Agents 15

2.2.1 Actors 15

2.2.2 Agents 17

2.3 From ACID to BASE, CAP, and NoSQL - The Database (R)evolution 22

2.4 The Cloud 24

2.5 From Distributed Sensor Networks to the Internet of Things and Cyber-Physical Systems 27

2.6 The Rise of Big Data 28

Chapter 3 Sources of Data 31

3.1 The Internet 32

3.1.1 The Semantic Web 32

3.1.2 Linked Data 35

3.1.3 Knowledge Graphs 36

3.1.4 Social Media 38

3.1.5 Web Mining 38

3.2 Scientific Data 40

3.2.1 Biomedical Data 40

3.2.2 Physics and Astrophysics Data 41

3.2.3 Environmental Sciences 44

3.3 Industrial Data 45

3.3.1 Smart Factories 45

3.3.2 SmartGrid 47

3.3.3 Aviation 47

3.4 Internet of Things 48

Chapter 4 Big Data Tasks 51

4.1 Recommender Systems 51

4.2 Search 52

4.3 Ad-tech and RTB Algorithms 55

4.4 Cross-Device Graph Generation 57

4.5 Forecasting and Prediction Systems 58

4.6 Social Media Big Data 59

4.7 Anomaly and Fraud Detection 61

4.8 New Drug Discovery 63

4.9 Smart Grid Control and Monitoring 64

4.10 IoT and Big Data Applications 65

Chapter 5 Cloud Computing 67

5.1 Cloud Enabled Architectures 67

5.1.1 Cloud Management Platforms 67

5.1.2 Efficient Cloud Computing 73

5.1.3 Distributed Storage Systems 75

5.2 Agents and the Cloud 82

5.2.1 Multi-agent Versus Cloud Paradigms 83

5.2.2 Agents in the Cloud 83

Chapter 6 Big Data Architectures 87

6.1 Big Data Computation Models 87

6.1.1 MapReduce 87

6.1.2 Directed Acyclic Graph Models 89

6.1.3 All-Pairs 92

6.1.4 Very Large Bitmap Operations 93

6.1.5 Message Passing Interface 94

6.1.6 Graphical Processing Unit Computing 95

6.2 Publish-Subscribe Systems 97

6.3 Stream Processing 99

6.3.1 Information Flow Processing Concepts 99

6.3.2 Stream Processing Systems 101

6.4 Higer Level Big Data Architectures 110

6.4.1 Spark 110

6.4.2 Lambda 112

6.4.3 Multi-Agent View of the Lambda Architecture 113

6.4.4 Questioning the Lambda 115

6.5 Industry and Other Approaches 116

6.6 Actor and Agent-Based Big Data Architectures 118

Chapter 7 Big Data Analytics, Mining, and Machine Learning 121

7.1 To SQL or Not to SQL 122

7.1.1 SQL Hadoop Interfaces 123

7.1.2 From Shark to SparkSQL 125

7.2 Big Data Mining and Machine Learning 128

7.2.1 Graph Mining 133

7.2.2 Agent Based Machine Learning and Data Mining 134

Chapter 8 Physically Distributed Systems - Mobile Cloud, Internet of Things, Edge Computing 137

8.1 Mobile Cloud 138

8.2 Edge and Fog Computing 145

8.2.1 Business Case: Mobile Context Aware Recommender System 147

8.3 Internet of Things 148

8.3.1 IoT Fundamentals 148

8.3.2 IoT and the Cloud 151

8.3.3 MAS in IoT 156

Chapter 9 Summary 159

Bibliography 161

Index 179

CHAPTER 2
Evolution of IT Architectures and Paradigms

2.1 Evolution of IT Architectures

Over recent decades corporate IT architectures have evolved significantly. Starting from the large monolith application, through the introduction of web services and the emergence of the Service Oriented Architecture (SOA), which has evolved into microservices, we went through the wide adoption of cloud computing and have now reached the popularity of edge computing, the Internet of Things and cyber-physical systems. Each of these steps required a change in the way we produced, processed, stored, and analyzed the data, which will be explored in the subsequent sections of this chapter.

2.1.1 Monolith

Back in the 1990s corporate systems were built mainly as large monolith applications. They were based on a number of tightly coupled modules with strong interdependencies. This caused high development and maintenance costs. At the beginning of the software development process it is beneficial to have all the building blocks in one place, but as the system grows, it becomes tedious to track all the internal dependencies and the code base becomes hard to manage. The growing size and complexity of a monolith impacts all software life cycle steps influencing design, development, testing, and deployment.

Each design and development decision taken in a monolith system has long lasting consequences. This phenomenon is well described by the term technical debt coined by Cunningham Cunningham [1993]. The larger the system, the more reluctant we are to introduce necessary changes and the debt grows.

Figure 2.1 BI in monolith architecture.

In monolith systems scalability is limited. More instances of the system can be set up to introduce load balancing. However, replicating the entire functionality each time is very costly. Demand for different functionalities can vary, and we do not have the tools to scale them separately.

On the other hand it is relatively easy to manage and analyze the data processed by such systems. We usually have a single underlying database with a relational schema, which can be easily exported to an analytical environment, typically a data warehouse, where a set of BI tools produce reports, KPI visuals, dashboards, etc. In the worst case we have to deal with a handful of monolith systems (e.g. ERP, CRM, Billing, etc.) and introduce some form of Extract Transform Load (ETL) processing, in order to combine them before loading into the warehouse. Figure 2.1 shows the overall reporting architecture in the world of monolith systems.

The methodology for creating and maintaining a data warehouse is well researched by now. Typically, the following layers can be identified in such a system:

Data Source Layer - systems and sources which feed the data into the warehouse
Data Extraction Layer - responsible for pulling the data into the warehouse
Staging Area - the area where data stays before the major transformations (ETL) begins
ETL Layer - in this layer is a set of processes which transform the data into the format usable for reporting and analysis
Figure 2.2 Data warehouse architecture.

Source: Kimball and Ross (2011). Reproduced with permission.
Data Storage Layer - stores the data after it has been transformed and cleaned
Data Logic Layer - gives semantic to the data by defining the report structure
Data Presentation Layer - provides interface to the user
Metadata Layer - describes the data stored in the warehouse
System Operations Layer - allows administrators to manage the data warehouse

In large organizations data marts are usually created, which are subsets of the overall data limited and optimized for specific groups of users. The Data Marts are efficient for analysis across multiple predefined dimensions such as time, region, product, etc. Kimball and Ross [2011]. A Data warehouse architecture is shown in Figure 2.2.

While ETL processes in a large organization can become quite complex, entities coming from a single monolith system are well structured and related with each other. What remains, is managing the relations between the data sets from various monoliths and from external sources if we wish to include them in our reporting setup.

2.1.2 Service Oriented Architecture

In the 2000s Service Oriented Architecture (SOA) paradigms were introduced. The idea was to break the large systems into reusable components, implementing specific groups of functionalities accessible by strictly defined APIs. In SOA the services are more loosely coupled then in the monolith systems. In other words services are self-describing, open components that support rapid, low-cost composition of distributed applications. Papazoglou [2003].

The Open Group formally defines SOA in the following way:

SOA is an architectural style that supports service-orientation.
Service-orientation is a way of thinking in terms of services and service-based development and the outcomes of services.

A service:

Is a logical representation of a repeatable business activity that has a specified outcome (e.g. check customer credit, provide weather data, consolidate drilling reports)
Is self-contained
May be composed of other services
Is a "black box" to consumers of the service

Such a setup requires a composition layer, which provides coordination, monitoring, conformance, and QoS functionalities in order to provide composite services to the clients. The backbone of the SOA system which allows it to do this is called the Enterprise Service Bus (ESB). The following specific tasks can be handled by the ESB. Josuttis [2007]:

Providing connectivity
Data transformation
(Intelligent) routing
Dealing with security
Dealing with reliability
Service management
Monitoring and logging

In order to manage the business processes, specific languages, e.g. XML-based BPEL (Business Process Execution Language) and business process servers have been introduced. The services can be built in various technologies as long as their APIs follow Web Service standards.

As the number of services and potential interactions in SOA increase, new problems arise. The dynamic nature of collaborating services means several issues can be experienced at run-time. Network can lag, messages can be lost, services can experience performance problems or crash entirely.

Therefore, monitoring of such systems becomes a crucial task. Administrators need to be able to pinpoint quickly where the source of the business process failure lies. Obviously any information which can help to anticipate potential problems in SOA, before they arise and have a big impact, is of great value.

From the perspective of this book, it is interesting to mention applications of Multi-Agent Systems to solve the issues described above. For example Ryzko and Ihnatowicz [2011] propose to distribute intelligent agents throughout the SOA system, which are tasked with monitoring selected services and following process execution. Whenever a certain service becomes unavailable or predefined KPIs (e.g. service queue length, response time, etc.) cross predefined threshold, alerts are raised. This early warning system provides the opportunity to take action before a substantial breach of the overall system SLAs takes place.

The central idea of SOA is to put emphasis on the good definition of the service interfaces and to hide the underlying logic and data. This imposes problems if we want to analyze data in a traditional way, as with the monolith systems, i.e. plugging each service into an ETL framework and integrating it into a BI solution.

If we do not want to violate the SOA principles we can pull the data from the services with the use of existing data contracts. If done regularly, this would allow us to obtain the complete history of required information. However, this model is not synchronized with the real pace in which the data is produced and can impose delays and efficiency problems.

A way to deal with the problems described above is to use a push model. This approach is called Event-Driven Architecture (EDA). The services publish events, which can be collected as they appear by subscribed entities. This reduces the network load, since data is published once rather than being requested several times as in a pull model.

In the book SOA Patterns by Rotem-Gal-Oz et al. [2012] an aggregated reporting pattern for SOA is described. The pattern is designed to overcome the distribution of data across services by creating a service that gathers immutable copies of data from multiple services for reporting purposes. The service works as follows. Firstly, the data is transferred from the source services into the raw data store. Then it is processed by the transformation backend and put into the reporting store, usually containing joined and aggregated data. Finally, an SQL output endpoint is provided in order to plug in ad-hoc SQL and reporting tools.

Four different ways of getting the data into the aggregated reporting are proposed:

Actively calling other services - use of other services contracts to get new data
Passively getting data from services - subscribing to batch data exports or...

Systemvoraussetzungen

Als PDF speichern Als Link merken

Modern Big Data Architectures

Beschreibung

Weitere Details

Weitere Ausgaben

Person

Inhalt

CHAPTER 2 Evolution of IT Architectures and Paradigms

2.1 Evolution of IT Architectures

2.1.1 Monolith

2.1.2 Service Oriented Architecture

Systemvoraussetzungen

CHAPTER 2
Evolution of IT Architectures and Paradigms