Schweitzer Fachinformationen
Wenn es um professionelles Wissen geht, ist Schweitzer Fachinformationen wegweisend. Kunden aus Recht und Beratung sowie Unternehmen, öffentliche Verwaltungen und Bibliotheken erhalten komplette Lösungen zum Beschaffen, Verwalten und Nutzen von digitalen und gedruckten Medien.
Provides an up-to-date analysis of big data and multi-agent systems
The term Big Data refers to the cases, where data sets are too large or too complex for traditional data-processing software. With the spread of new concepts such as Edge Computing or the Internet of Things, production, processing and consumption of this data becomes more and more distributed. As a result, applications increasingly require multiple agents that can work together. A multi-agent system (MAS) is a self-organized computer system that comprises multiple intelligent agents interacting to solve problems that are beyond the capacities of individual agents. Modern Big Data Architectures examines modern concepts and architecture for Big Data processing and analytics.
This unique, up-to-date volume provides joint analysis of big data and multi-agent systems, with emphasis on distributed, intelligent processing of very large data sets. Each chapter contains practical examples and detailed solutions suitable for a wide variety of applications. The author, an internationally-recognized expert in Big Data and distributed Artificial Intelligence, demonstrates how base concepts such as agent, actor, and micro-service have reached a point of convergence-enabling next generation systems to be built by incorporating the best aspects of the field. This book:
Modern Big Data Architectures: A Multi-Agent Systems Perspective is a timely and important resource for data science professionals and students involved in Big Data analytics, and machine and artificial learning.
DOMINIK RYZKO is an Assistant Professor at the Institute of Computer Science at Warsaw University of Technology. His research interests include Big Data and Distributed Artificial Intelligence. He is widely published, serves on program committees at international conferences, and is Vice President of artificial intelligence and analytics at Adform, a global ad-tech platform provider. He also spent three years at Allegro Group as the Chief Data Scientist where he oversaw Data Science activities, design and methodology of experiments, and model building.
List of Figures ix
List of Tables xi
Preface xiii
Acknowledgments xv
Acronyms xvii
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Assumptions 3
1.3 For Whom is This Book? 4
1.4 Book Structure 4
Chapter 2 Evolution of IT Architectures and Paradigms 7
2.1 Evolution of IT Architectures 7
2.1.1 Monolith 7
2.1.2 Service Oriented Architecture 9
2.1.3 Microservices 12
2.2 Actors and Agents 15
2.2.1 Actors 15
2.2.2 Agents 17
2.3 From ACID to BASE, CAP, and NoSQL - The Database (R)evolution 22
2.4 The Cloud 24
2.5 From Distributed Sensor Networks to the Internet of Things and Cyber-Physical Systems 27
2.6 The Rise of Big Data 28
Chapter 3 Sources of Data 31
3.1 The Internet 32
3.1.1 The Semantic Web 32
3.1.2 Linked Data 35
3.1.3 Knowledge Graphs 36
3.1.4 Social Media 38
3.1.5 Web Mining 38
3.2 Scientific Data 40
3.2.1 Biomedical Data 40
3.2.2 Physics and Astrophysics Data 41
3.2.3 Environmental Sciences 44
3.3 Industrial Data 45
3.3.1 Smart Factories 45
3.3.2 SmartGrid 47
3.3.3 Aviation 47
3.4 Internet of Things 48
Chapter 4 Big Data Tasks 51
4.1 Recommender Systems 51
4.2 Search 52
4.3 Ad-tech and RTB Algorithms 55
4.4 Cross-Device Graph Generation 57
4.5 Forecasting and Prediction Systems 58
4.6 Social Media Big Data 59
4.7 Anomaly and Fraud Detection 61
4.8 New Drug Discovery 63
4.9 Smart Grid Control and Monitoring 64
4.10 IoT and Big Data Applications 65
Chapter 5 Cloud Computing 67
5.1 Cloud Enabled Architectures 67
5.1.1 Cloud Management Platforms 67
5.1.2 Efficient Cloud Computing 73
5.1.3 Distributed Storage Systems 75
5.2 Agents and the Cloud 82
5.2.1 Multi-agent Versus Cloud Paradigms 83
5.2.2 Agents in the Cloud 83
Chapter 6 Big Data Architectures 87
6.1 Big Data Computation Models 87
6.1.1 MapReduce 87
6.1.2 Directed Acyclic Graph Models 89
6.1.3 All-Pairs 92
6.1.4 Very Large Bitmap Operations 93
6.1.5 Message Passing Interface 94
6.1.6 Graphical Processing Unit Computing 95
6.2 Publish-Subscribe Systems 97
6.3 Stream Processing 99
6.3.1 Information Flow Processing Concepts 99
6.3.2 Stream Processing Systems 101
6.4 Higer Level Big Data Architectures 110
6.4.1 Spark 110
6.4.2 Lambda 112
6.4.3 Multi-Agent View of the Lambda Architecture 113
6.4.4 Questioning the Lambda 115
6.5 Industry and Other Approaches 116
6.6 Actor and Agent-Based Big Data Architectures 118
Chapter 7 Big Data Analytics, Mining, and Machine Learning 121
7.1 To SQL or Not to SQL 122
7.1.1 SQL Hadoop Interfaces 123
7.1.2 From Shark to SparkSQL 125
7.2 Big Data Mining and Machine Learning 128
7.2.1 Graph Mining 133
7.2.2 Agent Based Machine Learning and Data Mining 134
Chapter 8 Physically Distributed Systems - Mobile Cloud, Internet of Things, Edge Computing 137
8.1 Mobile Cloud 138
8.2 Edge and Fog Computing 145
8.2.1 Business Case: Mobile Context Aware Recommender System 147
8.3 Internet of Things 148
8.3.1 IoT Fundamentals 148
8.3.2 IoT and the Cloud 151
8.3.3 MAS in IoT 156
Chapter 9 Summary 159
Bibliography 161
Index 179
Over recent decades corporate IT architectures have evolved significantly. Starting from the large monolith application, through the introduction of web services and the emergence of the Service Oriented Architecture (SOA), which has evolved into microservices, we went through the wide adoption of cloud computing and have now reached the popularity of edge computing, the Internet of Things and cyber-physical systems. Each of these steps required a change in the way we produced, processed, stored, and analyzed the data, which will be explored in the subsequent sections of this chapter.
Back in the 1990s corporate systems were built mainly as large monolith applications. They were based on a number of tightly coupled modules with strong interdependencies. This caused high development and maintenance costs. At the beginning of the software development process it is beneficial to have all the building blocks in one place, but as the system grows, it becomes tedious to track all the internal dependencies and the code base becomes hard to manage. The growing size and complexity of a monolith impacts all software life cycle steps influencing design, development, testing, and deployment.
Each design and development decision taken in a monolith system has long lasting consequences. This phenomenon is well described by the term technical debt coined by Cunningham Cunningham [1993]. The larger the system, the more reluctant we are to introduce necessary changes and the debt grows.
Figure 2.1 BI in monolith architecture.
In monolith systems scalability is limited. More instances of the system can be set up to introduce load balancing. However, replicating the entire functionality each time is very costly. Demand for different functionalities can vary, and we do not have the tools to scale them separately.
On the other hand it is relatively easy to manage and analyze the data processed by such systems. We usually have a single underlying database with a relational schema, which can be easily exported to an analytical environment, typically a data warehouse, where a set of BI tools produce reports, KPI visuals, dashboards, etc. In the worst case we have to deal with a handful of monolith systems (e.g. ERP, CRM, Billing, etc.) and introduce some form of Extract Transform Load (ETL) processing, in order to combine them before loading into the warehouse. Figure 2.1 shows the overall reporting architecture in the world of monolith systems.
The methodology for creating and maintaining a data warehouse is well researched by now. Typically, the following layers can be identified in such a system:
Figure 2.2 Data warehouse architecture.
Source: Kimball and Ross (2011). Reproduced with permission.
In large organizations data marts are usually created, which are subsets of the overall data limited and optimized for specific groups of users. The Data Marts are efficient for analysis across multiple predefined dimensions such as time, region, product, etc. Kimball and Ross [2011]. A Data warehouse architecture is shown in Figure 2.2.
While ETL processes in a large organization can become quite complex, entities coming from a single monolith system are well structured and related with each other. What remains, is managing the relations between the data sets from various monoliths and from external sources if we wish to include them in our reporting setup.
In the 2000s Service Oriented Architecture (SOA) paradigms were introduced. The idea was to break the large systems into reusable components, implementing specific groups of functionalities accessible by strictly defined APIs. In SOA the services are more loosely coupled then in the monolith systems. In other words services are self-describing, open components that support rapid, low-cost composition of distributed applications. Papazoglou [2003].
The Open Group formally defines SOA in the following way:
A service:
Such a setup requires a composition layer, which provides coordination, monitoring, conformance, and QoS functionalities in order to provide composite services to the clients. The backbone of the SOA system which allows it to do this is called the Enterprise Service Bus (ESB). The following specific tasks can be handled by the ESB. Josuttis [2007]:
In order to manage the business processes, specific languages, e.g. XML-based BPEL (Business Process Execution Language) and business process servers have been introduced. The services can be built in various technologies as long as their APIs follow Web Service standards.
As the number of services and potential interactions in SOA increase, new problems arise. The dynamic nature of collaborating services means several issues can be experienced at run-time. Network can lag, messages can be lost, services can experience performance problems or crash entirely.
Therefore, monitoring of such systems becomes a crucial task. Administrators need to be able to pinpoint quickly where the source of the business process failure lies. Obviously any information which can help to anticipate potential problems in SOA, before they arise and have a big impact, is of great value.
From the perspective of this book, it is interesting to mention applications of Multi-Agent Systems to solve the issues described above. For example Ryzko and Ihnatowicz [2011] propose to distribute intelligent agents throughout the SOA system, which are tasked with monitoring selected services and following process execution. Whenever a certain service becomes unavailable or predefined KPIs (e.g. service queue length, response time, etc.) cross predefined threshold, alerts are raised. This early warning system provides the opportunity to take action before a substantial breach of the overall system SLAs takes place.
The central idea of SOA is to put emphasis on the good definition of the service interfaces and to hide the underlying logic and data. This imposes problems if we want to analyze data in a traditional way, as with the monolith systems, i.e. plugging each service into an ETL framework and integrating it into a BI solution.
If we do not want to violate the SOA principles we can pull the data from the services with the use of existing data contracts. If done regularly, this would allow us to obtain the complete history of required information. However, this model is not synchronized with the real pace in which the data is produced and can impose delays and efficiency problems.
A way to deal with the problems described above is to use a push model. This approach is called Event-Driven Architecture (EDA). The services publish events, which can be collected as they appear by subscribed entities. This reduces the network load, since data is published once rather than being requested several times as in a pull model.
In the book SOA Patterns by Rotem-Gal-Oz et al. [2012] an aggregated reporting pattern for SOA is described. The pattern is designed to overcome the distribution of data across services by creating a service that gathers immutable copies of data from multiple services for reporting purposes. The service works as follows. Firstly, the data is transferred from the source services into the raw data store. Then it is processed by the transformation backend and put into the reporting store, usually containing joined and aggregated data. Finally, an SQL output endpoint is provided in order to plug in ad-hoc SQL and reporting tools.
Four different ways of getting the data into the aggregated reporting are proposed:
Dateiformat: ePUBKopierschutz: Adobe-DRM (Digital Rights Management)
Systemvoraussetzungen:
Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet – also für „fließenden” Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein „harter” Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.Bitte beachten Sie: Wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!
Weitere Informationen finden Sie in unserer E-Book Hilfe.