Data Lakes

Name: Data Lakes
Brand: Jossey-Bass
Price: 139.99 EUR
Availability: OnlineOnly

Anne Laurent Dominique Laurent Cédrine Madera(Editor)

Jossey-Bass (Publisher)

1st Edition

Published on 9. April 2020

244 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-119-72041-6 (ISBN)

€139.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Persons

Content

Preface xi

Chapter 1. Introduction to Data Lakes: Definitions and Discussions 1
Anne LAURENT, Dominique LAURENT and Cédrine MADERA

1.1. Introduction to data lakes 1

1.2. Literature review and discussion 3

1.3. The data lake challenges 7

1.4. Data lakes versus decision-making systems 10

1.5. Urbanization for data lakes 13

1.6. Data lake functionalities 17

1.7. Summary and concluding remarks 20

Chapter 2. Architecture of Data Lakes 21
Houssem CHIHOUB, Cédrine MADERA, Christoph QUIX and Rihan HAI

2.1. Introduction 21

2.2. State of the art and practice 25

2.2.1. Definition 25

2.2.2. Architecture 25

2.2.3. Metadata 26

2.2.4. Data quality 27

2.2.5. Schema-on-read 27

2.3. System architecture 28

2.3.1. Ingestion layer 29

2.3.2. Storage layer 31

2.3.3. Transformation layer 32

2.3.4. Interaction layer 33

2.4. Use case: the Constance system 33

2.4.1. System overview 33

2.4.2. Ingestion layer 35

2.4.3. Maintenance layer 35

2.4.4. Query layer 37

2.4.5. Data quality control 38

2.4.6. Extensibility and flexibility 38

2.5. Concluding remarks 39

Chapter 3. Exploiting Software Product Lines and Formal Concept Analysis for the Design of Data Lake Architectures 41
Marianne HUCHARD, Anne LAURENT, Thérèse LIBOUREL, Cédrine MADERA and André MIRALLES

3.1. Our expectations 41

3.2. Modeling data lake functionalities 43

3.3. Building the knowledge base of industrial data lakes 46

3.4. Our formalization approach 49

3.5. Applying our approach 51

3.6. Analysis of our first results 53

3.7. Concluding remarks 55

Chapter 4. Metadata in Data Lake Ecosystems 57
Asma ZGOLLI, Christine COLLET+ and Cédrine MADERA

4.1. Definitions and concepts 57

4.2. Classification of metadata by NISO 58

4.2.1. Metadata schema 59

4.2.2. Knowledge base and catalog 60

4.3. Other categories of metadata 61

4.3.1. Business metadata 61

4.3.2. Navigational integration 63

4.3.3. Operational metadata 63

4.4. Sources of metadata 64

4.5. Metadata classification 65

4.6. Why metadata are needed 70

4.6.1. Selection of information (re)sources 70

4.6.2. Organization of information resources 70

4.6.3. Interoperability and integration 70

4.6.4. Unique digital identification 71

4.6.5. Data archiving and preservation 71

4.7. Business value of metadata 72

4.8. Metadata architecture 75

4.8.1. Architecture scenario 1: point-to-point metadata architecture 75

4.8.2. Architecture scenario 2: hub and spoke metadata architecture 76

4.8.3. Architecture scenario 3: tool of record metadata architecture 78

4.8.4. Architecture scenario 4: hybrid metadata architecture 79

4.8.5. Architecture scenario 5: federated metadata architecture 80

4.9. Metadata management 82

4.10. Metadata and data lakes 86

4.10.1. Application and workload layer 86

4.10.2. Data layer 88

4.10.3. System layer 90

4.10.4. Metadata types 90

4.11. Metadata management in data lakes 92

4.11.1. Metadata directory 93

4.11.2. Metadata storage 93

4.11.3. Metadata discovery 94

4.11.4. Metadata lineage 94

4.11.5. Metadata querying 95

4.11.6. Data source selection 95

4.12. Metadata and master data management 96

4.13. Conclusion 96

Chapter 5. A Use Case of Data Lake Metadata Management 97
Imen MEGDICHE, Franck RAVAT and Yan ZHAO

5.1. Context 97

5.1.1. Data lake definition 98

5.1.2. Data lake functional architecture 100

5.2. Related work 103

5.2.1. Metadata classification 104

5.2.2. Metadata management 105

5.3. Metadata model 106

5.3.1. Metadata classification 106

5.3.2. Schema of metadata conceptual model 110

5.4. Metadata implementation 111

5.4.1. Relational database 112

5.4.2. Graph database 115

5.4.3. Comparison of the solutions 119

5.5. Concluding remarks 121

Chapter 6. Master Data and Reference Data in Data Lake Ecosystems 123
Cédrine MADERA

6.1. Introduction to master data management 125

6.1.1. What is master data? 125

6.1.2. Basic definitions 125

6.2. Deciding what to manage 126

6.2.1. Behavior 126

6.2.2. Lifecycle 127

6.2.3. Cardinality 127

6.2.4. Lifetime 128

6.2.5. Complexity 128

6.2.6. Value 128

6.2.7. Volatility 129

6.2.8. Reuse 129

6.3. Why should I manage master data? 130

6.4. What is master data management? 131

6.4.1. How do I create a master list? 136

6.4.2. How do I maintain a master list? 138

6.4.3. Versioning and auditing 139

6.4.4. Hierarchy management 140

6.5. Master data and the data lake 141

6.6. Conclusion 143

Chapter 7. Linked Data Principles for Data Lakes 145
Alessandro ADAMOU and Mathieu D'AQUIN

7.1. Basic principles 145

7.2. Using Linked Data in data lakes 148

7.2.1. Distributed data storage and querying with linked data graphs 151

7.2.2. Describing and profiling data sources 153

7.2.3. Integrating internal and external data 156

7.3. Limitations and issues 159

7.4. The smart cities use case 162

7.4.1. The MK Data Hub 163

7.4.2. Linked data in the MK Data Hub 165

7.5. Take-home message 169

Chapter 8. Fog Computing 171
Arnault IOUALALEN

8.1. Introduction 171

8.2. A little bit of context 171

8.3. Every machine talks 172

8.4. The volume paradox 173

8.5. The fog, a shift in paradigm 174

8.6. Constraint environment challenges 176

8.7. Calculations and local drift 177

8.7.1. A short memo about computer arithmetic 178

8.7.2. Instability from within 179

8.7.3. Non-determinism from outside 180

8.8. Quality is everything 181

8.9. Fog computing versus cloud computing and edge computing 184

8.10. Concluding remarks: fog computing and data lake 185

Chapter 9. The Gravity Principle in Data Lakes 187
Anne LAURENT, Thérèse LIBOUREL, Cédrine MADERA and André MIRALLES

9.1. Applying the notion of gravitation to information systems 187

9.1.1. Universal gravitation 187

9.1.2. Gravitation in information systems 189

9.2. Impact of gravitation on the architecture of data lakes 193

9.2.1. The case where data are not moved 195

9.2.2. The case where processes are not moved 197

9.2.3. The case where the environment blocks the move 198

Glossary 201

References 207

List of Authors 217

Index 219

1
Introduction to Data Lakes: Definitions and Discussions

As stated by Power [POW 08, POW 14], a new component of information systems is emerging when considering data-driven decision support systems. This is the case because enhancing the value of data requires that information systems contain a new data-driven component, instead of an information-driven component1. This new component is precisely what is called data lake.

In this chapter, we first briefly review existing work on data lakes and then introduce a global architecture for information systems in which data lakes appear as a new additional component, when compared to existing systems.

1.1. Introduction to data lakes

The interest in the emerging concept of data lake is increasing, as shown in Figure 1.1, which depicts the number of times the expression "data lake" has been searched for during the last five years on Google. One of the earliest research works on the topic of data lakes was published in 2015 by Fang [FAN 15].

The term data lake was first introduced in 2010 by James Dixon, a Penthao CTO, in a blog [DIX 10]. In this seminal work, Dixon expected that data lakes would be huge sets of row data, structured or not, which users could access for sampling, mining or analytical purposes.

Figure 1.1. Queries about "data lake" on Google

In 2014, Gartner [GAR 14] considered that the concept of data lake was nothing but a new way of storing data at low cost. However, a few years later, this claim was changed2, based on the fact that data lakes have been considered valuable in many companies [MAR 16a]. Consequently, Gartner now considers that the concept of data lake is like a graal in information management, when it comes to innovating through the value of data.

In the following, we review the industrial and academic literature about data lakes, aiming to better understand the emergence of this concept. Note that this review should not be considered as an exhaustive, state of the art of the topic, due to the recent increase in published papers about data lakes.

1.2. Literature review and discussion

In [FAN 15], which is considered one of the earliest academic papers about data lakes, the author lists the following characteristics:

- storing data, in their native form, at low cost. Low cost is achieved because (1) data servers are cheap (typically based on the standard X86 technology) and (2) no data transformation, cleaning and preparation is required (thus avoiding very costly steps);
- storing various types of data, such as blobs, data from relational DBMSs, semi-structured data or multimedia data;
- transforming the data only on exploitation. This makes it possible to reduce the cost of data modeling and integrating, as done in standard data warehouse design. This feature is known as the schema-on-read approach;
- requiring specific analysis tools to use the data. This is required because data lakes store row data;
- allowing for identifying or eliminating data;
- providing users with information on data provenance, such as the data source, the history of changes or data versioning.

According to Fang [FAN 15], no particular architecture characterizes data lakes and creating a data lake is closely related to the settlement of an Apache Hadoop environment. Moreover, in this same work, the author anticipates the decline of decision-making systems, in favor of data lakes stored in a cloud environment.

As emphasized in [MAD 17], considering data lakes as outlined in [FAN 15] leads to the following four limitations:

1) only Apache Hadoop technology is considered;
2) criteria for preventing the movement of the data are not taken into account;
3) data governance is decoupled from data lakes;
4) data lakes are seen as data warehouse "killers".

In 2016, Bill Inmon published a book on a data lake architecture [INM 16] in which the issue of storing useless or impossible to use data is addressed. More precisely, in this book, Bill Inmon advocates that the data lake architecture should evolve towards information systems, so as to avoid storing only row data, but also "prepared" data, through a process such as ETL (Extract-Transform-Load) that is widely used in data warehouses. We also stress that, in this book, the use of metadata and the specific profile of data lake users (namely that of data scientists) are emphasized. It is proposed the data is organized according to three types, namely analog data, application data and textual data. However, the issue of how to store the data is not addressed.

In [RUS 17], Russom first mentioned the limitations of Apache Hadoop technology as being the only possible environment of data lakes, which explains why Russom's proposal is based on a hybrid technology, i.e. not only on Apache Hadoop technology but also on relational database technology. Therefore, similar to data warehouses, a few years after Fang's proposal [FAN 15], data lakes are now becoming multi-platform and hybrid software components.

The work in [SUR 16] considers the problems of data lineage and traceability before their transformation in the data lake. The authors propose a baseline architecture that can take these features into account in the context of huge volumes of data, and they assess their proposal through a prototype, based on Apache Hadoop tools, such as Hadoop HDFS, Spark and Storm. This architecture is shown in Figure 1.3, from which it can be seen that elements of the IBM architecture (as introduced in [IBM 14] and shown in Figure 1.2) are present.

In [ALR 15], the authors introduced what they call personal data lake, as a means to query and analyze personal data. To this end, the considered option is to store the data in a single place so as to optimize data management and security. This work thus addresses the problem of data confidentiality, a crucial issue with regard to the General Data Protection Regulation3.

Figure 1.2. Baseline architecture of a data lake as proposed by IBM [IBM 14]. For a color version of this figure, see www.iste.co.uk/laurent/data.zip

In [MIL 16], the authors referred to the three Vs cited by Gartner [GAR 11] (Volume, Variety, Velocity), considered the additional V (Veracity) introduced by IBM and proposed three more Vs, namely Variability, Value and Visibility. In this context, the authors of [MIL 16] stated that the data lake should be part of IT systems, and then studied the three standard modes for data acquisition, namely batch pseudo real time, real time (or streaming) and hybrid. However, the same authors did not study the impact of these different modes on the data lake architecture. In this work, a data lake is seen as a data pool, gathering historical data along with new data produced by some pseudo real-time processes, in a single place and without specific schema, as long as data is not queried. A catalog containing data lineage is thus necessary in this context.

The most successful work about data lake architecture, components and positioning is presented in [IBM 14], because the emphasis is on data governance and more specifically on the metadata catalog. In [IBM 14], the authors highlighted, in this respect, that the metadata catalog is a major component of data lakes that prevents them from being transformed into data "swamps". This explains why metadata and their catalog currently motivate important research efforts, some of which are mentioned as follows:

- in [NOG 18a], the authors presented an approach to data vault (an approach to data modeling for storing historical data coming from different sources) for storing data lake metadata;
- in [TER 15], the importance of metadata as a key challenge is emphasized. It is then proposed that semantic information obtained from domain ontologies and vocabularies be part of metadata, in addition to traditional data structure descriptions;
- in [HAI 16], the authors proposed an approach for handling metadata called Constance. This approach focuses on discovering and summarizing structural metadata and their annotation using semantic information;
- in [ANS 18], the author introduced a semantic profiling approach to data lakes to prevent them from being transformed into "data swamps". To this end, it is shown that the semantic web provides improvements to data usability and the detection of integrated data in a data lake.

Regarding data storage, in [IBM 14], it is argued that the exclusive use of Apache Hadoop is now migrating to hybrid approaches for data storage (in particular using relational or NoSQL techniques, in addition to Apache Hadoop), and also for platforms (considering different servers either locally present or in the cloud). As mentioned earlier, these changes were first noted in [RUS 17].

An attempt to unify these different approaches to data lakes can be found in [MAD 17] as the following definition:

A data lake is a collection of data such that:

- the data have no fixed schema;
- all data formats should be possible;
- the data have not been transformed;
- the data are conceptually present in one single place, but...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

Data Lakes

Description

More details

Other editions

Additional editions

Persons

Content

1
Introduction to Data Lakes: Definitions and Discussions

1.1. Introduction to data lakes

1.2. Literature review and discussion

System requirements

Schweitzer Fachinformationen

Data Lakes

Description

More details

Other editions

Additional editions

Persons

Content

1 Introduction to Data Lakes: Definitions and Discussions

1.1. Introduction to data lakes

1.2. Literature review and discussion

System requirements

1
Introduction to Data Lakes: Definitions and Discussions