NoSQL Data Models

Name: NoSQL Data Models | Trends and Challenges
Brand: Wiley-IEEE Press
Price: 139.99 EUR
Availability: OnlineOnly

Trends and Challenges

Olivier Pivert(Editor)

Wiley-IEEE Press

1st Edition

Published on 30. July 2018

278 pages

E-Book

ePUB with Adobe-DRM

System requirements

978-1-119-54414-2 (ISBN)

€139.99incl. 7% vat

System requirements

for ePUB with Adobe-DRM

E-Book Single Licence

Available for download

Description

More details

Other editions

Person

Content

Foreword xi
Anne LAURENT and Dominique LAURENT

Preface xiii
Olivier PIVERT

Chapter 1. NoSQL Languages and Systems 1
Kim NGUY¿N

1.1. Introduction 1

1.1.1. The rise of NoSQL systems and languages 1

1.1.2. Overview of NoSQL concepts 4

1.1.3. Current trends of French research in NoSQL languages 6

1.2. Join implementations on top of MapReduce 7

1.3. Models for NoSQL languages and systems 12

1.4. New challenges for database research 16

1.5. Bibliography 18

Chapter 2. Distributed SPARQL Query Processing: A Case Study with Apache Spark 21
Bernd AMANN, Olivier CURÉ and Hubert NAACKE

2.1. Introduction 21

2.2. RDF and SPARQL 22

2.2.1. RDF framework and data model 22

2.2.2. SPARQL query language 25

2.3. SPARQL query processing 29

2.3.1. SPARQL with and without RDF/S entailment 29

2.3.2. Query optimization 30

2.3.3. Triple store systems 33

2.4. SPARQL and MapReduce 34

2.4.1. MapReduce-based SPARQL processing 35

2.4.2. Related work 39

2.5. SPARQL on Apache Spark 41

2.5.1. Apache Spark 41

2.5.2. SPARQL on Spark 42

2.5.3. Experimental evaluation 48

2.6. Bibliography 53

Chapter 3. Doing Web Data: from Dataset Recommendation to Data Linking 57
Manel ACHICHI, Mohamed BEN ELLEFI, Zohra BELLAHSENE and Konstantin TODOROV

3.1. Introduction 57

3.1.1. The Semantic Web vision 57

3.1.2. Linked data life cycles 58

3.1.3. Chapter overview 61

3.2. Datasets recommendation for data linking 62

3.2.1. Process definition 63

3.2.2. Dataset recommendation for data linking based on a Semantic Web index 64

3.2.3. Dataset recommendation for data linking based on social networks 64

3.2.4. Dataset recommendation for data linking based on domain-specific keywords 65

3.2.5. Dataset recommendation for data linking based on topic modeling 65

3.2.6. Dataset recommendation for data linking based on topic profiles 66

3.2.7. Dataset recommendation for data linking based on intensional profiling 67

3.2.8. Discussion on dataset recommendation approaches 68

3.3. Challenges of linking data 69

3.3.1. Value dimension 70

3.3.2. Ontological dimension 74

3.3.3. Logical dimension 77

3.4. Techniques applied to the data linking process 78

3.4.1. Data linking techniques 79

3.4.2. Discussion 83

3.5. Conclusion 86

3.6. Bibliography 87

Chapter 4. Big Data Integration in Cloud Environments: Requirements, Solutions and Challenges 93
Rami SELLAMI and Bruno DEFUDE

4.1. Introduction 93

4.2. Big Data integration requirements in Cloud environments 96

4.3. Automatic data store selection and discovery 99

4.3.1. Introduction 99

4.3.2. Model-based approaches 99

4.3.3. Matching-oriented approaches 100

4.3.4. Comparison 102

4.4. Unique access for all data stores 103

4.4.1. Introduction 103

4.4.2. ODBAPI: A unified REST API for relational and NoSQL data stores 104

4.4.3. Other works 105

4.4.4. Comparison 107

4.5. Unified data model and query languages 108

4.5.1. Introduction 108

4.5.2. Data models of classical data integration approaches 109

4.5.3. A global schema to unify the view over relational and NoSQL data stores 110

4.5.4. Other works 113

4.5.5. Comparison 117

4.6. Query processing and optimization 118

4.6.1. Introduction 118

4.6.2. Federated query language approaches 118

4.6.3. Integrated query language approaches 121

4.6.4. Comparison 124

4.7. Summary and open issues 125

4.7.1. Summary 125

4.7.2. Open issues 127

4.8. Conclusion 129

4.9. Bibliography 129

Chapter 5. Querying RDF Data: A Multigraph-based Approach 135
Vijay INGALALLI, Dino IENCO and Pascal PONCELET

5.1. Introduction 135

5.2. Related work 137

5.3. Background and preliminaries 137

5.3.1. RDF data 138

5.3.2. SPARQL query 140

5.3.3. SPARQL querying by adopting multigraph homomorphism 142

5.4. AMBER: A SPARQL querying engine 143

5.5. Index construction 144

5.5.1. Attribute index 144

5.5.2. Vertex signature index 145

5.5.3. Vertex neighborhood index 148

5.6. Query matching procedure 149

5.6.1. Vertex-level processing 151

5.6.2. Processing satellite vertices 152

5.6.3. Arbitrary query processing 154

5.7. Experimental analysis 159

5.7.1. Experimental setup 159

5.7.2. Workload generation 160

5.7.3. Comparison with RDF engines 161

5.8. Conclusion 164

5.9. Acknowledgment 164

5.10. Bibliography 164

Chapter 6. Fuzzy Preference Queries to NoSQL Graph Databases 167
Arnaud CASTELLTORT, Anne LAURENT, Olivier PIVERT, Olfa SLAMA and Virginie THION

6.1. Introduction 167

6.2. Preliminary statements 168

6.2.1. Graph databases 168

6.2.2. Fuzzy set theory 174

6.3. Fuzzy preference queries over graph databases 176

6.3.1. Fuzzy preference queries over crisp graph databases 176

6.3.2. Fuzzy preference queries over fuzzy graph databases 182

6.4. Implementation challenges 193

6.4.1. Modeling fuzzy databases 193

6.4.2. Evaluation of queries with fuzzy preferences 193

6.4.3. Scalability 195

6.5. Related work 197

6.6. Conclusion and perspectives 198

6.7. Acknowledgment 199

6.8. Bibliography 199

Chapter 7. Relevant Filtering in a Distributed Content-based Publish/Subscribe System 203
Cédric DU MOUZA and Nicolas TRAVERS

7.1. Introduction 203

7.2. Related work: novelty and diversity filtering 205

7.3. A Publish/Subscribe data model 206

7.3.1. Data model 206

7.3.2. Weighting terms in textual data flows 207

7.4. Publish/Subscribe relevance 208

7.4.1. Items and histories 208

7.4.2. Novelty 209

7.4.3. Diversity 209

7.4.4. An overview of the filtering process 210

7.4.5. Choices of relevance 210

7.5. Real-time integration of novelty and diversity 212

7.5.1. Centralized implementation 212

7.5.2. Distributed filtering 216

7.6. TDV updates 221

7.6.1. TDV computation techniques 221

7.6.2. Incremental approach 223

7.6.3. TDV in a distributed environment 225

7.7. Experiments 228

7.7.1. Implementation and description of datasets 229

7.7.2. TDV updates 229

7.7.3. Filtering rate 230

7.7.4. Performance evaluation in the centralized environment 234

7.7.5. Performance evaluation in a distributed environment 238

7.7.6. Quality of filtering 240

7.8. Conclusion 241

7.9. Bibliography 242

List of Authors 245

Index 247

Preface

As is well known, a major event in the field of data management was the introduction of the relational model by Codd in the early 1970s, which laid the foundations for a genuine theory of databases. After a somewhat slow start, due to the important Research and Development effort necessary to define efficient systems, relational database management systems reigned supreme for several decades.

However, around the end of the 20th Century, several phenomena modified the data management landscape. First, new types of applications in several domains were introduced to handle data for which the relational model appeared inadequate or inefficient. Typical examples are semi-structured data on the one hand, and graphs on the other (social networks, bibliographic databases, cartographic databases, genomic data, etc.) for which specific models and systems had to be designed. Second, a major event was the rise of the Semantic Web whose aim is, according to the W3C, to "provide a common framework that allows data to be shared and reused across application, enterprise and community boundaries". The Semantic Web uses models and languages specifically designed for linked data, which facilitate automated reasoning on such data. Besides, the amount of useful data in some application domains has become so huge that it cannot be stored or processed by traditional database solutions. This latter phenomenon is commonly referred to as Big Data. In terms of database technology, as a response to these new needs, we have seen the appearance of what have come to be called NoSQL databases.

The term NoSQL was coined by Carlo Strozzi in 1998, who designed a relational database system without SQL implementation and named it Strozzi NoSQL. However, this system is distinct from the circa-2009 general concept of NoSQL databases, which are typically non-relational. Many data models have been proposed: key-value stores, document stores (key-value stores that restrict values to semi-structured formats such as JSON), wide column stores, RDF, graph databases, XML, etc1.

While the management of large volumes of data has always been subject to many research efforts, recent results in both the distributed systems and database communities have led to an important renewal of interest in this topic. Large scale distributed file systems such as Google File System2 and parallel processing paradigm/environments such as MapReduce3 have been the foundation of a new ecosystem with data management contributions in major database conferences and journals. Different (often open-source) systems have been released, such as Pig4, Hive5 or, more recently, Spark6 and Flink7, making it easier to use data center resources to manage Big Data. However, many research challenges remain, related, for instance, to system efficiency, and query language expressiveness and flexibility.

This book presents a sample of recent works by French research teams active in this domain. As the reader will see, it covers various aspects of NoSQL research, from semantic data management to graph databases, as well as Big Data management in cloud environments, dealing with data models, query languages and implementation issues. The book is organized as follows:

Chapter 1, by Kim Nguy?n, from LRI and the University of Paris-Sud, presents an overview of NoSQL languages and systems. The author highlights some of the technical aspects of NoSQL systems (in particular, distributed computation with MapReduce) before discussing current research trends: join implementations on top of MapReduce, models for NoSQL languages and systems, and the perspective that consists of defining a formal model of NoSQL databases and queries.

Chapter 2, entitled "Distributed SPARQL Query Processing: A Case Study with Apache SPARK", by Bernd Amann, Olivier Curé and Hubert Naacke, from the LIP6 laboratory in Paris, is devoted to the issue of evaluating SPARQL queries over large RDF datasets. The authors present a solution that consists of using the MapReduce framework to process SPARQL graph patterns and show how the general purpose cluster computing platform Apache Spark can be used to this end. They emphasize the importance of the physical data layer for query evaluation efficiency and show that hybrid query plans combining partitioned and broadcast joins improve query performances in almost all cases.

Chapter 3, authored by Manel Achichi, Mohamed Ben Ellefi, Zohra Bellahsene and Konstantin Todorov, from the LIRMM laboratory in Montpellier, is entitled "Doing Web Data: From Dataset Recommendation to Data Linking". It deals with the production of web data and focuses on the data linking stage, seen as an operation which generates a set of links between two different datasets. The authors first study the prior task which consists of discovering relevant datasets leading to the identification of similar resources to support the data linking issue. They provide an overview of recommendation approaches for candidate datasets, then present and classify the different techniques that are applied by the currently available data linking tools. The main challenge faced by all of these techniques is to overcome different heterogeneity problems that may occur between the considered datasets, such as differences in descriptions at different levels (value, ontological or logical) in order to compare the resources efficiently, and the authors show that further research efforts are still needed to better cope with these heterogeneity issues.

Chapter 4, entitled "Big Data Integration in Cloud Environments: Requirements, Solutions and Challenges", by Rami Sellami and Bruno Defude, from CETIC Charleroi and Telecom SudParis respectively, presents and discusses the requirements of Big Data integration in cloud environments. In such a context, applications may need to interact with several heterogeneous data stores, depending on the types of data they have to manage (traditional data, documents, graph data from social networks, simple key-value data, etc.). A first constraint is that, to make these interactions possible, programmers have to be familiar with different APIs. A second difficulty is that the execution of complex queries over heterogeneous data models cannot currently be achieved in a declarative way and therefore requires extra implementation efforts. Moreover, cloud discovery as well as application deployment and execution are generally performed manually by programmers. The authors analyze and discuss the current state-of-the-art regarding four requirements (automatic data stores selection and discovery, unique access for all data stores, transparent access for all data stores, global query processing and optimization), provide a global synthesis according to three groups of criteria, and highlight important challenges that remain to be tackled.

Chapter 5 is authored by Vijay Ingalalli, Dino Ienco and Pascal Poncelet, from the LIRMM laboratory in Montpellier, and is entitled "Querying RDF Data: A Multigraph-based Approach". In this chapter, the authors cope with two challenges faced by the RDF data management community: first, automatically generated queries cannot be bounded in their structural complexity and size; second, the queries generated by retrieval systems (or any other application) need to be efficiently answered in a reasonable amount of time. In order to address these challenges, the authors advocate an approach to RDF query processing that involves two steps: an offline step where the RDF database is transformed into a multigraph and indexed, and an online step where the SPARQL query is transformed into a multigraph too, which makes query processing boil down to a subgraph homomorphism problem. An RDF query engine based on this strategy is presented, named AMBER, which exploits structural properties of the multigraph query as well as the indices previously built on the multigraph structure.

Chapter 6 is entitled "Fuzzy Preference Queries to NoSQL Graph Databases" and is authored by Arnaud Castelltort, Anne Laurent, Olivier Pivert, Olfa Slama and Virginie Thion; the first two authors being affiliated to the LIRMM laboratory in Montpellier, and the last three authors to the IRISA laboratory in Lannion. This chapter deals with flexible querying of graph databases that may involve gradual relationships. The authors first introduce an extension of attributed graphs where edges may represent a fuzzy concept (such as friend in the case of a social network, or co-author in the case of a bibliographic database). Then, they describe an extension of the query language Cypher that makes it possible to express fuzzy requirements, both on attribute values and on structural aspects of the graph (such as the length or the strength of a path). Finally, they deal with implementation issues and outline a query processing strategy based on the derivation of a regular Cypher query from the fuzzy query to be evaluated, through an add-on built on top of a classical graph database management system.

Finally, Chapter 7, by Cédric du Mouza and Nicolas Travers, from CNAM Paris, is entitled "Relevant Filtering in a Distributed Content-based Publish/Subscribe System", and deals with textual data management. More precisely, it considers a crucial challenge faced by Publish/Subscribe systems, which is to efficiently filter feeds' information in real time. Publish/Subscribe systems make it possible to subscribe to flows of items coming from diverse sources and notify the users according to their interests, but the...

System requirements

Save as PDF Copy link into clipboard

Schweitzer Fachinformationen

NoSQL Data Models

Description

More details

Other editions

Additional editions

Person

Content

Preface

System requirements