Practitioner's Guide to Graph Data

Applying Graph Thinking and Graph Technologies to Solve Complex Problems
 
 
O'Reilly (Verlag)
  • erschienen am 20. März 2020
  • |
  • 420 Seiten
 
E-Book | ePUB mit Adobe-DRM | Systemvoraussetzungen
978-1-4920-4402-4 (ISBN)
 
Graph data closes the gap between the way humans and computers view the world. While computers rely on static rows and columns of data, people navigate and reason about life through relationships. This practical guide demonstrates how graph data brings these two approaches together. By working with concepts from graph theory, database schema, distributed systems, and data analysis, youll arrive at a unique intersection known as graph thinking.Authors Denise Koessler Gosnell and Matthias Broecheler show data engineers, data scientists, and data analysts how to solve complex problems with graph databases. Youll explore templates for building with graph technology, along with examples that demonstrate how teams think about graph data within an application.Build an example application architecture with relational and graph technologiesUse graph technology to build a Customer 360 application, the most popular graph data pattern todayDive into hierarchical data and troubleshoot a new paradigm that comes from working with graph dataFind paths in graph data and learn why your trust in different paths motivates and informs your preferencesUse collaborative filtering to design a Netflix-inspired recommendation system
  • Englisch
  • Sebastopol
  • |
  • USA
  • 27,36 MB
978-1-4920-4402-4 (9781492044024)
weitere Ausgaben werden ermittelt
  • Intro
  • Preface
  • Who Should Read This Book
  • Goals of This Book
  • Navigating This Book
  • Conventions Used in This Book
  • Using Code Examples
  • O'Reilly Online Learning
  • How to Contact Us
  • Acknowledgments
  • 1. Graph Thinking
  • Why Now? Putting Database Technologies in Context
  • 1960s-1980s: Hierarchical Data
  • 1980s-2000s: Entity-Relationship
  • 2000s-2020s: NoSQL
  • 2020s-?: Graph
  • Why the 2020s?
  • Connecting the dots
  • What Is Graph Thinking?
  • Complex Problems and Complex Systems
  • Complex Problems in Business
  • Making Technology Decisions to Solve Complex Problems
  • Question 1: Does your problem need graph data?
  • Question 2: Do relationships within your data help you understand your problem?
  • Common missteps in understanding your data
  • So You Have Graph Data. What's Next?
  • Question 3: What are you going to do with the relationships in your data?
  • Question 4: What do you need the results for?
  • Break it down and try again
  • Seeing the Bigger Picture
  • Getting Started on Your Journey with Graph Thinking
  • 2. Evolving from Relational to Graph Thinking
  • Chapter Preview: Translating Relational Concepts to Graph Terminology
  • Relational Versus Graph: What's the Difference?
  • Data for Our Running Example
  • Relational Data Modeling
  • Entities and Attributes
  • Building Up to an ERD
  • Concepts in Graph Data
  • Fundamental Elements of a Graph
  • Adjacency
  • Neighborhoods
  • Distance
  • Degree
  • Implications of vertex degree
  • The Graph Schema Language
  • Vertex Labels and Edge Labels
  • Properties
  • Edge Direction
  • Self-Referencing Edge Labels
  • Multiplicity of Your Graph
  • Modeling multiplicity in the GSL
  • Full Example Graph Model
  • Relational Versus Graph: Decisions to Consider
  • Data Modeling
  • Understanding Graph Data
  • Mixing Database Design with Application Purpose
  • Summary
  • 3. Getting Started: A Simple Customer 360
  • Chapter Preview: Relational Versus Graph
  • The Foundational Use Case for Graph Data: C360
  • Why Do Businesses Care About C360?
  • Implementing a C360 Application in a Relational System
  • Data Models
  • Relational Implementation
  • Example C360 Queries
  • Query: Which credit cards does this customer use?
  • Query: Which accounts does this customer own?
  • Query: Which loans does this customer owe?
  • Query: What do we know about this customer?
  • Implementing a C360 Application in a Graph System
  • Data Models
  • Graph Implementation
  • Creating your graph's schema
  • Inserting your graph data
  • Graph traversals
  • Example C360 Queries
  • Query: Which credit cards does this customer use?
  • Query: Which accounts does this customer own?
  • Query: Which loans does this customer owe?
  • Query: What do we know about this customer?
  • Relational Versus Graph: How to Choose?
  • Relational Versus Graph: Data Modeling
  • Relational Versus Graph: Representing Relationships
  • Relational Versus Graph: Query Languages
  • Relational Versus Graph: Main Points
  • Summary
  • Why Not Relational?
  • Making a Technology Choice for Your C360 Application
  • 4. Exploring Neighborhoods in Development
  • Chapter Preview: Building a More Realistic Customer 360
  • Graph Data Modeling 101
  • Should This Be a Vertex or an Edge?
  • Lost Yet? Let Us Walk You Through Direction
  • An evolution of modeling transactions in a graph
  • When do we use properties?
  • A Graph Has No Name: Common Mistakes in Naming
  • Our Full Development Graph Model
  • Before We Start Building
  • Our Thoughts on the Importance of Data, Queries, and the End User
  • Implementation Details for Exploring Neighborhoods in Development
  • Generating More Data for Our Expanded Example
  • Basic Gremlin Navigation
  • Query 1: What are the most recent 20 transactions involving Michael's account?
  • Query 2: In December 2020, at which vendors did Michael shop, and with what frequency?
  • Query 3: Find and update the transactions that Jamie and Aaliyah most value: their payments from their account to their mortgage loan.
  • Query 3a: Find Aaliyah's transactions that are loan payments
  • Query 3b: Find and update the transactions that Jamie and Aaliyah most value: their payments from their checking account to their mortgage, loan_18
  • Query 3c: Verify that we didn't update every transaction
  • Advanced Gremlin: Shaping Your Query Results
  • Shaping Query Results with the project(), fold(), and unfold() Steps
  • Removing Data from the Results with the where(neq()) Pattern
  • Planning for Robust Result Payloads with the coalesce() Step
  • Moving from Development into Production
  • 5. Exploring Neighborhoods in Production
  • Chapter Preview: Understanding Distributed Graph Data in Apache Cassandra
  • Working with Graph Data in Apache Cassandra
  • The Most Important Topic to Understand About Data Modeling: Primary Keys
  • Partition Keys and Data Locality in a Distributed Environment
  • Partitioning graph data according to access pattern
  • Partitioning according to unique key
  • Final thoughts on partitioning strategies
  • Understanding Edges, Part 1: Edges in Adjacency Lists
  • Understanding Edges, Part 2: Clustering Columns
  • Synthesizing concepts: Edge location in a distributed cluster
  • Understanding Edges, Part 3: Materialized Views for Traversals
  • Materialized views for bidirectional edges
  • How far down do you want to go?
  • Where are we going from here?
  • Graph Data Modeling 201
  • Finding Indexes with an Intelligent Index Recommendation System
  • Production Implementation Details
  • Materialized Views and Adding Time onto Edges
  • Our Final C360 Production Schema
  • Bulk Loading Graph Data
  • Loading vertex data with DataStax Bulk Loader
  • Loading edge data with DataStax Bulk Loader
  • Updating Our Gremlin Queries to Use Time on Edges
  • Query 1: What are the most recent 20 transactions involving Michael's account?
  • Query 2: In December, at which vendors did Michael shop, and with what frequency?
  • Query 3: Find and update the transactions that Jamie and Aaliyah most value: their payments from their account to their mortgage loan.
  • Moving On to More Complex, Distributed Graph Problems
  • Our First 10 Tips to Get from Development to Production
  • 6. Using Trees in Development
  • Chapter Preview: Navigating Trees, Hierarchical Data, and Cycles
  • Seeing Hierarchies and Nested Data: Three Examples
  • Hierarchical Data in a Bill of Materials
  • Hierarchical Data in Version Control Systems
  • Hierarchical Data in Self-Organizing Networks
  • Why Graph Technology for Hierarchical Data?
  • Finding Your Way Through a Forest of Terminology
  • Trees, Roots, and Leaves
  • Depth in Walks, Paths, and Cycles
  • Understanding Hierarchies with Our Sensor Data
  • Understand the Data
  • Seeing hierarchies in data: From the bottom up
  • Seeing hierarchies in data: From the top down
  • Understanding edges in the sensor hierarchies
  • Conceptual Model Using the GSL Notation
  • Implement Schema
  • Loading vertex data with DataStax Bulk Loader
  • Loading edge data with DataStax Bulk Loader
  • Before We Build Our Queries
  • Querying from Leaves to Roots in Development
  • Where Has This Sensor Sent Information To?
  • From This Sensor, What Was Its Path to Any Tower?
  • Using the path() step and manipulating its data structure
  • How to assign labels with as()
  • How to shape path() results with by()
  • From Bottom Up to Top Down
  • Querying from Roots to Leaves in Development
  • Setup Query: Which Tower Has the Most Sensor Connections So That We Could Explore It for Our Example?
  • Which Sensors Have Connected Directly to Georgetown?
  • Find All Sensors That Connected to Georgetown
  • Depth Limiting in Recursion
  • Going Back in Time
  • 7. Using Trees in Production
  • Chapter Preview: Understanding Branching Factor, Depth, and Time on Edges
  • Understanding Time in the Sensor Data
  • Understanding time in hierarchies of data: From the bottom up
  • Valid and invalid paths from the bottom up
  • Understanding time in hierarchies of data: From the top down
  • Valid and invalid paths from the top down
  • Final Thoughts on Time Series Data in Graphs
  • Understanding Branching Factor in Our Example
  • What Is Branching Factor?
  • How Do We Get Around Branching Factor?
  • Production Schema for Our Sensor Data
  • Loading data with DataStax Bulk Loader
  • Querying from Leaves to Roots in Production
  • Where Has This Sensor Sent Information to, and at What Time?
  • From This Sensor, Find All Trees up to a Tower by Time
  • From This Sensor, Find a Valid Tree
  • Advanced Gremlin: Understanding the where().by() Pattern
  • Understanding a common Gremlin mistake: Overloading has()
  • Resolution: The where().by() pattern
  • Querying from Roots to Leaves in Production
  • Which Sensors Have Connected to Georgetown Directly, by Time?
  • What Valid Paths Can We Find from Georgetown Down to All Sensors?
  • Applying Your Queries to Tower Failure Scenarios
  • Get a list of sensors that connected with Georgetown in any time window
  • For each at-risk sensor, find all towers it communicated with
  • Applying the Final Results of Our Complex Problem
  • Seeing the Forest for the Trees
  • 8. Finding Paths in Development
  • Chapter Preview: Quantifying Trust in Networks
  • Thinking About Trust: Three Examples
  • How Much Do You Trust That Open Invitation?
  • How Defensible Is an Investigator's Story?
  • How Do Companies Model Package Delivery?
  • Fundamental Concepts About Paths
  • Shortest Paths
  • Depth-First Search and Breadth-First Search
  • Learning to See Application Features as Different Path Problems
  • Finding Paths in a Trust Network
  • Source Data
  • A Brief Primer on Bitcoin Terminology
  • Creating Our Development Schema
  • Loading Data
  • Exploring Communities of Trust
  • Understanding Traversals with Our Bitcoin Trust Network
  • Which Addresses Are in the First Neighborhood?
  • Which Addresses Are in the Second Neighborhood?
  • Which Addresses Are in the Second Neighborhood, but Not the First?
  • Evaluation Strategies with the Gremlin Query Language
  • Barrier steps in Gremlin
  • Pick a Random Address to Use for Our Example
  • Shortest Path Queries
  • Finding Paths of a Fixed Length
  • Finding Paths of Any Length
  • Connecting concepts: BFS and traversal strategies
  • Augmenting Our Paths with the Trust Scores
  • Using sack() to aggregate trust ratings
  • Do You Trust This Person?
  • 9. Finding Paths in Production
  • Chapter Preview: Understanding Weights, Distance, and Pruning
  • Weighted Paths and Search Algorithms
  • Shortest Weighted Path Problem Definition
  • Shortest Weighted Path Search Optimizations
  • Supernodes in graphs
  • Theoretical limits of supernodes
  • Pseudocode for the search algorithm we will implement
  • Normalization of Edge Weights for Shortest Path Problems
  • Normalizing the Edge Weights
  • Step 1: Shift the scale to the interval [0,1]
  • Step 2: Frame the new scale as a shortest path problem
  • Step 3: Decide how to handle modeling infinity
  • Updating Our Graph
  • Exploring the Normalized Edge Weights
  • Find all paths of length 2, sorted by total trust
  • Find the 15 shortest paths by path length, sorted by total trust
  • Interpreting path distance to total trust with the normalized edge weights
  • Some Thoughts Before Moving On to Shortest Weighted Path Queries
  • Shortest Weighted Path Queries
  • Building a Shortest Weighted Path Query for Production
  • 1) Swap two steps and change our limit
  • 2) Add an object to track the shortest weighted path to a visited vertex
  • 3) Remove a traverser if its path is longer than one already discovered to that vertex
  • 4) Remove traversers for custom reasons, such as to avoid supernodes
  • The and() step in Gremlin
  • sideEffect() in Gremlin
  • Interpreting the results of our shortest weighted path
  • Weighted Paths and Trust in Production
  • 10. Recommendations in Development
  • Chapter Preview: Collaborative Filtering for Movie Recommendations
  • Recommendation System Examples
  • How We Give Recommendations in Healthcare
  • How We Experience Recommendations in Social Media
  • How We Use Deeply Connected Data for Recommendations in Ecommerce
  • An Introduction to Collaborative Filtering
  • Understanding the Problem and Domain
  • Collaborative Filtering with Graph Data
  • Recommendations via Item-Based Collaborative Filtering with Graph Data
  • Three Different Models for Ranking Recommendations
  • Path counting
  • Net Promoter Score-inspired metric
  • Normalized Net Promoter Scores
  • Movie Data: Schema, Loading, and Query Review
  • Data Model for Movie Recommendations
  • Schema Code for Movie Recommendations
  • Loading the Movie Data
  • Loading the vertices
  • Loading the edges
  • Neighborhood Queries in the Movie Data
  • Grouping a user's movie ratings by liked, disliked, or neutral
  • Tree Queries in the Movie Data
  • Path Queries in the Movie Data
  • Item-Based Collaborative Filtering in Gremlin
  • Model 1: Counting Paths in the Recommendation Set
  • Model 2: NPS-Inspired
  • Model 3: Normalized NPS
  • Choosing Your Own Adventure: Movies and Graph Problems Edition
  • 11. Simple Entity Resolution in Graphs
  • Chapter Preview: Merging Multiple Datasets into One Graph
  • Defining a Different Complex Problem: Entity Resolution
  • Seeing the Complex Problem
  • Analyzing the Two Movie Datasets
  • MovieLens Dataset
  • 1) Links
  • 2) Movies
  • 3) Ratings
  • 4) Tags
  • 5) Tag genome
  • Kaggle Dataset
  • Movie details
  • Actors and casting details
  • Development Schema
  • Matching and Merging the Movie Data
  • Our Matching Process
  • Resolving False Positives
  • False Positives Found in the MovieLens Dataset
  • Additional Errors Discovered in the Entity Resolution Process
  • Final Analysis of the Merging Process
  • The Role of Graph Structure in Merging Movie Data
  • 12. Recommendations in Production
  • Chapter Preview: Understanding Shortcut Edges, Precomputation, and Advanced Pruning Techniques
  • Shortcut Edges for Recommendations in Real Time
  • Where Our Development Process Doesn't Scale
  • How We Fix Scaling Issues: Shortcut Edges
  • Seeing What We Designed to Deliver in Production
  • Pruning: Different Ways to Precompute Shortcut Edges
  • Pruning by score thresholds
  • Pruning with hard limits on total recommendations
  • Pruning by applying domain knowledge filters
  • Considerations for Updating Your Recommendations
  • Calculating Shortcut Edges for Our Movie Data
  • Breaking Down the Complex Problem of Precalculating Shortcut Edges
  • Schema required for calculating shortcut edges on our movie data
  • Collaborative-filtering query to calculate shortcut edges
  • Using simple parallelism to divide up the work
  • Addressing the Elephant in the Room: Batch Computation
  • Examples of when batch computation may be better for your environment
  • Examples of when transactional queries may be better for your environment
  • Production Schema and Data Loading for Movie Recommendations
  • Production Schema for Movie Recommendations
  • Production Data Loading for Movie Recommendations
  • Recommendation Queries with Shortcut Edges
  • Confirming Our Edges Loaded Correctly
  • Production Recommendations for Our User
  • Query 1: The top three recommendations for the most recent rating by our user
  • Query 2: The top recommendation for the three most recent ratings by our user
  • Query 3: The top three recommendations for each of the three most recent ratings by our user
  • What are we ignoring that you need to consider?
  • Understanding Response Time in Production by Counting Edge Partitions
  • Partitions traversed for Query 1: The top three recommendations for the most recent rating by a user
  • Partitions traversed for Query 2: The top recommendation for the three most recent ratings by a user
  • Partitions traversed for Query 3: The top three recommendations for each of the three most recent ratings by our user
  • Final Thoughts on Reasoning About Distributed Graph Query Performance
  • 13. Epilogue
  • Where to Go from Here?
  • Graph Algorithms
  • Distributed Graphs
  • Graph Theory
  • Network Theory
  • Stay in Touch
  • Index

Dateiformat: ePUB
Kopierschutz: Adobe-DRM (Digital Rights Management)

Systemvoraussetzungen:

Computer (Windows; MacOS X; Linux): Installieren Sie bereits vor dem Download die kostenlose Software Adobe Digital Editions (siehe E-Book Hilfe).

Tablet/Smartphone (Android; iOS): Installieren Sie bereits vor dem Download die kostenlose App Adobe Digital Editions (siehe E-Book Hilfe).

E-Book-Reader: Bookeen, Kobo, Pocketbook, Sony, Tolino u.v.a.m. (nicht Kindle)

Das Dateiformat ePUB ist sehr gut für Romane und Sachbücher geeignet - also für "fließenden" Text ohne komplexes Layout. Bei E-Readern oder Smartphones passt sich der Zeilen- und Seitenumbruch automatisch den kleinen Displays an. Mit Adobe-DRM wird hier ein "harter" Kopierschutz verwendet. Wenn die notwendigen Voraussetzungen nicht vorliegen, können Sie das E-Book leider nicht öffnen. Daher müssen Sie bereits vor dem Download Ihre Lese-Hardware vorbereiten.

Bitte beachten Sie bei der Verwendung der Lese-Software Adobe Digital Editions: wir empfehlen Ihnen unbedingt nach Installation der Lese-Software diese mit Ihrer persönlichen Adobe-ID zu autorisieren!

Weitere Informationen finden Sie in unserer E-Book Hilfe.


Download (sofort verfügbar)

58,49 €
inkl. 5% MwSt.
Download / Einzel-Lizenz
ePUB mit Adobe-DRM
siehe Systemvoraussetzungen
E-Book bestellen