Chapter 2
Memgraph Architecture Deep Dive
Beneath Memgraph's intuitive surface lies a cutting-edge, high-performance engine purpose-built for graph data. This chapter peels back the layers, examining the sophisticated internal mechanics that enable Memgraph to process connected data at unprecedented speed and scale. Discover how state-of-the-art design choices-spanning storage, concurrency, extensibility, and integration-empower Memgraph to tackle the most demanding graph workloads in modern data infrastructure.
2.1 Core System Components
Memgraph's architecture is fundamentally grounded on three primary entities: nodes, relationships, and properties. These form the backbone of its graph data model, and their internal design crucially influences the efficiency of graph storage, traversal, and query execution. A rigorous understanding of their internal representation, memory layout, and metadata handling elucidates how Memgraph achieves high-performance graph operations.
At the lowest level, nodes serve as the vertices of the graph, each representing an individual entity or concept. Internally, a node is identified by a unique 64-bit NodeID, which enables rapid indexing and retrieval. The memory layout encapsulates not only the identifier but also tightly packed metadata, including the node's label and a flexible array of attached properties. This label-to-node binding is managed through a dictionary structure that stores label identifiers as integers, reducing memory overhead and accelerating label-based access.
Relationships-or edges-connect nodes and symbolize associations or interactions between them. Each relationship is represented internally as a compact structure containing a 64-bit RelationshipID, references to the source and target nodes via their identifiers, and a relationship type encoded similarly to node labels. To optimize graph traversal, relationships are stored using adjacency lists indexed by node identifiers; Memgraph employs a dual adjacency representation to facilitate efficient traversals in both outgoing and incoming directions. This design enables constant-time access to adjacent edges, imperative for complex graph algorithms.
Both nodes and relationships support properties, which are key-value pairs that provide descriptive metadata. Properties are managed through a property table that leverages pointer-based indirection to accommodate varying property counts and types per entity without bloating fixed-size records. Each property key is referenced by an integer index, derived from a global property dictionary, enabling quick lookups and minimizing string storage redundancy. Property values are stored in a tagged union format, supporting multiple data types such as integers, floating-point numbers, strings, booleans, and temporal values, all encoded in a manner that maximizes alignment and access speed.
The memory layout for these components closely aligns with cache line boundaries and incorporates prefetching heuristics to expedite sequential scans. Nodes and relationships are allocated in contiguous memory blocks, segmented by labels and relationship types, respectively, which reduces pointer hopping during query execution. Metadata such as the count of relationships per node and property offsets are maintained in auxiliary headers adjacent to entity records. These headers facilitate fast enumeration of connected edges and property retrieval without scanning all stored data.
A critical optimization lies in Memgraph's use of metadata bitmaps. For example, node and relationship records include bitmaps indicating the presence or absence of specific properties. This compact representation allows skipping over non-existent properties during traversal and query filtering, significantly reducing unnecessary data accesses. Similarly, labels and relationship types are encoded with bitmaps supporting multi-label or multi-type assignments, enabling flexible schema representations and rapid type filtering.
To maintain storage efficacy, Memgraph employs a form of variable-length encoding for property values, especially strings and numeric types. Small integer values are stored directly within property slots, avoiding pointer dereferencing, while longer or variable-length data is stored externally in memory pools with fast indexing tables. This tiered approach balances quick access to common, compact data types with the flexibility to handle large and diverse datasets.
The adjacency lists utilize an innovative compressed sparse row (CSR)-like structure internally for storing relationships. This layout exploits locality by storing edges associated with each node consecutively in memory, indexed via offset tables. The CSR model is augmented with dynamic insertion buffers to reconcile the need for efficient writes and reads, enabling Memgraph to maintain responsiveness under mixed workloads without incurring costly reallocations or fragmentation.
Synchronization metadata such as version counters and mutex locks are embedded judiciously to ensure consistency during concurrent access while minimizing contention. Data structures that are heavily read-focused incorporate lock-free or fine-grained locking mechanisms, allowing multiple threads to traverse the graph concurrently with minimal overhead.
All these design choices collectively empower Memgraph's core components to jointly provide a robust foundation that balances storage space, retrieval speed, and dynamic update capabilities. The internal representations are crafted not only to occupy minimal memory but also to exploit modern CPU architectures' characteristics, including cache hierarchies and branch prediction. This holistic approach yields a system optimized for both workload heterogeneity and graph operation complexity.
Memgraph's core system components-nodes, relationships, and properties-are engineered with a deep understanding of graph data semantics and hardware realities. Their internal representation and memory layout, combined with metadata management strategies such as bitmaps, dictionaries, and compression, culminate in a system primed for efficient and scalable graph processing.
2.2 In-Memory Graph Storage Engine
Memgraph's in-memory graph storage engine is architected to address the stringent requirements of modern real-time graph analytics: high throughput, low latency, and scalable concurrency. The decision to utilize an in-memory approach stems from the need to eliminate disk I/O bottlenecks and leverage the speed advantages inherent in RAM, enabling complex graph traversals and pattern matching operations to execute within stringent time budgets. Central to this engine are tailored data indexing strategies, a custom memory allocation framework, and an efficient garbage collection mechanism, all optimized for high-performance graph workloads.
At the core of Memgraph's storage engine lies a hybrid indexing schema, designed to balance flexibility and access speed. Vertices and edges are stored with explicit indices on their unique identifiers and commonly queried attributes. The primary index is a high-performance hash table keyed by vertex or edge IDs, providing