Chapter 2
Data Models, Serialization, and Transformation
MarkLogic's unique strength lies in its ability to natively manage and transform diverse data models-enabling architects to harmonize complex information landscapes without compromise. In this chapter, we decode the mechanics of representing, storing, and evolving a spectrum of content types, from XML and JSON to semantic triples, binaries, and spatial data. Discover the transformative infrastructure and patterns that empower enterprises to fluidly shift, enrich, and interrelate data at scale, without sacrificing performance or consistency.
2.1 XML, JSON, and RDF Representation
MarkLogic is architected as a multi-model database that simultaneously supports native and hybrid storage of XML and JSON documents, along with semantic RDF triples. This design enables sophisticated management of structured, semi-structured, and graph data within a unified platform. The following exposition unpacks the internal handling of XML and JSON, their parsing and indexing strategies, and extends into MarkLogic's RDF capability-facilitating integrated querying over heterogeneous data.
Internally, both XML and JSON documents are ingested and stored in MarkLogic as hierarchical node trees optimized for efficient access and manipulation. XML documents are parsed into a canonical tree structure preserving all facets of XML Infoset, including elements, attributes, namespaces, and processing instructions. This native representation allows XQuery and XPath expressions to traverse and query documents at fine granularity. Similarly, JSON documents are parsed into a structurally analogous tree reflecting objects, arrays, and scalar values, maintaining type fidelity (e.g., numbers, booleans, strings). This internal JSON node model supports JavaScript and XQuery functions designed for JSON processing.
MarkLogic's storage format abstracts commonalities in the node tree representation, regardless of source format, enabling hybrid collections where XML and JSON coexist seamlessly. This hybrid model leverages a unified indexing framework, whereby universal lexicons and range indexes are derived from the parsed node trees without regard to original document serialization. Consequently, queries can be composed across mixed XML/JSON datasets without transformation overhead, utilizing the XQuery and SPARQL engines in tandem.
Indexing in MarkLogic is particularly sophisticated to achieve performant retrieval at scale. The platform automatically generates three primary index types for XML and JSON content:
- Word indexes for full-text search,
- Range indexes for ordered data such as numbers and dates,
- Path range indexes that accelerate queries targeting specific locations within node hierarchies.
Path range indexes are critical in multi-model querying, as they allow direct access to particular fields or elements within deeply nested structures, be they XML or JSON. Moreover, MarkLogic supports lexicon type indexes-such as element value lexicons and attribute value lexicons-that maintain sorted sets of distinct values for statistical and query optimization purposes.
Parsing and indexing occur asynchronously during ingestion, allowing content to be rapidly made available while indexing updates propagate in the background. Document repair and normalization mechanisms ensure well-formed XML and syntactically valid JSON, while supporting extensions such as JSON document collections and XML document fragments. This flexibility is pivotal to enterprise scenarios where data heterogeneity and variable schema adherence are common.
Extending beyond traditional document-centric storage, MarkLogic integrates RDF triple management to support semantic graph applications. RDF triples-comprising subject, predicate, and object components-are stored in a dedicated triple index that maps seamlessly into the existing node-tree framework. This graph-native storage enables SPARQL queries and semantics-aware operations to be performed alongside XQuery and JavaScript queries on document content.
The RDF triple store is tightly coupled with document storage, allowing triples to be embedded within XML or JSON documents or exist as standalone triplesets. This coexistence empowers hybrid querying strategies where linked data joins with document content, enabling knowledge graph applications that harness both rich metadata and unstructured or semi-structured documents. Moreover, MarkLogic's built-in inference engine supports common semantic web vocabularies (RDFS, OWL) for reasoning over the stored triples.
Cross-model query support is realized through flexible query grammars and powerful runtime optimizers that can execute queries spanning XML nodes, JSON nodes, and RDF triples in a unified execution plan. Queries may combine XPath expressions, JSON path-like queries, and SPARQL graph patterns within the same request, facilitating comprehensive evaluation against diverse data modalities. This also supports data interoperability use cases, such as federated data access, linked data publishing, and graph-enhanced search applications.
In addition to core triple storage, MarkLogic provides robust graph analytics and transformation functions implemented via XQuery and JavaScript modules. These include graph traversal, neighborhood discovery, and property graph pattern matching. Integration with external reasoners and data transformation pipelines complements the platform's capabilities, enabling scalable construction and querying of semantic knowledge graphs within enterprise environments.
MarkLogic's representation and indexing strategies for XML, JSON, and RDF exemplify a cohesive architecture for multi-model data management. Native hierarchical node trees capture the structural complexity of XML and JSON while sharing indexing infrastructures. RDF triples extend the platform's semantic reach, seamlessly incorporating graph data models into the content ecosystem. This unified approach empowers heterogeneous data applications that bridge traditional document management with cutting-edge semantic and linked data paradigms.
2.2 Schema and Schemaless Approaches
Within MarkLogic, data management strategies hinge critically on the balancing act between schema enforcement and schema flexibility, enabling robust, agile handling of complex and evolving information landscapes. MarkLogic's architecture supports both schema-driven and schemaless data paradigms, each tailored to specific operational and developmental needs. This section analyzes these paradigms, emphasizing schema enforcement, schema evolution, loose or schemaless data management, and techniques for combining or omitting schemas to optimize agility. Furthermore, it examines versioning and consistency mechanisms that enable forward and backward compatibility in diverse real-world applications.
Schema Enforcement Mechanisms
Schema enforcement in MarkLogic primarily leverages XML Schema Definition (XSD), JSON Schema, and Semantic shapes (SHACL), offering structured validation frameworks consistent with the repository's multi-model capabilities. Strict schema enforcement ensures data conformity, quality control, and reliable query execution paths by validating incoming documents during ingestion or update.
MarkLogic utilizes Schema Validation modules that can be enabled at the collection or database level. For XML documents, XSD validation is integrated within the ingestion pipeline, rejecting or quarantining non-conforming documents:
<schemaValidation database="yourDatabase" enabled="true" /> For JSON, JSON Schema validation is supported via custom modules or third-party integrations, often combined with application-level validation in server-side JavaScript or XQuery. Semantic triples can be constrained through SHACL shapes executed dynamically during inserts or updates, providing graph-based schema enforcement relevant to RDF data.
Schema Evolution Strategies
Dynamic business environments necessitate schema evolution-modifications to schemas that preserve existing data utility while permitting new capabilities. MarkLogic facilitates schema evolution without requiring rigid downtime or complex migration, through versioned schema management and flexible indexing.
Versioning is achieved by maintaining multiple schema versions within the repository, leveraging collections, semantically annotated metadata, or naming conventions to delineate schema versions. Documents can reference the schema version they comply with, and validation modules apply rules selectively. For instance, XML documents may specify schemaLocation attributes pointing to distinct XSD versions.
The search and indexing framework accommodates gradual transitions by enabling partial application of new indexes or query rules. For example, new indexes...