Chapter 2
Advanced Data Modeling and Schema Evolution
Shape your data to unlock Vespa's true potential. This chapter journeys beyond traditional schema design, empowering you to express complex, evolving datasets through Vespa's uniquely expressive schema language. Discover how thoughtful data modeling and strategic evolution underpin high-performance search, analytics, and AI use-cases-without sacrificing agility or uptime.
2.1 Modeling Documents and Fields
Central to Vespa's architecture is the concept of a document as the atomic unit of information storage and retrieval. Designing document types and their constituent fields with precision not only ensures optimal indexing and query performance but also enables adaptability to evolving business needs. This section elucidates the principles and best practices for modeling Vespa documents, translating diverse business requirements into an effective schema design. It proceeds to examine Vespa's advanced type constructs-maps, arrays, structs, and tensors-highlighting their appropriate use cases and performance implications.
The process of schema design begins with a rigorous analysis of the domain model and its essential queries. Mapping business requirements onto a Vespa document type involves identifying discrete entities that represent real-world concepts or aggregates and encapsulating their attributes as fields with appropriately chosen data types. For example, a product catalog might define a product document type capturing attributes such as product_id (string or integer), name (text), price (floating point), and categories (multivalue string).
Each field within a document type specifies a storage and indexing strategy, which directly affects flexibility and query efficiency. Fields can be declared as indexing: attribute, indexing: index, or indexing: summary, or a combination thereof, signifying whether the field supports search, fast in-memory attribute access for filtering/sorting, or retrieval in result summaries.
Selecting the correct data type is crucial: primitive types such as int, float, string, and bool are the building blocks, but Vespa extends the expressiveness with composite types. Primitive fields that require full-text search should be indexed; for example, textual fields benefit from index: enable with appropriate analyzers. Numeric fields that participate in range filters or sorting are best declared as attribute fields for rapid evaluation.
When business requirements specify a collection of values-such as multiple tags or localized descriptions-arrays become essential. Arrays allow multiple values of the same field per document, significantly enhancing modeling expressiveness without schema proliferation.
Structs are user-defined tuples of named subfields, offering a mechanism to group heterogeneous fields under a logical unit. They enable encapsulation, reusability, and enhance clarity. For example, an address struct might encapsulate street, city, state, and zip fields as a single cohesive entity attached to a customer document.
Maps introduce key-value semantics within fields, allowing dynamic associations where the keys are strings and values can be any supported type. This is particularly useful for properties with variable sets of attributes that cannot be anticipated up-front. However, maps are less efficient than fixed fields and should be applied judiciously when flexibility outweighs the cost of slower query performance and potentially increased storage.
Tensors represent n-dimensional arrays of numeric values and constitute a powerful type for modeling complex, high-dimensional data such as embeddings, feature vectors, or matrices. Tensors integrate seamlessly with Vespa's ranking expressions and enable efficient hardware-accelerated computations. The tensor type specification includes dimension names and sizes, which allow interpretable and optimized operations. For instance, a 128-dimensional embedding vector can be stored as a tensor field to support vector similarity search.
Translating real-world requirements into schema design often involves trade-offs dictated by query workload, data volatility, and update patterns. For example, a news article document requiring faceted navigation on categories and tags will represent these as array fields of strings stored as attributes to enable fast aggregation. If the domain requires flexible metadata schemas per document with variable keys, maps can encode these dynamic properties, but care must be taken to index or attribute only the most queried keys explicitly.
Similarly, for data-intensive applications such as recommender systems or personalization engines, tensors allow embedding vectors to be part of the document schema. The choice of tensor dimension sizes and sparsity significantly impacts storage efficiency and retrieval speed. Vespa supports low-rank approximations and pruning to optimize tensor data.
One must also consider update patterns: immutable or append-only fields can be optimized differently compared to frequently updated attributes. Using structs to group related fields may facilitate batch updates and clearer update logic.
Careful field modeling benefits both runtime performance and development agility. Fixed-schema designs using primitive and array types typically produce efficient, fast indexes with minimal overhead. Introducing structs clarifies domain abstractions and promotes schema maintainability. Maps introduce schema flexibility at the expense of index and memory overhead, making them best suited for optional or infrequently queried metadata.
Tensors open advanced machine learning scenarios within Vespa but require proper dimensioning and awareness of underlying hardware capabilities to ensure throughput and latency goals.
In all cases, explicit field declarations and aligned data types prevent type mismatches and support Vespa's robust validation mechanisms. Defining appropriate summaries and attribute indexes guarantees that retrieval and filtering are efficient, reducing the need for expensive full-document loads.
Consider the schema excerpt below illustrating multiple advanced fields:
document product { field product_id type string { indexing: summary | attribute } field name type string { indexing: index | summary } field price type float { ...