Chapter 2
Core Architecture and Components of Ray Serve
Beneath Ray Serve's user-friendly APIs lies a robust, high-performance distributed system architected for scalable, resilient inference at cloud scale. This chapter peels back the surface to expose the architectural building blocks, coordination patterns, and runtime abstractions that empower Ray Serve to balance elasticity, availability, and observability. By examining each core component and their orchestration, readers will gain the architectural intuition necessary to troubleshoot, extend, and optimize Ray Serve for real-world demands.
2.1 Distributed Actor Model in Ray
The distributed actor model in Ray represents a core paradigm for managing stateful computations at scale, enabling efficient and resilient execution of complex workloads. Unlike stateless task-oriented approaches, actors encapsulate both computation and mutable state behind an isolated interface, facilitating concurrency without shared memory and promoting fault tolerance through explicit lifecycle management.
An actor in Ray constitutes an instance of a class created remotely on a cluster node. Each actor maintains its own independent state, which persists across method invocations. By encapsulating state within these actors, Ray ensures that concurrent accesses to mutable data do not require explicit synchronization mechanisms such as locks or atomic operations. Instead, method calls on actors are queued and processed sequentially, thus preserving isolation and consistency without sacrificing parallelism across multiple actors running on diverse nodes.
The construction of actors leverages Ray's remote decorator syntax, enabling the seamless instantiation and interaction with distributed objects. Consider the following example of an actor definition in Ray:
@ray.remote class Counter: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value def get_value(self): return self.value An actor is instantiated remotely through a simple API call:
counter = Counter.remote() Subsequent method invocations on this actor are asynchronous remote calls, returning ObjectRefs which act as futures:
ref1 = counter.increment.remote() ref2 = counter.get_value.remote() These ObjectRefs serve as handles to results that will materialize once the corresponding remote execution completes. Ray's runtime manages the serialization, scheduling, dispatch, and communication underlying this interaction invisibly to the user.
A significant advantage of Ray's actor model is its concurrency model: although each actor serializes method execution, multiple actors can operate concurrently across multiple cluster nodes. This design naturally maps to scalable workloads that decompose stateful logic into independent, isolated components. The constrained single-threaded execution per actor avoids traditional pitfalls of concurrency such as race conditions and deadlocks, while still supporting distributed parallelism by employing numerous actors simultaneously.
Failures and fault tolerance in the distributed actor model are handled transparently by Ray's runtime. Each actor possesses a lineage and execution graph, enabling re-creation or recovery in the event of node failures. Since actor states are mutable and potentially large, mechanisms such as checkpointing or external state persistence can be integrated to minimize state loss during recovery. Upon failure, actors can be restarted either automatically or under explicit user control, preserving the computational semantics expected by client processes.
Remote object references extend the model's flexibility further. They can be passed between tasks and actors, facilitating composition and pipelining of distributed computations without incurring heavy data movement or synchronization overhead. By integrating remote references directly into method signatures and return values, Ray effectively hides the complexities of serialization and location transparency, providing a unified programming model for both stateless tasks and stateful actors.
This actor-centric architecture suits machine learning serving workloads exceptionally well. ML serving often entails managing multiple models or model versions concurrently, each with its own state (e.g., weights, configurations, statistics). Actors naturally encapsulate these states, enabling isolated request handling with dedicated concurrency, minimizing interference and improving latency predictability. Furthermore, long-lived actors reduce the overhead of repeated model loading and initialization by maintaining warm state between requests.
Concurrency is essential for high-throughput inference; actors can process requests independently and in parallel, subject only to serialization of method calls within each actor instance. This isolation also improves system resilience: failures in one actor do not cascade into others, and recovery procedures can be localized and fine-grained. Additionally, Ray's distributed scheduler can dynamically balance load by creating, migrating, or terminating actors as demand fluctuates.
Ray's distributed actor model combines encapsulated mutable state with asynchronous, distributed execution semantics, achieving a balance between isolation, concurrency, and resilience. Its design abstracts away complexities inherent in distributed computing, resulting in a robust foundation that elegantly supports scalable machine learning serving and other stateful applications requiring high availability, concurrent processing, and fault tolerance.
...