Chapter 2
Cilium's Architecture and Core Principles
Cilium emerges as a transformative force in cloud-native networking, marrying the flexibility of eBPF with intuitive, high-performance policy enforcement at scale. This chapter peels back the layers of Cilium's architecture, revealing not only how it integrates seamlessly with Kubernetes and the Linux kernel, but also how it reimagines security boundaries, service connectivity, and observability for complex modern workloads. Prepare to uncover the design philosophies and core mechanisms that position Cilium at the forefront of programmable networking.
2.1 Control Plane and Data Plane Separation
Cilium's architecture fundamentally relies on a clear demarcation between the control plane and the data plane, a design choice that optimizes both scalability and performance in modern cloud-native networking environments. This bifurcation delineates responsibilities such that the control plane manages configuration, policy definition, and orchestration, while the data plane enforces these policies by executing programmable logic directly in the Linux kernel via extended Berkeley Packet Filter (eBPF).
The control plane operates primarily as a centralized authority, responsible for overseeing network policy state, translating high-level intent into enforceable rules, and distributing these rules to each node's data plane instance. It ingests configuration inputs through Kubernetes APIs, service meshes, or custom control logic, synthesizing network policies, security rules, and monitoring directives. This centralization permits global consistency, extensive visibility, policy dependency resolution, and auditing capabilities. Cilium's control plane components maintain an authoritative representation of the cluster-wide network state, managing dynamic updates in response to service scaling, pod lifecycle events, and security posture changes.
In contrast, the data plane resides on every individual node and acts as the enforcement mechanism for the policies provisioned by the control plane. Cilium's data plane uniquely leverages eBPF, a lightweight and flexible technology capable of running sandboxed bytecode at kernel level, enabling packet processing at line rate with minimal latency. The data plane directly integrates with the Linux networking stack, intercepting packets at various hook points such as XDP (eXpress Data Path), TC (Traffic Control), and socket layers to perform connectivity decisions, enforce security policies, apply load balancing, and collect telemetry. This localized, programmable processing significantly reduces overhead compared to user-space proxies, eliminating costly packet context switches and kernel-user transitions.
Operational boundaries between the planes are clearly designed to maintain fault isolation and scalability. The control plane can independently evolve and scale across multiple nodes, handling cluster-wide policy reconciliation and state management without imposing computational load on the packet path. Meanwhile, the data plane continues enforcing the most recently received policies regardless of temporary control plane unavailability, thus maintaining uninterrupted traffic flow and security enforcement. This resilience is crucial for large-scale environments where network dynamics and node churn are frequent.
Communication between the control plane and data plane is typically implemented over reliable agent protocols, often via gRPC or bespoke APIs, facilitating asynchronous updates of policy and configuration data. The control plane dispatches incremental changes encoded as eBPF programs or maps, which the data plane applies atomically to maintain consistent enforcement. Additionally, health and status reports flow upstream, allowing the control plane to monitor node readiness, eBPF program load status, and enforcement metrics in real time. This bidirectional communication underpins adaptive policy responses and dynamic reconfiguration capabilities.
Fault tolerance in this architecture is realized through layered strategies. At the control plane level, redundancy is provided via multiple replicas coordinated with consensus mechanisms or leader election, ensuring continuous availability of policy management functions. The data plane employs fail-safe defaults, such as permissive or restrictive policy fallbacks, to handle control plane disconnections or map update failures. eBPF program loaders perform validation and rollback to guarantee kernel stability, preventing corrupted or incomplete policy states from compromising node operation. Additionally, Cilium utilizes versioned atomic updates to eBPF maps, avoiding transient inconsistencies that could lead to dropped packets or security bypasses.
The division of functionality also facilitates fine-grained telemetry and observability. Instrumentation hooks embedded within the data plane provide real-time metrics on packet processing, policy hit rates, and traffic flows, while the control plane aggregates and contextualizes this information for higher-level analytics and troubleshooting. This separation supports on-demand debugging and dynamic policy audits without impacting the latency-sensitive data path.
Cilium's control plane and data plane separation embodies a high-performance, scalable, and robust paradigm. The control plane retains authoritative and centralized governance over network policy and configuration, while the eBPF-powered data plane enforces these policies with kernel-level efficiency and resilience. The integrated communication and fault tolerance mechanisms ensure that network security and connectivity remain consistent in dynamic cloud-native environments, delivering a foundation well-suited for the demands of modern service meshes, container orchestration platforms, and microservices architectures.
2.2 Cilium Agent Internals
The Cilium agent functions as the central orchestrator managing the lifecycle of network endpoints within Kubernetes clusters. It seamlessly integrates with the Linux kernel, leveraging eBPF (extended Berkeley Packet Filter) technology to enforce networking and security policies with high efficiency. At the core of its operation lies the responsibility of endpoint provisioning, continuous state reconciliation, policy resolution, and maintaining synchronization of cluster-wide configuration.
Endpoint Lifecycle Management
Endpoints represent the fundamental entity within Cilium's networking model, typically corresponding to individual pods in Kubernetes. The agent is responsible for provisioning each endpoint by allocating identity, attaching appropriate eBPF programs, and configuring associated networking constructs such as interfaces, routes, and iptables rules. Upon detection of a new pod on the cluster network, the agent performs a series of steps:
- Identity allocation: An endpoint identity is determined based on labels and security policies, enabling fine-grained access control.
- eBPF program attachment: Tailored eBPF programs are loaded and pinned into the kernel, enforcing ingress and egress traffic control at the interface level.
- Device and addressing setup: Network interfaces are configured with appropriate IP addresses, routes, and neighbor entries to ensure connectivity.
The agent continuously monitors endpoint health and status through a local state store. Furthermore, it initiates necessary cleanup when pods are terminated, detaching eBPF programs and releasing associated resources to avoid inconsistencies.
State Reconciliation and Synchronization
Given the dynamic nature of Kubernetes clusters, with frequent pod churn and policy updates, the agent periodically reconciles its internal desired state with the actual cluster and kernel state. This reconciliation proceeds through a state machine that aligns differences between:
- Local endpoint configuration versus the latest Kubernetes state.
- Runtime kernel eBPF program attachments versus intended program lists.
- Policy maps and service maps installed in the kernel with the current configuration.
Such reconciliation minimizes drift in the system and ensures eventual consistency without introducing network downtime. It handles transient errors and performs adaptive retries for resource provisioning, relying on event-driven updates received from the Kubernetes API and local kernel events.
Policy Resolution and Enforcement
Cilium's security policies, usually specified as Kubernetes Network Policies, are processed by the agent via a multi-step resolution path:
- Policy ingestion: The agent watches for policy resource changes across namespaces.
- Policy merge: Multiple policies targeting a single endpoint are merged to derive effective rules.
- BPF map updates: The resulting policies are converted into eBPF map entries, guiding traffic classification in kernel space.
Efficient map population is critical because these data structures are accessed at packet processing time. The result is an enforcement strategy that operates at the...