Borg
Google’s Borg system is a large-scale cluster manager that automates the deployment, scaling, and management of applications across tens of thousands of machines in its data centers. It was designed to handle a diverse workload, including long-running services and batch jobs, and to maximize resource utilization while ensuring high availability.
The three basic functionalities that Borg accomplishes are:
.Resource Management and Failure Handling
Borg hides the complexities of managing machines and their resources, allowing users to focus on application development. It handles resource allocation, task placement, installation of programs and dependencies, monitoring, and restarting failed tasks.High Reliability and Availability
Borg is designed to operate with very high reliability and availability, supporting applications that require the same. It automatically reschedules tasks, spreads them across different machines and failure domains to reduce correlated failures, and limits the number of tasks that can be simultaneously down during upgrades.Efficient Resource Utilization
Borg archives high utilization by combining admission control, efficient task-packing, over-commitment, and machine sharing with process-level performance isolation. This allows it to co-locate diverse workloads on the same physical machines.
Separation of Duties/Isolation in Borg
Borg achieves separation of duties and isolation through several mechanisms:
Containerization (Linux Containers): Borg relies heavily on Linux containers to provide resource and execution isolation between different tasks running on the same physical machine. Containers isolate processes, ensuring that one application’s resource consumption( CPU, memory, disk I/O) does not negatively impact others. While containers provide a good level of isolation, Google also implements additional security layers (such as virtual machines for external software like Google App Engine and Google Compute Engine) for enhanced protection.
Resource Allocation and Quota: Borg enforces resource limits and quotas for jobs. Users request a certain amount of CPU, memory, and other resources for their jobs, and Borg ensures these allocations are respected. This prevents a single misbehaving or overly resource-hungry job form monopolizing machine resources and affecting other co-located workloads. Quotas are used for admission control, determining which jobs can be scheduled.
Priority Levels: Borg has different priority bands (e.,., monitoring, production, batch, best effort). Higher-priority jobs can preempt lower-priority ones if resources become scarce, ensuring that critical services remain operational. This creates a logical separation of importance among workloads.
Binary Authorization for Borg (BAB): BAB is critical security control that helps reduce insider risk and ensures the integrity of code running on Borg. It enforces policies that dictate the security requirements for services, such as:
- Code Review: Require production software to be reviewed and approved before deployment.
Verifiable Builds: Ensure that container images can be traced back to their human-readable sources, providing an audit trail. - Containerized Deployment: All deployment happen with containers
Service-based Identity :Jobs are provisioned with cryptographic credentials to prove their identity when communicating with other services. - Submitted Configurations: Configuration changes also undergo a review and submission process, preventing unauthorized modifications.
Deploy-time Enforcement and Continuous Validation: BAB blocks non-compliant jobs from deploying and continuously monitors and alerts on non-compliant jobs that were deployed.
Network Isolation (Implicit) : While Borg tasks on a machine share the host’s IP address, Borg manages port assignments as a resource and enforces port isolation. This means that tasks need to declare the ports they require, and the Borglet (node agent) ensures that port conflicts are avoided, providing a form of network isolation within the shared host environment.Separation of Concerns in Architecture: The evolution of Borg (and its successor, Omega, which influenced Kubernetes) involved breaking down monolithic functionalities into separate, peer components. This architectural decoupling contributes to isolation by limiting the blast radius of failure and allowing individual components to manage their specific state space, improving overall system resilience and manageability.
Main Components of Borg and Their Architectural Definition
Borg employs a master-slave architecture within each “cell” (a collection of machines managed as a unit within a data center). The main components are:
BorgMaster:
Role: The “brain” of the Borg cell. It’s a centralized controller that manages the overall state of the cluster, and accepts client requests (job submissions). And orchestrates the scheduling and execution of tasks.
Architecture: The BorgMaster is replicated for high availability, often using Paxos for distributed consensus to maintain a consistent view of the cell’s state across replicas.It interacts with both external components( users, other Google services) and the Borglets on individual machines. Its state, including job configurations and task assignments, is persistently recorded in a Paxos store.
Definition of Architecture: The BorgMaster defines the centralized control plane of the Borg system. It acts as the single source of truth for the desired state of the cell, making high-level decisions about job placement and resource allocation. Its replication ensures fault tolerance and continuous operation even if one master fails.Scheduler:
Role: An Independent service responsible for assigning tasks to machines. It continuously scans pending jobs and determines the most suitable machines for their execution based on resource availability, constraints, priorities, and other policies.
Architecture: The scheduler operates asynchronously, pulling state changes from the elected BorgMaster and updating its local copy. It performs feasibility checks (identifying machines with enough resources that meet job constraints) and scoring( selecting the best machine based on various criteria like minimizing preemptions, optimizing packing, and distributing tasks across failure domains). Once a task is assigned, the scheduler informs the BorgMaster.
Definition of Architecture: The scheduler represents the intelligent decision-making engine of Borg. By being a separate component, it allows for complex scheduling algorithms to be developed and evolved independently of the core state management, contributing to Borg’s efficiency and ability to handle diverse workloads. Its asynchronous nature prevents it from becoming a bottleneck for the BorgMaster.Borglet:
Role: A local agent process that runs on every machine within a Borg cell. It’s responsible for managing the tasks assigned to its host machine, including starting and stopping containers, monitoring task health, reporting machine status, and managing local resources.
Architecture: The Borglet received instructions from the BorgMaster regarding which tasks to run. It interacts directly with the underlying Linux kernel’s container functionalities to enforce resource isolation It also reports back the health and resource usage of tasks and the machine itself to the BorgMaster for monitoring and scheduling decision.
Definition of Architecture: The Borglet defines the execution plaine of the Borg system. It’s the worker component that directly interacts with the host operating system to implement the decisions made by the BorgMaster and scheduler. Its presence on every machine makes Borg a distributed system, allowing for scalable and fault- tolerant execution of tasks across the entire cluster.
In essence, the BrogMaster maintains the desired state, the Scheduler decides where to archive that state, and the Borglets execute the commands to reach that state on individual machines. This distributed and loosely coupled architecture, with a centralized control plane (BorgMaster) and distributed agents (Borglets) coordinated by a smart scheduler, enables Borg to manage massive clusters efficiently, reliably, and with strong isolation guarantees.
Kubernetes
While Google’s internal Borg system was developed with C++, Kubernetes, which drew heavy inspiration from Borg and brought its core concepts to the open-source word, wa sprimarily built with Go. Both systems offer similar functionalities for managing containerized workloads.
Architecture and Main Components of Kubernetes:
Kubernetes operates on a cluster of machines, typically divided into a control plane( formerly master nodes) and worker nodes.
Control Plane Components (formerly Master Node Components), The control plane components are responsible for managing the cluster’s state, scheduling workload, and responding to cluster events. They typically run on dedicated “master” nodes (though in production, these are often highly available and distributed across multiple machines).
- Kube-apiserver: The front end of the Kubernetes control plane. It exposes the Kubernetes API. All communication with the cluster (from users, other control plane components, and worker nodes) goes through the API server. It validates and configures data for API objects (pods, services, replicationcontrollers) and persistes them in etcd.
- etcd: A consistent and highly available key-value store. It stores all cluster data, including the desired state of the cluster (e.g., which applications should be running, how many replicas, their configurations) and the actual state. It’s crucial for the cluster’s health and consistency.
- Kube-scheduler: Watches for newly created Pods with no assigned node and selects a node for them to run on. It considers various for scheduling, including individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, and deadlines.
- Kube-controller-manager: Runs controller processes. These controllers watch the shared state of the cluster through the API server and make changes attempting to move the current state towards the desired state. Examples include:
- Node Controller: Responsible for noticing and responding when nodes go down.
- Replication Controller: Maintains the correct number of pods for every replication controller object.
- Endpoints Controller: Populates the Endpoints object (which joins Services and Pods).
- Service Account & Token Controller: create default accounts and API access tokens for new namespaces.
- cloud-controller-manager(Optional): Embeds cloud-specific control login. This component runs controllers that interact with the underlying cloud provider’s APIs. For example, in a cloud environment, it might handle creating load balancers, managing cloud-specific storage columns, and routing network traffic. It allows Kubernetes to integrate with the cloud provider without needing to modify the core Kubernetes code.
Worker Node Components (formerly Minion/Node components)
These components run on each worker node and are responsible for running the containerized applications and communicating with the control plane.
kubelet: An agent that runs on each node in the cluster. It ensures that containers are running in a Pod. It registers the node with the API server, reports the node’s status, takes PodSpecs (configurations for pods) from the API server, and ensures the described containers are running and healthy. It does not manage containers not created by Kubernetes.
Kub-proxy: A network proxy that runs on each node. It maintains network rules on nodes, allowing network communication to your Pods from inside or outside the cluster. It can perform simple TCP/UDP/SCTP stream forwarding or round-robin forwarding across a set of backend Pods.
Container Runtime: The software responsible for running containers. Kubernetes supports various container runtimes through the Container Runtime Interface (CRI). Examples include:
Contrainerd: A core container runtime that Docker is built on.
CRI-O: A lightweight runtime specifically for Kubernetes’ CRI.
Docker Engine: While Kubernetes doesn’t directly interface with Docker Daemon anymore, containerd (which Docker uses) is a widely used runtime.
This architectural breakdown illustrates how Kubernetes provides a robus, self-healing, and scalable platform for managing containerized applications.