What is Firmament?
Firmament is a cluster manager, similar to Mesos or Kubernetes, but its key feature is a new approach to task scheduling. Firmament makes high-quality placement decisions at very low scheduling delay.
To make good decisions, Firmament models the scheduling problem as a minimum-cost optimisation over a flow network (as in Quincy). The optimisation considers all possible assignments at the same time, and thus yields optimal results for a given cost model.
Previous work considers the goals of optimality and fast decision time to be mutually incompatible; with Firmament, we show that they are not. Firmament makes scheduling decisions within hundreds of milliseconds even over tens of thousands of machines, and offers improved expressivity (in terms of scheduling policies supported) compared to other, previous schedulers.
What's it good for?
Most current cluster schedulers assume a homogeneous data centre environment: they assume that all placement choices are equally good. However, data centres are not as homogeneous as one might think: different machine types are mixed within the same cluster, and co-located tasks compete for resources, which leads to negative interference.
Firmament is built around the idea of making heterogeneity and interference between co-located tasks explicit. It actively monitors task performance and uses the information collected to make better decisions in the future. Having explicit information about the environment helps us to make optimal choices for communication, co-scheduling and workload partitioning, and yields superior performance on many common workloads.
How does it work?
Firmament models the scheduling problem as an optimisation over a flow network. This technique is itself not novel: the Quincy scheduler first proposed it. The picture below shows an example.
The goal of the optimisation is to route flow generated at tasks (circles) to the sink (hexagon). Depending on which resource nodes (rectangles) are crossed by the flow on the way there, scheduling decisions are made:
- If flow is routed through an unscheduled aggregator (Uj), the task remains unscheduled.
- If flow is routed through a machine node (Mm), the task is scheduled on machine Mm.
The flow network also contains edges for already running tasks (dotted) and tasks' locality preferences (dashed). Values next to the arcs correspond to the cost assigned to flow travelling through an arc. The goal of the optimisation is to minimise the total cost of the flow network.
However, Firmament extends the Quincy approach in several ways:
- It collects detailed resource utilisation and performance profiles from running tasks, measuring e.g. CPU and memory used, cache misses and a task's memory-intensitivity (accesses per instruction). This information is available to the scheduler when it makes decisions.
- It can run as a centralised or as a distributed scheduler, either in a hierarchical, overlapping or partitioned setup.
- It allows a variety of scheduling policies to be expressed using modular cost models on the flow graph.
- It represents the exact micro-architectural topology of each cluster machine in the flow graph and takes it into account when making scheduling decisions to avoid cross-task interference.
Below screenshots shows the Firmament web interface displaying the state of a Firmament a scheduler node, along with the resource utilization on a machine, and various topology views.
|Resource monitor||Simple machine topology view||Cluster topology view (4 machines)|
The code for Firmament is on Github.