Firmament is a new cluster scheduler for warehouse-scale computers. It makes high-quality placement decisions at low scheduling delay.
To make good placement decisions, Firmament models the scheduling problem as a minimum-cost optimisation over a flow network (as in Quincy). The optimisation considers all possible assignments at the same time, and thus yields optimal results for a given cost model.
Previous work considers the goals of optimality and fast decision time to be mutually incompatible; with Firmament, I show that they are not. Firmament makes scheduling decisions within seconds even over tens of thousands of machines, while offering the same expressivity in terms of scheduling policies as other previous schedulers.
What's it good for?
Most current data centre schedulers assume a homogeneous data centre environments: they assume that all placement choices are equally good. However, data centres are not as homogeneous as one might think: different machine types are mixed within the same cluster, and co-located tasks compete for resources, which leads to negative interference.
Firmament is built around the idea of making heterogeneity explicit, actively monitors task performance and uses the information collected to make better decisions in the future. Having explicit information about the environment helps us to make optimal choices for communication, co-scheduling and workload partitioning, and yields superior performance on many common workloads.
Firmament is a CamSaS project, and more information is available on the CamSaS page.
How does it work?
Firmament models the scheduling problem as an optimisation over a flow network. This technique is itself not novel: the Quincy scheduler first proposed it. The picture below shows an example.
The goal of the optimisation is to route flow generated at tasks (circles) to the sink (hexagon). Depending on which resource nodes (rectangles) are crossed by the flow on the way there, scheduling decisions are made:
- If flow is routed through an unscheduled aggregator (Uj), the task remains unscheduled.
- If flow is routed through a machine node (Mm), the task is scheduled on machine Mm.
The flow network also contains edges for already running tasks (dotted) and tasks' locality preferences (dashed). Values next to the arcs correspond to the cost assigned to flow travelling through an arc. The goal of the optimisation is to minimise the total cost of the flow network.
Firmament extends the Quincy approach in several ways:
- It collects detailed resource utilisation and performance profiles from running tasks, measuring e.g. CPU and memory used, cache misses and a task's memory-intensitivity (accesses per instruction). This information is available to the scheduler when it makes decisions.
- It can run as a centralised or as a distributed scheduler, either in a hierarchical, overlapping or partitioned setup.
- It allows a variety of scheduling policies to be expressed using modular cost models on the flow graph.
- It represents the exact micro-architectural topology of each cluster machine in the flow graph and takes it into account when making scheduling decisions to avoid cross-task interference.
Below screenshots shows the Firmament web interface displaying the state of a Firmament a scheduler node, along with the resource utilization on a machine, and various topology views.
|Resource monitor||Simple machine topology view||Cluster topology view (4 machines)|
The code for Firmament is on Github.