The problem: everyone is fencing for themselves.
  • There are too many big data processing systems to understand them all.
    Hence, it's hard for users to choose the right system.
  • Each system encourages lock-in.
    Hence, once a workload is implemented for a system, it cannot easily be ported to another.

Many big data systems have been developed in recent years, but it is hard to decide which one is "best" for a particular workflow. In fact, sometimes a combination of systems performs best or uses resources most efficiently.

Few users can tell the precise trade-offs by intuition; and in addition, porting workflows between systems is tedious, making experimentation time consuming. Hence, users often become "locked into" a system after implementing workflows, despite faster or more efficient systems being (or becoming) available.

This problem is a consequence of the tight coupling between user-facing front-ends (e.g. Hive, Lindi, GraphLINQ) and back-end execution engines (e.g. MapReduce, Spark, PowerGraph, Naiad).

We have evaluated a range of contemporary data processing systems – Hadoop, Spark, Naiad, PowerGraph, Metis and GraphChi   under controlled conditions and found that their performance can vary widely depending on the workflow. Interestingly, no single system consistently outperforms the others, and almost every system performs best under some circumstances.

The solution: all for one, and one for all!
Schematic diagram of Musketeer.
  • Musketeer has a common intermediate representation.
    As a result, it can map from any front-end framework to any back-end execution engine.
  • Musketeer automatically maps workflows to data processing systems.
    By combining optimised templates for different data-flow operators, Musketeer automatically generates job implementations that are competitive with hand-optimized baselines.
  • Musketeer automatically chooses the right mappings.
    Moreover, Musketeer smartly estimates expected performance and automatically chooses a good back-end execution engine (or combination of engines) for a workflow.

With Musketeer, we make it easier for users to experiment with different systems by automatically translating their workflows. In fact, Musketeer can even take decision off their shoulders entirely and guess a suitable system automatically.

Users simply write their workflow in one of the support front-end languages and a smart scheduler chooses the best back-end execution system, or combination of systems, at execution time and translates the workflow appropriately.

Key features

Benefits of using Musketeer

  1. Dynamic systems binding: Musketeer chooses the right data processing system at runtime, depending on data volume, workflow iterativity and composition, reducing the runtime or resource utilisation.

  2. Dynamic job boundaries: Musketeer places job boundaries within a workflow at runtime, adding the opportunity to dynamically decompose the workflow differently depending on the data and the availability of different processing systems.

  3. Flexible adaptation: Musketeer's runtime mapping to data processing systems allows workflows that change over time to adapt dynamically: workflows can be partitioned and mapped differently if they change, or if the properties of data processing systems change (e.g. because of additional optimisations made in new versions of a system).

  4. Easy job migration: Musketeer makes it easy to port workflows to new systems. Once support for the new system has been added to Musketeer, it can translate workflows onto the new system and schedule them.

Example use case: PageRank

Twitter follower graph (42 million vertices, 1.4 billion edges).

PageRank is a common graph computation that is used to measure popularity of web pages, products or users. In this example, we implemented PageRank and ran five iterations on a Twitter follower graphs in order to compute each user's popularity.

We used five different data processing systems: Hadoop, Spark, PowerGraph, GraphLINQ on Naiad and GraphChi. For each system, we show the configuration that yielded the best results. Hadoop, Spark and Naiad ran on 100 EC2 m1.xlarge instances, PowerGraph on 16 instances and GraphChi on a single instance.

It is evident from the graph that the choice of data processing system has a major impact on performance and efficiency. For example, GraphLINQ on Naiad has the shortest runtime, but uses 100 EC2 instances. PowerGraph only uses 16, but takes only slightly longer (i.e. it is more resource-efficient). However, the absolute best resource efficiency comes from GraphChi, as it only uses a single EC2 instance, but takes less than 100x longer than GraphLINQ on 100 instances.

Musketeer (bold) in each setup manages to generate jobs that perform similarly to the best option. For example, Musketeer generates PowerGraph code when running on 16 instances, and it generates Naiad code on 100 instances.

Musketeer's automatically generated job code (found thatpcialls adds 5-20% overhead. This is despite the code being generated entirely automatically, with the user specifying merely a simple SQL-like or Gather-and-Scatter (GAS) workload description.