Many big data systems have been developed in recent years, but it is hard to decide which one is "best" for a particular workflow. In fact, sometimes a combination of systems performs best or uses resources most efficiently.
Few users can tell the precise trade-offs by intuition; and in addition, porting workflows between systems is tedious, making experimentation time consuming. Hence, users often become "locked into" a system after implementing workflows, despite faster or more efficient systems being (or becoming) available.
This problem is a consequence of the tight coupling between user-facing front-ends (e.g. Hive, Lindi, GraphLINQ) and back-end execution engines (e.g. MapReduce, Spark, PowerGraph, Naiad).
We have evaluated a range of contemporary data processing systems – Hadoop, Spark, Naiad, PowerGraph, Metis and GraphChi under controlled conditions and found that their performance can vary widely depending on the workflow. Interestingly, no single system consistently outperforms the others, and almost every system performs best under some circumstances.
With Musketeer, we make it easier for users to experiment with different systems by automatically translating their workflows. In fact, Musketeer can even take decision off their shoulders entirely and guess a suitable system automatically.
Users simply write their workflow in one of the support front-end languages and a smart scheduler chooses the best back-end execution system, or combination of systems, at execution time and translates the workflow appropriately.
PageRank is a common graph computation that is used to measure popularity of web pages, products or users. In this example, we implemented PageRank and ran five iterations on a Twitter follower graphs in order to compute each user's popularity.
We used five different data processing systems: Hadoop, Spark, PowerGraph, GraphLINQ on Naiad and GraphChi. For each system, we show the configuration that yielded the best results. Hadoop, Spark and Naiad ran on 100 EC2 m1.xlarge instances, PowerGraph on 16 instances and GraphChi on a single instance.
It is evident from the graph that the choice of data processing system has a major impact on performance and efficiency. For example, GraphLINQ on Naiad has the shortest runtime, but uses 100 EC2 instances. PowerGraph only uses 16, but takes only slightly longer (i.e. it is more resource-efficient). However, the absolute best resource efficiency comes from GraphChi, as it only uses a single EC2 instance, but takes less than 100x longer than GraphLINQ on 100 instances.
Musketeer (bold) in each setup manages to generate jobs that perform similarly to the best option. For example, Musketeer generates PowerGraph code when running on 16 instances, and it generates Naiad code on 100 instances.
Musketeer's automatically generated job code (found thatpcialls adds 5-20% overhead. This is despite the code being generated entirely automatically, with the user specifying merely a simple SQL-like or Gather-and-Scatter (GAS) workload description.