Figure 2b (page 2) shows that the performance of different data processing systems varies significantly for an I/O-intensive join between two data sets.

For the asymmetric join, a simple serial C implementation is fastest, while a symmetric join is best executed in a distributed system with parallel I/O.

Figure 2b

Under construction: the links in the description below are currently broken; we will fix this shortly when we make the first Musketeer release.

If you are interested in being notified when the data appears, please join our musketeer-announce mailing list.

Thanks for your patience.

-- The Musketeer team.

Experimental setup

This experiment was executed on our small dedicated cluster of seven machines.

We used two different input data sets:

  1. Asymmetric: the 4.8M vertices joined against the 69M edges of the LiveJournal graph.
  2. Symmetric: two uniformly randomly generated sets of integers (generated using of 39M rows each.

We also tuned the following system-specific configuration parameters:

Result data set

The raw results for this experiment are available here.

To plot Figure 2b, run the following command:

experiments/plotting_scripts$ python ../query_proc/data.csv HiveOptimized Hive MusketeerHadoop Hadoop Spark128Parallelism Spark Naiad Lindi MusketeerWildCherry "Serial C" join-motiv-makespan.pdf

The graph will be in join-motiv-makespan.pdf.