Figure 2b (page 2) shows that the performance of different data processing systems varies significantly for an I/O-intensive join between two data sets.
For the asymmetric join, a simple serial C implementation is fastest, while a symmetric join is best executed in a distributed system with parallel I/O.
Under construction: the links in the description below are currently broken; we will fix this shortly when we make the first Musketeer release.
If you are interested in being notified when the data appears,
please join our
musketeer-announce
mailing list.
Thanks for your patience.
-- The Musketeer team.
This experiment was executed on our small dedicated cluster of seven machines.
We used two different input data sets:
gen_join_symmetric.sh
) of 39M rows each.
We also tuned the following system-specific configuration parameters:
hive.input.format
variable to HiveInputFormat
instead of the default CombinedHiveInputFormat
class. The latter merges several input splits and hence reduces the total number of map tasks, which limits parallelism and, in this case, is detrimental. This change improved Hive's performance by over 5x.
parallelism
parameter to 128 (other values performed worse).
The raw results for this experiment are available here.
To plot Figure 2b, run the following command:
experiments/plotting_scripts$ python plot_join_barchart.py ../query_proc/data.csv HiveOptimized Hive MusketeerHadoop Hadoop Spark128Parallelism Spark Naiad Lindi MusketeerWildCherry "Serial C" join-motiv-makespan.pdf
The graph will be in join-motiv-makespan.pdf
.