Figure 2a (page 2) shows that different data processing systems see wildly different performance even on a simple string processing workload.
At small data sizes, low-overhead single-machine systems (e.g. Metis) perform best, while at larger scale, the parallel I/O of distributed systems offers benefit over their overheads (although some distributed approaches, e.g. Lindi, scale less well than others).
Under construction: the links in the description below are currently broken; we will fix this shortly when we make the first Musketeer release.
If you are interested in being notified when the data appears,
please join our
musketeer-announce
mailing list.
Thanks for your patience.
-- The Musketeer team.
This experiment was executed on our small dedicated cluster of seven machines.
The input data set was a two-column ASCII table, generated using gen_project.sh
,
with data sizes varying between 128 MB and 32 GB.
We also tuned the following system-specific configuration parameters:
hive.input.format
variable to HiveInputFormat
instead of the default CombinedHiveInputFormat
class. The latter merges several input splits and hence reduces the total number of map tasks, which limits parallelism and, in this case, is detrimental. This change improved Hive's performance by over 5x.
The raw results for this experiment are available here.
To plot Figure 2a, run the following command:
experiments/plotting_scripts$ python plot_project_linechart.py ../op_benchmarks/stat/combined_project.stat project-motiv-makespan
The graph will be in project-motiv-makespan.pdf
.