Computer Laboratory

Technical reports

Recomputation-based data reliability for MapReduce using lineage

Sherif Akoush, Ripduman Sohan, Andy Hopper

May 2016, 19 pages

Abstract

Ensuring block-level reliability of MapReduce datasets is expensive due to the spatial overheads of replicating or erasure coding data. As the amount of data processed with MapReduce continues to increase, this cost will increase proportionally. In this paper we introduce Recomputation-Based Reliability in MapReduce (RMR), a system for mitigating the cost of maintaining reliable MapReduce datasets. RMR leverages record-level lineage of the relationships between input and output records in the job for the purposes of supporting block-level recovery. We show that collecting this lineage imposes low temporal overhead. We further show that the collected lineage is a fraction of the size of the output dataset for many MapReduce jobs. Finally, we show that lineage can be used to deterministically reproduce any block in the output. We quantitatively demonstrate that, by ensuring the reliability of the lineage rather than the output, we can achieve data reliability guarantees with a small storage requirement.

Full text

PDF (0.5 MB)

BibTeX record

@TechReport{UCAM-CL-TR-888,
  author =	 {Akoush, Sherif and Sohan, Ripduman and Hopper, Andy},
  title = 	 {{Recomputation-based data reliability for MapReduce using
         	   lineage}},
  year = 	 2016,
  month = 	 may,
  url = 	 {http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-888.pdf},
  institution =  {University of Cambridge, Computer Laboratory},
  number = 	 {UCAM-CL-TR-888}
}