Technical reports
Recomputation-based data reliability for MapReduce using lineage
Sherif Akoush, Ripduman Sohan, Andy Hopper
May 2016, 19 pages
DOI: 10.48456/tr-888
Abstract
Ensuring block-level reliability of MapReduce datasets is expensive due to the spatial overheads of replicating or erasure coding data. As the amount of data processed with MapReduce continues to increase, this cost will increase proportionally. In this paper we introduce Recomputation-Based Reliability in MapReduce (RMR), a system for mitigating the cost of maintaining reliable MapReduce datasets. RMR leverages record-level lineage of the relationships between input and output records in the job for the purposes of supporting block-level recovery. We show that collecting this lineage imposes low temporal overhead. We further show that the collected lineage is a fraction of the size of the output dataset for many MapReduce jobs. Finally, we show that lineage can be used to deterministically reproduce any block in the output. We quantitatively demonstrate that, by ensuring the reliability of the lineage rather than the output, we can achieve data reliability guarantees with a small storage requirement.
Full text
PDF (0.5 MB)
BibTeX record
@TechReport{UCAM-CL-TR-888, author = {Akoush, Sherif and Sohan, Ripduman and Hopper, Andy}, title = {{Recomputation-based data reliability for MapReduce using lineage}}, year = 2016, month = may, url = {https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-888.pdf}, institution = {University of Cambridge, Computer Laboratory}, doi = {10.48456/tr-888}, number = {UCAM-CL-TR-888} }