International Journal of Scientific & Technology Research

Home About Us Scope Editorial Board Blog/Latest News Contact Us
10th percentile
Powered by  Scopus
Scopus coverage:
Nov 2018 to May 2020


IJSTR >> Volume 2- Issue 9, September 2013 Edition

International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616

HDFS+: Erasure Coding Based Hadoop Distributed File System

[Full Text]



Fredrick RomanusIshengoma



Index Terms: Erasure coding, Hadoop, HDFS, I/O performance, node failure, replication, space efficiency.



Abstract: A simple replication-based mechanism has been used to achieve high data reliability of Hadoop Distributed File System (HDFS). However, replication based mechanisms have high degree of disk storage requirement since it makes copies of full block without consideration of storage size. Studies have shown that erasure-coding mechanism can provide more storage space when used as an alternative to replication. Also, it can increase write throughput compared to replication mechanism. To improve both space efficiency and I/O performance of the HDFS while preserving the same data reliability level, we propose HDFS+, an erasure coding based Hadoop Distributed File System. The proposed scheme writes a full block on the primary DataNode and then performs erasure coding with Vandermonde-based Reed-Solomon algorithm that divides data into m data fragments and encode them into ndata fragments (n>m), which are saved in N distinct DataNodes such that the original object can be reconstructed from any m fragments. The experimental results show that our scheme can save up to 33% of storage space while outperforming the original scheme in write performance by 1.4 times. Our scheme provides the same read performance as the original scheme as long as data can be read from the primary DataNode even under single-node or double-node failure. Otherwise, the read performance of the HDFS+ decreases to some extent. However, as the number of fragments increases, we show that the performance degradation becomes negligible.



[1] G. D.H.-C.Du, “Recent Advancements and Future Challenges of Storage Systems”, Proceedings of the IEEE, 96(11): 1875-186, 2008.

[2] J.GantzandD.Reinsel, ”The Digital Universe Decade–Are You Ready?” http://idcdocserv.com/925, 2010.

[3] The Apache Software Foundation,“Welcome to Apache Hadoop,” http://hadoop.apache.org.

[4] T.White, “Hadoop: The definitive guide (Third Edition),” O’Reilly & Associates Incorporated, 2012.

[5] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the ACM 51(1), pp. 107–113, January 2008.

[6] The Apache Software Foundation, “Welcome to Hadoop Distributed File System,” http://hadoop.apache.org/hdfs.

[7] Konstantin Shvachko, HairongKuang, Sanjay Radia, Robert Chansler, “The Hadoop Distributed File System”, Proceedings of MSST2010, May 2010.

[8] Bin Fan. WittawatTantisiriroj, Lin Xiao, Garth Gibson, “DiskReduce: Replication as a Prelude to Erasure Coding in Data-Intensive Scalable Computing,” CMU Parallel Data Laboratory Technical Report CMU-PDL-11-112, October 2011.

[9] James S. Plank, JianqiangLuo, Catherine D. Schuman, LihaoXu, Zooko Wilcox-O'Hearn, “A Performance Evaluation and Examination of Open Source Erasure Coding Libraries for Storage,” FAST 2009.

[10] Zhe Zhang, AmeyDeshpande, Xiaosong Ma, EnoThereska, and Dushyanth Narayanan, “Does Erasure Coding Have a Role to Play in My Data Center?”,Microsoft Research Technical Report MSR-TR-2010-52. May 2010.

[11] A. G. Dimakis, K. Ramchandran, Y. Wu and C. Suh, “A Survey on Network Codes for Distributed Storage”, Proceedings of the IEEE, vol. 99, no. 3, pp. 476--489, March 2011.

[12] Yasushi Saito, SvendFrolund, Alistair Veitch, Arif Merchant and Susan Spence. Fab: Building Distributed Enterprise Disk Arrays from Commodity Components”, SIGOPS Oper. Syst. Rev,. 38(5):48-58, 2004.

[13] Rodrigo Rodrigues, Barbara Liskov, “High availability in Dhts: Erasure coding vs. replication”, In Proceedings of 4th International Workshop on Peer-to-Peer Systems, 2005.

[14] Andreas Haeberlen, Alan Mislove, Peter Druschel, “Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated Failures”, In NSDI’05: Proceedings of the 2nd conference on Synopsum on Networked Systems Design & Implementation, pages 143-158, Berckley, CA, USA, 2005. USENIX Association.

[15] John Wilkes, Rrichard Golding, Carl Staelin, Tim Sulliva, “The HpAutoraidHierarchial Storage System”, ACM Trans. Comput. Syst.14 (1): 108-136, 1996.

[16] Reed, I. S., and Solomon, G. “Polynomial codes over certain finite fields”. Journal of the Society for Industrial and Applied Mathematics 8 (1960), 300–304.

[17] James Hendricks, Gregory R. Ganger, Michael K. Reiter, “Low-overhead Byzantine Fault-tolerant Storage”, In Proceedings of the Twenty-First ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

[18] Maheswaran Sathiamoorthy, MegasthenisAsteris, DimitricPapailiopoulos, Alexandros G. Dimakis, RamkumarVadali, Scott Chen, “XORing Elephants: Novel Erasure Codes for Big Data”, Proceeding PVLDB'13 Proceedings of the 39th international conference on Very Large Data Bases Pages 325-336, 2013.