Principal Data Scientist, Practice Lead for Data Science, AI, Big Data Technologies at Teradata. O’Reilly author on distributed computing and machine learning. ​

Natalino leads the definition, design and implementation of data-driven financial and telecom applications. He has previously served as Enterprise Data Architect at ING in the Netherlands, focusing on fraud prevention/detection, SoC, cybersecurity, customer experience, and core banking processes.

​Prior to that, he had worked as senior researcher at Philips Research Laboratories in the Netherlands, on the topics of system-on-a-chip architectures, distributed computing and compilers. All-round Technology Manager, Product Developer, and Innovator with 15+ years track record in research, development and management of distributed architectures, scalable services and data-driven applications.

Thursday, May 30, 2013

Small files on Hadoop

The problem

The hadoop framework (hdsf and map-reduce)  are oriented towards big files. Big files have good data locality, and use the resources of the jmv's running map reduce jobs more efficiently. No to forget that setup times and shuffling and sorting would work better for larger, chunky files than for small files. Furthermore, the indexing provided by the hdfs namenode are not optimized for small files. Since the metadata associated with a hdfs file can sum up to several 100's bytes, storing small file will fill up the memory of the name nodes pretty quickly. So how to deal with a large collection (say a billion) of small files such as audio mp3 files or emails?

Some solutions 

Sequence Files
Sequence files is a Hadoop specific archive file format similar to tar and zip. The concept behind this is to merge the file set with using a key and a value pair and this created files known as ‘Hadoop Sequence Files’. In this method file name is used as the key and the file content is used as value.

Append using Flume
If this files are generated by some other agent, try to append them using flume and use flume settingsin order to aggregarte the files as a single file. Note that it's not always possible to append. For instance if you have a collection of mp3 files you cannot simply append them back-to-back.

HDFS grooming
Similarly to the previous setup, you could use agents to periodically groom on the hdfs and replace small files by larger ones. Since this process will delete the original files, you could in this way keep the pressure of the file count on the hdfs under control.

HAR files
This structure has lowered down the pressure on the memory of namenode being transparent to any application accessing original files by reducing the files at HDFS. But this may be slower than reading HDFS as each HAR file access requires two ‘index’ file reads as well as the data file read.

HBase stores data in MapFiles (indexed Sequence Files) and therefore it is a good choice when it is need to do MapReduce style streaming analyses with the occasional random look up. For small files problems where a large number of files are generated, a different type of storage such as HBase is much appropriate depending on the access pattern.