Apache > Hadoop > Core
 

Hadoop Archives

What are Hadoop archives?

Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a *.har extension. A Hadoop archive directory contains metadata (in the form of _index and _masterindex) and data (part-*) files. The _index file contains the name of the files that are part of the archive and the location within the part files.

How to create an archive?

Usage: hadoop archive -archiveName name <src>* <dest>

-archiveName is the name of the archive you would like to create. An example would be foo.har. The name should have a *.har extension. The inputs are file system pathnames which work as usual with regular expressions. The destination directory would contain the archive. Note that this is a Map/Reduce job that creates the archives. You would need a map reduce cluster to run this. The following is an example:

hadoop archive -archiveName foo.har /user/hadoop/dir1 /user/hadoop/dir2 /user/zoo/

In the above example /user/hadoop/dir1 and /user/hadoop/dir2 will be archived in the following file system directory -- /user/zoo/foo.har. The sources are not changed or removed when an archive is created.

How to look up files in archives?

The archive exposes itself as a file system layer. So all the fs shell commands in the archives work but with a different URI. Also, note that archives are immutable. So, rename's, deletes and creates return an error. URI for Hadoop Archives is

har://scheme-hostname:port/archivepath/fileinarchive

If no scheme is provided it assumes the underlying filesystem. In that case the URI would look like

har:///archivepath/fileinarchive

Here is an example of archive. The input to the archives is /dir. The directory dir contains files filea, fileb. To archive /dir to /user/hadoop/foo.har, the command is

hadoop archive -archiveName foo.har /dir /user/hadoop

To get file listing for files in the created archive

hadoop dfs -lsr har:///user/hadoop/foo.har

To cat filea in archive -

hadoop dfs -cat har:///user/hadoop/foo.har/dir/filea