# Copyright 2008 The Apache Software Foundation Licensed under the
# Apache License, Version 2.0 (the "License"); you may not use this
# file except in compliance with the License. You may obtain a copy
# of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless
# required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
# implied. See the License for the specific language governing
# permissions and limitations under the License.
This package implements a Distributed Raid File System. It is used alongwith
an instance of the Hadoop Distributed File System (HDFS). It can be used to
provide better protection against data corruption. It can also be used to
reduce the total storage requirements of HDFS.
Distributed Raid File System consists of two main software components. The first component
is the RaidNode, a daemon that creates parity files from specified HDFS files.
The second component "raidfs" is a software that is layered over a HDFS client and it
intercepts all calls that an application makes to the HDFS client. If HDFS encounters
corrupted data while reading a file, the raidfs client detects it; it uses the
relevant parity blocks to recover the corrupted data (if possible) and returns
the data to the application. The application is completely transparent to the
fact that parity data was used to satisfy it's read request.
The primary use of this feature is to save disk space for HDFS files.
HDFS typically stores data in triplicate.
The Distributed Raid File System can be configured in such a way that a set of
data blocks of a file are combined together to form one or more parity blocks.
This allows one to reduce the replication factor of a HDFS file from 3 to 2
while keeping the failure probabilty relatively same as before. This typically
results in saving 25% to 30% of storage space in a HDFS cluster.
--------------------------------------------------------------------------------
BUILDING:
In HADOOP_PREFIX, run ant package to build Hadoop and its contrib packages.
--------------------------------------------------------------------------------
INSTALLING and CONFIGURING:
The entire code is packaged in the form of a single jar file hadoop-*-raid.jar.
To use HDFS Raid, you need to put the above mentioned jar file on
the CLASSPATH. The easiest way is to copy the hadoop-*-raid.jar
from HADOOP_PREFIX/build/contrib/raid to HADOOP_PREFIX/lib. Alternatively
you can modify HADOOP_CLASSPATH to include this jar, in conf/hadoop-env.sh.
There is a single configuration file named raid.xml that describes the HDFS
path(s) that you want to raid. A sample of this file can be found in
sc/contrib/raid/conf/raid.xml. Please edit the entries in this file to list the
path(s) that you want to raid. Then, edit the hdfs-site.xml file for
your installation to include a reference to this raid.xml. You can add the
following to your hdfs-site.xml
raid.config.file
/mnt/hdfs/DFS/conf/raid.xml
This is needed by the RaidNode
Please add an entry to your hdfs-site.xml to enable hdfs clients to use the
parity bits to recover corrupted data.
fs.hdfs.impl
org.apache.hadoop.dfs.DistributedRaidFileSystem
The FileSystem for hdfs: uris.
--------------------------------------------------------------------------------
OPTIONAL CONFIGIURATION:
The following properties can be set in hdfs-site.xml to further tune you configuration:
Specifies the location where parity files are located.
hdfs.raid.locations
hdfs://newdfs.data:8000/raid
The location for parity files. If this is
is not defined, then defaults to /raid.
Specify the parity stripe length
hdfs.raid.stripeLength
10
The number of blocks in a file to be combined into
a single raid parity block. The default value is 5. The lower
the number the greater is the disk space you will save when you
enable raid.
Specify the size of HAR part-files
raid.har.partfile.size
4294967296
The size of HAR part files that store raid parity
files. The default is 4GB. The higher the number the fewer the
number of files used to store the HAR archive.
Specify which implementation of RaidNode to use.
raid.classname
org.apache.hadoop.raid.DistRaidNode
Specify which implementation of RaidNode to use
(class name).
Specify the periodicy at which the RaidNode re-calculates (if necessary)
the parity blocks
raid.policy.rescan.interval
5000
Specify the periodicity in milliseconds after which
all source paths are rescanned and parity blocks recomputed if
necessary. By default, this value is 1 hour.
By default, the DistributedRaidFileSystem assumes that the underlying file
system is the DistributedFileSystem. If you want to layer the DistributedRaidFileSystem
over some other file system, then define a property named fs.raid.underlyingfs.impl
that specifies the name of the underlying class. For example, if you want to layer
The DistributedRaidFileSystem over an instance of the NewFileSystem, then
fs.raid.underlyingfs.impl
org.apche.hadoop.new.NewFileSystem
Specify the filesystem that is layered immediately below the
DistributedRaidFileSystem. By default, this value is DistributedFileSystem.
--------------------------------------------------------------------------------
ADMINISTRATION:
The Distributed Raid File System provides support for administration at runtime without
any downtime to cluster services. It is possible to add/delete new paths to be raided without
interrupting any load on the cluster. If you change raid.xml, its contents will be
reload within seconds and the new contents will take effect immediately.
Designate one machine in your cluster to run the RaidNode software. You can run this daemon
on any machine irrespective of whether that machine is running any other hadoop daemon or not.
You can start the RaidNode by running the following on the selected machine:
nohup $HADOOP_PREFIX/bin/hadoop org.apache.hadoop.raid.RaidNode >> /xxx/logs/hadoop-root-raidnode-hadoop.xxx.com.log &
Optionally, we provide two scripts to start and stop the RaidNode. Copy the scripts
start-raidnode.sh and stop-raidnode.sh to the directory $HADOOP_PREFIX/bin in the machine
you would like to deploy the daemon. You can start or stop the RaidNode by directly
callying the scripts from that machine. If you want to deploy the RaidNode remotely,
copy start-raidnode-remote.sh and stop-raidnode-remote.sh to $HADOOP_PREFIX/bin at
the machine from which you want to trigger the remote deployment and create a text
file $HADOOP_PREFIX/conf/raidnode at the same machine containing the name of the server
where the RaidNode should run. These scripts run ssh to the specified machine and
invoke start/stop-raidnode.sh there. As an example, you might want to change
start-mapred.sh in the JobTracker machine so that it automatically calls
start-raidnode-remote.sh (and do the equivalent thing for stop-mapred.sh and
stop-raidnode-remote.sh).
To validate the integrity of a file system, run RaidFSCK as follows:
$HADOOP_PREFIX/bin/hadoop org.apache.hadoop.raid.RaidShell -fsck [path]
This will print a list of corrupt files (i.e., files which have lost too many
blocks and can no longer be fixed by Raid).
--------------------------------------------------------------------------------
IMPLEMENTATION:
The RaidNode periodically scans all the specified paths in the configuration
file. For each path, it recursively scans all files that have more than 2 blocks
and that has not been modified during the last few hours (default is 24 hours).
It picks the specified number of blocks (as specified by the stripe size),
from the file, generates a parity block by combining them and
stores the results as another HDFS file in the specified destination
directory. There is a one-to-one mapping between a HDFS
file and its parity file. The RaidNode also periodically finds parity files
that are orphaned and deletes them.
The Distributed Raid FileSystem is layered over a DistributedFileSystem
instance intercepts all calls that go into HDFS. HDFS throws a ChecksumException
or a BlocMissingException when a file read encounters bad data. The layered
Distributed Raid FileSystem catches these exceptions, locates the corresponding
parity file, extract the original data from the parity files and feeds the
extracted data back to the application in a completely transparent way.
The layered Distributed Raid FileSystem does not fix the data-loss that it
encounters while serving data. It merely make the application transparently
use the parity blocks to re-create the original data. A command line tool
"fsckraid" is currently under development that will fix the corrupted files
by extracting the data from the associated parity files. An adminstrator
can run "fsckraid" manually as and when needed.