Apache Hadoop 0.22.0 Release Notes

These release notes cover new developer and user-facing incompatibilities, important issues, features, and major improvements.


Option webinterface.private.actions has been renamed to mapreduce.jobtracker.webinterface.trusted and should be specified in mapred-site.xml instead of core-site.xml


When Hadoop’s Kerberos integration is enabled, it is now required that either {{kinit}} be on the path for user accounts running the Hadoop client, or that the {{hadoop.kerberos.kinit.command}} configuration option be manually set to the absolute path to {{kinit}}.


Updated the help for the touchz command.


Updated the web documentation to reflect the formatting abilities of ‘fs -stat’.


Adds a new configuration hadoop.work.around.non.threadsafe.getpwuid which can be used to enable a mutex around this call to workaround thread-unsafe implementations of getpwuid_r. Users should consult http://wiki.apache.org/hadoop/KnownBrokenPwuidImplementations for a list of such systems.


Removed contrib related build targets.


I have just committed this to trunk and branch-0.22. Thanks Roman!


Removed references to the older fs.checkpoint.* properties that resided in core-site.xml


Updates hadoop-config.sh to always resolve symlinks when determining HADOOP_HOME. Bash built-ins or POSIX:2001 compliant cmds are now required.


This patch has changed the serialization format of BlockLocation.


N/A


Increments the RPC protocol version in org.apache.hadoop.ipc.Server from 4 to 5. Introduces ArrayPrimitiveWritable for a much more efficient wire format to transmit arrays of primitives over RPC. ObjectWritable uses the new writable for array of primitives for RPC and continues to use existing format for on-disk data.


Makes AccessControlList a writable and updates documentation for Job ACLs.


WARNING: No release note provided for this incompatible change.


Processing of concatenated gzip files formerly stopped (quietly) at the end of the first substream/“member”; now processing will continue to the end of the concatenated stream, like gzip(1) does. (bzip2 support is unaffected by this patch.)


WARNING: No release note provided for this incompatible change.


WARNING: No release note provided for this incompatible change.


New metrics “login” of type MetricTimeVaryingRate is added under new metrics context name “ugi” and metrics record name “ugi”.


Improve the buffer utilization of ZlibCompressor to avoid invoking a JNI per write request.


Fix EOF exception in BlockDecompressorStream when decompressing previous compressed empty file


Split existing RpcMetrics into RpcMetrics and RpcDetailedMetrics. The new RpcDetailedMetrics has per method usage details and is available under context name “rpc” and record name “detailed-metrics”


The native build run when from trunk now requires autotools, libtool and openssl dev libraries.


Trash feature notifies user of over-quota condition rather than silently deleting files/directories; deletion can be compelled with “rm -skiptrash”.


Support for reporting metrics to Ganglia 3.1 servers


Adds method to NameNode/ClientProtocol that allows for rude revoke of lease on current lease holder


Removed thriftfs contrib component.


Removed references to the older fs.checkpoint.* properties that resided in core-site.xml


The native build run when from trunk now requires autotools, libtool and openssl dev libraries.


The permissions on datanode data directories (configured by dfs.datanode.data.dir.perm) now default to 0700. Upon startup, the datanode will automatically change the permissions to match the configured value.


Add a configuration variable dfs.image.transfer.bandwidthPerSec to allow the user to specify the amount of bandwidth for transferring image and edits. Its default value is 0 indicating no throttling.


This provides an option to store fsimage compressed. The layout version is bumped to -25. The user could configure if s/he wants the fsimage to be compressed or not and which codec to use. By default the fsimage is not compressed.


resubmit the patch for HDFS1318 as Hudson was down last week.


When running fsck, audit log events are not logged for listStatus and open are not logged. A new event with cmd=fsck is logged with ugi field set to the user requesting fsck and src field set to the fsck path.


WARNING: No release note provided for this incompatible change.


changed protocol name (may be used in hadoop-policy.xml) from security.refresh.usertogroups.mappings.protocol.acl to security.refresh.user.mappings.protocol.acl


WARNING: No release note provided for this incompatible change.


Specific exceptions are thrown from HDFS implementation and protocol per the interface defined in AbstractFileSystem. The compatibility is not affected as the applications catch IOException and will be able to handle specific exceptions that are subclasses of IOException.


WARNING: No release note provided for this incompatible change.


Added support to auto-generate the Eclipse .classpath file from ivy.


Store fsimage MD5 checksum in VERSION file. Validate checksum when loading a fsimage. Layout version bumped.


Moved the libhdfs package to the HDFS subproject.


Does not currently provide anything but uniform distribution. Uses some older depreciated class interfaces (for mapper and reducer) This was tested on 0.20 and 0.22 (locally) so it should be fairly backwards compatible.


A robots.txt is now in place which will prevent well behaved crawlers from perusing Hadoop web interfaces.


WARNING: No release note provided for this incompatible change.


Confirmed that problem of finding ivy file occurs w/o patch with ant 1.7, and not with patch (with either ant 1.7 or 1.8). Other unit tests are still failing the test steps themselves on my laptop, but that is not due not finding the ivy file.


Configuration option webinterface.private.actions has been renamed to mapreduce.jobtracker.webinterface.trusted


Add an FAQ entry regarding the differences between Java API and Streaming development of MR programs.


Job ACL files now have permissions set to 600 (previously 700).


The native build run when from trunk now requires autotools, libtool and openssl dev libraries.


Remove the now defunct property mapreduce.job.userhistorylocation.


Remove some redundant lines from JobInProgress’s constructor which was re-initializing things unnecessarily.


The TaskTracker now uses the libhadoop JNI library to operate securely on local files when security is enabled. Secure clusters must ensure that libhadoop.so is available to the TaskTracker.


Fix Dynamic Priority Scheduler to work with hierarchical queue names


Clears a problem that {{TestJobCleanup}} leaves behind files that cause {{TestJobOutputCommitter}} to error out.


Fix a misleading documentation note about the usage of Reporter objects in Reducers.


Moved the api public Counter getCounter(Enum<?> counterName), public Counter getCounter(String groupName, String counterName) from org.apache.hadoop.mapreduce.TaskInputOutputContext to org.apache.hadoop.mapreduce.TaskAttemptContext


MAPREDUCE-1887. MRAsyncDiskService now properly absolutizes volume root paths. (Aaron Kimball via zshao)


Removed public deprecated class org.apache.hadoop.streaming.UTF8ByteArrayUtils.


changing name of the protocol (may be used in hadoop-policy.xml) from security.refresh.usertogroups.mappings.protocol.acl to security.refresh.user.mappings.protocol.acl


Improved performance of the method JobInProgress.findSpeculativeTask() which is in the critical heartbeat code path.


Fixed an NPE in streaming that occurs when there is no input to reduce and the streaming reducer sends status updates by writing “reporter:status: xxx” statements to stderr.


Added a configuration property “stream.map.input.ignoreKey” to specify whether to ignore key or not while writing input for the mapper. This configuration parameter is valid only if stream.map.input.writer.class is org.apache.hadoop.streaming.io.TextInputWriter.class. For all other InputWriter’s, key is always written.


Fixes serialization of job-acls in JobStatus to use AccessControlList.write() instead of AccessControlList.toString().


Improved console messaging for streaming jobs by using the generic JobClient API itself instead of the existing streaming-specific code.


This jira introduces backward incompatibility. Existing pipes applications MUST be recompiled with new hadoop pipes library once the changes in this jira are deployed.


Fixed a bug that causes TaskRunner to get NPE in getting ugi from TaskTracker and subsequently crashes it resulting in a failing task after task-timeout period.


Removes JNI calls to get jvm current/max heap usage in ClusterStatus. Any instances of ClusterStatus serialized in a prior version will not be correctly deserialized using the updated class.


Added a metric to track number of heartbeats processed by the JobTracker.



Added support to auto-generate the Eclipse .classpath file from ivy.


new config: hadoop.security.service.user.name.key this setting points to the server principal for RefreshUserToGroupMappingsProtocol. The value should be either NN or JT principal depending if it is used in DFAdmin or MRAdmin. The value is set by the application. No need for default value.


Adds the audit logging facility to MapReduce. All authorization/authentication events are logged to audit log. Audit log entries are stored as key=value.


Incremental enhancements to the JobTracker to optimize heartbeat handling.


Adds -background option to run a streaming job in background.


Lazily construct a connection to the JobTracker from the job-submission client.


Incremental enhancements to the JobTracker include a no-lock version of JT.getTaskCompletion events, no lock on the JT while doing i/o during job-submission and several fixes to cut down configuration parsing during heartbeat-handling.


Job names on jobtracker.jsp should be 80 characters long at most.


Add CapacityScheduler servlet to enhance web UI for queue information.


Moved Task log cleanup into a separate thread in TaskTracker.
Added configuration “mapreduce.job.userlog.retain.hours” to specify the time(in hours) for which the user-logs are to be retained after the job completion.


Improved streaming job failure when #link is missing from uri format of -cacheArchive. Earlier it used to fail when launching individual tasks, now it fails during job submission itself.


Allow map and reduce jvm parameters, environment variables and ulimit to be set separately.

Configuration changes: add mapred.map.child.java.opts add mapred.reduce.child.java.opts add mapred.map.child.env add mapred.reduce.child.ulimit add mapred.map.child.env add mapred.reduce.child.ulimit deprecated mapred.child.java.opts deprecated mapred.child.env deprecated mapred.child.ulimit


Collect cpu and memory statistics per task.