Hadoop 0.23.0 Release Notes
These release notes include new developer and user-facing incompatibilities, features, and major improvements.
Changes since Hadoop 0.22
- HADOOP-7778.
Major bug reported by tomwhite and fixed by tomwhite
FindBugs warning in Token.getKind()
From https://builds.apache.org/job/PreCommit-HADOOP-Build/330//artifact/trunk/hadoop-common-project/patchprocess/newPatchFindbugsWarningshadoop-common.html
bq. org.apache.hadoop.security.token.Token.getKind() is unsynchronized, org.apache.hadoop.security.token.Token.setKind(Text) is synchronized
Looks like this was introduced by MAPREDUCE-2764.
- HADOOP-7772.
Trivial improvement reported by stevel@apache.org and fixed by stevel@apache.org
javadoc the topology classes
To help people understand and make changes to the Topology classes, their javadocs could be rounded off.
- HADOOP-7771.
Blocker bug reported by johnvijoe and fixed by johnvijoe
NPE when running hdfs dfs -copyToLocal, -get etc
NPE when running hdfs dfs -copyToLocal if the destination directory does not exist. The behavior in branch-0.20-security is to create the directory and copy/get the contents from source.
- HADOOP-7770.
Blocker bug reported by raviprak and fixed by raviprak (fs)
ViewFS getFileChecksum throws FileNotFoundException for files in /tmp and /user
Thanks to Rohini Palaniswamy for discovering this bug. To quote
bq. When doing getFileChecksum for path /user/hadoopqa/somefile, it is trying to fetch checksum for /user/user/hadoopqa/somefile. If /tmp/file, it is trying /tmp/tmp/file. Works fine for other FS operations.
- HADOOP-7766.
Major bug reported by jnp and fixed by jnp
The auth to local mappings are not being respected, with webhdfs and security enabled.
KerberosAuthenticationHandler reloads the KerberosName statically and overrides the auth to local mappings.
- HADOOP-7764.
Blocker bug reported by jeagles and fixed by jeagles
Allow both ACL list and global path spec filters to HttpServer
HttpServer allows setting global path spec filters in one constructor and ACL list in another constructor. Having both set in HttpServer is not user settable either by public API or constructor.
- HADOOP-7763.
Major improvement reported by tomwhite and fixed by tomwhite (documentation)
Add top-level navigation to APT docs
We need navigation menus for the APT docs that have been written so far.
- HADOOP-7753.
Major sub-task reported by tlipcon and fixed by tlipcon (io, native, performance)
Support fadvise and sync_data_range in NativeIO, add ReadaheadPool class
This JIRA adds JNI wrappers for sync_data_range and posix_fadvise. It also implements a ReadaheadPool class for future use from HDFS and MapReduce.
- HADOOP-7749.
Minor improvement reported by tlipcon and fixed by tlipcon (util)
Add NetUtils call which provides more help in exception messages
In setting up MR2, I accidentally had a bad configuration value specified for one of the IP configs. I was getting a NumberFormatException parsing this config, but no indication as to what config value was being fetched. This JIRA is to add an API to NetUtils.createSocketAddr which takes the configuration name, so that any exceptions thrown will point back to where the user needs to fix it.
- HADOOP-7745.
Major bug reported by raviprak and fixed by raviprak
I switched variable names in HADOOP-7509
As Aaron pointed out on https://issues.apache.org/jira/browse/HADOOP-7509?focusedCommentId=13126725&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13126725 I stupidly swapped CommonConfigurationKeys.HADOOP_SECURITY_AUTHENTICATION with CommonConfigurationKeys.HADOOP_SECURITY_AUTHORIZATION.
- HADOOP-7744.
Major bug reported by jeagles and fixed by jeagles (test)
Incorrect exit code for hadoop-core-test tests when exception thrown
Please see MAPREDUCE-3179 for a full description.
- HADOOP-7743.
Major improvement reported by tucu00 and fixed by tucu00 (build)
Add Maven profile to create a full source tarball
Currently we are building binary distributions only.
We should also build a full source distribution from where Hadoop can be built.
- HADOOP-7740.
Minor bug reported by arpitgupta and fixed by arpitgupta (conf)
security audit logger is not on by default, fix the log4j properties to enable the logger
Fixed security audit logger configuration. (Arpit Gupta via Eric Yang)
- HADOOP-7737.
Major improvement reported by tucu00 and fixed by tucu00 (build)
normalize hadoop-mapreduce & hadoop-dist dist/tar build with common/hdfs
Normalize the build fo hadoop-mapreduce and hadoop-dist with hadoop-common and hadoop-hdfs making the -Pdist and -Dtar maven options to be consistent.
* -Pdist should create the layout
* -Dtar should create the TAR
- HADOOP-7728.
Major bug reported by rramya and fixed by rramya (conf)
hadoop-setup-conf.sh should be modified to enable task memory manager
Enable task memory management to be configurable via hadoop config setup script.
- HADOOP-7724.
Major bug reported by gkesavan and fixed by arpitgupta
hadoop-setup-conf.sh should put proxy user info into the core-site.xml
Fixed hadoop-setup-conf.sh to put proxy user in core-site.xml. (Arpit Gupta via Eric Yang)
- HADOOP-7721.
Major bug reported by arpitgupta and fixed by jnp
dfs.web.authentication.kerberos.principal expects the full hostname and does not replace _HOST with the hostname
- HADOOP-7720.
Major improvement reported by arpitgupta and fixed by arpitgupta (conf)
improve the hadoop-setup-conf.sh to read in the hbase user and setup the configs
Added parameter for HBase user to setup config script. (Arpit Gupta via Eric Yang)
- HADOOP-7709.
Major improvement reported by jeagles and fixed by jeagles
Running a set of methods in a Single Test Class
Instead of running every test method in a class, limit to specific testing methods as describe in the link below.
http://maven.apache.org/plugins/maven-surefire-plugin/examples/single-test.html
Upgrade to the latest version of maven-surefire-plugin that has this feature.
- HADOOP-7708.
Critical bug reported by arpitgupta and fixed by eyang (conf)
config generator does not update the properties file if on exists already
Fixed hadoop-setup-conf.sh to handle config file consistently. (Eric Yang)
- HADOOP-7707.
Major improvement reported by arpitgupta and fixed by arpitgupta (conf)
improve config generator to allow users to specify proxy user, turn append on or off, turn webhdfs on or off
Added toggle for dfs.support.append, webhdfs and hadoop proxy user to setup config script. (Arpit Gupta via Eric Yang)
- HADOOP-7705.
Minor new feature reported by stevel@apache.org and fixed by stevel@apache.org (util)
Add a log4j back end that can push out JSON data, one per line
If we had a back end for Log4j that pushed out log events in single line JSON content, we'd have something that is fairly straightforward to machine parse. If: it may be harder to do than expected. Once working HADOOP-6244 could use it.
- HADOOP-7691.
Major bug reported by gkesavan and fixed by eyang
hadoop deb pkg should take a diff group id
Fixed conflict uid for install packages. (Eric Yang)
- HADOOP-7684.
Major bug reported by eyang and fixed by eyang (scripts)
jobhistory server and secondarynamenode should have init.d script
Added init.d script for jobhistory server and secondary namenode. (Eric Yang)
- HADOOP-7681.
Minor bug reported by arpitgupta and fixed by arpitgupta (conf)
log4j.properties is missing properties for security audit and hdfs audit should be changed to info
(Arpit Gupta via Eric Yang)
- HADOOP-7671.
Major bug reported by raviprak and fixed by raviprak
Add license headers to hadoop-common/src/main/packages/templates/conf/
hadoop-common/src/main/packages/templates/conf/ not in the exclude list for apache-rat plugin . This causes 10 release audit warnings for missing license headers (in the properties and xml files like hdfs-site.xml)
- HADOOP-7668.
Minor improvement reported by sureshms and fixed by stevel@apache.org (util)
Add a NetUtils method that can tell if an InetAddress belongs to local host
closing again
- HADOOP-7664.
Minor improvement reported by raviprak and fixed by raviprak (conf)
o.a.h.conf.Configuration complains of overriding final parameter even if the value with which its attempting to override is the same.
o.a.h.conf.Configuration complains of overriding final parameter even if the value with which its attempting to override is the same.
- HADOOP-7663.
Major bug reported by mayank_bansal and fixed by mayank_bansal (test)
TestHDFSTrash failing on 22
Seems to have started failing recently in many commit builds as well as the last two nightly builds of 22:
https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/51/testReport/org.apache.hadoop.hdfs/TestHDFSTrash/testTrashEmptier/
https://issues.apache.org/jira/browse/HDFS-1967
- HADOOP-7662.
Major bug reported by tgraves and fixed by tgraves
logs servlet should use pathspec of /*
The logs servlet in HttpServer should use a pathspec of /* instead of /.
logContext.addServlet(AdminAuthorizedServlet.class, "/*");
In making the changes for the yarn webapps (MAPREDUCE-2999), I registered a webapp to use "/". This blocked the /logs servlet from working. because both had a pathSpec of "/" and the guice filter seemed to take precendence. Changing the pathspec of the logs servlet to /* fixes the issue.
- HADOOP-7658.
Major bug reported by gkesavan and fixed by eyang
to fix hadoop config template
hadoop rpm config template by default sets the HADOOP_SECURE_DN_USER, HADOOP_SECURE_DN_LOG_DIR & HADOOP_SECURE_DN_PID_DIR
the above values should only be set for secured deployment ;
# On secure datanodes, user to run the datanode as after dropping privileges
export HADOOP_SECURE_DN_USER=${HADOOP_HDFS_USER}
# Where log files are stored. $HADOOP_HOME/logs by default.
export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
# Where log files are stored in the secure data environment.
export HADOOP_SE...
- HADOOP-7655.
Major improvement reported by arpitgupta and fixed by arpitgupta
provide a small validation script that smoke tests the installed cluster
Committed to trunk and v23, since code reviewed by Eric.
- HADOOP-7642.
Major improvement reported by tucu00 and fixed by tomwhite (build)
create hadoop-dist module where TAR stitching would happen
Instead having a post build script that stitches common&hdfs&mmr, this should be done as part of the build when running 'mvn package -Pdist -Dtar'
- HADOOP-7639.
Major bug reported by tgraves and fixed by tgraves
yarn ui not properly filtered in HttpServer
Currently httpserver only has .html", ".jsp as user facing urls when you add a filter. For the new web framework in yarn, the pages no longer have the *.html or *.jsp and thus they are not properly being filtered. The yarn ui just uses paths - for it would be serve:port/yarn/*
- HADOOP-7637.
Major bug reported by eyang and fixed by eyang (build)
Fair scheduler configuration file is not bundled in RPM
205 build of tar is fine, but rpm failed with:
{noformat}
[rpm] Processing files: hadoop-0.20.205.0-1
[rpm] warning: File listed twice: /usr/libexec
[rpm] warning: File listed twice: /usr/libexec/hadoop-config.sh
[rpm] warning: File listed twice: /usr/libexec/jsvc.i386
[rpm] Checking for unpackaged file(s): /usr/lib/rpm/check-files /tmp/hadoop_package_build_hortonfo/BUILD
[rpm] error: Installed (but unpackaged) file(s) found:
[rpm] /etc/hadoop/fai...
- HADOOP-7633.
Major bug reported by arpitgupta and fixed by eyang (conf)
log4j.properties should be added to the hadoop conf on deploy
currently the log4j properties are not present in the hadoop conf dir. We should add them so that log rotation happens appropriately and also define other logs that hadoop can generate for example the audit and the auth logs as well as the mapred summary logs etc.
- HADOOP-7631.
Major bug reported by rramya and fixed by eyang (conf)
In mapred-site.xml, stream.tmpdir is mapped to ${mapred.temp.dir} which is undeclared.
Streaming jobs seem to fail with the following exception:
{noformat}
Exception in thread "main" java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.checkAndCreate(File.java:1704)
at java.io.File.createTempFile(File.java:1792)
at org.apache.hadoop.streaming.StreamJob.packageJobJar(StreamJob.java:603)
at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:798)
a...
- HADOOP-7630.
Major bug reported by arpitgupta and fixed by eyang (conf)
hadoop-metrics2.properties should have a property *.period set to a default value foe metrics
currently the hadoop-metrics2.properties file does not have a value set for *.period
This property is useful for metrics to determine when the property will refresh. We should set it to default of 60
- HADOOP-7629.
Major bug reported by phunt and fixed by tlipcon
regression with MAPREDUCE-2289 - setPermission passed immutable FsPermission (rpc failure)
MAPREDUCE-2289 introduced the following change:
{noformat}
+ fs.setPermission(stagingArea, JOB_DIR_PERMISSION);
{noformat}
JOB_DIR_PERMISSION is an immutable FsPermission which cannot be used in RPC calls, it results in the following exception:
{noformat}
2011-09-08 16:31:45,187 WARN org.apache.hadoop.ipc.Server: Unable to read call parameters for client 127.0.0.1
java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.fs.permission.FsPermission$2.<init>()
...
- HADOOP-7627.
Minor improvement reported by tlipcon and fixed by tlipcon (metrics, test)
Improve MetricsAsserts to give more understandable output on failure
In developing a test case that uses MetricsAsserts, I had two issues:
1) the error output in the case that an assertion failed does not currently give any information as to the _actual_ value of the metric
2) there is no way to retrieve the metric variable (eg to assert that the sum of a metric over all DNs is equal to some value)
This JIRA is to improve this test class to fix the above issues.
- HADOOP-7626.
Major bug reported by eyang and fixed by eyang (scripts)
Allow overwrite of HADOOP_CLASSPATH and HADOOP_OPTS
Quote email from Ashutosh Chauhan:
bq. There is a bug in hadoop-env.sh which prevents hcatalog server to start in secure settings. Instead of adding classpath, it overrides them. I was not able to verify where the bug belongs to, in HMS or in hadoop scripts. Looks like hadoop-env.sh is generated from hadoop-env.sh.template in installation process by HMS. Hand crafted patch follows:
bq. - export HADOOP_CLASSPATH=$f
bq. +export HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:$f
bq. -export HADOOP_OPTS=...
- HADOOP-7612.
Major improvement reported by tomwhite and fixed by tomwhite (build)
Change test-patch to run tests for all nested modules
HADOOP-7561 changed the behaviour of test-patch to run tests for changed modules, however this was assuming a flat structure. Given the nested maven hierarchy we should always run all the common tests for any common change, all the HDFS tests for any HDFS change, and all the MapReduce tests for any MapReduce change.
In addition, we should do a top-level build to test compilation after any change.
- HADOOP-7610.
Major bug reported by eyang and fixed by eyang (scripts)
/etc/profile.d does not exist on Debian
As part of post installation script, there is a symlink created in /etc/profile.d/hadoop-env.sh to source /etc/hadoop/hadoop-env.sh. Therefore, users do not need to configure HADOOP_* environment. Unfortunately, /etc/profile.d only exists in Ubuntu. [Section 9.9 of the Debian Policy|http://www.debian.org/doc/debian-policy/ch-opersys.html#s9.9] states:
{quote}
A program must not depend on environment variables to get reasonable defaults. (That's because these environment variables would ha...
- HADOOP-7608.
Major bug reported by tucu00 and fixed by tucu00 (io)
SnappyCodec check for Hadoop native lib is wrong
Currently SnappyCodec is doing:
{code}
public static boolean isNativeSnappyLoaded(Configuration conf) {
return LoadSnappy.isLoaded() && conf.getBoolean(
CommonConfigurationKeys.IO_NATIVE_LIB_AVAILABLE_KEY,
CommonConfigurationKeys.IO_NATIVE_LIB_AVAILABLE_DEFAULT);
}
{code}
But the conf check is wrong as it defaults to true. Instead it should use *NativeCodeLoader.isNativeCodeLoaded()*
- HADOOP-7606.
Major bug reported by atm and fixed by tucu00 (test)
Upgrade Jackson to version 1.7.1 to match the version required by Jersey
As of 2 days ago, 13 tests started failing, all with errors in Avro-related tests.
- HADOOP-7604.
Critical bug reported by mahadev and fixed by mahadev
Hadoop Auth examples pom in 0.23 point to 0.24 versions.
hadoop-auth-examples/pom.xml has references to 0.24 in the 0.23 branch.
- HADOOP-7603.
Major bug reported by eyang and fixed by eyang
Set default hdfs, mapred uid, and hadoop group gid for RPM packages
Set hdfs uid, mapred uid, and hadoop gid to fixed numbers (201, 202, and 123, respectively).
- HADOOP-7599.
Major bug reported by eyang and fixed by eyang (scripts)
Improve hadoop setup conf script to setup secure Hadoop cluster
Setting up a secure Hadoop cluster requires a lot of manual setup. The motivation of this jira is to provide setup scripts to automate setup secure Hadoop cluster.
- HADOOP-7598.
Major bug reported by revans2 and fixed by revans2 (build)
smart-apply-patch.sh does not handle patching from a sub directory correctly.
smart-apply-patch.sh does not apply valid patches from trunk, or from git like it was designed to do in some situations.
- HADOOP-7595.
Major improvement reported by tucu00 and fixed by tucu00 (build)
Upgrade dependency to Avro 1.5.3
Avro 1.5.3 depends on Snappy-Java 1.5.3 which enables the use of its SO file from the java.library.path
- HADOOP-7594.
Major new feature reported by szetszwo and fixed by szetszwo
Support HTTP REST in HttpServer
Provide an API in HttpServer for supporting HTTP REST.
This is a part of HDFS-2284.
- HADOOP-7593.
Major bug reported by szetszwo and fixed by umamaheswararao (test)
AssertionError in TestHttpServer.testMaxThreads()
TestHttpServer passed but there were AssertionError in the output.
{noformat}
11/08/30 03:35:56 INFO http.TestHttpServer: HTTP server started: http://localhost:52974/
Exception in thread "pool-1-thread-61" java.lang.AssertionError:
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.junit.Assert.assertTrue(Assert.java:54)
at org.apache.hadoop.http.TestHttpServer$1.run(TestHttpServer.java:164)
at java.util.concurrent.ThreadPoolExecutor$Worker.ru...
- HADOOP-7589.
Major bug reported by revans2 and fixed by revans2 (build)
Prefer mvn test -DskipTests over mvn compile in test-patch.sh
I got a failure running test-patch with a clean .m2 directory.
To quote Alejandro:
{quote}
The reason for this failure is because of how Maven reactor/dependency
resolution works (IMO a bug).
Maven reactor/dependency resolution is smart enough to create the classpath
using the classes from all modules being built.
However, this smartness falls short just a bit. The dependencies are
resolved using the deepest maven phase used by current mvn invocation. If
you are doing 'mvn compile' you don...
- HADOOP-7580.
Major bug reported by sseth and fixed by sseth
Add a version of getLocalPathForWrite to LocalDirAllocator which doesn't create dirs
Required in MR where directories are created by ContainerExecutor (mrv2) / TaskController (0.20) as a specific user.
- HADOOP-7579.
Major task reported by tucu00 and fixed by tucu00 (security)
Rename package names from alfredo to auth
- HADOOP-7578.
Major bug reported by mahadev and fixed by mahadev
Fix test-patch to be able to run on MR patches.
- HADOOP-7576.
Major bug reported by tomwhite and fixed by szetszwo (security)
Fix findbugs warnings in Hadoop Auth (Alfredo)
Found in HADOOP-7567: https://builds.apache.org/job/PreCommit-HADOOP-Build/65//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-alfredo.html
- HADOOP-7575.
Minor bug reported by jeagles and fixed by jeagles (fs)
Support fully qualified paths as part of LocalDirAllocator
Contexts with configuration path strings using fully qualified paths (e.g. file:///tmp instead of /tmp) mistakenly creates a directory named 'file:' and sub-directories in the current local file system working directory.
- HADOOP-7568.
Major bug reported by shv and fixed by zero45 (io)
SequenceFile should not print into stdout
The following line in {{SequenceFile.Reader.initialize()}} should be removed:
{code}
System.out.println("Setting end to " + end);
{code}
- HADOOP-7566.
Major bug reported by mahadev and fixed by tucu00
MR tests are failing webapps/hdfs not found in CLASSPATH
While running ant tests, the tests are failing with the following trace:
{noformat}
webapps/hdfs not found in CLASSPATH
java.io.FileNotFoundException: webapps/hdfs not found in CLASSPATH
at org.apache.hadoop.http.HttpServer.getWebAppsPath(HttpServer.java:470)
at org.apache.hadoop.http.HttpServer.<init>(HttpServer.java:186)
at org.apache.hadoop.http.HttpServer.<init>(HttpServer.java:147)
at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer$1.run(NameNo...
- HADOOP-7564.
Major sub-task reported by tomwhite and fixed by tomwhite
Remove test-patch SVN externals
With the new top-level test-patch script in dev-support, the SVN externals for the old test-patch scripts are no longer needed.
- HADOOP-7563.
Major bug reported by eyang and fixed by eyang (scripts)
hadoop-config.sh setup CLASSPATH, HADOOP_HDFS_HOME and HADOOP_MAPRED_HOME incorrectly
HADOOP_HDFS_HOME and HADOOP_MAPRED_HOME was set to HADOOP_PREFIX/share/hadoop/hdfs and HADOOP_PREFIX/share/hadoop/mapreduce. This setup confuses the location of hdfs and mapred scripts. Instead the script should look for hdfs and mapred script in HADOOP_PREFIX/bin.
- HADOOP-7561.
Major sub-task reported by tomwhite and fixed by tomwhite
Make test-patch only run tests for changed modules
By running test-patch from trunk we can check that a change in one project (e.g. common) doesn't cause compile errors in other projects (e.g. HDFS). To get this to work we only need to run tests for the modules that are affected by the patch.
- HADOOP-7560.
Major sub-task reported by tucu00 and fixed by tucu00
Make hadoop-common a POM module with sub-modules (common & alfredo)
Currently hadoop-common is a JAR module, thus it cannot aggregate sub-modules.
Changing it to POM module it makes it an aggregator module, all the code under hadoop-common must be moved to a sub-module.
I.e.:
mkdir hadoop-common-project
mv hadoop-common hadoop-common-project
mv hadoop-alfredo hadoop-common-project
hadoop-common-project/pom.xml is a POM module that aggregates common & alfredo
- HADOOP-7555.
Trivial improvement reported by atm and fixed by atm (build)
Add a eclipse-generated files to .gitignore
The .gitignore file in the hadoop-mapreduce directory specifically excludes .classpath, .settings, and .project files/dirs. We should move these excludes to the top level .gitignore so that Common and HDFS have these files excluded as well.
- HADOOP-7552.
Minor improvement reported by eli and fixed by eli (fs)
FileUtil#fullyDelete doesn't throw IOE but lists it in the throws clause
FileUtil#fullyDelete doesn't throw IOException so it shouldn't have IOException in its throws clause. Having it listed makes it easy to think you'll get an IOException eg trying to delete a non-existant file or on an IO error accessing the local file, but you don't.
- HADOOP-7547.
Minor bug reported by umamaheswararao and fixed by umamaheswararao (io)
Fix the warning in writable classes.[ WritableComparable is a raw type. References to generic type WritableComparable<T> should be parameterized ]
WritableComparable is a raw type. References to generic type WritableComparable<T> should be parameterized.
Also address the same in example implementation in WritableComparable interface's javadoc.
- HADOOP-7545.
Critical bug reported by tlipcon and fixed by tlipcon (build, test)
common -tests jar should not include properties and configs
This is the cause of HDFS-2242. The -tests jar generated from the common build should only include the test classes, and not the test resources.
- HADOOP-7536.
Major bug reported by kihwal and fixed by tucu00 (build)
Correct the dependency version regressions introduced in HADOOP-6671
I just noticed the versions specified for dependencies have gone backward with HADOOP-6671.
To name a few,
* commons-logging was 1.1.1, now 1.0.4
* commons-logging-api was 1.1, now 1.0.4
* slf4 was 1.5.11, now 1.5.8
There might be more.
- HADOOP-7533.
Major sub-task reported by tomwhite and fixed by tomwhite
Allow test-patch to be run from any subproject directory
Currently dev-support/test-patch.sh can only be run from the top-level (and only for hadoop-common).
- HADOOP-7531.
Major improvement reported by eli and fixed by eli (util)
Add servlet util methods for handling paths in requests
Common side of HDFS-2235.
- HADOOP-7529.
Critical bug reported by tlipcon and fixed by vicaya (metrics)
Possible deadlock in metrics2
Lock cycle detected by jcarder between MetricsSystemImpl and DefaultMetricsSystem
- HADOOP-7528.
Major sub-task reported by tucu00 and fixed by tucu00 (build)
Maven build fails in Windows
Maven does not run in window for the following reasons:
* Enforcer plugin restricts build to Unix
* Ant run snippets to create TAR are not cygwin friendly
- HADOOP-7526.
Minor test reported by eli and fixed by eli (fs)
Add TestPath tests for URI conversion and reserved characters
TestPath needs tests that cover URI conversion (eg places where Paths and URIs differ) and handling of URI reserved characters in paths.
- HADOOP-7525.
Major sub-task reported by tomwhite and fixed by tomwhite (scripts)
Make arguments to test-patch optional
Currently you have to specify all the arguments to test-patch.sh, which makes it cumbersome to use. We should make all arguments except the patch file optional.
- HADOOP-7523.
Blocker bug reported by jlee@mindset-media.com and fixed by jlee@mindset-media.com (test)
Test org.apache.hadoop.fs.TestFilterFileSystem fails due to java.lang.NoSuchMethodException
Test org.apache.hadoop.fs.TestFilterFileSystem fails due to java.lang.NoSuchMethodException. Here is the error message:
-------------------------------------------------------------------------------
Test set: org.apache.hadoop.fs.TestFilterFileSystem
-------------------------------------------------------------------------------
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.232 sec <<< FAILURE!
testFilterFileSystem(org.apache.hadoop.fs.TestFilterFileSystem) Time elapsed...
- HADOOP-7520.
Major bug reported by tucu00 and fixed by tucu00 (build)
hadoop-main fails to deploy
Doing a Maven deployment hadoop-main (trunk/pom.xml) fails to deploy because it does not have the distribution management information.
- HADOOP-7515.
Major sub-task reported by tomwhite and fixed by tomwhite (build)
test-patch reports the wrong number of javadoc warnings
- HADOOP-7512.
Trivial task reported by qwertymaniac and fixed by qwertymaniac (documentation)
Fix example mistake in WritableComparable javadocs
From IRC, via uberj:
{code}
[9:58pm] uberj: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/WritableComparable.html
[9:58pm] uberj: In the example it says "int thatValue = ((IntWritable)o).value;"
[9:59pm] uberj: should 'o' be replaced with 'w'?
[9:59pm] uberj: int thatValue = ((IntWritable)w).value;
{code}
Attaching patch for s/w/o.
- HADOOP-7509.
Trivial improvement reported by raviprak and fixed by raviprak
Improve message when Authentication is required
Thanks Aaron and Suresh!
Marking as resolved fixed since changes have gone in.
- HADOOP-7508.
Major sub-task reported by tucu00 and fixed by tucu00 (build)
compiled nativelib is in wrong directory and it is not picked up by surefire setup
The location of the compiled native libraries differs from the one surefire plugin (run testcases) is configured to use.
This makes testcases using nativelibs to fail loading them.
- HADOOP-7507.
Major bug reported by jwfbean and fixed by tucu00 (metrics)
jvm metrics all use the same namespace
JVM metrics published to Ganglia now include the process name as part of the gmetric name.
- HADOOP-7502.
Major sub-task reported by vicaya and fixed by vicaya
Use canonical (IDE friendly) generated-sources directory for generated sources
- HADOOP-7501.
Major sub-task reported by tucu00 and fixed by tomwhite (build)
publish Hadoop Common artifacts (post HADOOP-6671) to Apache SNAPSHOTs repo
A *distributionManagement* section must be added to the hadoop-project POM with the SNAPSHOTs section, then 'mvn deploy' will push the artifacts to it.
- HADOOP-7499.
Major bug reported by naisbitt and fixed by naisbitt (util)
Add method for doing a sanity check on hostnames in NetUtils
As part of MAPREDUCE-2489, we need a method in NetUtils to do a sanity check on hostnames
- HADOOP-7498.
Major sub-task reported by tucu00 and fixed by tucu00 (build)
Remove legacy TAR layout creation
Currently the build creates 2 different tarball layouts.
One is the legacy one, the layout used until 0.22 (ant tar & mvn package -Ptar)
The other is new new one, the layout used in trunk that mimics the Unix layout (ant binary & mvn package -Pbintar).
The legacy layout is of not use as all the scripts have been modified to work with the new layout only.
We should thus remove the legacy layout generation.
In addition we could rename the current 'bintar' to just 'tar'
- HADOOP-7496.
Major sub-task reported by tucu00 and fixed by tucu00 (build)
break Maven TAR & bintar profiles into just LAYOUT & TAR proper
Currently the tar & bintar profile create the layout and create tarball.
For development it would be convenient to break them into layout and tar, thus not having to pay the overhead of TARing up.
- HADOOP-7493.
Major new feature reported by umamaheswararao and fixed by umamaheswararao (io)
[HDFS-362] Provide ShortWritable class in hadoop.
As part of HDFS-362, Provide the ShortWritable class.
- HADOOP-7491.
Major improvement reported by eli and fixed by eli (scripts)
hadoop command should respect HADOOP_OPTS when given a class name
When using the hadoop command HADOOP_OPTS and HADOOP_CLIENT_OPTS options are not passeed through.
- HADOOP-7474.
Major improvement reported by jnp and fixed by jnp
Refactor ClientCache out of WritableRpcEngine.
This jira captures the changes in common corresponding to MAPREDUCE-2707.
Moving ClientCache out into its own class makes sense because it can be used by other RpcEngine implementations as well.
- HADOOP-7472.
Minor improvement reported by kihwal and fixed by kihwal (ipc)
RPC client should deal with the IP address changes
The current RPC client implementation and the client-side callers assume that the hostname-address mappings of servers never change. The resolved address is stored in an immutable InetSocketAddress object above/outside RPC, and the reconnect logic in the RPC Connection implementation also trusts the resolved address that was passed down.
If the NN suffers a failure that requires migration, it may be started on a different node with a different IP address. In this case, even if the name-addre...
- HADOOP-7471.
Major bug reported by tucu00 and fixed by tucu00 (build)
the saveVersion.sh script sometimes fails to extract SVN URL
When using an SVN checkout of the source, sometime the {{svn info}} command outputs a 'Copied from URL: ###' line in addition to the 'URL: ###'.
This breaks the saveVersion.sh script that assume there is only one line in the output of {{svn info}} that contains the word URL.
- HADOOP-7469.
Minor sub-task reported by stevel@apache.org and fixed by stevel@apache.org (util)
add a standard handler for socket connection problems which improves diagnostics
connection refused, connection timed out, no route to host, etc, are classic IOExceptions that can be raised in a lot of parts of the code. The standard JDK exceptions are useless for debugging as they
# don't include the destination (host, port) that can be used in diagnosing service dead/blocked problems
# don't include any source hostname that can be used to handle routing issues
# assume the reader understands the TCP stack.
It's obvious from the -user lists that a lot of people hit thes...
- HADOOP-7465.
Trivial sub-task reported by xiexianshan and fixed by xiexianshan (fs, ipc)
A several tiny improvements for the LOG format
There are several fields in the log that the space characters are missed.
For instance:
src/java/org/apache/hadoop/ipc/Client.java(248): LOG.debug("The ping interval is" + this.pingInterval + "ms.");
src/java/org/apache/hadoop/fs/LocalDirAllocator.java(235): LOG.warn( localDirs[i] + "is not writable¥n", de);
- HADOOP-7463.
Minor improvement reported by mahadev and fixed by mahadev
Adding a configuration parameter to SecurityInfo interface.
HADOOP-6929 allowed to make implementations/providers of SecurityInfo to be configurable via service class loaders. For adding Security to TunnelProtocols, configuration is needed to figure out which particular interface getKerberosInfo is called for. Just the class name is not enough since its always TunnerProtocol for all the interfaces. I propose adding a config to getKerberosInfo, so that its easy for TunnerProtocols to get the information they need.
- HADOOP-7460.
Major improvement reported by dhruba and fixed by usmanm (fs)
Support for pluggable Trash policies
It would be beneficial to make the Trash policy pluggable. One primary use-case for this is to archive files (in some remote store) when they get removed by Trash emptier.
- HADOOP-7457.
Blocker improvement reported by jghoman and fixed by jghoman (documentation)
Remove out-of-date Chinese language documentation
The Chinese language documentation haven't been updated (other than copyright years and svn moves) since their original contribution several years ago. Worse than no docs is out-of-date, wrong docs. We should delete them from the source tree.
- HADOOP-7451.
Major improvement reported by mattf and fixed by mattf
merge for MR-279: Generalize StringUtils#join
Fix incomplete merge from yahoo-merge branch to trunk:
-r 1079167: Generalize StringUtils::join (Chris Douglas)
- HADOOP-7449.
Major improvement reported by mattf and fixed by mattf
merge for MR-279: add Data(In,Out)putByteBuffer to work with ByteBuffer similar to Data(In,Out)putBuffer for byte[]
Fix incomplete merge from yahoo-merge branch to trunk:
-r 1079163: Added Data(In,Out)putByteBuffer to work with ByteBuffer similar to Data(In,Out)putBuffer for byte[]. (Chris Douglas)
- HADOOP-7448.
Major improvement reported by mattf and fixed by mattf
merge for MR-279: HttpServer /stacks servlet should use plain text content type
Fix incomplete merge from yahoo-merge branch to trunk:
-r 1079157: Fix content type for /stacks servlet (Luke Lu)
-r 1079164: No need to escape plain text (Luke Lu)
- HADOOP-7446.
Major improvement reported by tlipcon and fixed by tlipcon (native)
Implement CRC32C native code using SSE4.2 instructions
Once HADOOP-7445 is implemented, we can get further performance improvements by implementing CRC32C using the hardware support available in SSE4.2. This support should be dynamically enabled based on CPU feature flags, and of course should be ifdeffed properly so that it doesn't break the build on architectures/platforms where it's not available.
- HADOOP-7445.
Major improvement reported by tlipcon and fixed by tlipcon (native, util)
Implement bulk checksum verification using efficient native code
Once HADOOP-7444 is implemented ("bulk" API for checksums), good performance gains can be had by implementing bulk checksum operations using JNI. This JIRA is to add checksum support to the native libraries. Of course if native libs are not available, it will still fall back to the pure-Java implementations.
- HADOOP-7444.
Major improvement reported by tlipcon and fixed by tlipcon
Add Checksum API to verify and calculate checksums "in bulk"
Currently, the various checksum types only provide the capability to calculate the checksum of a range of a byte array. For HDFS-2080, it's advantageous to provide an API that, given a buffer with some number of "checksum chunks", can either calculate or verify the checksums of all of the chunks. For example, given a 4KB buffer and a 512-byte chunk size, it would calculate or verify 8 CRC32s in one call.
This allows efficient JNI-based checksum implementations since the cost of crossing the ...
- HADOOP-7443.
Major new feature reported by tlipcon and fixed by tlipcon (io, util)
Add CRC32C as another DataChecksum implementation
CRC32C is another checksum very similar to our existing CRC32, but with a different polynomial. The chief advantage of this other polynomial is that SSE4.2 includes hardware support for its calculation. HDFS-2080 is the umbrella JIRA which proposes using this new polynomial to save substantial amounts of CPU.
- HADOOP-7442.
Major bug reported by atm and fixed by atm (conf, documentation)
Docs in core-default.xml still reference deprecated config "topology.script.file.name"
HADOOP-6233 renamed the config "{{topology.script.file.name}}" to "{{net.topology.script.file.name}}" but missed a few spots in the docs of core-default.xml.
- HADOOP-7440.
Major bug reported by tlipcon and fixed by tlipcon
HttpServer.getParameterValues throws NPE for missing parameters
If the requested parameter was not specified in the request, the raw request's getParameterValues function returns null. Thus, trying to access {{unquoteValue.length}} throws NPE.
- HADOOP-7438.
Major improvement reported by raviprak and fixed by raviprak
Using the hadoop-deamon.sh script to start nodes leads to a depricated warning
hadoop-daemon.sh calls common/bin/hadoop for hdfs/bin/hdfs tasks and so common/bin/hadoop complains its deprecated for those uses.
- HADOOP-7437.
Major bug reported by umamaheswararao and fixed by umamaheswararao (io)
IOUtils.copybytes will suppress the stream closure exceptions.
{code}
public static void copyBytes(InputStream in, OutputStream out, long count,
boolean close) throws IOException {
byte buf[] = new byte[4096];
long bytesRemaining = count;
int bytesRead;
try {
.............
.............
} finally {
if (close) {
closeStream(out);
closeStream(in);
}
}
}
{code}
Here if any exception in closing the stream, it will get suppressed here.
So, better to follow the stream closure pattern ...
- HADOOP-7434.
Minor improvement reported by yanjinshuang and fixed by yanjinshuang
Display error when using "daemonlog -setlevel" with illegal level
While using the command with inexistent "level" like "nomsg", there is no error message displayed,and the level "DEBUG" is set by default.
- HADOOP-7430.
Minor improvement reported by raviprak and fixed by raviprak (fs)
Improve error message when moving to trash fails due to quota issue
-rm command doesn't suggest -skipTrash on failure.
- HADOOP-7428.
Major bug reported by tlipcon and fixed by tlipcon (ipc)
IPC connection is orphaned with null 'out' member
We had a situation a JT ended up in a state where a certain user could not submit a job, due to an NPE on the following line in {{sendParam}}:
{code}
synchronized (Connection.this.out) {
{code}
Looking at the code, my guess is that an RTE was thrown in setupIOstreams, which only catches IOE. This could leave the connection in a half-setup state which is never cleaned up and also cannot perform IPCs.
- HADOOP-7419.
Major bug reported by tlipcon and fixed by bzheng
new hadoop-config.sh doesn't manage classpath for HADOOP_CONF_DIR correctly
Since the introduction of the RPM packages, hadoop-config.sh incorrectly puts $HADOOP_HDFS_HOME/conf on the classpath regardless of whether HADOOP_CONF_DIR is already defined in the environment.
- HADOOP-7402.
Trivial bug reported by atm and fixed by atm (test)
TestConfiguration doesn't clean up after itself
{{testGetFile}} and {{testGetLocalPath}} both create directories a, b, and c in the working directory from where the tests are run. They should clean up after themselves.
- HADOOP-7392.
Major improvement reported by tanping and fixed by tanping
Implement capability of querying individual property of a mbean using JMXProxyServlet
Hadoop-7144 provides the capability to query all the properties of a mbean using JMXProxyServlet. In addition to this, we add the capability to query an individual property of a mbean. Client will send http request,
http://hostname/jmx?get=meanName::property
to query from server.
- HADOOP-7389.
Major bug reported by atm and fixed by atm (test)
Use of TestingGroups by tests causes subsequent tests to fail
As mentioned in HADOOP-6671, {{UserGroupInformation.createUserForTesting(...)}} manipulates static state which can cause test cases which are run after a call to this function to fail.
- HADOOP-7385.
Minor bug reported by bharathm and fixed by bharathm
Remove StringUtils.stringifyException(ie) in logger functions
Apache logger api has an overloaded function which can take the message and exception. I am proposing to clean the logging code with this api.
ie.:
Change the code from LOG.warn(msg, StringUtils.stringifyException(exception)); to LOG.warn(msg, exception);
- HADOOP-7384.
Major improvement reported by tlipcon and fixed by tlipcon
Allow test-patch to be more flexible about patch format
Right now the test-patch process only accepts patches that are generated as "-p0" relative to common/, hdfs/, or mapreduce/. This has always been annoying for git users where the default patch format is -p1. It's also now annoying for SVN users who may generate a patch relative to trunk/ instead of the subproject subdirectory. We should auto-detect the correct patch level.
- HADOOP-7383.
Blocker bug reported by tlipcon and fixed by tlipcon (build)
HDFS needs to export protobuf library dependency in pom
MR builds are failing since the HDFS protobuf patch went in, since they aren't picking up protobuf as a transitive dependency. I think we just need to add it to the HDFS pom template.
- HADOOP-7380.
Major sub-task reported by atm and fixed by atm (ipc)
Add client failover functionality to o.a.h.io.(ipc|retry)
Implementing client failover will likely require changes to {{o.a.h.io.ipc}} and/or {{o.a.h.io.retry}}. This JIRA is to track those changes.
- HADOOP-7379.
Major improvement reported by tlipcon and fixed by tlipcon (io, ipc)
Add ability to include Protobufs in ObjectWritable
Protocol buffer-generated types may now be used as arguments or return values for Hadoop RPC.
- HADOOP-7377.
Major bug reported by daryn and fixed by daryn (fs)
Fix command name handling affecting DFSAdmin
When an error occurs in the get/set quota commands in DFSAdmin, they are displaying the following:
setQuota: failed to get SetQuotaCommand.NAME
The {{Command}} class expects the {{NAME}} field to be accessible, but for DFSAdmin, it's not.
- HADOOP-7375.
Major improvement reported by sanjay.radia and fixed by sanjay.radia
Add resolvePath method to FileContext
- HADOOP-7374.
Major improvement reported by eli and fixed by eli (scripts)
Don't add tools.jar to the classpath when running Hadoop
The scripts that run Hadoop no longer automatically add tools.jar from the JDK to the classpath (if it is present). If your job depends on tools.jar in the JDK you will need to add this dependency in your job.
- HADOOP-7361.
Minor improvement reported by umamaheswararao and fixed by umamaheswararao (fs)
Provide overwrite option (-overwrite/-f) in put and copyFromLocal command line options
FileSystem has the API
*public void copyFromLocalFile(boolean delSrc, boolean overwrite, Path[] srcs, Path dst)*
This API provides overwrite option. But the mapping command line doesn't have this option. To maintain the consistency and better usage the command line option also can support the overwrite option like to put the files forcefully. ( put [-f] <srcpath> <dstPath>) and also for copyFromLocal command line option.
- HADOOP-7360.
Major improvement reported by daryn and fixed by kihwal (fs)
FsShell does not preserve relative paths with globs
FsShell currently preserves relative paths that do not contain globs. Unfortunately the method {{fs.globStatus()}} is fully qualifying all returned paths. This is causing inconsistent display of paths.
- HADOOP-7357.
Trivial bug reported by philip and fixed by philip (test)
hadoop.io.compress.TestCodec#main() should exit with non-zero exit code if test failed
It's convenient to run something like
{noformat}
HADOOP_CLASSPATH=hadoop-test-0.20.2.jar bin/hadoop org.apache.hadoop.io.compress.TestCodec -count 3 -codec fo
{noformat}
but the error code it returns isn't interesting.
1-line patch attached fixes that.
- HADOOP-7356.
Blocker bug reported by eyang and fixed by eyang
RPM packages broke bin/hadoop script for hadoop 0.20.205
hadoop-config.sh has been moved to libexec for binary package, but developers prefers to have hadoop-config.sh in bin. Hadoo shell scripts should be modified to support both scenarios.
- HADOOP-7353.
Major bug reported by daryn and fixed by daryn (fs)
Cleanup FsShell and prevent masking of RTE stacktraces
{{FsShell}}'s top level exception handler catches and displays exceptions. Unfortunately it displays only the first line of an exception, which means an unexpected {{RuntimeExceptions}} like {{NullPointerException}} only display "{{cmd: NullPointerException}}". This user has no context to understand and/or accurately report the issue.
Found due to bugs such as {{HADOOP-7327}}.
- HADOOP-7342.
Minor bug reported by bharathm and fixed by bharathm
Add an utility API in FileUtil for JDK File.list
Java File.list API can return null when disk is bad or directory is not a directory. This utility API in FileUtil will throw an exception when this happens rather than returning null.
- HADOOP-7341.
Major bug reported by daryn and fixed by daryn (fs)
Fix option parsing in CommandFormat
CommandFormat currently allows options in any location within the args. This is not the intended behavior for FsShell commands. Prior to the redesign, the commands used to expect option processing to stop at the first non-option.
CommandFormat was an existing class prior the redesign, but it was only used by "count" to find the -q flag. All commands were converted to using this class, thus inherited the unintended behavior.
- HADOOP-7337.
Minor improvement reported by szetszwo and fixed by szetszwo (util)
Annotate PureJavaCrc32 as a public API
The API of PureJavaCrc32 is stable. It is incorrect to annotate it as private unstable.
- HADOOP-7336.
Minor bug reported by jnp and fixed by jnp
TestFileContextResolveAfs will fail with default test.build.data property.
In TestFileContextResolveAfs if test.build.data property is not set and default is used, the test case will try to create that in the root directory and that will fail. /tmp should be used as default as in many other test cases. Normally, test.build.data will be set and this issue should not occur.
- HADOOP-7333.
Minor improvement reported by ecaspole and fixed by ecaspole (util)
Performance improvement in PureJavaCrc32
I would like to propose a small patch to
org.apache.hadoop.util.PureJavaCrc32.update(byte[] b, int off, int len)
Currently the method stores the intermediate result back into the data member "crc." I noticed this method gets
inlined into DataChecksum.update() and that method appears as one of the hotter methods in a simple hprof profile collected while running terasort and gridmix.
If the code is modified to save the temporary result into a local and just once store the final result bac...
- HADOOP-7331.
Trivial improvement reported by tanping and fixed by tanping (scripts)
Make hadoop-daemon.sh to return 1 if daemon processes did not get started
hadoop-daemon.sh now returns a non-zero exit code if it detects that the daemon was not still running after 3 seconds.
- HADOOP-7329.
Minor improvement reported by xiexianshan and fixed by xiexianshan (fs)
incomplete help message is displayed for df -h option
The help message for the command "hdfs dfs -help df" is displayed like this:
"-df [<path> ...]: Shows the capacity, free and used space of the filesystem.
If the filesystem has multiple partitions, and no path to a
particular partition is specified, then the status of the root
partitions will be shown."
and the information about df -h option is missed,despite the fact that df -h option is implemented.
Therefore,the expected message should be displayed like this:
"-...
- HADOOP-7327.
Minor bug reported by mattf and fixed by mattf (fs)
FileSystem.listStatus() throws NullPointerException instead of IOException upon access permission failure
Many processes that call listStatus() expect to handle IOException, but instead are getting runtime error NullPointerException, if the directory being scanned is visible but no-access to the running user id. For example, if directory foo is drwxr-xr-x, and subdirectory foo/bar is drwx------, then trying to do listStatus(Path(foo/bar)) will cause a NullPointerException.
- HADOOP-7324.
Blocker bug reported by vicaya and fixed by priyomustafi (metrics)
Ganglia plugins for metrics v2
Although, all metrics in metrics v2 are exposed via the standard JMX mechanisms, most users are using Ganglia to collect metrics.
- HADOOP-7322.
Minor bug reported by bharathm and fixed by bharathm
Adding a util method in FileUtil for JDK File.listFiles
Use of this new utility method avoids null result from File.listFiles(), and consequent NPEs.
- HADOOP-7320.
Major improvement reported by daryn and fixed by daryn
Refactor FsShell's copy & move commands
Need to refactor the move and copy commands to conform to the FsCommand class.
- HADOOP-7316.
Major improvement reported by jmhsieh and fixed by eli (documentation)
Add public javadocs to FSDataInputStream and FSDataOutputStream
This is a method made public for testing. In comments in HADOOP-7301 after commit, adding javadoc comments was requested. This is a follow up jira to address it.
- HADOOP-7314.
Major improvement reported by naisbitt and fixed by naisbitt
Add support for throwing UnknownHostException when a host doesn't resolve
As part of MAPREDUCE-2489, we need support for having the resolve methods (for DNS mapping) throw UnknownHostExceptions. (Currently, they hide the exception). Since the existing 'resolve' method is ultimately used by several other locations/components, I propose we add a new 'resolveValidHosts' method.
- HADOOP-7306.
Major improvement reported by vicaya and fixed by vicaya (metrics)
Start metrics system even if config files are missing
Per experience and discussion with HDFS-1922, it seems preferable to treat missing metrics config file as empty/default config, which is more compatible with metrics v1 behavior (the MBeans are always registered.)
- HADOOP-7305.
Minor improvement reported by nielsbasjes and fixed by nielsbasjes (build)
Eclipse project files are incomplete
Added missing library during creation of the eclipse project files.
- HADOOP-7301.
Major improvement reported by jmhsieh and fixed by jmhsieh
FSDataInputStream should expose a getWrappedStream method
Ideally FSDataInputStream should expose a getWrappedStream method similarly to how FSDataOutputStream exposes a getWrappedStream method. Exposing this is useful for verifying correctness in tests cases. This FSDataInputStream type is the class that the o.a.h.fs.FileSystem.open call returns.
- HADOOP-7298.
Major test reported by tlipcon and fixed by tlipcon (test)
Add test utility for writing multi-threaded tests
A lot of our tests spawn off multiple threads in order to check various synchronization issues, etc. It's often tedious to write these kinds of tests because you have to manually propagate exceptions back to the main thread, etc.
In HBase we have developed a testing utility which makes writing these kinds of tests much easier. I'd like to copy that utility into Hadoop so we can use it here as well.
- HADOOP-7292.
Minor bug reported by vicaya and fixed by vicaya (metrics)
Metrics 2 TestSinkQueue is racy
The TestSinkQueue is racy (Thread.yield is not enough to guarantee other intended thread getting run), though it's the first time (from HADOOP-7289) I saw it manifested here.
- HADOOP-7289.
Major improvement reported by szetszwo and fixed by eyang (build)
ivy: test conf should not extend common conf
Otherwise, the same jars will appear in both {{build/ivy/lib/Hadoop-Common/common/}} and {{build/ivy/lib/Hadoop-Common/test/}}.
- HADOOP-7287.
Blocker bug reported by tlipcon and fixed by atm (conf)
Configuration deprecation mechanism doesn't work properly for GenericOptionsParser/Tools
For example, you can't use -D options on the "hadoop fs" command line in order to specify the deprecated names of configuration options. The issue is that the ordering is:
- JVM starts
- GenericOptionsParser creates a Configuration object and calls set() for each of the options specified on command line
- DistributedFileSystem or other class eventually instantiates HdfsConfiguration which adds the deprecations
- Some class calls conf.get("new key") and sees the default instead of the version ...
- HADOOP-7286.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's du/dus/df
The "Found X items" header on the output of the "du" command has been removed to more closely match unix. The displayed paths now correspond to the command line arguments instead of always being a fully qualified URI. For example, the output will have relative paths if the command line arguments are relative paths.
- HADOOP-7285.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's test
Need to refactor to conform to FsCommand subclass.
- HADOOP-7284.
Major bug reported by sanjay.radia and fixed by sanjay.radia
Trash and shell's rm does not work for viewfs
- HADOOP-7282.
Major bug reported by johnvijoe and fixed by johnvijoe (ipc)
getRemoteIp could return null in cases where the call is ongoing but the ip went away.
getRemoteIp gets the ip from socket instead of the stored ip in Connection object. Thus calls to this function could return null when a client disconnected, but the rpc call is still ongoing...
- HADOOP-7276.
Major bug reported by scurrilous and fixed by scurrilous (native)
Hadoop native builds fail on ARM due to -m32
The native build fails on machine targets where gcc does not support -m32. This is any target other than x86, SPARC, RS/6000, or PowerPC, such as ARM.
$ ant -Dcompile.native=true
...
[exec] make all-am
[exec] make[1]: Entering directory
`/home/trobinson/dev/hadoop-common/build/native/Linux-arm-32'
[exec] /bin/bash ./libtool --tag=CC --mode=compile gcc
-DHAVE_CONFIG_H -I. -I/home/trobinson/dev/hadoop-common/src/native
-I/usr/lib/jvm/java-6-openjdk/include
-I/usr/lib/jvm/jav...
- HADOOP-7275.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's stat
Refactor to conform to the FsCommand class.
- HADOOP-7272.
Major improvement reported by sureshms and fixed by sureshms (ipc, security)
Remove unnecessary security related info logs
Two info logs are printed when connection to RPC server is established, is not necessary. On a production cluster, these log lines made up of close to 50% of lines in the namenode log. I propose changing them into debug logs.
- HADOOP-7271.
Major improvement reported by daryn and fixed by daryn (fs)
Standardize error messages
The FsShell commands have no standard format for the same error message. For instance, here is a snippet of the variations of just one of many error messages:
cmd: $path: No such file or directory
cmd: cannot stat `$path': No such file or directory
cmd: Can not find listing for $path
cmd: Cannot access $path: No such file or directory.
cmd: No such file or directory `$path'
cmd: File does not exist: $path
cmd: File $path does not exist
... etc ...
These need to be common.
- HADOOP-7268.
Major bug reported by devaraj and fixed by jnp (fs, security)
FileContext.getLocalFSFileContext() behavior needs to be fixed w.r.t tokens
FileContext.getLocalFSFileContext() instantiates a FileContext object upon the first call to it, and for all subsequent calls returns back that instance (a static localFsSingleton object). With security turned on, this causes some hard-to-debug situations when that fileContext is used for doing HDFS operations. This is because the UserGroupInformation is stored when a FileContext is instantiated. If the process in question wishes to use different UserGroupInformation objects for different fil...
- HADOOP-7267.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's rm/rmr/expunge
Refactor to conform to the FsCommand class.
- HADOOP-7265.
Major improvement reported by daryn and fixed by daryn (fs)
Keep track of relative paths
As part of the effort to standardize the display of paths, the PathData tracks the exact string used to create a path. When obtaining a directory's contents, the relative nature of the original path should be preserved.
- HADOOP-7264.
Major improvement reported by vicaya and fixed by vicaya (io)
Bump avro version to at least 1.4.1
Needed by mapreduce 2.0 avro support. Maybe we could jump to Avro 1.5. There is incompatible API changes from 1.3x to 1.4x (Utf8 to CharSequence in user facing APIs) not sure about 1.5x though.
- HADOOP-7261.
Major bug reported by sureshms and fixed by sureshms (test)
Disable IPV6 for junit tests
IPV6 addresses not handles currently in the common library methods. IPV6 can return address as "0:0:0:0:0:0:port". Some utility methods such as NetUtils#createSocketAddress(), NetUtils#normalizeHostName(), NetUtils#getHostNameOfIp() to name a few, do not handle IPV6 address and expect address to be of format host:port.
Until IPV6 is formally supported, I propose disabling IPV6 for junit tests to avoid problems seen in HDFS-1891.
- HADOOP-7259.
Major bug reported by owen.omalley and fixed by owen.omalley (build)
contrib modules should include build.properties from parent.
Current build.properties in the hadoop root directory is not included by the contrib modules.
- HADOOP-7258.
Major bug reported by owen.omalley and fixed by owen.omalley
Gzip codec should not return null decompressors
In HADOOP-6315, the gzip codec was changed to return a null codec with the intent to disallow pooling of the decompressors. Rather than break the interface, we can use an annotation to achieve the goal.
- HADOOP-7257.
Major new feature reported by sanjay.radia and fixed by sanjay.radia
A client side mount table to give per-application/per-job file system view
viewfs - client-side mount table.
- HADOOP-7251.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's getmerge
Need to refactor getmerge to conform to new FsCommand class.
- HADOOP-7250.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's setrep
Need to refactor setrep to conform to new FsCommand class.
- HADOOP-7249.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's chmod/chown/chgrp
Need to refactor permissions commands to conform to new FsCommand class.
- HADOOP-7241.
Minor improvement reported by weiyj and fixed by weiyj (fs, test)
fix typo of command 'hadoop fs -help tail'
Fix the typo of command 'hadoop fs -help tail'.
$ hadoop fs -help tail
-tail [-f] <file>: Show the last 1KB of the file.
The -f option shows apended data as the file grows.
The "apended data" should be "appended data".
- HADOOP-7238.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's cat & text
Need to refactor cat & text to conform to new FsCommand class.
- HADOOP-7237.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's touchz
Need to refactor touchz to conform to new FsCommand class.
- HADOOP-7236.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's mkdir
Need to refactor tail to conform to new FsCommand class.
- HADOOP-7235.
Major improvement reported by daryn and fixed by daryn
Refactor FsShell's tail
Need to refactor tail to conform to new FsCommand class.
- HADOOP-7233.
Major improvement reported by daryn and fixed by daryn (fs)
Refactor FsShell's ls
Need to refactor ls to conform to new FsCommand class.
- HADOOP-7231.
Major bug reported by daryn and fixed by daryn (util)
Fix synopsis for -count
The synopsis for the count command is wrong.
1) missing a space in "-count[-q]"
2) missing ellipsis for multiple path args
- HADOOP-7230.
Major test reported by daryn and fixed by daryn (test)
Move -fs usage tests from hdfs into common
The -fs usage tests are in hdfs which causes an unnecessary synchronization of a common & hdfs bug when changing the text. The usages have no ties to hdfs, so they should be moved into common.
- HADOOP-7227.
Major improvement reported by jnp and fixed by jnp (ipc)
Remove protocol version check at proxy creation in Hadoop RPC.
1. Protocol version check is removed from proxy creation, instead version check is performed at server in every rpc call.
2. This change is backward incompatible because format of the rpc messages is changed to include client version, client method hash and rpc version.
3. rpc version is introduced which should change when the format of rpc messages is changed.
- HADOOP-7223.
Major bug reported by sureshms and fixed by sureshms (fs)
FileContext createFlag combinations during create are not clearly defined
During file creation with FileContext, the expected behavior is not clearly defined for combination of createFlag EnumSet.
- HADOOP-7216.
Major bug reported by atm and fixed by daryn (test)
HADOOP-7202 broke TestDFSShell in HDFS
The commit of HADOOP-7202 now requires that classes that extend {{FsCommand}} implement the {{void run(PathData)}} method. The {{Count}} class was changed to extend {{FsCommand}}, but renamed the {{run}} method and did not provide a replacement.
- HADOOP-7215.
Blocker bug reported by sureshms and fixed by sureshms (security)
RPC clients must connect over a network interface corresponding to the host name in the client's kerberos principal key
HDFS-7104 introduced a change where RPC server matches client's hostname with the hostname specified in the client's Kerberos principal name. RPC client binds the socket to a random local address, which might not match the hostname specified in the principal name. This results authorization failure of the client at the server.
- HADOOP-7214.
Major new feature reported by atm and fixed by atm
Hadoop /usr/bin/groups equivalent
Since user -> groups resolution is done on the NN and JT machines, there should be a way for users to determine what groups they're a member of from the NN's and JT's perspective.
- HADOOP-7210.
Major bug reported by umamaheswararao and fixed by umamaheswararao (fs)
Chown command is not working from FSShell.
chown command is not invoking the setOwner on FileSystem.
- HADOOP-7208.
Major bug reported by umamaheswararao and fixed by umamaheswararao
equals() and hashCode() implementation need to change in StandardSocketFactory
In Hadoop IPC Client, we are using ClientCache which will maintain the HashMap to keep the Client references.
private Map<SocketFactory, Client> clients =
new HashMap<SocketFactory, Client>();
Now let us say, we want use two standard factories with Hadoop. MyStandardSocketFactory (which extends StandardSocketFactory), and StandardSocketFactory. In this case, because of equals and hashcode implementation, MyStandardSocketFactory client can be overridden by StandardSocketFactoryClient
- HADOOP-7205.
Trivial improvement reported by daryn and fixed by daryn
automatically determine JAVA_HOME on OS X
OS X provides a java_home command that will return the user's selected jvm. The hadoop-env.sh should use this command if JAVA_HOME is not set.
- HADOOP-7202.
Major improvement reported by daryn and fixed by daryn
Improve Command base class
Need to extend the Command base class to allow all command to easily subclass from a code set of code that correctly handles globs and exit codes.
- HADOOP-7194.
Major bug reported by devaraj.k and fixed by devaraj.k (io)
Potential Resource leak in IOUtils.java
{code:title=IOUtils.java|borderStyle=solid}
try {
copyBytes(in, out, buffSize);
} finally {
if(close) {
out.close();
in.close();
}
}
{code}
In the above code if any exception throws from the out.close() statement, in.close() statement will not execute and the input stream will not be closed.
- HADOOP-7193.
Minor improvement reported by umamaheswararao and fixed by umamaheswararao (fs)
Help message is wrong for touchz command.
Updated the help for the touchz command.
- HADOOP-7187.
Major bug reported by umamaheswararao and fixed by umamaheswararao (metrics)
Socket Leak in org.apache.hadoop.metrics.ganglia.GangliaContext
Init method is creating DatagramSocket. But this is not closed any where.
- HADOOP-7180.
Minor improvement reported by daryn and fixed by daryn (fs)
Improve CommandFormat
CommandFormat currently takes an array and offset for parsing and returns a list of arguments. It'd be much more convenient to have it process a list too. It would also be nice to differentiate between too few and too many args instead of the generic "Illegal number of arguments". Finally, CommandFormat is completely devoid of tests.
- HADOOP-7178.
Major bug reported by umamaheswararao and fixed by umamaheswararao (fs)
FileSystem should have an option to control the .crc file creations at Local.
When we copy the files from DFS to local, it is creating the .crc files in local filesystem for the verification of checksum. When user dont want to do any check sum verifications, this files will not be useful.
Command line already has an option ignoreCrc, to control this.
So, we should have the similar option on FileSystem also..... like fs.ignoreCrc().
This should set the setVerifyChecksum to false and also should select the non CheckSumFileSystem as local fs.
- HADOOP-7177.
Trivial improvement reported by aw and fixed by aw (native)
CodecPool should report which compressor it is using
Certain native compression libraries are overly verbose causing confusion while reading the task logs. Let's actually say which compressor we got when we report it in the task logs.
- HADOOP-7175.
Major bug reported by daryn and fixed by daryn (fs)
Add isEnabled() to Trash
The moveToTrash method returns false in a number of cases. It's not possible to discern if false means an error occurred. In particular, it's not possible to know if the trash is disabled vs. an error occurred.
- HADOOP-7174.
Minor bug reported by umamaheswararao and fixed by umamaheswararao (fs)
null is displayed in the console,if the src path is invalid while doing copyToLocal operation from commandLine
When we perform copyToLocal operations from commandLine and if src Path is invalid
srcFS.globStatus(srcpath) will return null. So, when we find the length of resulted value, it will *throw NullPointerException*.
Since we are handling generic exception , it will display null as the message.
- HADOOP-7172.
Critical bug reported by tlipcon and fixed by tlipcon (io, security)
SecureIO should not check owner on non-secure clusters that have no native support
The SecureIOUtils.openForRead function currently uses a racy stat/open combo if security is disabled and the native libraries are not available. This ends up shelling out to "ls -ld" which is very very slow. We've seen this cause significant performance regressions on clusters that match this profile.
Since the racy permissions check doesn't buy us any security anyway, we should just fall back to a normal "open" without any stat() at all, if we can't use the native support to do it efficiently.
- HADOOP-7171.
Major bug reported by owen.omalley and fixed by jnp (security)
Support UGI in FileContext API
The FileContext API needs to support UGI.
- HADOOP-7167.
Minor improvement reported by tlipcon and fixed by tlipcon
Allow using a file to exclude certain tests from build
It would be nice to be able to exclude certain tests when running builds. For example, when a test is "known flaky", you may want to exclude it from the main Hudson job, but not actually disable it in the codebase (so that it still runs as part of another Hudson job, for example).
- HADOOP-7162.
Minor bug reported by humanoid and fixed by humanoid (fs)
FsShell: call srcFs.listStatus(src) twice
in file ./src/java/org/apache/hadoop/fs/FsShell.java line 555
call method twice:
1. for init variable
2. for getting data
- HADOOP-7159.
Trivial improvement reported by schen and fixed by schen (ipc)
RPC server should log the client hostname when read exception happened
This makes find mismatched clients easier
- HADOOP-7153.
Minor improvement reported by nicktelford and fixed by nicktelford (io)
MapWritable violates contract of Map interface for equals() and hashCode()
MapWritable now implements equals() and hashCode() based on the map contents rather than object identity in order to correctly implement the Map interface.
- HADOOP-7151.
Minor bug reported by dvryaboy and fixed by dvryaboy
Document need for stable hashCode() in WritableComparable
When a Writable is used as a key, HashPartitioner implicitly assumes that hashCode() will return the same value across different instances of the JVM. This is not a guaranteed behavior in Java, and Object's default hashCode() does not in fact do this, which can lead to subtle bugs. This requirement should be explicitly called out.
In addition the sample MyWritable in the javadoc for WritableComparable does not implement hashCode() and thus has a bug. That should be fixed.
- HADOOP-7144.
Major new feature reported by vicaya and fixed by revans2
Expose JMX with something like JMXProxyServlet
Much of the Hadoop metrics and status info is available via JMX, especially since 0.20.100, and 0.22+ (HDFS-1318, HADOOP-6728 etc.) For operations staff not familiar JMX setup, especially JMX with SSL and firewall tunnelling, the usage can be daunting. Using a JMXProxyServlet (a la Tomcat) to translate JMX attributes into JSON output would make a lot of non-Java admins happy.
We could probably use Tomcat's JMXProxyServlet code directly, if it's already output some standard format (JSON or XM...
- HADOOP-7136.
Major task reported by nidaley and fixed by nidaley
Remove failmon contrib
Failmon removed from contrib codebase.
- HADOOP-7133.
Major improvement reported by mattf and fixed by mattf (util)
CLONE to COMMON - HDFS-1445 Batch the calls in DataStorage to FileUtil.createHardLink(), so we call it once per directory instead of once per file
This is the COMMON portion of a fix requiring coordinated change of COMMON and HDFS. Please see <a href="/jira/browse/HDFS-1445" title="Batch the calls in DataStorage to FileUtil.createHardLink(), so we call it once per directory instead of once per file"><strike>HDFS-1445</strike></a> for HDFS portion and release note.
- HADOOP-7131.
Minor improvement reported by umamaheswararao and fixed by umamaheswararao (io)
set() and toString Methods of the org.apache.hadoop.io.Text class does not include the root exception, in the wrapping RuntimeException.
In below code snippets, we can include e, instead of e.toString(), so that caller can get complete trace.
1)
/** Set to contain the contents of a string.
*/
public void set(String string) {
try {
ByteBuffer bb = encode(string, true);
bytes = bb.array();
length = bb.limit();
}catch(CharacterCodingException e) {
throw new RuntimeException("Should not have happened ",e.toString());
}
}
2)
public String toString() {
try {
return decod...
- HADOOP-7120.
Major bug reported by szetszwo and fixed by szetszwo (test)
200 new Findbugs warnings
ant test-patch on an empty patch over hdfs trunk.
{noformat}
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no new tests are needed for this patch.
[exec] Also please list what manual steps were performed to verify this patch.
[ex...
- HADOOP-7119.
Major new feature reported by tucu00 and fixed by tucu00 (security)
add Kerberos HTTP SPNEGO authentication support to Hadoop JT/NN/DN/TT web-consoles
Adding support for Kerberos HTTP SPNEGO authentication to the Hadoop web-consoles
- HADOOP-7117.
Major improvement reported by patrickangeles and fixed by qwertymaniac (conf)
Move secondary namenode checkpoint configs from core-default.xml to hdfs-default.xml
Removed references to the older fs.checkpoint.* properties that resided in core-site.xml
- HADOOP-7114.
Minor improvement reported by tlipcon and fixed by tlipcon (fs)
FsShell should dump all exceptions at DEBUG level
Most of the FsShell commands catch exceptions and then just print out an error like "foo: " + e.getLocalizedMessage(). This is fine when the exception is "user-facing" (eg permissions errors) but in the case of a user hitting a bug you get a useless error message with no stack trace. For example, something "chmod: null" in the case of a NullPointerException bug.
It would help debug these cases for users and developers if we also logged the exception with full trace at DEBUG level.
- HADOOP-7112.
Major improvement reported by tomwhite and fixed by tomwhite (conf, filecache)
Issue a warning when GenericOptionsParser libjars are not on local filesystem
In GenericOptionsParser#getLibJars() any jars that are not local filesystem paths are silently ignored. We should issue a warning for users.
- HADOOP-7111.
Critical bug reported by tlipcon and fixed by atm (io)
Several TFile tests failing when native libraries are present
When running tests with native libraries present, TestTFileByteArrays and TestTFileJClassComparatorByteArrays fail on trunk. They don't seem to fail in 0.20 with native libraries.
- HADOOP-7098.
Major bug reported by brainlounge and fixed by brainlounge (conf)
tasktracker property not set in conf/hadoop-env.sh
For all cluster components, except TaskTracker the OPTS environment variable is set like this in hadoop-env.sh:
export HADOOP_<COMPONENT>_OPTS="-Dcom.sun.management.jmxremote $HADOOP_<COMPONENT>_OPTS"
The provided patch fixes this.
- HADOOP-7096.
Major improvement reported by ahmed.radwan and fixed by ahmed.radwan
Allow setting of end-of-record delimiter for TextInputFormat
The patch for https://issues.apache.org/jira/browse/MAPREDUCE-2254 required minor changes to the LineReader class to allow extensions (see attached 2.patch). Description copied below:
It will be useful to allow setting the end-of-record delimiter for TextInputFormat. The current implementation hardcodes '\n', '\r' or '\r\n' as the only possible record delimiters. This is a problem if users have embedded newlines in their data fields (which is pretty common). This is also a problem for other ...
- HADOOP-7090.
Major bug reported by gokulm and fixed by umamaheswararao (fs/s3, io)
Possible resource leaks in hadoop core code
It is always a good practice to close the IO streams in a finally block..
For example, look at the following piece of code in the Writer class of BloomMapFile
{code:title=BloomMapFile .java|borderStyle=solid}
public synchronized void close() throws IOException {
super.close();
DataOutputStream out = fs.create(new Path(dir, BLOOM_FILE_NAME), true);
bloomFilter.write(out);
out.flush();
out.close();
}
{code}
If an exception occurs during fs.create or o...
- HADOOP-7089.
Minor bug reported by eli and fixed by eli (scripts)
Fix link resolution logic in hadoop-config.sh
Updates hadoop-config.sh to always resolve symlinks when determining HADOOP_HOME. Bash built-ins or POSIX:2001 compliant cmds are now required.
- HADOOP-7078.
Trivial improvement reported by tlipcon and fixed by qwertymaniac
Add better javadocs for RawComparator interface
The RawComparator interface is very important to understand for users implementing their own serialization classes. Right now the javadoc is woefully sparse. We should improve that.
- HADOOP-7071.
Minor bug reported by nidaley and fixed by nidaley (build)
test-patch.sh has bad ps arg
- HADOOP-7061.
Minor improvement reported by yaojingguo and fixed by yaojingguo (io)
unprecise javadoc for CompressionCodec
In CompressionCodec.java, there is the following code:
/**
* Create a stream decompressor that will read from the given input stream.
*
* @param in the stream to read compressed bytes from
* @return a stream to read uncompressed bytes from
* @throws IOException
*/
CompressionInputStream createInputStream(InputStream in) throws IOException;
"stream decompressor" should be "{@link CompressionInputStream}".
- HADOOP-7060.
Major improvement reported by hairong and fixed by pkling (fs)
A more elegant FileSystem#listCorruptFileBlocks API
I'd like to change the newly added listCorruptFileBlocks signature to be:
{code}
/**
* Get all files with corrupt blocks under the given path
*/
RemoteIterator<Path> listCorruptFileBlocks(Path src) throws IOException;
{code}
This new API does not expose "cookie" to user although underlying implementation may still need to invoke multiple RPCs to get the whole list.
- HADOOP-7059.
Major improvement reported by nwatkins and fixed by nwatkins (native)
Remove "unused" warning in native code
Adds __attribute__ ((unused))
- HADOOP-7058.
Trivial improvement reported by tlipcon and fixed by tlipcon
Expose number of bytes in FSOutputSummer buffer to implementatins
For HDFS-1497 it would be useful for an FSOutputSummer implementation to know how many bytes are in the FSOutputSummer buffer. This trivial patch adds a protected call to return this.
- HADOOP-7057.
Minor bug reported by cos and fixed by cos (util)
IOUtils.readFully and IOUtils.skipFully have typo in exception creation's message
{noformat}
throw new IOException( "Premeture EOF from inputStream");
{noformat}
- HADOOP-7055.
Major bug reported by yaojingguo and fixed by yaojingguo (metrics)
Update of commons logging libraries causes EventCounter to count logging events incorrectly
Hadoop 0.20.2 uses commons logging 1.0.4. EventCounter works correctly with this version of commons logging. Hadoop 0.21.0 uses commons logging 1.1.1 which causes EventCounter to count logging events incorrectly. I have verified it with Hadoop 0.21.0. After start-up of hadoop, I checked jvmmetrics.log after several minutes. In every metrics record, "logError=0, logFatal=0, logInfo=3, logWarn=0" was shown. The following text is an example.
jvm.metrics: hostName=jingguolin, processName=DataNod...
- HADOOP-7053.
Minor bug reported by yaojingguo and fixed by yaojingguo (conf)
wrong FSNamesystem Audit logging setting in conf/log4j.properties
"log4j.logger.org.apache.hadoop.fs.FSNamesystem.audit=WARN" should be "log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN".
- HADOOP-7052.
Major bug reported by yaojingguo and fixed by yaojingguo (conf)
misspelling of threshold in conf/log4j.properties
In "log4j.threshhold=ALL", threshhold is a misspelling of threshold. So "log4j.threshhold=ALL" has no effect on the control of log4j logging.
- HADOOP-7049.
Trivial improvement reported by pkling and fixed by pkling (conf)
TestReconfiguration should be junit v4
TestReconfiguration should be a junit v4 unit test. I'll also add some messages to the assertions.
- HADOOP-7048.
Minor improvement reported by yaojingguo and fixed by yaojingguo (io)
Wrong description of Block-Compressed SequenceFile Format in SequenceFile's javadoc
Here is the following description for Block-Compressed SequenceFile Format in SequenceFile's javadoc:
* <li>
* Record <i>Block</i>
* <ul>
* <li>Compressed key-lengths block-size</li>
* <li>Compressed key-lengths block</li>
* <li>Compressed keys block-size</li>
* <li>Compressed keys block</li>
* <li>Compressed value-lengths block-size</li>
* <li>Compressed value-lengths block</li>
* <li>Compressed values block-size</li>
* <li>Compressed values bloc...
- HADOOP-7046.
Blocker bug reported by nidaley and fixed by pocheung (security)
1 Findbugs warning on trunk and branch-0.22
There is 1 findbugs warnings on trunk. See attached html file. This must be fixed or filtered out to get back to 0 warnings. The OK_FINDBUGS_WARNINGS property in src/test/test-patch.properties should also be set to 0 in the patch that fixes this issue.
- HADOOP-7045.
Minor bug reported by eli and fixed by eli (fs)
TestDU fails on systems with local file systems with extended attributes
The test reports that the file takes an extra 4k on disk:
{noformat}
Testcase: testDU took 5.74 sec
FAILED
expected:<32768> but was:<36864>
junit.framework.AssertionFailedError: expected:<32768> but was:<36864>
at org.apache.hadoop.fs.TestDU.testDU(TestDU.java:79)
{noformat}
This is because du reports 32k for the file and 4k because the file system it lives on uses extended attributes.
{noformat}
common-branch-0.20 $ dd if=/dev/zero of=data bs=4096 count=8
8+0 records in
8+...
- HADOOP-7042.
Minor improvement reported by nidaley and fixed by nidaley (test)
Update test-patch.sh to include failed test names and move test-patch.properties
As Jakob suggested, it would be helpful if the Jira messages left by Hudson included the list of failed tests.
Also, test-patch.properties must be moved out of the src/test/bin dir because it is project specific and the entire bin dir is svn included into other projects (hdfs and mapreduce)
- HADOOP-7023.
Major improvement reported by pkling and fixed by pkling
Add listCorruptFileBlocks to FileSystem
Add a new API listCorruptFileBlocks to FIleContext that returns a list of files that have corrupt blocks.
- HADOOP-7015.
Minor bug reported by sanjay.radia and fixed by sanjay.radia
RawLocalFileSystem#listStatus does not deal with a directory whose entries are changing ( e.g. in a multi-thread or multi-process environment)
- HADOOP-7014.
Major improvement reported by cos and fixed by cos (test)
Generalize CLITest structure and interfaces to facilitate upstream adoption (e.g. for web testing)
There's at least one use case where TestCLI infrastructure is helpful for testing projects outside of core Hadoop (e.g. Owl web testing). In order to make this acceptance easier for upstream project TestCLI needs to be refactored.
- HADOOP-7001.
Major task reported by pkling and fixed by pkling (conf)
Allow configuration changes without restarting configured nodes
Currently, changing the configuration on a node (e.g., the name node) requires that we restart the node. We propose a change that would allow us to make configuration changes without restarting. Nodes that support configuration changes at run time should implement the following interface:
interface ChangeableConfigured extends Configured {
void changeConfiguration(Configuration newConf) throws ConfigurationChangeException;
}
The contract of changeConfiguration is as follows:
The node wil...
- HADOOP-6995.
Minor improvement reported by tlipcon and fixed by tlipcon (security)
Allow wildcards to be used in ProxyUsers configurations
When configuring proxy users and hosts, the special wildcard value "*" may be specified to match any host or any user.
- HADOOP-6994.
Major improvement reported by jnp and fixed by jnp
Api to get delegation token in AbstractFileSystem
APIs to get delegation tokens is required in AbstractFileSystem. AbstractFileSystems are accessed via file context therefore an API to get list of AbstractFileSystems accessed in a path is also needed.
A path may refer to several file systems and delegation tokens could be needed for many of them for a client to be able to successfully access the path.
- HADOOP-6949.
Major improvement reported by navis and fixed by mattf (io)
Reduces RPC packet size for primitive arrays, especially long[], which is used at block reporting
Increments the RPC protocol version in org.apache.hadoop.ipc.Server from 4 to 5.
Introduces ArrayPrimitiveWritable for a much more efficient wire format to transmit arrays of primitives over RPC. ObjectWritable uses the new writable for array of primitives for RPC and continues to use existing format for on-disk data.
- HADOOP-6939.
Minor bug reported by tlipcon and fixed by tlipcon
Inconsistent lock ordering in AbstractDelegationTokenSecretManager
AbstractDelegationTokenSecretManager.startThreads() is synchronized, which calls updateCurrentKey(), which calls logUpdateMasterKey. logUpdateMasterKey's implementation for HDFS's manager calls namesystem.logUpdateMasterKey() which is synchronized. Thus the lock order is ADTSM -> FSN. In FSN.saveNamespace, though, it calls DTSM.saveSecretManagerState(), so the lock order is FSN -> ADTSM.
I don't think this deadlock occurs in practice since saveNamespace won't occur until after the ADTSM has ...
- HADOOP-6929.
Major improvement reported by sharadag and fixed by sharadag (ipc, security)
RPC should have a way to pass Security information other than protocol annotations
Currently Hadoop RPC allows protocol annotations as the only way to pass security information. This becomes a problem if protocols are generated and not hand written. For example protocols generated via Avro and passed over Avro tunnel (AvroRpcEngine.java) can't pass the security information.
- HADOOP-6921.
Major sub-task reported by vicaya and fixed by vicaya
metrics2: metrics plugins
Metrics names are standardized to CapitalizedCamelCase. See release note of <a href="/jira/browse/HADOOP-6918" title="Make metrics naming consistent">HADOOP-6918</a> and <a href="/jira/browse/HADOOP-6920" title="Metrics2: metrics instrumentation"><strike>HADOOP-6920</strike></a>.
- HADOOP-6920.
Major sub-task reported by vicaya and fixed by vicaya
Metrics2: metrics instrumentation
Metrics names are standardized to use CapitalizedCamelCase. Some examples of this is:
# Metrics names using "_" is changed to new naming scheme. Eg: bytes_written changes to BytesWritten.
# All metrics names start with capitals. Example: threadsBlocked changes to ThreadsBlocked.
- HADOOP-6919.
Major sub-task reported by vicaya and fixed by vicaya (metrics)
Metrics2: metrics framework
New metrics2 framework for Hadoop.
- HADOOP-6912.
Major bug reported by kzhang and fixed by kzhang (security)
Guard against NPE when calling UGI.isLoginKeytabBased()
NPE can happen when isLoginKeytabBased() is called before a login is performed. See MAPREDUCE-1992 for an example.
- HADOOP-6904.
Major new feature reported by hairong and fixed by hairong (ipc)
A baby step towards inter-version RPC communications
Currently RPC communications in Hadoop is very strict. If a client has a different version from that of the server, a VersionMismatched exception is thrown and the client can not connect to the server. This force us to update both client and server all at once if a RPC protocol is changed. But sometime different versions do not mean the client & server are not compatible. It would be nice if we could relax this restriction and allows us to support inter-version communications.
My idea is tha...
- HADOOP-6889.
Major new feature reported by hairong and fixed by johnvijoe (ipc)
Make RPC to have an option to timeout
Currently Hadoop RPC does not timeout when the RPC server is alive. What it currently does is that a RPC client sends a ping to the server whenever a socket timeout happens. If the server is still alive, it continues to wait instead of throwing a SocketTimeoutException. This is to avoid a client to retry when a server is busy and thus making the server even busier. This works great if the RPC server is NameNode.
But Hadoop RPC is also used for some of client to DataNode communications, for e...
- HADOOP-6887.
Major improvement reported by bharathm and fixed by vicaya (metrics)
Need a separate metrics per garbage collector
In addition to current GC metrics which are the sum of all the collectors, Need separate metrics for monitoring young generation and old generation collections per collector w.r.t collection count and collection time.
- HADOOP-6864.
Major improvement reported by zasran and fixed by boryas (security)
Provide a JNI-based implementation of ShellBasedUnixGroupsNetgroupMapping (implementation of GroupMappingServiceProvider)
The netgroups implementation of GroupMappingServiceProvider (see ShellBasedUnixGroupsNetgroupMapping.java) does a fork of a unix command to get the netgroups of a user. Since the group resolution happens in the servers, this might be costly. This jira aims at providing a JNI-based implementation for GroupMappingServiceProvider.
Note that this is similar to what https://issues.apache.org/jira/browse/HADOOP-6818 does for implementation of GroupMappingServiceProvider that supports only unix gr...
- HADOOP-6764.
Major improvement reported by dms and fixed by dms (ipc)
Add number of reader threads and queue length as configuration parameters in RPC.getServer
In HDFS-599 we are introducing multiple RPC servers running inside of the same process on different ports. Since one might want to configure these servers differently we need a good abstraction to pass configuration values to servers as parameters, not through Configuration.
- HADOOP-6754.
Major bug reported by kimballa and fixed by kimballa (io)
DefaultCodec.createOutputStream() leaks memory
DefaultCodec.createOutputStream() creates a new Compressor instance in each OutputStream. Even if the OutputStream is closed, this leaks memory.
- HADOOP-6683.
Minor sub-task reported by xiaokang and fixed by xiaokang (io)
the first optimization: ZlibCompressor does not fully utilize the buffer
Improve the buffer utilization of ZlibCompressor to avoid invoking a JNI per write request.
- HADOOP-6671.
Major sub-task reported by gkesavan and fixed by tucu00 (build)
To use maven for hadoop common builds
We are now able to publish hadoop artifacts to the maven repo successfully [ Hadoop-6382]
Drawbacks with the current approach:
* Use ivy for dependency management with ivy.xml
* Use maven-ant-task for artifact publishing to the maven repository
* pom files are not generated dynamically
To address this I propose we use maven to build hadoop-common, which would help us to manage dependencies, publish artifacts and have one single xml file(POM) for dependency management and artifact publishing...
- HADOOP-6622.
Major bug reported by jnp and fixed by eli (security)
Token should not print the password in toString.
The toString method in Token should not print out the password.
- HADOOP-6578.
Minor improvement reported by tlipcon and fixed by pirroh (conf)
Configuration should trim whitespace around a lot of value types
I've seen multiple users make an error where they've listed some whitespace around a class name (eg for configuring a scheduler). This results in a ClassNotFoundException which is very hard to debug, as you don't notice the whitespace in the exception! We should simply trim the whitespace in Configuration.getClass and Configuration.getClasses to avoid this class of user error.
Similarly, we should trim in getInt, getLong, etc - anywhere that whitespace doesn't have semantic meaning we should...
- HADOOP-6508.
Major bug reported by amareshwari and fixed by vicaya (metrics)
Incorrect values for metrics with CompositeContext
In our clusters, when we use CompositeContext with two contexts, second context gets wrong values.
This problem is consistent on 500 (and above) node cluster.
- HADOOP-6436.
Major improvement reported by eli and fixed by rvs
Remove auto-generated native build files
The native build run when from trunk now requires autotools, libtool and openssl dev libraries.
- HADOOP-6432.
Major new feature reported by jnp and fixed by jnp
Statistics support in FileContext
FileContext should have API to get statistics from underlying file systems.
- HADOOP-6385.
Minor new feature reported by sphillip and fixed by daryn (fs)
dfs does not support -rmdir (was HDFS-639)
The "rm" family of FsShell commands now supports -rmdir and -f options.
- HADOOP-6376.
Minor improvement reported by kaykay.unique and fixed by kaykay.unique (conf)
slaves file to have a header specifying the format of conf/slaves file
When we open the file conf/slaves - it is not immediately obvious what the format of the file is ( a comma-separated list or one per each line). The docs confirm it is 1 per line.
Specifying the information by means of a comment in the template so that it is easy to modify the same, and self-explanatory.
- HADOOP-6255.
Major new feature reported by owen.omalley and fixed by eyang
Create an rpm integration project
Added RPM/DEB packages to build system.
- HADOOP-6158.
Minor task reported by owen.omalley and fixed by eli (util)
Move CyclicIteration to HDFS
I think we should move CyclicIteration from Common utils to HDFS.
- HADOOP-5647.
Major bug reported by ravidotg and fixed by ravidotg (test)
TestJobHistory fails if /tmp/_logs is not writable to. Testcase should not depend on /tmp
Removed dependency of testcase on /tmp and made it to use test.build.data directory instead.
- HADOOP-2081.
Major bug reported by owen.omalley and fixed by qwertymaniac (conf)
Configuration getInt, getLong, and getFloat replace invalid numbers with the default value
Invalid configuration values now result in a number format exception rather than the default value being used.
- HADOOP-1886.
Trivial improvement reported by shv and fixed by frankconrad (fs)
Undocumented parameters in FilesSystem
Multiple create methods in public FileSystem class lack documentation for the following 2 parameters.
- long blockSize,
- Progressable progress
- HDFS-2512.
Major improvement reported by tlipcon and fixed by tlipcon (data-node, hdfs client)
Add textual error message to data transfer protocol responses
Currently, the error response code from the DN has very little extra info. I had a situation the other day where the balancer was failing to move blocks, but just reported back "error moving block", which didn't help much. We can easily add a "message" field to OpBlockResponseProto to communicate back the underlying issue (in this case, it was thread quota exceeded)
- HDFS-2501.
Major sub-task reported by szetszwo and fixed by szetszwo
add version prefix and root methods to webhdfs
- HDFS-2500.
Major improvement reported by tlipcon and fixed by tlipcon (data-node)
Avoid file system operations in BPOfferService thread while processing deletes
While running a workload with concurrent writes and deletes, I saw a lot of NotReplicatedYetExceptions being thrown due to late arrivals of blockReceived reports from the DN. Looking at the DN logs, I found that the blockReceived message was being delayed as much as 15 seconds because the OfferService thread was blocked on file system operations processing deletes. We previously moved the deletions to another thread, but it still accesses the file system to determine the block length in the m...
- HDFS-2494.
Major sub-task reported by umamaheswararao and fixed by umamaheswararao (data-node)
[webhdfs] When Getting the file using OP=OPEN with DN http address, ESTABLISHED sockets are growing.
As part of the reliable test,
Scenario:
Initially check the socket count. ---there are aroud 42 sockets are there.
open the file with DataNode http address using op=OPEN request parameter about 500 times in loop.
Wait for some time and check the socket count. --- There are thousands of ESTABLISHED sockets are growing. ~2052
Here is the netstat result:
C:\Users\uma>netstat | grep 127.0.0.1 | grep ESTABLISHED |wc -l
2042
C:\Users\uma>netstat | grep 127.0.0.1 | grep ESTABLISHED |wc -l
2042
C:\...
- HDFS-2493.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Remove reference to FSNamesystem in blockmanagement classes
- HDFS-2485.
Trivial improvement reported by stevel@apache.org and fixed by stevel@apache.org (data-node)
Improve code layout and constants in UnderReplicatedBlocks
Before starting HDFS-2472 I want to clean up the code in UnderReplicatedBlocks slightly
# use constants for all the string levels
# change the {{getUnderReplicatedBlockCount()}} method so that it works even if the corrupted block list is not the last queue
# improve the javadocs
# add some more curly braces and spaces to follow the style guidelines better
This is a trivial change as behaviour will not change at all. If committed it will go into trunk and 0.23 so that patches between the two ...
- HDFS-2471.
Major new feature reported by sureshms and fixed by sureshms (documentation)
Add Federation feature, configuration and tools documentation
This jira intends to add Federation documentation.
- HDFS-2467.
Major bug reported by owen.omalley and fixed by owen.omalley
HftpFileSystem uses incorrect compare for finding delegation tokens
When looking for hdfs delegation tokens, Hftp converts the service to a string and compares it to a text.
- HDFS-2465.
Major improvement reported by tlipcon and fixed by tlipcon (data-node, performance)
Add HDFS support for fadvise readahead and drop-behind
HDFS now has the ability to use posix_fadvise and sync_data_range syscalls to manage the OS buffer cache. This support is currently considered experimental, and may be enabled by configuring the following keys:
dfs.datanode.drop.cache.behind.writes - set to true to drop data out of the buffer cache after writing
dfs.datanode.drop.cache.behind.reads - set to true to drop data out of the buffer cache when performing sequential reads
dfs.datanode.sync.behind.writes - set to true to trigger dirty page writeback immediately after writing data
dfs.datanode.readahead.bytes - set to a non-zero value to trigger readahead for sequential reads
- HDFS-2453.
Major sub-task reported by arpitgupta and fixed by szetszwo
tail using a webhdfs uri throws an error
/usr//bin/hadoop --config /etc/hadoop dfs -tail webhdfs://NN:50070/file
tail: HTTP_PARTIAL expected, received 200
- HDFS-2452.
Major bug reported by shv and fixed by umamaheswararao (data-node)
OutOfMemoryError in DataXceiverServer takes down the DataNode
OutOfMemoryError brings down DataNode, when DataXceiverServer tries to spawn a new data transfer thread.
- HDFS-2445.
Major bug reported by jeagles and fixed by jeagles (test)
Incorrect exit code for hadoop-hdfs-test tests when exception thrown
Please see MAPREDUCE-3179 for a full description.
- HDFS-2441.
Major sub-task reported by arpitgupta and fixed by szetszwo
webhdfs returns two content-type headers
$ curl -i "http://localhost:50070/webhdfs/path?op=GETFILESTATUS"
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Expires: Thu, 01-Jan-1970 00:00:00 GMT
........
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26)
It should only return one content type header = application/json
- HDFS-2439.
Major sub-task reported by arpitgupta and fixed by szetszwo
webhdfs open an invalid path leads to a 500 which states a npe, we should return a 404 with appropriate error message
- HDFS-2436.
Major bug reported by arpitgupta and fixed by umamaheswararao
FSNamesystem.setTimes(..) expects the path is a file.
FSNamesystem.setTimes(..) does not work if the path is a directory.
Arpit found this bug when testing webhdfs:
{quote}
settimes api is working when called on a file, but when called on a dir it returns a 404. I should be able to set time on both a file and a directory.
{quote}
- HDFS-2432.
Major sub-task reported by arpitgupta and fixed by szetszwo
webhdfs setreplication api should return a 403 when called on a directory
Currently the set replication api on a directory leads to a 200.
Request URI http://NN:50070/webhdfs/tmp/webhdfs_data/dir_replication_tests?op=SETREPLICATION&replication=5
Request Method: PUT
Status Line: HTTP/1.1 200 OK
Response Content: {"boolean":false}
Since we can determine that this call did not succeed (boolean=false) we should rather just return a 403
- HDFS-2428.
Major sub-task reported by arpitgupta and fixed by szetszwo
webhdfs api parameter validation should be better
PUT Request: http://localhost:50070/webhdfs/some_path?op=MKDIRS&permission=955
Exception returned
HTTP/1.1 500 Internal Server Error
{"RemoteException":{"className":"com.sun.jersey.api.ParamException$QueryParamException","message":"java.lang.NumberFormatException: For input string: \"955\""}}
We should return a 400 with appropriate error message
- HDFS-2427.
Major sub-task reported by arpitgupta and fixed by szetszwo
webhdfs mkdirs api call creates path with 777 permission, we should default it to 755
- HDFS-2424.
Major sub-task reported by arpitgupta and fixed by szetszwo
webhdfs liststatus json does not convert to a valid xml document
- HDFS-2422.
Major bug reported by jwfbean and fixed by atm (name-node)
The NN should tolerate the same number of low-resource volumes as failed volumes
We encountered a situation where the namenode dropped into safe mode after a temporary outage of an NFS mount.
At 12:10 the NFS server goes offline
Oct 8 12:10:05 <namenode> kernel: nfs: server <nfs host> not responding, timed out
This caused the namenode to conclude resource issues:
2011-10-08 12:10:34,848 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space available on volume '<nfs host>' is 0, which is below the configured reserved amount 104857600
Temporary lo...
- HDFS-2414.
Critical bug reported by revans2 and fixed by tlipcon (name-node, test)
TestDFSRollback fails intermittently
When running TestDFSRollback repeatedly in a loop I observed a failure rate of about 3%. Two separate stack traces are in the output and it appears to have something to do with not writing out a complete snapshot of the data for rollback.
{noformat}
-------------------------------------------------------------------------------
Test set: org.apache.hadoop.hdfs.TestDFSRollback
-------------------------------------------------------------------------------
Tests run: 1, Failures: 1, Errors: 0...
- HDFS-2412.
Blocker bug reported by tlipcon and fixed by tlipcon
Add backwards-compatibility layer for FSConstants
HDFS-1620 renamed FSConstants which we believed to be a private class. But currently the public APIs for safe-mode and datanode reports depend on constants in FSConstants. This is breaking HBase builds against 0.23. This JIRA is to provide a backward-compatibility route.
- HDFS-2411.
Major bug reported by arpitgupta and fixed by jnp
with webhdfs enabled in secure mode the auth to local mappings are not being respected.
- HDFS-2409.
Major bug reported by jnp and fixed by jnp
_HOST in dfs.web.authentication.kerberos.principal.
This is HDFS part of HADOOP-7721.
- HDFS-2404.
Major sub-task reported by arpitgupta and fixed by sureshms
webhdfs liststatus json response is not correct
- HDFS-2403.
Major sub-task reported by szetszwo and fixed by szetszwo
The renewer in NamenodeWebHdfsMethods.generateDelegationToken(..) is not used
Below are some suggestions from Suresh.
# renewer not used in #generateDelegationToken
# put() does not use InputStream in and should not throw URISyntaxException
# post() does not use InputStream in and should not throw URISyntaxException
# get() should not throw URISyntaxException
- HDFS-2401.
Major improvement reported by jeagles and fixed by jeagles (build)
Running a set of methods in a Single Test Class
Instead of running every test method in a class, limit to specific testing methods as describe in the link below.
http://maven.apache.org/plugins/maven-surefire-plugin/examples/single-test.html
Upgrade to the latest version of maven-surefire-plugin that has this feature.
- HDFS-2395.
Critical sub-task reported by arpitgupta and fixed by szetszwo
webhdfs api's should return a root element in the json response
- HDFS-2385.
Major sub-task reported by szetszwo and fixed by szetszwo
Support delegation token renewal in webhdfs
- HDFS-2371.
Major improvement reported by sureshms and fixed by sureshms (data-node)
Refactor BlockSender.java for better readability
BlockSender.java is hard to read and understand. I propose refactoring it for better readability
- HDFS-2368.
Major bug reported by arpitgupta and fixed by szetszwo
defaults created for web keytab and principal, these properties should not have defaults
the following defaults are set in hdfs-defaults.xml
<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/${dfs.web.hostname}@${kerberos.realm}</value>
<description>
The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.
The HTTP Kerberos principal MUST start with 'HTTP/' per Kerberos
HTTP SPENGO specification.
</description>
</property>
<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>${user.home}/dfs.web....
- HDFS-2366.
Major sub-task reported by arpitgupta and fixed by szetszwo
webhdfs throws a npe when ugi is null from getDelegationToken
- HDFS-2363.
Minor sub-task reported by umamaheswararao and fixed by umamaheswararao (name-node)
Move datanodes size printing to BlockManager from FSNameSystem's metasave API
{code}
final List<DatanodeDescriptor> live = new ArrayList<DatanodeDescriptor>();
final List<DatanodeDescriptor> dead = new ArrayList<DatanodeDescriptor>();
blockManager.getDatanodeManager().fetchDatanodes(live, dead, false);
out.println("Live Datanodes: "+live.size());
out.println("Dead Datanodes: "+dead.size());
blockManager.metaSave(out);
{code}
Logically all the dataNode related logic can be moved to BlockManager.
So, here metaSave API is getting the ...
- HDFS-2361.
Critical bug reported by rajsaha and fixed by jnp (name-node)
hftp is broken
Distcp with hftp is failing.
{noformat}
$hadoop distcp hftp://<NNhostname>:50070/user/hadoopqa/1316814737/newtemp 1316814737/as
11/09/23 21:52:33 INFO tools.DistCp: srcPaths=[hftp://<NNhostname>:50070/user/hadoopqa/1316814737/newtemp]
11/09/23 21:52:33 INFO tools.DistCp: destPath=1316814737/as
Retrieving token from: https://<NN IP>:50470/getDelegationToken
Retrieving token from: https://<NN IP>:50470/getDelegationToken?renewer=mapred
11/09/23 21:52:34 INFO security.TokenCache: Got dt for h...
- HDFS-2356.
Major sub-task reported by szetszwo and fixed by szetszwo
webhdfs: support case insensitive query parameter names
- HDFS-2355.
Major improvement reported by sureshms and fixed by sureshms (name-node)
Federation: enable using the same configuration file across all the nodes in the cluster.
This change allows when running multiple namenodes on different hosts, sharing the same configuration file across all the nodes in the cluster (Datanodes, NamNode, BackupNode, SecondaryNameNode), without the need to define dfs.federation.nameservice.id parameter.
- HDFS-2348.
Major sub-task reported by szetszwo and fixed by szetszwo
Support getContentSummary and getFileChecksum in webhdfs
- HDFS-2347.
Trivial bug reported by umamaheswararao and fixed by umamaheswararao (name-node)
checkpointTxnCount's comment still saying about editlog size
As per the latest changes checkpoint will trigger based on transaction counts instead of editlog size.But checkpointTxnCount comment is still saying about editlog size.
{code}
private long checkpointTxnCount; // size (in MB) of current Edit Log
{code}
- HDFS-2344.
Major bug reported by umamaheswararao and fixed by umamaheswararao (test)
Fix the TestOfflineEditsViewer test failure in 0.23 branch
TestOfflineEditsViewer test fails in 0.23 branch
- HDFS-2340.
Major sub-task reported by szetszwo and fixed by szetszwo
Support getFileBlockLocations and getDelegationToken in webhdfs
- HDFS-2338.
Major sub-task reported by jnp and fixed by jnp
Configuration option to enable/disable webhdfs.
Added a conf property dfs.webhdfs.enabled for enabling/disabling webhdfs.
- HDFS-2333.
Major bug reported by ikelly and fixed by szetszwo
HDFS-2284 introduced 2 findbugs warnings on trunk
When HDFS-2284 was submitted it made DFSOutputStream public which triggered two SC_START_IN_CTOR findbug warnings.
- HDFS-2332.
Major test reported by tlipcon and fixed by tlipcon (test)
Add test for HADOOP-7629: using an immutable FsPermission as an IPC parameter
HADOOP-7629 fixes a bug where an immutable FsPermission would throw an error if used as the argument to fs.setPermission(). This JIRA is to add a test case for the common bugfix.
- HDFS-2331.
Major bug reported by abhijit.shingate and fixed by abhijit.shingate (hdfs client)
Hdfs compilation fails
I am trying to perform complete build from trunk folder but the compilation fails.
*Commandline:*
mvn clean install
*Error Message:*
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.
3.2:compile (default-compile) on project hadoop-hdfs: Compilation failure
[ERROR] \Hadoop\SVN\trunk\hadoop-hdfs-project\hadoop-hdfs\src\main\java\org
\apache\hadoop\hdfs\web\WebHdfsFileSystem.java:[209,21] type parameters of <T>T
cannot be determined; no unique maximal instance...
- HDFS-2323.
Major bug reported by tomwhite and fixed by tomwhite
start-dfs.sh script fails for tarball install
I build Common and HDFS tarballs from trunk then tried to start a cluster with start-dfs.sh, but I got the following error:
{noformat}
Starting namenodes on [localhost ]
sbin/start-dfs.sh: line 55: /Users/tom/tmp/hadoop/libexec/../bin/hadoop-daemons.sh: No such file or directory
sbin/start-dfs.sh: line 68: /Users/tom/tmp/hadoop/libexec/../bin/hadoop-daemons.sh: No such file or directory
Starting secondary namenodes [0.0.0.0 ]
sbin/start-dfs.sh: line 88: /Users/tom/tmp/hadoop/libexec/../bin/h...
- HDFS-2322.
Major bug reported by tucu00 and fixed by tucu00 (build)
the build fails in Windows because commons-daemon TAR cannot be fetched
For windows there is no commons-daemon TAR but a ZIP, plus the name follows a different convention.
- HDFS-2318.
Major sub-task reported by szetszwo and fixed by szetszwo
Provide authentication to webhdfs using SPNEGO
Added two new conf properties dfs.web.authentication.kerberos.principal and dfs.web.authentication.kerberos.keytab for the SPNEGO servlet filter.
- HDFS-2317.
Major sub-task reported by szetszwo and fixed by szetszwo
Read access to HDFS using HTTP REST
- HDFS-2314.
Major bug reported by vinodkv and fixed by tlipcon (test)
MRV1 test compilation broken after HDFS-2197
Runing the following:
At the trunk level: {{mvn clean install package -Dtar -Pdist -Dmaven.test.skip.exec=true}}
In hadoop-mapreduce-project: {{ant jar-test -Dresolvers=internal}}
yields the errors:
{code}
[javac] /home/vinodkv/Workspace/eclipse-workspace/apache-git/hadoop-common/hadoop-mapreduce-project/src/test/mapred/org/apache/hadoop/security/authorize/TestServiceLevelAuthorization.java:62: cannot find symbol
[javac] symbol : method getRpcServer(org.apache.hadoop.hdfs.server...
- HDFS-2294.
Major improvement reported by tucu00 and fixed by tucu00 (build)
Download of commons-daemon TAR should not be under target
Committed HDFS-2289 downloads commons-daemon TAR in the hadoop-hdfs/target/, earlier patches for HDFS-2289 were using hadoop-hdfs/download/ as the location for the download.
The motivation not to use the 'target/' directory is that on every clean build the TAR will be downloaded from Apache archives. Using a 'download' directory this happens once per workspace.
The patch was also adding the 'download/' directory to the .gitignore file (it should also be svn ignored).
Besides downloading it...
- HDFS-2290.
Major bug reported by shv and fixed by benoyantony (name-node)
Block with corrupt replica is not getting replicated
A block has one replica marked as corrupt and two good ones. countNodes() correctly detects that there are only 2 live replicas, and fsck reports the block as under-replicated. But ReplicationMonitor never schedules replication of good replicas.
- HDFS-2289.
Blocker bug reported by acmurthy and fixed by tucu00
jsvc isn't part of the artifact
Apparently we had something like this in build.xml:
<property name="jsvc.location" value="http://archive.apache.org/dist/commons/daemon/binaries/1.0.2/linux/commons-daemon-1.0.2-bin-linux-i386.tar.gz" />
Also, when I manually add in jsvc binary I get this error:
{noformat}
25/08/2011 23:47:18 29805 jsvc.exec error: Cannot find daemon loader org/apache/commons/daemon/support/DaemonLoader
25/08/2011 23:47:18 29778 jsvc.exec error: Service exit with a return value of 1
{noformat}
- HDFS-2286.
Trivial improvement reported by tlipcon and fixed by tlipcon (data-node)
DataXceiverServer logs AsynchronousCloseException at shutdown
During DN shutdown, the acceptor thread gets an AsynchronousCloseException, and logs it at WARN level. This exception is excepted, since another thread is closing the listener socket, so we should just swallow it.
- HDFS-2284.
Major sub-task reported by sanjay.radia and fixed by szetszwo
Write Http access to HDFS
HFTP allows on read access to HDFS via HTTP. Add write HTTP access to HDFS.
- HDFS-2273.
Minor improvement reported by szetszwo and fixed by szetszwo (name-node)
Refactor BlockManager.recentInvalidateSets to a new class
recentInvalidateSets and the associated methods can be moved out from BlockManager.
- HDFS-2267.
Trivial bug reported by tlipcon and fixed by tlipcon (data-node)
DataXceiver thread name incorrect while waiting on op during keepalive
Since HDFS-941, the DataXceiver can spend time waiting for a second op to come from the client. Currently, its thread name indicates whatever the previous operation was, rather than something like "Waiting in keepalive for a new request" or something.
- HDFS-2266.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Add a Namesystem interface to avoid directly referring to FSNamesystem
- HDFS-2265.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Remove unnecessary BlockTokenSecretManager fields/methods from BlockManager
- HDFS-2260.
Major improvement reported by tlipcon and fixed by tlipcon (hdfs client)
Refactor BlockReader into an interface and implementation
For the new block reader in HDFS-2129, or the local block reader in HDFS-347, we need to be able to support different implementations. This JIRA is to simply refactor the current BlockReader into an interface and an implementation.
- HDFS-2258.
Major bug reported by shv and fixed by shv (name-node, test)
TestLeaseRecovery2 fails as lease hard limit is not reset to default
TestLeaseRecovery2.testSoftLeaseRecovery() fails as lease hard limit remains set to 1 sec from the previous test case. If initial file creation in testSoftLeaseRecovery() takes longer than 1 sec, NN correctly reassigns the lease to itself and starts recovery. The test fails as the client cannot hflush() and close the file.
- HDFS-2245.
Major bug reported by szetszwo and fixed by szetszwo (name-node)
BlockManager.chooseTarget(..) throws NPE
{noformat}
2011-08-10 20:20:51,350 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8020, call: addBlock(/user/had
oopqa/passwd.1108102020.<NN hostname>.txt, DFSClient_NONMAPREDUCE_1875954430_1, null, null), rpc
version=1, client version=68, methodsFingerPrint=-1239577025 from <gateway>:38874, error:
java.io.IOException: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1225)
at org.apache.had...
- HDFS-2241.
Major improvement reported by sureshms and fixed by sureshms
Remove implementing FSConstants interface just to access the constants defined in the interface
Currently many classes implement FSConstants.java interface just for the convenience of accessing the constants defined in it. This could be done by using static imports or in some cases using FSConstants.<CONSTANT_NAME>, with no need for implementing the interface.
- HDFS-2240.
Critical bug reported by tlipcon and fixed by szetszwo (hdfs client)
Possible deadlock between LeaseRenewer and its factory
Lock cycle detected by jcarder
- HDFS-2239.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Reduce access levels of the fields and methods in FSNamesystem
- HDFS-2238.
Minor improvement reported by szetszwo and fixed by umamaheswararao (name-node)
NamenodeFsck.toString() uses StringBuilder with + operator
We should always use StringBuilder.append(..) but not + (string concatenation).
- HDFS-2237.
Minor sub-task reported by szetszwo and fixed by szetszwo (name-node)
Change UnderReplicatedBlocks from public to package private
- HDFS-2235.
Major bug reported by eli and fixed by eli (name-node)
Encode servlet paths
Hftp does not support paths which contain semicolons. The commented out test in HDFS-2234 illustrates this.
- HDFS-2233.
Major test reported by eli and fixed by eli (name-node)
Add WebUI tests with URI reserved chars in the path and filename
The web UI tests should cover paths where the path and filenames contain URI reserved characters. Ie Web UI coverage for HDFS-2235.
- HDFS-2232.
Blocker bug reported by shv and fixed by zero45 (test)
TestHDFSCLI fails on 0.22 branch
Several HDFS CLI tests fail on 0.22 branch. I can see 2 reasons:
# Not generic enough regular expression for host names and paths. Similar to MAPREDUCE-2304.
# Some command outputs have new-line in the end.
# And some seem to produce [much] more output than expected.
- HDFS-2230.
Major improvement reported by gkesavan and fixed by gkesavan (build)
hdfs it not resolving the latest common test jars published post common mavenization
hdfs it not pulling the right common test jar.
hadoop-common test jar dependency in ivy.xml has to configure as type=tests and not as a separate module.
- HDFS-2229.
Blocker bug reported by vinodkv and fixed by szetszwo (name-node)
Deadlock in NameNode
Either I am doing something incredibly stupid, or something about my environment is completely weird, or may be it really is a valid bug. I am running a NameNode deadlock consistently with 0.23 HDFS. I could never start NN successfully.
- HDFS-2228.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Move block and datanode code from FSNamesystem to BlockManager and DatanodeManager
- HDFS-2227.
Major improvement reported by ikelly and fixed by ikelly
HDFS-2018 Part 2 : getRemoteEditLogManifest should pull it's information from FileJournalManager
This is the second part of HDFS-2018. This patch moves the code that selects the available RemoteEditLogManifest out of the transactional inspector and into FileJournalManager.
- HDFS-2226.
Trivial improvement reported by tlipcon and fixed by tlipcon (name-node)
Clean up counting of operations in FSEditLogLoader
This is simple cleanup in FSEditLogLoader - rather than having a variable per operation type, we can just use an EnumMap to count how many instances of each opcode we've hit.
- HDFS-2225.
Major improvement reported by ikelly and fixed by ikelly
HDFS-2018 Part 1 : Refactor file management so its not in classes which should be generic
This is the first part of HDFS-2018 changes, to refactor some of the file management classes so they're in file specific places rather than in the generic classes.
- HDFS-2212.
Major improvement reported by tlipcon and fixed by tlipcon (name-node)
Refactor double-buffering code out of EditLogOutputStreams
This is a small cleanup that makes EditLogFileOutputStream and EditLogBackupOutputStream more consistent with each other on how they buffer edits. It simply refactors the double-buffering behavior into a new class.
- HDFS-2210.
Major task reported by eli and fixed by eli (contrib/hdfsproxy)
Remove hdfsproxy
The hdfsproxy contrib component is no longer supported.
- HDFS-2209.
Minor improvement reported by stevel@apache.org and fixed by stevel@apache.org (test)
Make MiniDFS easier to embed in other apps
I've been deploying MiniDFSCluster for some testing, and while using it/looking through the code I made some notes of where there are issues and improvement opportunities. This is mostly minor as its a test tool, but a risk of synchronization problems is there and does need addressing; the rest are all feature creep.
Field {{nameNode}} should be marked as volatile as the shutdown operation can be in a different thread than startup. Best of all,
add synchronized methods to set and get the f...
- HDFS-2205.
Major improvement reported by raviprak and fixed by raviprak (hdfs client)
Log message for failed connection to datanode is not followed by a success message.
To avoid confusing users on whether their HDFS operation was succesful or not, a success message should be printed.
- HDFS-2202.
Major new feature reported by eepayne and fixed by eepayne (balancer, data-node)
Changes to balancer bandwidth should not require datanode restart.
New dfsadmin command added: [-setBalancerBandwidth <bandwidth>] where bandwidth is max network bandwidth in bytes per second that the balancer is allowed to use on each datanode during balacing.
This is an incompatible change in 0.23. The versions of ClientProtocol and DatanodeProtocol are changed.
- HDFS-2200.
Minor sub-task reported by szetszwo and fixed by szetszwo (name-node)
Set FSNamesystem.LOG to package private
- HDFS-2199.
Major sub-task reported by szetszwo and fixed by umamaheswararao (name-node)
Move blockTokenSecretManager from FSNamesystem to BlockManager
- HDFS-2198.
Minor improvement reported by sureshms and fixed by sureshms (data-node, hdfs client, name-node)
Remove hardcoded configuration keys
Remove hardcoded config keys in hdfs code. Will do it in a separate jira for test code.
- HDFS-2197.
Major improvement reported by tlipcon and fixed by tlipcon (name-node)
Refactor RPC call implementations out of NameNode class
For HA, the NameNode will gain a bit of a state machine, to be able to transition between standby and active states. This would be cleaner in the code if the {{NameNode}} class were just a container for various services, as discussed in HDFS-1974. It's also nice for testing, where it would become easier to construct just the RPC handlers around a mock NameSystem, with no HTTP server, for example.
This JIRA is to move all of the protocol implementations out of {{NameNode}} into a separate {{N...
- HDFS-2196.
Major task reported by tucu00 and fixed by tucu00 (build)
Make ant build system work with hadoop-common JAR generated by Maven
Some tweaks must be done in HDFS ivy configuration to work with HADOOP-6671.
This wil be a temporary fix until HFDS is mavenized.
- HDFS-2191.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Move datanodeMap from FSNamesystem to DatanodeManager
- HDFS-2187.
Major improvement reported by ikelly and fixed by ikelly
HDFS-1580: Make EditLogInputStream act like an iterator over FSEditLogOps
This JIRA is for the input side changes moved out of HDFS-2149. EditLogInputStream has been changed to no longer be an InputStream implementation, but to return a stream of FSEditLogOp objects using readOp(). The upshot is that all that can ever be read from an EditLogInputStream is a op. No random hackery can be used to put other things in the stream. Version is now a property of the EditLogInputStream and retrieved using getVersion().
- HDFS-2186.
Major bug reported by eli and fixed by eli (data-node)
DN volume failures on startup are not counted
Volume failures detected on startup are not currently counted/reported as such. Eg if you have configured 4 volumes, 2 tolerated failures, and you start a DN with two failed volumes it will come up and report (to the NN) no failed volumes. The DN will still be able to tolerate 2 additional volume failures (ie it's OK with no valid volumes remaining). The intent of the volume failure toleration config value is that if more than this # of volumes of the total set of configured volumes have fail...
- HDFS-2180.
Major improvement reported by tlipcon and fixed by tlipcon
Refactor NameNode HTTP server into new class
As discussed in HDFS-1974, it would be nice to refactor some parts of NameNode.java out into their own classes. This JIRA is to move out the HTTP server.
- HDFS-2167.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Move dnsToSwitchMapping and hostsReader from FSNamesystem to DatanodeManager
- HDFS-2161.
Minor improvement reported by szetszwo and fixed by szetszwo (balancer, data-node, hdfs client, name-node, security)
Move utilities to DFSUtil
Utilities include
- {{createNamenode(..)}}, {{createClientDatanodeProtocolProxy(..)}};
- {{stringifyToken(..)}}; and
- {{Random}} object.
- HDFS-2159.
Major sub-task reported by szetszwo and fixed by szetszwo (hdfs client)
Deprecate DistributedFileSystem.getClient()
The DFSClient in DistributedFileSystem should not be accessed directly.
- HDFS-2157.
Major improvement reported by atm and fixed by atm (documentation, name-node)
Improve header comment in o.a.h.hdfs.server.namenode.NameNode
A developer new to HDFS pointed out to me that the header comment at the top of {{NameNode.java}} is a little out of date/inaccurate.
- HDFS-2156.
Major bug reported by owen.omalley and fixed by eyang
rpm should only require the same major version as common
The rpm for hdfs should only require the same major version (eg. 0.23) of common.
- HDFS-2154.
Minor test reported by szetszwo and fixed by szetszwo (test)
TestDFSShell should use test dir
The new test by HDFS-2131 creates files/directories under currently directory.
- HDFS-2153.
Minor bug reported by szetszwo and fixed by szetszwo (test)
DFSClientAdapter should be put under test
{{DFSClientAdapter}} is a test utility but it is put in src/java.
- HDFS-2149.
Major sub-task reported by ikelly and fixed by ikelly (name-node)
Move EditLogOp serialization formats into FsEditLogOp implementations
On trunk serialisation of editlog ops is in FSEditLog#log* and deserialisation is in FSEditLogOp.*Op . This improvement is to move the serialisation code into one place, i.e under FSEditLogOp.*Op.
This is part of HDFS-1580.
- HDFS-2147.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Move cluster network topology to block management
- HDFS-2144.
Major improvement reported by raviprak and fixed by raviprak (name-node)
If SNN shuts down during initialization it does not log the cause
SNN should log messages when it shuts down because of authentication issues.
- HDFS-2143.
Major improvement reported by raviprak and fixed by raviprak
Federation: we should link to the live nodes and dead nodes to cluster web console
The dfsclusterhealth page shows the number of live and dead nodes. It would be nice to link those numbers to the page containing the list of those nodes
- HDFS-2141.
Major sub-task reported by sureshms and fixed by sureshms (name-node)
Remove NameNode roles Active and Standby (they become states)
In HDFS, following roles are supported in NameNodeRole: ACTIVE, BACKUP, CHECKPOINT and STANDBY.
Active and Standby are the state of the NameNode. While Backup and CheckPoint are the name/role of the daemons that are started. This mixes up the run time state of NameNode with the daemon role. I propose changing the NameNodeRole to: NAMENODE, BACKUP, CHECKPOINT. HDFS-1974 will introduce the states active and standby to the daemon that is running in the role NAMENODE.
- HDFS-2140.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Move Host2NodesMap to block management
- HDFS-2134.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Move DecommissionManager to block management
Datanode management including {{DecommissionManager}} should belong to block management.
- HDFS-2132.
Major bug reported by atm and fixed by atm
Potential resource leak in EditLogFileOutputStream.close
{{EditLogFileOutputStream.close(...)}} sequentially closes a series of underlying resources. If any of the calls to {{close()}} throw before the last one, the later resources will never be closed.
- HDFS-2131.
Major test reported by umamaheswararao and fixed by umamaheswararao (test)
Tests for HADOOP-7361
- HDFS-2118.
Minor improvement reported by eli and fixed by eli (data-node)
Couple dfs data dir improvements
Some small dfs data dir improvements:
* DataNode#getDataDirsFromURIs should indicate which directory failed.
* FSDataset#FSDataset should use getTrimmedStrings when reading dfs.data.dir config.
* Fixes a spelling mistake in DataXceiver and DataXceiverServer
- HDFS-2116.
Minor improvement reported by eli and fixed by zero45 (test)
Cleanup TestStreamFile and TestByteRangeInputStream
TestStreamFile and TestByteRangeInputStream should use mockito. This would allow the private URLOpener class to be removed from ByteRangeInputStream.
- HDFS-2114.
Major bug reported by johnvijoe and fixed by johnvijoe
re-commission of a decommissioned node does not delete excess replica
If a decommissioned node is removed from the decommissioned list, namenode does not delete the excess replicas it created while the node was decommissioned.
- HDFS-2112.
Major sub-task reported by szetszwo and fixed by umamaheswararao (name-node)
Move ReplicationMonitor to block management
Replication should be handled by block manager instead of name system.
- HDFS-2111.
Major test reported by qwertymaniac and fixed by qwertymaniac (data-node, test)
Add tests for ensuring that the DN will start with a few bad data directories (Part 1 of testing DiskChecker)
Add tests to ensure that given multiple data dirs, if a single is bad, the DN should still start up.
This is to check DiskChecker's functionality used in instantiating DataNodes
- HDFS-2110.
Minor improvement reported by eli and fixed by eli (name-node)
Some StreamFile and ByteRangeInputStream cleanup
StreamFile#sendPartialData can be cleaned up, has some System.out.printlns, no javadoc, and the byte copying method should be pulled out to IOUtils.
- HDFS-2109.
Major bug reported by bharathm and fixed by bharathm (hdfs client)
Store uMask as member variable to DFSClient.Conf
As a part of removing reference to conf in DFSClient, I am proposing replacing FsPermission.getUMask(conf) everywhere in DFSClient class with
dfsClientConf.uMask by storing uMask as a member variable to DFSClient.Conf.
- HDFS-2108.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Move datanode heartbeat handling to BlockManager
Logically, datanodes should heartbeat to block manager instead of name system. Therefore, we should move datanode heartbeat handling code to {{BlockManager}}.
- HDFS-2107.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Move block management code to a package
Moved block management codes to a new package org.apache.hadoop.hdfs.server.blockmanagement.
- HDFS-2100.
Minor test reported by atm and fixed by atm (test)
Improve TestStorageRestore
Though running multiple 2NNs isn't supported, accidentally doing so should not result in HDFS metadata corruptions. We should add a test case to exercise this possibility when name.dir.storage.restore is enabled, which is a particularly delicate code path.
- HDFS-2096.
Major task reported by tucu00 and fixed by tucu00 (build)
Mavenization of hadoop-hdfs
Same as HADOOP-6671 for hdfs
- HDFS-2092.
Major bug reported by bharathm and fixed by bharathm (hdfs client)
Create a light inner conf class in DFSClient
At present, DFSClient stores reference to configuration object. Since, these configuration objects are pretty big at times can blot the processes which has multiple DFSClient objects like in TaskTracker. This is an attempt to remove the reference of conf object in DFSClient.
This patch creates a light inner conf class and copies the required keys from the Configuration object.
- HDFS-2087.
Major sub-task reported by szetszwo and fixed by szetszwo (data-node, hdfs client)
Add methods to DataTransferProtocol interface
Declare methods in DataTransferProtocol interface, and change Sender and Receiver to implement the interface.
- HDFS-2086.
Major bug reported by tanping and fixed by tanping (name-node)
If the include hosts list contains host name, after restarting namenode, datanodes registrant is denied
As the title describes the problem: if the include host list contains host name, after restarting namenodes, the datanodes registrant is denied by namenodes. This is because after namenode is restarted, the still alive data node will try to register itself with the namenode and it identifies itself with its *IP address*. However, namenode only allows all the hosts in its hosts list to registrant and all of them are hostnames. So namenode would deny the datanode registration.
- HDFS-2083.
Major new feature reported by tanping and fixed by tanping
Adopt JMXJsonServlet into HDFS in order to query statistics
HADOOP-7144 added JMXJsonServlet into Common. It gives the capability to query statistics and metrics exposed via JMX to be queried through HTTP. We adopt this into HDFS. This provides the alternative solution to HDFS-1874.
- HDFS-2082.
Major bug reported by atm and fixed by atm
SecondaryNameNode web interface doesn't show the right info
HADOOP-3741 introduced some useful info to the 2NN web UI. This broke when security was added.
- HDFS-2073.
Minor improvement reported by sureshms and fixed by sureshms (name-node)
Namenode is missing @Override annotations
NameNode implements several protocols. The methods that implement the interface do not have @Override. Also @inheritdoc is used, which is not needed with @Override.
- HDFS-2069.
Trivial sub-task reported by raviphulari and fixed by qwertymaniac (documentation)
Incorrect default trash interval value in the docs
Current HDFS architecture information about Trash is incorrectly documented as -
{color:red}
The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.
{color}
It should be something like -
Current default trash interval is set to 0 (Deletes file without storing in trash ) . This value is configurable parameter stored as fs.trash.interval stored in core-site.xml .
- HDFS-2067.
Major bug reported by tlipcon and fixed by szetszwo (data-node, hdfs client)
Bump DATA_TRANSFER_VERSION in trunk for protobufs
Forgot to bump DATA_TRANSFER_VERSION in HDFS-2058. We need to do this since the protobufs are incompatible with the old writables.
- HDFS-2066.
Major sub-task reported by szetszwo and fixed by szetszwo (data-node, hdfs client, name-node)
Create a package and individual class files for DataTransferProtocol
{{DataTransferProtocol}} contains quite a few classes. It is better to create a package and put the classes into individual files.
- HDFS-2065.
Major bug reported by bharathm and fixed by umamaheswararao
Fix NPE in DFSClient.getFileChecksum
The following code can throw NPE if callGetBlockLocations returns null.
If server returns null
{code}
List<LocatedBlock> locatedblocks
= callGetBlockLocations(namenode, src, 0, Long.MAX_VALUE).getLocatedBlocks();
{code}
The right fix for this is server should throw right exception.
- HDFS-2061.
Minor bug reported by mattf and fixed by mattf (name-node)
two minor bugs in BlockManager block report processing
In a recent review of HDFS-1295 patches (speedup for block report processing), found two very minor bugs in BlockManager, as documented in following comments.
- HDFS-2058.
Major new feature reported by tlipcon and fixed by tlipcon
DataTransfer Protocol using protobufs
We've been talking about this for a long time... would be nice to use something like protobufs or Thrift for some of our wire protocols.
I knocked together a prototype of DataTransferProtocol on top of proto bufs that seems to work.
- HDFS-2056.
Minor improvement reported by tanping and fixed by tanping (documentation, tools)
Update fetchdt usage
Update the usage of fetchdt.
- HDFS-2055.
Major new feature reported by traviscrawford and fixed by traviscrawford (libhdfs)
Add hflush support to libhdfs
Add hdfsHFlush to libhdfs.
- HDFS-2054.
Minor improvement reported by kihwal and fixed by kihwal (data-node)
BlockSender.sendChunk() prints ERROR for connection closures encountered during transferToFully()
The addition of ERROR was part of HDFS-1527. In environments where clients tear down FSInputStream/connection before reaching the end of stream, this error message often pops up. Since these are not really errors and especially not the fault of data node, the message should be toned down at least.
- HDFS-2053.
Minor bug reported by miguno and fixed by miguno (name-node)
Bug in INodeDirectory#computeContentSummary warning
*How to reproduce*
{code}
# create test directories
$ hadoop fs -mkdir /hdfs-1377/A
$ hadoop fs -mkdir /hdfs-1377/B
$ hadoop fs -mkdir /hdfs-1377/C
# ...add some test data (few kB or MB) to all three dirs...
# set space quota for subdir C only
$ hadoop dfsadmin -setSpaceQuota 1g /hdfs-1377/C
# the following two commands _on the parent dir_ trigger the warning
$ hadoop fs -dus /hdfs-1377
$ hadoop fs -count -q /hdfs-1377
{code}
Warning message in the namenode logs:
{code}
2011-06-09 09:42...
- HDFS-2046.
Major improvement reported by tlipcon and fixed by tlipcon (build, test)
Force entropy to come from non-true random for tests
Same as HADOOP-7335 but for HDFS
- HDFS-2041.
Major bug reported by tlipcon and fixed by tlipcon
Some mtimes and atimes are lost when edit logs are replayed
The refactoring in HDFS-2003 allowed findbugs to expose two potential bugs:
- the atime field logged with OP_MKDIR is unused
- the timestamp field logged with OP_CONCAT_DELETE is unused
The concat issue is definitely real. The atime for MKDIR might always be identical to mtime in that case, in which case it could be ignored.
- HDFS-2040.
Minor improvement reported by eli and fixed by eli
Only build libhdfs if a flag is passed
In HDFS-2022 we made ant binary build libhdfs unconditionally, this is a pain for users who now need to get the native toolchain working to create a tarball to test a change, and inconsistent with common and MR (see MAPREDUCE-2559) which only build native code if a flag is passed. Let's revert to the previous behavior of requiring -Dlibhdfs be passed at build time. We could also create a new ant target that doesn't build the native code, however restoring the old behavior seems simplest.
- HDFS-2034.
Minor bug reported by johnvijoe and fixed by johnvijoe (hdfs client)
length in getBlockRange becomes -ve when reading only from currently being written blk
This came up during HDFS-1907. Posting an example that Todd posted in HDFS-1907 that brought out this issue.
{quote}
Here's an example sequence to describe what I mean:
1. open file, write one and a half blocks
2. call hflush
3. another reader asks for the first byte of the second block
{quote}
In this case since offset is greater than the completed block length, the math in getBlockRange() of DFSInputStreamer.java will set "length" to negative.
- HDFS-2030.
Minor bug reported by bharathm and fixed by bharathm
Fix the usability of namenode upgrade command
Fixing the Namenode upgrade option along the same line as Namenode format option.
If clusterid is not given then clusterid will be automatically generated for the upgrade but if clusterid is given then it will be honored.
- HDFS-2029.
Trivial improvement reported by szetszwo and fixed by johnvijoe (test)
Improve TestWriteRead
Let's fix code style and remove redundant codes.
- HDFS-2024.
Trivial improvement reported by cwchung and fixed by cwchung (test)
Eclipse format HDFS Junit test hdfs/TestWriteRead.java
Eclipse format the file src/test/../hdfs/TestWriteRead.java. This is in preparation of HDFS-1968.
So the patch should have only formatting changes such as white space.
- HDFS-2022.
Major bug reported by eli and fixed by eyang (build)
ant binary should build libhdfs
Post HDFS-1963 ant binary fails w/ the following. The bin-package is trying to copy from the c++ lib dir which doesn't exist yet. The binary target should check for the existence of this dir or would also be reasonable to depend on the compile-c++-libhdfs (since this is the binary target).
{noformat}
/home/eli/src/hdfs4/build.xml:1115: /home/eli/src/hdfs4/build/c++/Linux-amd64-64/lib not found.
{noformat}
- HDFS-2021.
Major bug reported by cwchung and fixed by johnvijoe (data-node)
TestWriteRead failed with inconsistent visible length of a file
The junit test failed when iterates a number of times with larger chunk size on Linux. Once a while, the visible number of bytes seen by a reader is slightly less than what was supposed to be.
When run with the following parameter, it failed more often on Linux ( as reported by John George) than my Mac:
private static final int WR_NTIMES = 300;
private static final int WR_CHUNK_SIZE = 10000;
Adding more debugging output to the source, this is a sample of the output:
Caused by: java.io....
- HDFS-2020.
Major bug reported by sureshms and fixed by sureshms (data-node, test)
TestDFSUpgradeFromImage fails
Datanode has a singleton datanodeObject. When running MiniDFSCluster with multiple datanodes, the singleton can point to only one of the datanodes. TestDFSUpgradeFromImage fails related to initialization of this singleton.
- HDFS-2019.
Minor bug reported by bharathm and fixed by bharathm (data-node)
Fix all the places where Java method File.list is used with FileUtil.list API
This new method FileUtil.list will throw an exception when disk is bad rather than returning null.
- HDFS-2014.
Critical bug reported by tlipcon and fixed by eyang (scripts)
bin/hdfs no longer works from a source checkout
bin/hdfs now appears to depend on ../libexec, which doesn't exist inside of a source checkout:
todd@todd-w510:~/git/hadoop-hdfs$ ./bin/hdfs namenode
./bin/hdfs: line 22: /home/todd/git/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory
./bin/hdfs: line 138: cygpath: command not found
./bin/hdfs: line 161: exec: : not found
- HDFS-2011.
Major bug reported by raviprak and fixed by raviprak (name-node)
Removal and restoration of storage directories on checkpointing failure doesn't work properly
Removal and restoration of storage directories on checkpointing failure doesn't work properly. Sometimes it throws a NullPointerException and sometimes it doesn't take off a failed storage directory
- HDFS-2003.
Major improvement reported by ikelly and fixed by ikelly
Separate FSEditLog reading logic from editLog memory state building logic
Currently FSEditLogLoader has code for reading from an InputStream interleaved with code which updates the FSNameSystem and FSDirectory. This makes it difficult to read an edit log without having a whole load of other object initialised, which is problematic if you want to do things like count how many transactions are in a file etc.
This patch separates the reading of the stream and the building of the memory state.
- HDFS-1999.
Major bug reported by atm and fixed by atm (test)
Tests use deprecated configs
A few of the HDFS tests (not intended to test deprecation) use config keys which are deprecated.
- HDFS-1998.
Minor bug reported by tanping and fixed by tanping (scripts)
make refresh-namodenodes.sh refreshing all namenodes
refresh-namenodes.sh is used to refresh name nodes in the cluster to check for updates of include/exclude list. It is used when decommissioning or adding a data node. Currently it only refreshes the name node who serves the defaultFs, if there is defaultFs defined. Fix it by refreshing all the name nodes in the cluster.
- HDFS-1996.
Major improvement reported by szetszwo and fixed by eyang (build)
ivy: hdfs test jar should be independent to common test jar
hdfs tests and common tests may require different libraries, e.g. common tests need ftpserver for testing {{FTPFileSystem}} but hdfs does not.
- HDFS-1995.
Minor improvement reported by tanping and fixed by tanping
Minor modification to both dfsclusterhealth and dfshealth pages for Web UI
Four small modifications/fixes:
on dfshealthpage:
1) fix remaining% to be remaining / total ( it was mistaken as used / total)
on dfsclusterhealth page:
1) makes the table header 8em wide
2) fix the typo(inconsistency) Total Files and Blocks => Total Files and Directories
3) make the DFS Used = sum of block pool used space of every name space. And change the label names accordingly.
- HDFS-1990.
Minor bug reported by ram_krish and fixed by umamaheswararao (data-node)
Resource leaks in HDFS
Possible resource leakage in HDFS.
- HDFS-1986.
Minor bug reported by tanping and fixed by tanping (tools)
Add an option for user to return http or https ports regardless of security is on/off in DFSUtil.getInfoServer()
Currently DFSUtil.getInfoServer gets http port with security off and httpS port with security on. However, we want to return http port regardless of security on/off for Cluster UI to use. Add in a third Boolean parameter for user to decide whether to check security or not.
- HDFS-1983.
Major test reported by daryn and fixed by daryn (test)
Fix path display for copy & rm
This will also fix a few misc broken tests.
- HDFS-1968.
Minor test reported by cwchung and fixed by cwchung (test)
Enhance TestWriteRead to support File Append and Position Read
Desirable to enhance TestWriteRead to support command line options to do:
(1) File Append
(2) Position Read (currently supporting sequential read).
- HDFS-1966.
Major sub-task reported by szetszwo and fixed by szetszwo (data-node, hdfs client)
Encapsulate individual DataTransferProtocol op header
Added header classes for individual DataTransferProtocol op headers.
- HDFS-1964.
Major bug reported by atm and fixed by atm
Incorrect HTML unescaping in DatanodeJspHelper.java
HDFS-1575 introduced some HTML unescaping of parameters so that viewing a file would work for paths containing HTML-escaped characters, but in two of the places did the unescaping either too early or too late.
- HDFS-1963.
Major new feature reported by eyang and fixed by eyang (build)
HDFS rpm integration project
Create HDFS RPM package
- HDFS-1959.
Minor improvement reported by eli and fixed by eli
Better error message for missing namenode directory
Starting the namenode with a missing NN directory currently results in two stack traces, "Expecting a line not the end of stream" from DF and an NPE. Let's make this more user-friendly.
- HDFS-1958.
Major improvement reported by tlipcon and fixed by tlipcon (name-node)
Format confirmation prompt should be more lenient of its input
As reported on the mailing list, the namenode format prompt only accepts 'Y'. We should also accept 'y' and 'yes' (non-case-sensitive).
- HDFS-1955.
Major bug reported by mattf and fixed by mattf (name-node)
FSImage.doUpgrade() was made too fault-tolerant by HDFS-1826
Prior to HDFS-1826, doUpgrade() would fail if any of the storage directories failed to successfully write the new fsimage or edits files.
Now it appears to "succeed" even if some or all of the individual directories fail.
There is some discussion about whether doUpgrade() should have some fault tolerance, but for now make it fail on any single storage directory failure, as before.
- HDFS-1953.
Minor bug reported by tanping and fixed by tanping
Change name node mxbean name in cluster web console
name node mxbean name is changed after the new metrics framework is checked. Need to change this in ClusterJspHelper.java in order for cluster web console to work again.
- HDFS-1952.
Major bug reported by mattf and fixed by azuriel
FSEditLog.open() appears to succeed even if all EDITS directories fail
FSEditLog.open() appears to "succeed" even if all of the individual directories failed to allow creation of an EditLogOutputStream. The problem and solution are essentially similar to that of HDFS-1505.
- HDFS-1945.
Major sub-task reported by szetszwo and fixed by szetszwo (data-node, hdfs client)
Removed deprecated fields in DataTransferProtocol
Removed the deprecated fields in DataTransferProtocol.
- HDFS-1943.
Blocker bug reported by weiyj and fixed by weiyj (scripts)
fail to start datanode while start-dfs.sh is executed by root user
When start-dfs.sh is run by root user, we got the following error message:
# start-dfs.sh
Starting namenodes on [localhost ]
localhost: namenode running as process 2556. Stop it first.
localhost: starting datanode, logging to /usr/hadoop/hadoop-common-0.23.0-SNAPSHOT/bin/../logs/hadoop-root-datanode-cspf01.out
localhost: Unrecognized option: -jvm
localhost: Could not create the Java virtual machine.
The -jvm options should be passed to jsvc when we starting a secure
datanode, but it still pa...
- HDFS-1939.
Major improvement reported by szetszwo and fixed by eyang (build)
ivy: test conf should not extend common conf
* Removed duplicated jars in test class path.
- HDFS-1938.
Minor bug reported by szetszwo and fixed by eyang (build)
Reference ivy-hdfs.classpath not found.
{noformat}
$ant test-system
...
BUILD FAILED
/export/crawlspace/tsz/hdfs/h1/src/test/aop/build/aop.xml:129: The following error occurred while executing this line:
/export/crawlspace/tsz/hdfs/h1/src/test/aop/build/aop.xml:183: The following error occurred while executing this line:
/export/crawlspace/tsz/hdfs/h1/src/test/aop/build/aop.xml:193: The following error occurred while executing this line:
/export/crawlspace/tsz/hdfs/h1/build.xml:449: Reference ivy-hdfs.classpath not found.
{noformat}
- HDFS-1936.
Blocker bug reported by sureshms and fixed by sureshms (name-node)
Updating the layout version from HDFS-1822 causes upgrade problems.
In HDFS-1822 and HDFS-1842, the layout versions for 203, 204, 22 and trunk were changed. Some of the namenode logic that depends on layout version is broken because of this. Read the comment for more description.
- HDFS-1934.
Major bug reported by bharathm and fixed by bharathm
Fix NullPointerException when File.listFiles() API returns null
While testing Disk Fail Inplace, We encountered the NPE from this part of the code.
File[] files = dir.listFiles();
for (File f : files) {
...
}
This is kinda of an API issue. When a disk is bad (or name is not a directory), this API (listFiles, list) return null rather than throwing an exception. This 'for loop' throws a NPE exception. And same applies to dir.list() API.
Fix all the places where null condition was not checked.
- HDFS-1933.
Major test reported by daryn and fixed by daryn (test)
Update tests for FsShell's "test"
Fix tests broken by refactoring.
- HDFS-1931.
Major test reported by daryn and fixed by daryn
Update tests for du/dus/df
- HDFS-1928.
Major test reported by daryn and fixed by daryn (test)
Fix path display for touchz
Fix expected text in TestHDFSCLI
- HDFS-1927.
Major bug reported by johnvijoe and fixed by johnvijoe (name-node)
audit logs could ignore certain xsactions and also could contain "ip=null"
Namenode audit logs could be ignoring certain transactions that are successfully completed. This is because it check if the RemoteIP is null to decide if a transaction is remote or not. In certain cases, RemoteIP could return null but the xsaction could still be "remote". An example is a case where a client gets killed while in the middle of the transaction.
- HDFS-1923.
Major sub-task reported by mattf and fixed by szetszwo (test)
Intermittent recurring failure in TestFiDataTransferProtocol2.pipeline_Fi_29
- HDFS-1922.
Major sub-task reported by mattf and fixed by vicaya (test)
Recurring failure in TestJMXGet.testNameNode since build 477 on May 11
- HDFS-1921.
Blocker bug reported by atm and fixed by mattf
Save namespace can cause NN to be unable to come up on restart
I discovered this in the course of trying to implement a fix for HDFS-1505.
Per the comment for {{FSImage.saveNamespace(...)}}, the algorithm for save namespace proceeds in the following order:
# rename current to lastcheckpoint.tmp for all of them,
# save image and recreate edits for all of them,
# rename lastcheckpoint.tmp to previous.checkpoint.
The problem is that step 3 occurs regardless of whether or not an error occurs for all storage directories in step 2. Upon restart, the NN will...
- HDFS-1920.
Major bug reported by scurrilous and fixed by scurrilous (libhdfs)
libhdfs does not build for ARM processors
$ ant compile -Dcompile.native=true -Dcompile.c++=1 -Dlibhdfs=1 -Dfusedfs=1
...
create-libhdfs-configure:
...
[exec] configure: error: Unsupported CPU architecture "armv7l"
Once the CPU arch check is fixed in src/c++/libhdfs/m4/apsupport.m4, then next issue is -m32:
$ ant compile -Dcompile.native=true -Dcompile.c++=1 -Dlibhdfs=1 -Dfusedfs=1
...
compile-c++-libhdfs:
[exec] /bin/bash ./libtool --tag=CC --mode=compile gcc -DPACKAGE_NAME=\"libhdfs\" -DPACKAGE_TARNAME=\"libhdfs\" -D...
- HDFS-1917.
Major bug reported by eyang and fixed by eyang (build)
Clean up duplication of dependent jar files
Remove packaging of duplicated third party jar files
- HDFS-1914.
Major bug reported by sureshms and fixed by sureshms (name-node)
Federation: namenode storage directory must be configurable specific to a namenode
Federation allows common configuration where namenode specific configuration are in the same configuration suffixed by nameservice ID. When namenodes use an external storage directory (NFS), in order to make namenodes use different directories on the external server, the storage directory configuration must also allow specific configuration, using nameservice ID.
- HDFS-1912.
Major test reported by daryn and fixed by daryn (test)
Update tests for FsShell standardized error messages
Need to update the FsShell based tests for commonized error messages.
- HDFS-1911.
Major test reported by sanjay.radia and fixed by sanjay.radia
HDFS tests for viewfs
- HDFS-1908.
Minor bug reported by szetszwo and fixed by szetszwo (test)
DataTransferTestUtil$CountdownDoosAction.run(..) throws NullPointerException
In [build #426|https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/426//testReport/org.apache.hadoop.hdfs.server.datanode/TestFiDataTransferProtocol2/pipeline_Fi_17/],
{noformat}
2011-04-28 07:20:10,559 ERROR datanode.DataNode (DataXceiver.java:run(133)) - DatanodeRegistration(127.0.0.1:50589,
storageID=DS-499221794-127.0.1.1-50589-1303975177998, infoPort=58607, ipcPort=52786):DataXceiver
java.lang.NullPointerException
at org.apache.hadoop.fi.DataTransferTestUtil$CountdownDoosAction.r...
- HDFS-1907.
Major bug reported by cwchung and fixed by johnvijoe (hdfs client)
BlockMissingException upon concurrent read and write: reader was doing file position read while writer is doing write without hflush
BlockMissingException is thrown under this test scenario:
Two different processes doing concurrent file r/w: one read and the other write on the same file
- writer keep doing file write
- reader doing position file read from beginning of the file to the visible end of file, repeatedly
The reader is basically doing:
byteRead = in.read(currentPosition, buffer, 0, byteToReadThisRound);
where CurrentPostion=0, buffer is a byte array buffer, byteToReadThisRound = 1024*10000;
Usually it doe...
- HDFS-1906.
Minor improvement reported by sureshms and fixed by sureshms (hdfs client)
Remove logging exception stack trace when one of the datanode targets to read from is not reachable
When client fails to connect to one of the datanodes from the list of block locations returned, exception stack trace is printed in the client log. This is an expected failure scenario that is handled at the client, by going to the next location. Printing entire stack trace is unnecessary and just printing the exception message should be sufficient.
- HDFS-1905.
Minor bug reported by bharathm and fixed by bharathm (name-node)
Improve the usability of namenode -format
While setting up 0.23 version based cluster, i ran into this issue. When i issue a format namenode command, which got changed in 23, it should let the user know to how to use this command in case where complete options were not specified.
./hdfs namenode -format
I get the following error msg, still its not clear what and how user should use this command.
11/05/09 15:36:25 ERROR namenode.NameNode: java.lang.IllegalArgumentException: Format must be provided with clusterid
at org.apache.hado...
- HDFS-1903.
Major test reported by daryn and fixed by daryn (test)
Fix path display for rm/rmr
- HDFS-1902.
Major test reported by daryn and fixed by daryn (test)
Fix path display for setrep
See HDFS-1901.
- HDFS-1899.
Major improvement reported by tlipcon and fixed by yuzhihong@gmail.com
GenericTestUtils.formatNamenode is misplaced
This function belongs in DFSTestUtil, the standard place for putting cluster-related utils.
- HDFS-1898.
Critical bug reported by tlipcon and fixed by tlipcon
Tests failing on trunk due to use of NameNode.format
After federation merge, NameNode.format no longer works on trunk. Unclear why these tests aren't failing on Hudson, but some seem to fail for me in my checkout (including TestEditLogFileOutputStream for example)
- HDFS-1890.
Minor improvement reported by szetszwo and fixed by szetszwo (hdfs client)
A few improvements on the LeaseRenewer.pendingCreates map
- The class is better to be just a {{Map}} instead of a {{SortedMap}}.
- The value type is better to be {{DFSOutputStream}} instead of {{OutputStream}}.
- The variable name is better to be filesBeingWritten instead of pendingCreates since we have append.
- HDFS-1889.
Major bug reported by johnvijoe and fixed by johnvijoe
incorrect path in start/stop dfs script
HADOOP_HOME in start-dfs.sh and stop-dfs.sh should be changed to HADOOP_HDFS_HOME because hdfs script is in the hdfs
directory and not common directory
- HDFS-1888.
Major bug reported by sureshms and fixed by sureshms
MiniDFSCluster#corruptBlockOnDatanodes() access must be public for MapReduce contrib raid
HDFS-1052 during code merge the method was made package private. It needs to be public for access in MapReduce contrib raid.
- HDFS-1884.
Major sub-task reported by mattf and fixed by atm (test)
Improve TestDFSStorageStateRecovery
- HDFS-1883.
Major sub-task reported by mattf and fixed by (test)
Recurring failures in TestBackupNode since HDFS-1052
- HDFS-1881.
Major bug reported by tanping and fixed by tanping (data-node)
Federation: after taking snapshot the current directory of datanode is empty
After taking a snapshot in Federation (by starting up namenode with option -upgrade), it appears that the current directory of data node does not contain the block files. We have also verified that upgrading from 20 to Federation does not have this problem.
- HDFS-1877.
Minor test reported by cwchung and fixed by cwchung (test)
Create a functional test for file read/write
It would be a great to have a tool, running on a real grid, to perform function test (and stress tests to certain extent) for the file operations. The tool would be written in Java and makes HDFS API calls to read, write, append, hflush hadoop files. The tool would be usable standalone, or as a building block for other regression or stress test suites (written in shell, perl, python, etc).
- HDFS-1876.
Blocker bug reported by tlipcon and fixed by tlipcon
One MiniDFSCluster ignores numDataNodes parameter
After the federation merge, one of the MiniDFSCluster constructors ignores its numDataNodes argument, thus causing TestFileInputFormat to fail (MAPREDUCE-2466)
- HDFS-1875.
Major bug reported by eepayne and fixed by eepayne (test)
MiniDFSCluster hard-codes dfs.datanode.address to localhost
When creating RPC addresses that represent the communication sockets for each simulated DataNode, the MiniDFSCluster class hard-codes the address of the dfs.datanode.address port to be "127.0.0.1:0"
The DataNodeCluster test tool uses the MiniDFSCluster class to create a selected number of simulated datanodes on a single host. In the DataNodeCluster setup, the NameNode is not simulated but is started as a separate daemon.
The problem is that if the write requrests into the simulated datanode...
- HDFS-1873.
Major new feature reported by tanping and fixed by tanping
Federation Cluster Management Web Console
The Federation cluster management console provides
# Cluster summary information that shows overall cluster utilization. A list of the name nodes that reports the used space, files and directories, blocks, live and dead datanodes
of each name space.
# decommissioning status of all the data nodes who are decommissioning in process or decommissioned.
- HDFS-1871.
Major bug reported by sureshms and fixed by sureshms (test)
Tests using MiniDFSCluster fail to compile due to HDFS-1052 changes
MiniDFSCluster public method signature changes from HDFS-1052 breaks build of mapreduce tests.
- HDFS-1870.
Minor improvement reported by szetszwo and fixed by szetszwo (hdfs client)
Refactor DFSClient.LeaseChecker
Two simply changes:
- move the inner class {{LeaseChecker}} from {{DFSClient}} to a separated class;
- rename {{LeaseChecker}} to {{LeaseRenewer}} since it is actually renewing lease instead of checking lease.
- HDFS-1869.
Major bug reported by daryn and fixed by daryn (name-node)
mkdirs should use the supplied permission for all of the created directories
A multi-level mkdir is now POSIX compliant. Instead of creating intermediate directories with the permissions of the parent directory, intermediate directories are created with permission bits of rwxrwxrwx (0777) as modified by the current umask, plus write and search permission for the owner.
- HDFS-1865.
Major improvement reported by szetszwo and fixed by szetszwo (hdfs client)
Share LeaseChecker thread among DFSClients
Each {{DFSClient}} runs a {{LeaseChecker}} thread within a JVM. The number threads could be reduced by sharing the threads.
- HDFS-1862.
Major test reported by atm and fixed by atm (test)
Improve test reliability of HDFS-1594
One of the tests I wrote in HDFS-1594 seems to be flaky in Hudson runs, despite passing reliably on my box. This JIRA is to track improving the reliability of this test.
- HDFS-1861.
Major improvement reported by eli and fixed by eli (data-node)
Rename dfs.datanode.max.xcievers and bump its default value
Reasonably sized jobs and HBase easily exhaust the current default for dfs.datanode.max.xcievers. 4096 works better in practice.
Let's also deprecate it in favor of a more intuitive name, eg dfs.datanode.max.receiver.threads.
- HDFS-1856.
Major sub-task reported by mattf and fixed by mattf (test)
TestDatanodeBlockScanner waits forever, errs without giving information
- HDFS-1855.
Major test reported by mattf and fixed by mattf (test)
TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different ways
The second part of test case TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy(), "corrupt replica recovery for two corrupt replicas", always fails, half the time with a checksum error upon block replication, and half the time by timing out upon failure to delete the second corrupt replica.
- HDFS-1854.
Major sub-task reported by mattf and fixed by mattf (test)
make failure message more useful in DFSTestUtil.waitReplication()
- HDFS-1846.
Major improvement reported by atm and fixed by atm (name-node)
Don't fill preallocated portion of edits log with 0x00
HADOOP-2330 added a feature to preallocate space in the local file system for the NN transaction log. That change seeks past the current end of the file and writes out some data, which on most systems results in the intervening data in the file being filled with zeros. Most underlying file systems have special handling for sparse files, and don't actually allocate blocks on disk for blocks of a file which consist completely of 0x00.
I've seen cases in the wild where the volume an edits dir i...
- HDFS-1845.
Major bug reported by johnvijoe and fixed by johnvijoe
symlink comes up as directory after namenode restart
When a symlink is first created, it get added to EditLogs. When namenode is restarted, it reads from this editlog and represents a symlink correctly and saves this information to its image. If the namenode is restarted again, it reads its from this FSImage, but thinks that a symlink is a directory. This is because it uses "Block[] blocks" to determine if an INode is a directory, a file, or symlink. Since both a directory and a symlink has blocks as null, it thinks that a symlink is a directory.
- HDFS-1844.
Major test reported by daryn and fixed by daryn (test)
Move -fs usage tests from hdfs into common
The -fs usage tests are in hdfs which causes an unnecessary synchronization of a common & hdfs bug when changing the text. The usages have no ties to hdfs, so they should be moved into common.
- HDFS-1843.
Minor improvement reported by bharathm and fixed by bharathm
Discover file not found early for file append
I have committed this. Thanks to Bharath!
- HDFS-1840.
Major improvement reported by szetszwo and fixed by szetszwo (hdfs client)
Terminate LeaseChecker when all writing files are closed.
In {{DFSClient}}, when there are files opened for write, a {{LeaseChecker}} thread is started for updating the leases periodically. However, it never terminates when when all writing files are closed.
- HDFS-1835.
Major bug reported by johnyoh and fixed by johnyoh (data-node)
DataNode.setNewStorageID pulls entropy from /dev/random
DataNode.setNewStorageID uses SecureRandom.getInstance("SHA1PRNG") which always pulls fresh entropy.
It wouldn't be so bad if this were only the 120 bits needed by sha1, but the default impl of SecureRandom actually uses a BufferedInputStream around /dev/random and pulls 1024 bits of entropy for this one call.
If you are on a system without much entropy coming in, this call can block and block others.
Can we just change this to use "new SecureRandom().nextInt(Integer.MAX_VALUE)" instead?
- HDFS-1833.
Minor improvement reported by szetszwo and fixed by szetszwo (data-node)
Refactor BlockReceiver
There are repeated codes for creating log/error messages in BlockReceiver. Also, some comment in the codes are incorrect, e.g.
{code}
private int numTargets; // number of downstream datanodes including myself
{code}
but the count indeed excludes the current datanode.
- HDFS-1831.
Major improvement reported by sureshms and fixed by sureshms (name-node)
HDFS equivalent of HADOOP-7223 changes to handle FileContext createFlag combinations
During file creation with FileContext, the expected behavior is not clearly defined for combination of createFlag EnumSet. This is HDFS equivalent of HADOOP-7223
- HDFS-1829.
Major bug reported by mattf and fixed by mattf (name-node)
TestNodeCount waits forever, errs without giving information
In three locations in the code, TestNodeCount waits forever on a condition. Failures result in Hudson/Jenkins "Timeout occurred" error message with no information about where or why. Need to replace with TimeoutExceptions that throw a stack trace and useful info about the failure mode.
Also investigate possible cause of failure.
- HDFS-1828.
Major sub-task reported by mattf and fixed by mattf (name-node)
TestBlocksWithNotEnoughRacks intermittently fails assert
In server.namenode.TestBlocksWithNotEnoughRacks.testSufficientlyReplicatedBlocksWithNotEnoughRacks
assert fails at curReplicas == REPLICATION_FACTOR, but it seems that it should go higher initially, and if the test doesn't wait for it to go back down, it will fail false positive.
- HDFS-1827.
Major bug reported by mattf and fixed by mattf (name-node)
TestBlockReplacement waits forever, errs without giving information
In method checkBlocks(), TestBlockReplacement waits forever on a condition. Failures result in Hudson/Jenkins "Timeout occurred" error message with no information about where or why. Need to replace with TimeoutException that throws a stack trace and useful info about the failure mode.
Also investigate possible cause of failure.
- HDFS-1826.
Major sub-task reported by hairong and fixed by mattf (name-node)
NameNode should save image to name directories in parallel during upgrade
I've committed this. Thanks, Matt!
- HDFS-1823.
Blocker bug reported by tomwhite and fixed by tomwhite (scripts)
start-dfs.sh script fails if HADOOP_HOME is not set
HDFS portion of HADOOP-6953
- HDFS-1822.
Blocker bug reported by sureshms and fixed by sureshms (name-node)
Editlog opcodes overlap between 20 security and later releases
Same opcode are used for different operations between 0.20.security, 0.22 and 0.23. This results in failure to load editlogs on later release, especially during upgrades.
- HDFS-1821.
Major bug reported by johnvijoe and fixed by johnvijoe
FileContext.createSymlink with kerberos enabled sets wrong owner
TEST SETUP
Using attached sample hdfs java program that illustrates the issue.
Using cluster with Kerberos enabled on cluster
# Compile instructions
$ javac Symlink.java -cp `hadoop classpath`
$ jar -cfe Symlink.jar Symlink Symlink.class
# create test file for symlink to use
1. hadoop fs -touchz /user/username/filetest
# Create symlink using file context
2. hadoop jar Symlink.jar ln /user/username/filetest /user/username/testsymlink
# Verify owner of test file
3. hadoop jar Symlink.jar ls...
- HDFS-1818.
Major bug reported by atm and fixed by atm (test)
TestHDFSCLI is failing on trunk
The commit of HADOOP-7202 changed the output of a few FsShell commands. Since HDFS tests rely on the precise format of this output, TestHDFSCLI is now failing.
- HDFS-1817.
Trivial improvement reported by szetszwo and fixed by szetszwo (test)
Split TestFiDataTransferProtocol.java into two files
{{TestFiDataTransferProtocol}} has tests from pipeline_Fi_01 to _16 and pipeline_Fi_39 to _51. It is natural to split them into two files.
- HDFS-1814.
Major new feature reported by atm and fixed by atm (hdfs client, name-node)
HDFS portion of HADOOP-7214 - Hadoop /usr/bin/groups equivalent
Introduces a new command, "hdfs groups", which displays what groups are associated with a user as seen by the NameNode.
- HDFS-1812.
Minor bug reported by umamaheswararao and fixed by umamaheswararao (test)
Address the cleanup issues in TestHDFSCLI.java
- HDFS-1808.
Major bug reported by mattf and fixed by mattf (data-node, name-node)
TestBalancer waits forever, errs without giving information
In three locations in the code, TestBalancer waits forever on a condition. Failures result in Hudson/Jenkins "Timeout occurred" error message with no information about where or why. Need to replace with TimeoutExceptions that throw a stack trace and useful info about the failure mode.
In waitForHeartBeat(), it is waiting on an exact match for a value that may be coarsely quantized -- i.e., significant deviation from the exact "expected" result may occur. Replace with an allowed range of r...
- HDFS-1806.
Major bug reported by mattf and fixed by mattf (data-node, name-node)
TestBlockReport.blockReport_08() and _09() are timing-dependent and likely to fail on fast servers
Method waitForTempReplica() polls every 100ms during block replication, attempting to "catch" a datanode in the state of having a TEMPORARY replica. But examination of a current Hudson test failure log shows that the replica goes from "start" to "TEMPORARY" to "FINALIZED" in only 50ms, so of course the poll usually misses it.
- HDFS-1797.
Major bug reported by tlipcon and fixed by tlipcon
New findbugs warning introduced by HDFS-1120
HDFS-1120 introduced a new findbugs warning:
Unread field: org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolumeSet.curVolume
This JIRA is to fix the simple error.
- HDFS-1789.
Minor improvement reported by szetszwo and fixed by szetszwo (data-node, hdfs client)
Refactor frequently used codes from DFSOutputStream, BlockReceiver and DataXceiver
Below are frequently used codes
- Check block tokens in {{DataXceiver}}
- Log/error messages in {{BlockReceiver}}
In addition, I will refactor {{DFSOutputStream}} for testing)
- create socket for pipeline
- HDFS-1786.
Minor bug reported by szetszwo and fixed by umamaheswararao (test)
Some cli test cases expect a "null" message
There are a few tests cases specified in {{TestHDFSCLI}} and {{TestDFSShell}} expecting "null" messages.
e.g. in {{testHDFSConf.xml}},
{code}
<expected-output>get: null</expected-output>
{code}
- HDFS-1785.
Major improvement reported by szetszwo and fixed by szetszwo (data-node)
Cleanup BlockReceiver and DataXceiver
{{clientName.length()}} is used multiple times for determining whether the source is a client or a datanode.
{code}
if (clientName.length() == 0) {
//it is a datanode
}
{code}
- HDFS-1782.
Major bug reported by johnvijoe and fixed by johnvijoe (name-node)
FSNamesystem.startFileInternal(..) throws NullPointerException
I'm observing when there is one balancer running trying to run another one results in
"Java.lang.NullPointerException" error. I was hoping to see message "Another balancer is running.
Exiting.... Exiting ...". This is a reproducible issue.
Details
========
1) Cluster ->elrond
[hdfs@]$ hadoop version
2) Run first balancer
[hdfs]$ hdfs balancer
1
through XX.XX.XX.XX:1004 is succeeded.
[hdfs@]$ hdfs balancer
11/03/09 16:34:32 INFO balancer.Balancer: namenodes =
java.io.IOException: java.l...
- HDFS-1781.
Major bug reported by johnvijoe and fixed by johnvijoe (scripts)
jsvc executable delivered into wrong package...
The jsvc executable is delivered in the 0.22 hdfs package, but the script that uses it (bin/hdfs) refers to
$HADOOP_HOME/bin/jsvc to find it.
- HDFS-1776.
Major bug reported by dms and fixed by bharathm
Bug in Concat code
There is a bug in the concat code. Specifically: in INodeFile.appendBlocks() we need to first reassign the blocks list and then go through it and update the INode pointer. Otherwise we are not updating the inode pointer on all of the new blocks in the file.
- HDFS-1774.
Minor improvement reported by umamaheswararao and fixed by umamaheswararao (data-node)
Small optimization to FSDataset
Inner class FSDir constructor is doing duplicate iterations over the listed files in the passed directory. We can optimize this to single loop and also we can avoid isDirectory check which will perform some native invocations.
Consider a case: one directory has only one child directory and 10000 files.
1) First loop will get the number of children directories.
2) if (numChildren > 0) , This condition will satisfy and again it will iterate 10001 times and also will check isDirectory.
- HDFS-1773.
Minor improvement reported by tanping and fixed by tanping (name-node)
Remove a datanode from cluster if include list is not empty and this datanode is removed from both include and exclude lists
Our service engineering team who operates the clusters on a daily basis founds it is confusing that after a data node is decommissioned, there is no way to make the cluster forget about this data node and it always remains in the dead node list.
- HDFS-1770.
Minor test reported by eli and fixed by eli
TestFiRename fails due to invalid block size
HDFS-1763 exposed a bug in TestFiRename or HDFS (see HADOOP-70800) which fails due to the following:
{quote}
Internal error: default blockSize is not a multiple of default bytesPerChecksum
java.io.IOException: Internal error: default blockSize is not a multiple of default bytesPerChecksum
{quote}
Previously this test passed because it used dfs.block.size (instead of dfs.blocksize), though the behavior should be equivalent since on deprecates the other.
- HDFS-1767.
Major sub-task reported by mattf and fixed by mattf (data-node)
Namenode should ignore non-initial block reports from datanodes when in safemode during startup
Consider a large cluster that takes 40 minutes to start up. The datanodes compete to register and send their Initial Block Reports (IBRs) as fast as they can after startup (subject to a small sub-two-minute random delay, which isn't relevant to this discussion).
As each datanode succeeds in sending its IBR, it schedules the starting time for its regular cycle of reports, every hour (or other configured value of dfs.blockreport.intervalMsec). In order to spread the reports evenly across th...
- HDFS-1763.
Minor improvement reported by eli and fixed by eli
Replace hard-coded option strings with variables from DFSConfigKeys
There are some places in the code where we use hard-coded strings instead of the equivalent DFSConfigKeys define, and a couple places where the default is defined multiple places (once in DFSConfigKeys and once elsewhere, though both have the same value). This is error-prone, and also a pain in that it prevents eclipse from easily showing you all the places where a particular config option is used. Let's replace all the uses of the hard-coded option strings with uses of the corresponding vari...
- HDFS-1761.
Major sub-task reported by szetszwo and fixed by szetszwo (data-node)
Add a new DataTransferProtocol operation, Op.TRANSFER_BLOCK, instead of using RPC
Add a new DataTransferProtocol operation, Op.TRANSFER_BLOCK, for transferring RBW/Finalized with acknowledgement and without using RPC.
- HDFS-1760.
Major bug reported by daryn and fixed by daryn (name-node)
problems with getFullPathName
FSDirectory's getFullPathName method is flawed. Given a list of inodes, it starts at index 1 instead of 0 (based on the assumption that inode[0] is always the root inode) and then builds the string with "/"+inode[i]. This means the empty string is returned for the root, or when requesting the full path of the parent dir for top level items.
In addition, it's not guaranteed that the list of inodes starts with the root inode. The inode lookup routine will only fill the inode array with the ...
- HDFS-1757.
Major improvement reported by eli and fixed by eli (contrib/fuse-dfs)
Don't compile fuse-dfs by default
The infra machines don't have fuse headers, therefore we shouldn't compile fuse-dfs by default.
- HDFS-1751.
Major new feature reported by daryn and fixed by daryn (data-node)
Intrinsic limits for HDFS files, directories
Enforce a configurable limit on:
the length of a path component
the number of names in a directory
The intention is to prevent a too-long name or a too-full directory. This is not about RPC buffers, the length of command lines, etc. There may be good reasons for those kinds of limits, but that is not the intended scope of this feature. Consequently, a reasonable implementation might be to extend the existing quota checker so that it faults the creation of a name that violates the limits....
- HDFS-1750.
Major bug reported by szetszwo and fixed by szetszwo
fs -ls hftp://file not working
{noformat}
hadoop dfs -touchz /tmp/file1 # create file. OK
hadoop dfs -ls /tmp/file1 # OK
hadoop dfs -ls hftp://namenode:50070/tmp/file1 # FAILED: not seeing the file
{noformat}
- HDFS-1748.
Major bug reported by szetszwo and fixed by szetszwo (balancer)
Balancer utilization classification is incomplete
{code}
//Balancer.java
/* Return true if the given datanode is overUtilized */
private boolean isOverUtilized(BalancerDatanode datanode) {
return datanode.utilization > (avgUtilization+threshold);
}
/* Return true if the given datanode is above average utilized
* but not overUtilized */
private boolean isAboveAvgUtilized(BalancerDatanode datanode) {
return (datanode.utilization <= (avgUtilization+threshold))
&& (datanode.utilization > avgUtilization);
}
...
- HDFS-1741.
Major improvement reported by cos and fixed by cos (build)
Provide a minimal pom file to allow integration of HDFS into Sonar analysis
In order to user Sonar facility a project has to be either build by Maven or has a special pom 'wrapper'. Let's provide a minimal one to allow just that.
- HDFS-1739.
Minor improvement reported by umamaheswararao and fixed by umamaheswararao (data-node)
When DataNode throws DiskOutOfSpaceException, it will be helpfull to the user if we log the available volume size and configured block size.
DataNode will throw DiskOutOfSpaceException for new blcok write if available volume size is less than configured blcok size.
So, it will be helpfull to the user if we log this details.
- HDFS-1734.
Major bug reported by umamaheswararao and fixed by umamaheswararao (name-node)
'Chunk size to view' option is not working in Name Node UI.
1. Write a file to DFS
2. Browse the file using Name Node UI.
3. give the chunk size to view as 100 and click the refresh.
It will say Invalid input ( getnstamp absent )
- HDFS-1731.
Minor improvement reported by tlipcon and fixed by tlipcon
Allow using a file to exclude certain tests from build
It would be nice to be able to exclude certain tests when running builds. For example, when a test is "known flaky", you may want to exclude it from the main Hudson job, but not actually disable it in the codebase (so that it still runs as part of another Hudson job, for example).
- HDFS-1728.
Minor bug reported by szetszwo and fixed by szetszwo (name-node)
SecondaryNameNode.checkpointSize is in byte but not MB.
The unit of SecondaryNameNode.checkpointSize is byte but not MB as stated in the following comment.
{code}
//SecondaryNameNode.java
private long checkpointSize; // size (in MB) of current Edit Log
{code}
- HDFS-1727.
Minor bug reported by umamaheswararao and fixed by sravankorumilli
fsck command can display command usage if user passes any illegal argument
In fsck command if user passes the arguments like
./hadoop fsck -test -files -blocks -racks
In this case it will take / and will display whole DFS information regarding to files,blocks,racks.
But here, we are hiding the user mistake. Instead of this, we can display the command usage if user passes any invalid argument like above.
If user passes illegal optional arguments like
./hadoop fsck /test -listcorruptfileblocks instead of
./hadoop fsck /test -list-corruptfileblocks also we can displa...
- HDFS-1723.
Minor improvement reported by aw and fixed by jimplush
quota errors messages should use the same scale
Updated the Quota exceptions to now use human readable output.
- HDFS-1703.
Minor sub-task reported by tanping and fixed by tanping (scripts)
HDFS federation: Improve start/stop scripts and add script to decommission datanodes
This Jira covers two issues:
# Startup scripts should start namenodes, secondary namenodes and datanodes on hosts retunred by getConfig (new feature). This patch is spread out to both common(HADOOP-7179) and hdfs (this Jira).
# Decommission script to decommission datanodes
- HDFS-1692.
Major bug reported by bharathm and fixed by bharathm (data-node)
In secure mode, Datanode process doesn't exit when disks fail.
In secure mode, when disks fail more than volumes tolerated, datanode process doesn't exit properly and it just hangs even though shutdown method is called.
- HDFS-1691.
Minor bug reported by humanoid and fixed by humanoid (tools)
double static declaration in Configuration.addDefaultResource("hdfs-default.xml");
in /src/java/org/apache/hadoop/hdfs/tools/DFSck.java
double declaration
static{
Configuration.addDefaultResource("hdfs-default.xml");
Configuration.addDefaultResource("hdfs-site.xml");
}
1. in head class
2. before main
- HDFS-1675.
Major sub-task reported by szetszwo and fixed by szetszwo (data-node)
Transfer RBW between datanodes
Added a new stage TRANSFER_RBW to DataTransferProtocol
- HDFS-1665.
Minor bug reported by szetszwo and fixed by szetszwo (balancer)
Balancer sleeps inadequately
The value of {{dfs.heartbeat.interval}} is in seconds. Balancer seems misused it.
- HDFS-1656.
Major bug reported by jnp and fixed by jnp
getDelegationToken in HftpFileSystem should renew TGT if needed.
Fetching of delegation tokens in HftpFileSystem will fail if TGT has expired. The TGT should be renewed first if needed.
- HDFS-1636.
Minor improvement reported by tlipcon and fixed by qwertymaniac (name-node)
If dfs.name.dir points to an empty dir, namenode format shouldn't require confirmation
If dfs.name.dir points to an empty dir, namenode -format no longer requires confirmation.
- HDFS-1630.
Major improvement reported by hairong and fixed by hairong (name-node)
Checksum fsedits
HDFS-903 calculates a MD5 checksum to a saved image, so that we could verify the integrity of the image at the loading time.
The other half of the story is how to verify fsedits. Similarly we could use the checksum approach. But since a fsedit file is growing constantly, a checksum per file does not work. I am thinking to add a checksum per transaction. Is it doable or too expensive?
- HDFS-1629.
Major sub-task reported by szetszwo and fixed by szetszwo (name-node)
Add a method to BlockPlacementPolicy for not removing the chosen nodes
{{BlockPlacementPolicy}} supports chosen nodes in some of the {{chooseTarget(..)}} methods. The chosen nodes will be removed from the output array. For adding new datanodes to an existing pipeline, it is useful to keep the chosen nodes in the output array.
- HDFS-1628.
Minor improvement reported by rramya and fixed by johnvijoe (name-node)
AccessControlException should display the full path
org.apache.hadoop.security.AccessControlException should display the full path for which the access is denied.
- HDFS-1627.
Major bug reported by hairong and fixed by hairong (name-node)
Fix NullPointerException in Secondary NameNode
Secondary NameNode should not reset namespace if no new image is downloaded from the primary NameNode.
- HDFS-1626.
Minor improvement reported by acmurthy and fixed by szetszwo (name-node)
Make BLOCK_INVALIDATE_LIMIT configurable
Added a new configuration property dfs.block.invalidate.limit for FSNamesystem.blockInvalidateLimit.
- HDFS-1625.
Minor bug reported by tlipcon and fixed by szetszwo (test)
TestDataNodeMXBean fails if disk space usage changes during test run
I've seen this on our internal hudson - we get failures like:
null expected:<...:{"freeSpace":857683[43552],"usedSpace":28672,"...> but was:<...:{"freeSpace":857683[59936],"usedSpace":28672,"...>
because some other build on the same build slave used up some disk space during the middle of the test.
- HDFS-1620.
Minor improvement reported by szetszwo and fixed by qwertymaniac
Rename HdfsConstants -> HdfsServerConstants, FSConstants -> HdfsConstants
Rename HdfsConstants interface to HdfsServerConstants, FSConstants interface to HdfsConstants
- HDFS-1612.
Minor bug reported by joecrobak and fixed by joecrobak (documentation)
HDFS Design Documentation is outdated
I was trying to discover details about the Secondary NameNode, and came across the description below in the HDFS design doc.
{quote}
The NameNode keeps an image of the entire file system namespace and file Blockmap in memory. This key metadata item is designed to be compact, such that a NameNode with 4 GB of RAM is plenty to support a huge number of files and directories. When the NameNode starts up, it reads the FsImage and EditLog from disk, applies all the transactions from the EditLog to...
- HDFS-1611.
Minor bug reported by umamaheswararao and fixed by umamaheswararao (hdfs client, name-node)
Some logical issues need to address.
Title: Some code level logical issues.
Description:
1. DFSClient:
Consider the below case, if we enable only info, then below log will never be logged.
if (ClientDatanodeProtocol.LOG.isDebugEnabled()) {
ClientDatanodeProtocol.LOG.info("ClientDatanodeProtocol addr=" + addr);
}
2.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerMBean()
catch (NotCompliantMBeanException e) {
e.printStackTrace();
}
We can avoid using stackTace(). Better to add log me...
- HDFS-1606.
Major new feature reported by szetszwo and fixed by szetszwo (data-node, hdfs client, name-node)
Provide a stronger data guarantee in the write pipeline
Added two configuration properties, dfs.client.block.write.replace-datanode-on-failure.enable and dfs.client.block.write.replace-datanode-on-failure.policy. Added a new feature to replace datanode on failure in DataTransferProtocol. Added getAdditionalDatanode(..) in ClientProtocol.
- HDFS-1602.
Major bug reported by cos and fixed by boryas (name-node)
NameNode storage failed replica restoration is broken
NameNode storage restore functionality doesn't work (as HDFS-903 demonstrated). This needs to be either disabled, or removed, or fixed. This feature also fails HDFS-1496
- HDFS-1601.
Major improvement reported by tlipcon and fixed by tlipcon (data-node)
Pipeline ACKs are sent as lots of tiny TCP packets
I noticed in an hbase benchmark that the packet counts in my network monitoring seemed high, so took a short pcap trace and found that each pipeline ACK was being sent as five packets, the first four of which only contain one byte. We should buffer these bytes and send the PipelineAck as one TCP packet.
- HDFS-1600.
Major bug reported by szetszwo and fixed by tlipcon (build, test)
editsStored.xml cause release audit warning
The file {{src/test/hdfs/org/apache/hadoop/hdfs/tools/offlineEditsViewer/editsStored.xml}} for any new patch.
- HDFS-1598.
Major bug reported by szetszwo and fixed by szetszwo (name-node)
ListPathsServlet excludes .*.crc files
The {{.*.crc}} files are excluded by default.
- HDFS-1596.
Major improvement reported by patrickangeles and fixed by qwertymaniac (documentation, name-node)
Move secondary namenode checkpoint configs from core-default.xml to hdfs-default.xml
Removed references to the older fs.checkpoint.* properties that resided in core-site.xml
- HDFS-1594.
Major bug reported by devaraj.k and fixed by atm (name-node)
When the disk becomes full Namenode is getting shutdown and not able to recover
Implemented a daemon thread to monitor the disk usage for periodically and if the disk usage reaches the threshold value, put the name node into Safe mode so that no modification to file system will occur. Once the disk usage reaches below the threshold, name node will be put out of the safe mode. Here threshold value and interval to check the disk usage are configurable.
- HDFS-1592.
Major bug reported by bharathm and fixed by bharathm
Datanode startup doesn't honor volumes.tolerated
Datanode startup doesn't honor volumes.tolerated for hadoop 20 version.
- HDFS-1588.
Major improvement reported by zasran and fixed by zasran
Add dfs.hosts.exclude to DFSConfigKeys and use constant in stead of hardcoded string
- HDFS-1585.
Blocker bug reported by tlipcon and fixed by tlipcon (test)
HDFS-1547 broke MR build
Added a parameter to startDatanodes without maintaining old API
- HDFS-1583.
Major improvement reported by liangly and fixed by liangly (name-node)
Improve backup-node sync performance by wrapping RPC parameters
The journal edit records are sent by the active name-node to the backup-node with RPC:
{code:}
public void journal(NamenodeRegistration registration,
int jAction,
int length,
byte[] records) throws IOException;
{code}
During the name-node throughput benchmark, the size of byte array _records_ is around *8000*. Then the serialization and deserialization is time-consuming. I wrote a simple application to test RPC with byte arr...
- HDFS-1582.
Major improvement reported by rvs and fixed by rvs (libhdfs)
Remove auto-generated native build files
The native build run when from trunk now requires autotools, libtool and openssl dev libraries.
- HDFS-1573.
Trivial improvement reported by tlipcon and fixed by tlipcon (hdfs client)
LeaseChecker thread name trace not that useful
The LeaseChecker thread in DFSClient will put a stack trace in its thread name, theoretically to help debug cases where these threads get leaked. However it just shows the stack trace of whoever is asking for the thread's name, not the stack trace of when the thread was allocated. I'd like to fix this so that you can see where the thread got started, which was presumably its original intent.
- HDFS-1568.
Minor improvement reported by tlipcon and fixed by fwiffo (data-node)
Improve DataXceiver error logging
In supporting customers we often see things like SocketTimeoutExceptions or EOFExceptions coming from DataXceiver, but the logging isn't very good. For example, if we get an IOE while setting up a connection to the downstream mirror in writeBlock, the IP of the downstream mirror isn't logged on the DN side.
- HDFS-1560.
Minor bug reported by tlipcon and fixed by tlipcon (data-node)
dfs.data.dir permissions should default to 700
The permissions on datanode data directories (configured by dfs.datanode.data.dir.perm) now default to 0700. Upon startup, the datanode will automatically change the permissions to match the configured value.
- HDFS-1557.
Major sub-task reported by ikelly and fixed by ikelly (name-node)
Separate Storage from FSImage
FSImage currently derives from Storage and FSEditLog has to call methods directly on FSImage to access the filesystem. This JIRA is to separate the Storage class out into NNStorage so that FSEditLog is less dependent on FSImage. From this point, the other parts of the circular dependency should be easy to fix.
- HDFS-1551.
Major bug reported by gkesavan and fixed by gkesavan (build)
fix the pom template's version
pom templates in the ivy folder should be updated to the latest version hadoo-common dependencies.
- HDFS-1547.
Major improvement reported by sureshms and fixed by sureshms (name-node)
Improve decommission mechanism
Summary of changes to the decommissioning process:
# After nodes are decommissioned, they are not shutdown. The decommissioned nodes are not used for writes. For reads, the decommissioned nodes are given as the last location to read from.
# Number of live and dead decommissioned nodes are displayed in the namenode webUI.
# Decommissioned nodes free capacity is not count towards the the cluster free capacity.
- HDFS-1541.
Major sub-task reported by hairong and fixed by hairong (name-node)
Not marking datanodes dead When namenode in safemode
In a big cluster, when namenode starts up, it takes a long time for namenode to process block reports from all datanodes. Because heartbeats processing get delayed, some datanodes are erroneously marked as dead, then later on they have to register again, thus wasting time.
It would speed up starting time if the checking of dead nodes is disabled when namenode in safemode.
- HDFS-1540.
Major bug reported by dhruba and fixed by dhruba (data-node)
Make Datanode handle errors to namenode.register call more elegantly
When a datanode receives a "Connection reset by peer" from the namenode.register(), it exits. This causes many datanodes to die.
- HDFS-1539.
Major improvement reported by dhruba and fixed by dhruba (data-node, hdfs client, name-node)
prevent data loss when a cluster suffers a power loss
we have seen an instance where a external outage caused many datanodes to reboot at around the same time. This resulted in many corrupted blocks. These were recently written blocks; the current implementation of HDFS Datanodes do not sync the data of a block file when the block is closed.
1. Have a cluster-wide config setting that causes the datanode to sync a block file when a block is finalized.
2. Introduce a new parameter to the FileSystem.create() to trigger the new behaviour, i.e. cau...
- HDFS-1536.
Major improvement reported by hairong and fixed by hairong
Improve HDFS WebUI
On web UI, missing block number now becomes accurate and under-replicated blocks do not include missing blocks.
- HDFS-1534.
Minor improvement reported by eli and fixed by eli (name-node)
Fix some incorrect logs in FSDirectory
FSDirectory#removeBlock has the wrong debug log, it copied it from the add block log.
- HDFS-1533.
Major bug reported by pkling and fixed by pkling (hdfs client)
A more elegant FileSystem#listCorruptFileBlocks API (HDFS portion)
This is the HDFS portion of HADOOP-7060.
- HDFS-1526.
Major bug reported by hairong and fixed by hairong (hdfs client)
Dfs client name for a map/reduce task should have some randomness
Make a client name has this format: DFSClient_applicationid_randomint_threadid, where applicationid = mapred.task.id or else = "NONMAPREDUCE".
- HDFS-1524.
Blocker bug reported by hairong and fixed by hairong (name-node)
Image loader should make sure to read every byte in image file
When I work on HDFS-1070, I come across a very strange bug. Occasionally when loading a compressed image, NameNode throws an exception complaining that the image file is corrupt. However, the result returned by the md5sum utility matches the checksum value stored in the VERSION file.
It turns out the image loader leaves 4 bytes unread after loading all the real data of an image. Those unread bytes may be some compression-related meta-info. The image loader should make sure to read to the en...
- HDFS-1523.
Major bug reported by cos and fixed by cos (test)
TestLargeBlock is failing on trunk
TestLargeBlock is failing for more than a week not on 0.22 and trunk with
{noformat}
java.io.IOException: Premeture EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:118)
at org.apache.hadoop.hdfs.BlockReader.readChunk(BlockReader.java:275)
{noformat}
- HDFS-1518.
Minor improvement reported by yaojingguo and fixed by yaojingguo (name-node)
Wrong description in FSNamesystem's javadoc
"4) machine --> blocklist (inverted #2)" should be "4) machine --> blocklist (inverted #3)"
- HDFS-1516.
Major bug reported by cos and fixed by cos (build)
mvn-install is broken after 0.22 branch creation
Version HAS to be bumped for system testing framework artifacts (as mentioned in the build.xml file)
- HDFS-1513.
Minor improvement reported by eli and fixed by eli
Fix a number of warnings
There are a bunch of warnings besides DeprecatedUTF8, HDFS-1512 and two warnings caused by a Java bug (http://bugs.sun.com/view_bug.do?bug_id=646014) that we can fix.
- HDFS-1511.
Blocker bug reported by nidaley and fixed by jghoman
98 Release Audit warnings on trunk and branch-0.22
There are 98 release audit warnings on trunk. See attached txt file. These must be fixed or filtered out to get back to a reasonably small number of warnings. The OK_RELEASEAUDIT_WARNINGS property in src/test/test-patch.properties should also be set appropriately in the patch that fixes this issue.
- HDFS-1510.
Minor improvement reported by nidaley and fixed by nidaley
Add test-patch.properties required by test-patch.sh
Related to HADOOP-7042.
- HDFS-1509.
Major improvement reported by dhruba and fixed by dhruba (name-node)
Resync discarded directories in fs.name.dir during saveNamespace command
In the current implementation, if the Namenode encounters an error while writing to a fs.name.dir directory it stops writing new edits to that directory. My proposal is to make the namenode write the fsimage to all configured directories in fs.name.dir, and from then on, continue writing fsedits to all configured directories.
- HDFS-1506.
Major improvement reported by hairong and fixed by hairong (name-node)
Refactor fsimage loading code
I plan to do some code refactoring to make HDFS-1070 simpler.
- HDFS-1505.
Blocker bug reported by tlipcon and fixed by atm
saveNamespace appears to succeed even if all directories fail to save
After HDFS-1071, saveNamespace now appears to "succeed" even if all of the individual directories failed to save.
- HDFS-1503.
Minor bug reported by eli and fixed by tlipcon (test)
TestSaveNamespace fails
Will attach the full log. Here's the relevant snippet:
{noformat}
Exception in thread "FSImageSaver for /home/eli/src/hdfs4/build/test/data/dfs/na
me1 of type IMAGE_AND_EDITS" java.lang.RuntimeException: Injected fault: saveFSI
mage second time
....
at java.lang.Thread.run(Thread.java:619)
Exception in thread "FSImageSaver for /home/eli/src/hdfs4/build/test/data/dfs/na
me2 of type IMAGE_AND_EDITS" java.lang.StackOverflowError
at java.util.Arrays.copyOf(Arrays.java:2882)
{nofo...
- HDFS-1502.
Minor bug reported by eli and fixed by hairong
TestBlockRecovery triggers NPE in assert
{noformat}
Testcase: testRBW_RWRReplicas took 10.333 sec
Caused an ERROR
null
java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1881)
at org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery.testSyncReplicas(TestBlockRecovery.java:144)
at org.apache.hadoop.hdfs.server.datanode.TestBlockRecovery.testRBW_RWRReplicas(TestBlockRecovery.java:305)
{noformat}
{noformat}
Block reply = r.datanode.update...
- HDFS-1486.
Major improvement reported by cos and fixed by cos (test)
Generalize CLITest structure and interfaces to facilitate upstream adoption (e.g. for web testing)
HDFS part of HADOOP-7014. HDFS side of TestCLI doesn't require any special changes but needs to be aligned with Common
- HDFS-1481.
Major improvement reported by hairong and fixed by hairong (name-node)
NameNode should validate fsimage before rolling
We had an incident that the fsimage at secondary NameNode was truncated but got uploaded to the primary NameNode. The primary NameNode simply rolled the image without checking its integrity, therefore causing the fsimage to corrupt. The primary NameNode should check the new image's integrity before rolling fsimage.
- HDFS-1480.
Major bug reported by mary and fixed by tlipcon (name-node)
All replicas of a block can end up on the same rack when some datanodes are decommissioning.
It appears that all replicas of a block can end up in the same rack. The likelihood of such replicas seems to be directly related to decommissioning of nodes.
Post rolling OS upgrade (decommission 3-10% of nodes, re-install etc, add them back) of a running cluster, all replicas of about 0.16% of blocks ended up in the same rack.
Hadoop Namenode UI etc doesn't seem to know about such incorrectly replicated blocks. "hadoop fsck .." does report that the blocks must be replicated on additional...
- HDFS-1476.
Major improvement reported by pkling and fixed by pkling (name-node)
listCorruptFileBlocks should be functional while the name node is still in safe mode
This would allow us to detect whether missing blocks can be fixed using Raid and if that is the case exit safe mode earlier.
One way to make listCorruptFileBlocks available before the name node has exited from safe mode would be to perform a scan of the blocks map on each call to listCorruptFileBlocks to determine if there are any blocks with no replicas. This scan could be parallelized by dividing the space of block IDs into multiple intervals than can be scanned independently.
- HDFS-1473.
Major sub-task reported by tlipcon and fixed by tlipcon (name-node)
Refactor storage management into separate classes than fsimage file reading/writing
Currently the FSImage class is responsible both for storage management (eg moving around files, tracking file names, the VERSION file, etc) as well as for the actual serialization and deserialization of the "fsimage" file within the storage directory.
I'd like to refactor the loading and saving code into new classes. This will make testing easier and also make the major changes in HDFS-1073 easier to understand.
- HDFS-1467.
Blocker bug reported by tlipcon and fixed by tlipcon (data-node)
Append pipeline never succeeds with more than one replica
TestPipelines appears to be failing on trunk:
Should be RBW replica after sequence of calls append()/write()/hflush() expected:<RBW> but was:<FINALIZED>
junit.framework.AssertionFailedError: Should be RBW replica after sequence of calls append()/write()/hflush() expected:<RBW> but was:<FINALIZED>
at org.apache.hadoop.hdfs.TestPipelines.pipeline_01(TestPipelines.java:109)
- HDFS-1463.
Major bug reported by dhruba and fixed by dhruba (name-node)
accessTime updates should not occur in safeMode
FSNamesystem.getBlockLocations sometimes need to update the accessTime of files. If the namenode is in safemode, this call should fail.
- HDFS-1458.
Major improvement reported by hairong and fixed by hairong (name-node)
Improve checkpoint performance by avoiding unnecessary image downloads
If secondary namenode could verify that the image it has on its disk is the same as the one in the primary NameNode, it could skip downloading the image from the primary NN, thus completely eliminating the image download overhead.
- HDFS-1448.
Major new feature reported by zasran and fixed by zasran (tools)
Create multi-format parser for edits logs file, support binary and XML formats initially
Offline edits viewer feature adds oev tool to hdfs script. Oev makes it possible to convert edits logs to/from native binary and XML formats. It uses the same framework as Offline image viewer.
Example usage:
$HADOOP_HOME/bin/hdfs oev -i edits -o output.xml
- HDFS-1445.
Major sub-task reported by mattf and fixed by mattf (data-node)
Batch the calls in DataStorage to FileUtil.createHardLink(), so we call it once per directory instead of once per file
Batch hardlinking during "upgrade" snapshots, cutting time from aprx 8 minutes per volume to aprx 8 seconds. Validated in both Linux and Windows. Depends on prior integration with patch for <a href="/jira/browse/HADOOP-7133" title="CLONE to COMMON - HDFS-1445 Batch the calls in DataStorage to FileUtil.createHardLink(), so we call it once per directory instead of once per file"><strike>HADOOP-7133</strike></a>.
- HDFS-1442.
Major improvement reported by jnp and fixed by jnp
Api to get delegation token in Hdfs
FileContext uses Hdfs instead of DistributedFileSystem. We need to add delegation token APIs in Hdfs class as well.
- HDFS-1398.
Major sub-task reported by tanping and fixed by
HDFS federation: Upgrade and rolling back of Federation
- HDFS-1381.
Major bug reported by jghoman and fixed by jimplush (test)
HDFS javadocs hard-code references to dfs.namenode.name.dir and dfs.datanode.data.dir parameters
Updated the JavaDocs to appropriately represent the new Configuration Keys that are used in the code. The docs did not match the code.
- HDFS-1378.
Major improvement reported by tlipcon and fixed by atm (name-node)
Edit log replay should track and report file offsets in case of errors
Occasionally there are bugs or operational mistakes that result in corrupt edit logs which I end up having to repair by hand. In these cases it would be very handy to have the error message also print out the file offsets of the last several edit log opcodes so it's easier to find the right place to edit in the OP_INVALID marker. We could also use this facility to provide a rough estimate of how far along edit log replay the NN is during startup (handy when a 2NN has died and replay takes a w...
- HDFS-1377.
Blocker bug reported by eli and fixed by eli (name-node)
Quota bug for partial blocks allows quotas to be violated
There's a bug in the quota code that causes them not to be respected when a file is not an exact multiple of the block size. Here's an example:
{code}
$ hadoop fs -mkdir /test
$ hadoop dfsadmin -setSpaceQuota 384M /test
$ ls dir/ | wc -l # dir contains 101 files
101
$ du -ms dir # each is 3mb
304 dir
$ hadoop fs -put dir /test
$ hadoop fs -count -q /test
none inf 402653184 -550502400 2 101 317718528 hdfs://haus01.sf.clouder...
- HDFS-1371.
Major bug reported by knoguchi and fixed by tanping (hdfs client, name-node)
One bad node can incorrectly flag many files as corrupt
On our cluster, 12 files were reported as corrupt by fsck even though the replicas on the datanodes were healthy.
Turns out that all the replicas (12 files x 3 replicas per file) were reported corrupt from one node.
Surprisingly, these files were still readable/accessible from dfsclient (-get/-cat) without any problems.
- HDFS-1360.
Minor bug reported by tlipcon and fixed by tlipcon (test)
TestBlockRecovery should bind ephemeral ports
TestBlockRecovery starts up a DN, but doesn't configure the various ports to be ephemeral, so the test fails if run on a machine where another DN is already running.
- HDFS-1335.
Major improvement reported by hairong and fixed by hairong (hdfs client, name-node)
HDFS side of HADOOP-6904: first step towards inter-version communications between dfs client and NameNode
The idea is that for getProtocolVersion, NameNode checks if the client and server versions are compatible if the server version is greater than the client version. If no, throws a VersionIncompatible exception; otherwise, returns the server version.
On the dfs client side, when creating a NameNode proxy, catches the VersionMismatch exception and then checks if the client version and the server version are compatible if the client version is greater than the server version. If not compatible,...
- HDFS-1332.
Minor improvement reported by tlipcon and fixed by yuzhihong@gmail.com (name-node)
When unable to place replicas, BlockPlacementPolicy should log reasons nodes were excluded
Whenever the block placement policy determines that a node is not a "good target" it could add the reason for exclusion to a list, and then when we log "Not able to place enough replicas" we could say why each node was refused. This would help new users who are having issues on pseudo-distributed (eg because their data dir is on /tmp and /tmp is full). Right now it's very difficult to figure out the issue.
- HDFS-1330.
Major new feature reported by hairong and fixed by johnvijoe (data-node)
Make RPCs to DataNodes timeout
This jira aims to make client/datanode or datanode/datanode RPC to have a timeout of DataNode#socketTimeout.
- HDFS-1321.
Minor bug reported by garymurry and fixed by jimplush (name-node)
If service port and main port are the same, there is no clear log message explaining the issue.
Added a check to match the sure RPC and HTTP Port's on the NameNode were not set to the same value, otherwise an IOException is throw with the appropriate message.
- HDFS-1295.
Major sub-task reported by dhruba and fixed by mattf (name-node)
Improve namenode restart times by short-circuiting the first block reports from datanodes
The namenode restart is dominated by the performance of processing block reports. On a 2000 node cluster with 90 million blocks, block report processing takes 30 to 40 minutes. The namenode "diffs" the contents of the incoming block report with the contents of the blocks map, and then applies these diffs to the blocksMap, but in reality there is no need to compute the "diff" because this is the first block report from the datanode.
This code change improves block report processing time by 3...
- HDFS-1257.
Major bug reported by rvadali and fixed by eepayne (name-node)
Race condition on FSNamesystem#recentInvalidateSets introduced by HADOOP-5124
HADOOP-5124 provided some improvements to FSNamesystem#recentInvalidateSets. But it introduced unprotected access to the data structure recentInvalidateSets. Specifically, FSNamesystem.computeInvalidateWork accesses recentInvalidateSets without read-lock protection. If there is concurrent activity (like reducing replication on a file) that adds to recentInvalidateSets, the name-node crashes with a ConcurrentModificationException.
- HDFS-1217.
Major improvement reported by szetszwo and fixed by lakshman (name-node)
Some methods in the NameNdoe should not be public
There are quite a few NameNode methods which are not required to be public.
- HDFS-1206.
Major bug reported by szetszwo and fixed by cos (test)
TestFiHFlush fails intermittently
When I was testing HDFS-1114, the patch passed all tests except TestFiHFlush. Then, I tried to print out some debug messages, however, TestFiHFlush succeeded after added the messages.
TestFiHFlush probably depends on the speed of BlocksMap. If BlocksMap is slow enough, then it will pass.
- HDFS-1189.
Major bug reported by xiaokang and fixed by johnvijoe (name-node)
Quota counts missed between clear quota and set quota
HDFS Quota counts will be missed between a clear quota operation and a set quota.
When setting quota for a dir, the INodeDirectory will be replaced by INodeDirectoryWithQuota and dir.isQuotaSet() becomes true. When INodeDirectoryWithQuota is newly created, quota counting will be performed. However, when clearing quota, the quota conf is set to -1 and dir.isQuotaSet() becomes false while INodeDirectoryWithQuota will NOT be replaced back to INodeDirectory.
FSDirectory.updateCount just update...
- HDFS-1149.
Major bug reported by tlipcon and fixed by atm (name-node)
Lease reassignment is not persisted to edit log
During lease recovery, the lease gets reassigned to a special NN holder. This is not currently persisted to the edit log, which means that after an NN restart, the original leaseholder could end up allocating more blocks or completing a file that has already started recovery.
- HDFS-1120.
Major improvement reported by hammer and fixed by qwertymaniac (data-node)
Make DataNode's block-to-device placement policy pluggable
Make the DataNode's block-volume choosing policy pluggable.
- HDFS-1117.
Major improvement reported by vicaya and fixed by vicaya
HDFS portion of HADOOP-6728 (ovehaul metrics framework)
Metrics names are standardized to use CapitalizedCamelCase. Some examples:
# Metrics names using "_" is changed to new naming scheme. Eg: bytes_written changes to BytesWritten.
# All metrics names start with capitals. Example: threadsBlocked changes to ThreadsBlocked.
- HDFS-1073.
Major improvement reported by sanjay.radia and fixed by tlipcon
Simpler model for Namenode's fs Image and edit Logs
The NameNode's storage layout for its name directories has been reorganized to be more robust. Each edit now has a unique transaction ID, and each file is associated with a transaction ID (for checkpoints) or a range of transaction IDs (for edit logs).
- HDFS-1070.
Major sub-task reported by hairong and fixed by hairong (name-node)
Speedup NameNode image loading and saving by storing local file names
This changes the fsimage format to be
root directory-1 directory-2 ... directoy-n.
Each directory stores all its children in the following format:
Directory_full_path_name num_of_children child-1 ... child-n.
Each inode stores only the last component of its path name into fsimage.
This change requires an upgrade at deployment.
- HDFS-1052.
Major new feature reported by sureshms and fixed by sureshms (name-node)
HDFS scalability with multiple namenodes
HDFS currently uses a single namenode that limits scalability of the cluster. This jira proposes an architecture to scale the nameservice horizontally using multiple namenodes.
- HDFS-1001.
Minor bug reported by bcwalrus and fixed by bcwalrus (data-node)
DataXceiver and BlockReader disagree on when to send/recv CHECKSUM_OK
Running the TestPread with additional debug statements reveals that the BlockReader sends CHECKSUM_OK when the DataXceiver doesn't expect it. Currently it doesn't matter since DataXceiver closes the connection after each op, and CHECKSUM_OK is the last thing on the wire. But if we want to cache connections, they need to agree on the exchange of CHECKSUM_OK.
- HDFS-863.
Major bug reported by tlipcon and fixed by kengoodhope (test)
Potential deadlock in TestOverReplicatedBlocks
TestOverReplicatedBlocks.testProcesOverReplicateBlock synchronizes on namesystem.heartbeats without synchronizing on namesystem first. Other places in the code synchronize namesystem, then heartbeats. It's probably unlikely to occur in this test case, but it's a simple fix.
- HDFS-780.
Major test reported by eli and fixed by eli (contrib/fuse-dfs)
Revive TestFuseDFS
Looks like TestFuseDFS has bit rot. Let's revive it.
- HDFS-560.
Minor improvement reported by stevel@apache.org and fixed by stevel@apache.org (build)
Proposed enhancements/tuning to hadoop-hdfs/build.xml
sibling list of HADOOP-6206, enhancements to the hdfs build for easier single-system build/test
- HDFS-420.
Major improvement reported by dbrodsky and fixed by bockelman (contrib/fuse-dfs)
Fuse-dfs should cache fs handles
Fuse-dfs should cache fs handles on a per-user basis. This significantly increases performance, and has the side effect of fixing the current code which leaks fs handles.
The original bug description follows:
I run the following test:
1. Run hadoop DFS in single node mode
2. start up fuse_dfs
3. copy my source tree, about 250 megs, into the DFS
cp -av * /mnt/hdfs/
in /var/log/messages I keep seeing:
Dec 22 09:02:08 bodum fuse_dfs: ERROR: hdfs trying to utime /bar/backend-trunk2/s...
- HDFS-73.
Blocker bug reported by rangadi and fixed by umamaheswararao (hdfs client)
DFSOutputStream does not close all the sockets
When DFSOutputStream writes to multiple blocks, it closes only the socket opened for the last block. When it is done with writing to one block it should close the socket.
I noticed this when I was fixing HADOOP-3067. After fixing HADOOP-3067, there were still a lot of sockets open (but not enough to fail the tests). These sockets were used to write to blocks.
- MAPREDUCE-3322.
Major improvement reported by acmurthy and fixed by acmurthy (documentation, mrv2)
Create a better index.html for maven docs
Create a better index.html for maven docs.
- MAPREDUCE-3321.
Minor bug reported by hitesh and fixed by hitesh (mrv2)
Disable some failing legacy tests for MRv2 builds to go through
By-product of MR-3214. Disable tests for the short term until fixes are available for all tests.
- MAPREDUCE-3313.
Blocker bug reported by ravidotg and fixed by hitesh (mrv2, test)
TestResourceTrackerService failing in trunk some times
TestResourceTrackerService is failing in trunk sometimes with the following error:
testDecommissionWithIncludeHosts(org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService) Time elapsed: 0.876 sec <<< ERROR!
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.ClusterMetrics.getNumDecommisionedNMs(ClusterMetrics.java:78)
at org.apache.hadoop.yarn.server.resourcemanager.TestResourceTrackerService.testDecommissionWithIncludeHosts(TestResourceTr...
- MAPREDUCE-3306.
Blocker bug reported by vinodkv and fixed by vinodkv (mrv2, nodemanager)
Cannot run apps after MAPREDUCE-2989
Seeing this in NM logs when trying to run jobs.
{code}
2011-10-28 21:40:21,263 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Processing application_1319818154209_0001 of type APPLICATION_INITED
2011-10-28 21:40:21,264 FATAL org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread. Exiting..
java.util.NoSuchElementException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:796)
at java.util.HashMap$ValueIterator....
- MAPREDUCE-3304.
Major bug reported by raviprak and fixed by raviprak (mrv2, test)
TestRMContainerAllocator#testBlackListedNodes fails intermittently
Thanks to Hitesh for verifying!
bq. The heartbeat event should be drained before the schedule call.
bq. -- Hitesh
I can see this test fail intermittently on my Mac OSX 10.5 and Fedora 14 machines.
- MAPREDUCE-3296.
Major bug reported by vinodkv and fixed by vinodkv (build)
Pending(9) findBugs warnings
- MAPREDUCE-3295.
Critical bug reported by mahadev and fixed by
TestAMAuthorization failing on branch 0.23.
The test seems to fail both on Mac and linux. Trace in the next comment.
- MAPREDUCE-3292.
Critical bug reported by mahadev and fixed by mahadev (mrv2)
In secure mode job submission fails with Provider org.apache.hadoop.mapreduce.security.token.JobTokenIndentifier$Renewer not found.
This happens when you submit a job to a secure cluster. Also, its only the first time the error shows up. On the next submission of the job, the job passes.
- MAPREDUCE-3290.
Major bug reported by rramya and fixed by acmurthy (mrv2)
list-active-trackers throws NPE
bin/mapred -list-active-trackers throws NPE in mrV2. Trace in the next comment.
- MAPREDUCE-3288.
Blocker bug reported by rramya and fixed by mahadev (mrv2)
Mapreduce 23 builds failing
Hadoop mapreduce 0.23 builds are failing.
- MAPREDUCE-3285.
Blocker bug reported by acmurthy and fixed by sseth (mrv2)
Tests on branch-0.23 failing
Most are failing with some kerberos login exception:
Running org.apache.hadoop.yarn.server.nodemanager.TestLinuxContainerExecutorWithMocks
Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.548 sec <<< FAILURE!
--
Running org.apache.hadoop.yarn.server.resourcemanager.TestAppManager
Tests run: 8, Failures: 0, Errors: 6, Skipped: 0, Time elapsed: 0.125 sec <<< FAILURE!
Running org.apache.hadoop.yarn.server.resourcemanager.TestRMAuditLogger
Tests run: 3, Failures: 0, Errors: 1, S...
- MAPREDUCE-3284.
Major bug reported by rramya and fixed by acmurthy (mrv2)
bin/mapred queue fails with JobQueueClient ClassNotFoundException
bin/mapred queue fails with the following exception:
{code}
-bash$ bin/mapred queue
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/JobQueueClient
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.JobQueueClient
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang....
- MAPREDUCE-3282.
Critical bug reported by rramya and fixed by acmurthy (mrv2)
bin/mapred job -list throws exception
bin/mapred job -list throws exception when mapreduce.framework.name is set to "yarn"
- MAPREDUCE-3281.
Blocker bug reported by vinodkv and fixed by vinodkv (test)
TestLinuxContainerExecutorWithMocks failing on trunk.
- MAPREDUCE-3279.
Major bug reported by sseth and fixed by sseth (mrv2)
TestJobHistoryParsing broken
Broken after 3264, the test was verifying against the default user.
- MAPREDUCE-3275.
Critical improvement reported by revans2 and fixed by revans2 (documentation, mrv2)
Add docs for WebAppProxy
In my haste to get the WebAppProxy code in the documentation for it was neglected. This is to fix that. Docs need to be added to ClusterSetup.html about how to configure and use the WebAppProxy.
- MAPREDUCE-3274.
Blocker bug reported by revans2 and fixed by revans2 (applicationmaster, mrv2)
Race condition in MR App Master Preemtion can cause a dead lock
There appears to be a race condition in the MR App Master in relation to preempting reducers to let a mapper run. In the particular case that I have been debugging a reducer was selected for preemption that did not have a container assigned to it yet. When the container became available that reduce started running and the previous TA_KILL event appears to have been ignored.
- MAPREDUCE-3269.
Blocker bug reported by rramya and fixed by mahadev (mrv2)
Jobsummary logs not being moved to a separate file
The jobsummary logs are not being moved to a separate file. Below is the configuration in log4j.properties:
{noformat}
mapred.jobsummary.logger=INFO,console
log4j.logger.org.apache.hadoop.mapreduce.jobhistory.JobSummary=${mapred.jobsummary.logger}
log4j.additivity.org.apache.hadoop.mapreduce.jobhistory.JobSummary=false
log4j.appender.JSA=org.apache.log4j.DailyRollingFileAppender
log4j.appender.JSA.File=${hadoop.log.dir}/mapred-jobsummary.log
log4j.appender.JSA.layout=org.apache.log4j.Pattern...
- MAPREDUCE-3264.
Blocker bug reported by tlipcon and fixed by acmurthy (mrv2)
mapreduce.job.user.name needs to be set automatically
Currently in MR2 I have to manually specify mapreduce.job.user.name for each job. It's not picking it up from the security infrastructure, at least when running with DefaultContainerExecutor. This is obviously incorrect.
- MAPREDUCE-3263.
Blocker bug reported by rramya and fixed by hitesh (build, mrv2)
compile-mapred-test target fails
Compile mapred test target is broken due to which the builds are not archiving the test jars.
- MAPREDUCE-3262.
Critical bug reported by hitesh and fixed by hitesh (mrv2, nodemanager)
A few events are not handled by the NodeManager in failure scenarios
Need to handle kill container event in localization failed state.
Need to handle resource localized in localization failed state.
- MAPREDUCE-3261.
Major bug reported by criccomini and fixed by (applicationmaster)
AM unable to release containers
I'm probably doing something wrong here, but I can't figure it out.
My ApplicationMaster is sending an AllocateRequest with ContainerIds to release. My ResourceManager logs say:
2011-10-25 10:02:52,236 WARN resourcemanager.RMAuditLogger (RMAuditLogger.java:logFailure(207)) - USER=criccomi IP=127.0.0.1 OPERATION=AM Released Container TARGET=FifoScheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id PERMISSIONS=Unauthorized access or invalid cont...
- MAPREDUCE-3259.
Blocker bug reported by kihwal and fixed by kihwal (mrv2, nodemanager)
ContainerLocalizer should get the proper java.library.path from LinuxContainerExecutor
As seen in MAPREDUCE-2915, java.library.path is not being passed when the LCE spawns a JVM for ContainerLocalizer.
However, unlike branch-0.20-security, the task runtime in 0.23 is unaffected by this. This is because tasks' run-time environment is specified in the launch script by client. Setting LD_LIBRARY_PATH is the primary way of specifying the locations of required native library in this case. The config property, mapreduce.admin.user.env is always set in the job environment and the de...
- MAPREDUCE-3258.
Blocker bug reported by sseth and fixed by sseth (mrv2)
Job counters missing from AM and history UI
- MAPREDUCE-3257.
Blocker sub-task reported by vinodkv and fixed by vinodkv (applicationmaster, mrv2, resourcemanager, security)
Authorization checks needed for AM->RM protocol
This is like MAPREDUCE-3256, but for AM->RM protocol.
- MAPREDUCE-3256.
Blocker sub-task reported by vinodkv and fixed by vinodkv (applicationmaster, mrv2, nodemanager, security)
Authorization checks needed for AM->NM protocol
We already authenticate requests to NM from any AM. We also need to authorize the requests, otherwise a rogue AM, *but with proper tokens and thus authenticated to talk to NM*, could either launch or kill a container with different ContainerID. We have two options:
- Remove the explicit passing of the ContainerId as part of the API and instead get it from the RPC layer. In this case, we will need a ContainerToken for each container.
- Do explicit authorization checks without relying on gett...
- MAPREDUCE-3254.
Blocker bug reported by rramya and fixed by acmurthy (contrib/streaming, mrv2)
Streaming jobs failing with PipeMapRunner ClassNotFoundException
ClassNotFoundException: org.apache.hadoop.streaming.PipeMapRunner encountered while running streaming jobs. Stack trace in the next comment.
- MAPREDUCE-3253.
Blocker bug reported by daijy and fixed by acmurthy (mrv2)
ContextFactory throw NoSuchFieldException
I see exceptions from ContextFactory when I am running Pig unit test:
Caused by: java.lang.IllegalArgumentException: Can't find field
at org.apache.hadoop.mapreduce.ContextFactory.<clinit>(ContextFactory.java:139)
Caused by: java.lang.NoSuchFieldException: reporter
at java.lang.Class.getDeclaredField(Class.java:1882)
at org.apache.hadoop.mapreduce.ContextFactory.<clinit>(ContextFactory.java:126)
- MAPREDUCE-3252.
Critical bug reported by tlipcon and fixed by tlipcon (mrv2, task)
MR2: Map tasks rewrite data once even if output fits in sort buffer
I found that, even if the output of a map task fits entirely in its sort buffer, it was rewriting the output entirely rather than just renaming the first spill into place. This is due to RawLocalFileSystem.rename() falling back to a copy if renameTo() fails. The first rename attempt was failing because no one has called mkdir for the output directory yet.
- MAPREDUCE-3250.
Blocker sub-task reported by vinodkv and fixed by vinodkv (applicationmaster, mrv2)
When AM restarts, client keeps reconnecting to the new AM and prints a lots of logs.
- MAPREDUCE-3249.
Blocker sub-task reported by vinodkv and fixed by vinodkv (applicationmaster, mrv2)
Recovery of MR AMs with reduces fails the subsequent generation of the job
- MAPREDUCE-3242.
Major bug reported by mahadev and fixed by mahadev (mrv2)
Trunk compilation broken with bad interaction from MAPREDUCE-3070 and MAPREDUCE-3239.
Looks like patch command threw away some of the changes when I committed MAPREDUCE-3239 after MAPREDUCE-3070.
- MAPREDUCE-3240.
Blocker bug reported by vinodkv and fixed by hitesh (mrv2, nodemanager)
NM should send a SIGKILL for completed containers also
This is to address the containers which exit properly after spawning sub-processes themselves. We don't want to leave these sub-process-tree or else they can pillage the NM's resources.
Today, we already have code to send SIGKILL to the whole process-trees (because of single sessionId resulting from setsid) when the container is alive. We need to obtain the PID of the containers when they start and use that PID to send signal for completed containers' case also.
- MAPREDUCE-3239.
Minor improvement reported by tlipcon and fixed by tlipcon (mrv2)
Use new createSocketAddr API in MRv2 to give better error messages on misconfig
HADOOP-7749 added a NetUtils call which will include the configuration name as part of the exception message. This is handy if you accidentally specify some invalid string, or forget to specify a required parameter. This JIRA is to make MR2 use the new API.
- MAPREDUCE-3233.
Blocker sub-task reported by karams and fixed by mahadev (mrv2)
AM fails to restart when first AM is killed
Set yarn.resourcemanager.am.max-retries=5 in yarn-site.xml. Started yarn cluster.
Sumbitted Sleep Job of 100K maps tasks as following -:
$HADOOP_COMMON_HOME/bin/hadoop jar $HADOOP_MAPRED_HOME/hadoop-test.jar sleep -m 100000 -r 0 -mt 1000 -rt 1000
when around 53K tasks go, login node running AppMaster, and killed AppMaster with kill -9
Resource Manager tried restart AM uptio max-retris but failed with following -:
{code}
11/10/19 15:29:09 INFO mapreduce.Job: Job job_1319036155027_0002 failed...
- MAPREDUCE-3228.
Blocker bug reported by vinodkv and fixed by vinodkv (applicationmaster, mrv2)
MR AM hangs when one node goes bad
Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
Thanks to [~Karams] for helping with this.
- MAPREDUCE-3226.
Blocker bug reported by vinodkv and fixed by vinodkv (mrv2, task)
Few reduce tasks hanging in a gridmix-run
In a gridmix run with ~1000 jobs, one job is getting stuck because of 2-3 hanging reducers. All of the them are stuck after downloading all map outputs and have the following thread dump.
{code}
"EventFetcher for fetching Map Completion Events" daemon prio=10 tid=0xa325fc00 nid=0x1ca4 waiting on condition [0xa315c000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapreduce.task.reduce.EventFetcher.run(EventFe...
- MAPREDUCE-3212.
Minor bug reported by kam_iitkgp and fixed by kamesh (mrv2)
Message displays while executing yarn command should be proper
execute yarn command without any arguments. It displays
{noformat}Usage: hadoop [--config confdir] COMMAND {noformat}.
Rather the message should be
{noformat}Usage: yarn [--config confdir] COMMAND{noformat}
- MAPREDUCE-3209.
Major bug reported by vinodkv and fixed by vinodkv (build, mrv2)
Jenkins reports 160 FindBugs warnings
See
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1055//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-common.html
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1055//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-app.html
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/1055//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-...
- MAPREDUCE-3208.
Minor bug reported by liangzhwa and fixed by liangzhwa (mrv2)
NPE while flushing TaskLogAppender
NPE will be throwed out while calling flush() of TaskLogAppender,if the QuietWriter isn't initialized in advance.
- MAPREDUCE-3205.
Blocker improvement reported by tlipcon and fixed by tlipcon (mrv2, nodemanager)
MR2 memory limits should be pmem, not vmem
Resource limits are now expressed and enforced in terms of physical memory, rather than virtual memory. The virtual memory limit is set as a configurable multiple of the physical limit. The NodeManager's memory usage is now configured in units of MB rather than GB.
- MAPREDUCE-3204.
Major bug reported by sureshms and fixed by tucu00 (build)
mvn site:site fails on MapReduce
This problem does not happen on 0.23. See details in the next comment.
- MAPREDUCE-3203.
Major bug reported by mahadev and fixed by mahadev (mrv2)
Fix some javac warnings in MRAppMaster.
MAPREDUCE-2762 accidentally introduced a couple of javac warning. This jira is to fix some of them in MRAppMaster. We have plenty more to fix but I dont intend to fix them all here. This is just so that the hudson bot does not -1 other patches with javac warnings.
- MAPREDUCE-3199.
Major bug reported by vinodkv and fixed by vinodkv (mrv2, test)
TestJobMonitorAndPrint is broken on trunk
I bisected this down to MAPREDUCE-3003 changes. The parent project for client-core changed to hadoop-project which doesn't have the log4j configuration unlike the previous parent hadoop-mapreduce-client.
- MAPREDUCE-3198.
Trivial bug reported by hitesh and fixed by acmurthy (mrv2)
Change mode for hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/resources/mock-container-executor to 755
The file is checked in with 644 permissions. TestLinuxContainerExecutorWithMocks changes the file mode to add executable permission if needed resulting in a modified file for 'git/svn status' when tests are run.
- MAPREDUCE-3197.
Major bug reported by anupamseth and fixed by mahadev (mrv2)
TestMRClientService failing on building clean checkout of branch 0.23
A clean checkout of the branch 0.23 source tree does not pass TestMRClientService#test(), which fails with the error message "Num diagnostics is not correct expected <2> but was:<1> upon running "mvn clean install assembly:assembly" inside MR directory.
- MAPREDUCE-3196.
Major bug reported by acmurthy and fixed by acmurthy (mrv2)
TestLinuxContainerExecutorWithMocks fails on Mac OSX
TestLinuxContainerExecutorWithMocks uses /bin/true which isn't present.
- MAPREDUCE-3192.
Major bug reported by jnp and fixed by jnp
Fix Javadoc warning in JobClient.java and Cluster.java
Javadoc warnings in JobClient.java and Cluster.java need to be fixed.
- MAPREDUCE-3190.
Major improvement reported by tlipcon and fixed by tlipcon (mrv2)
bin/yarn should barf early if HADOOP_COMMON_HOME or HADOOP_HDFS_HOME are not set
Currently, if these env vars are not set when you run bin/yarn, it will crash with various ClassNotFoundExceptions, having added {{/share/hadoop/hdfs}} to the classpath. Rather, we should check for these env vars in the wrapper script and display a reasonable error message.
- MAPREDUCE-3189.
Major improvement reported by tlipcon and fixed by tlipcon (mrv2)
Add link decoration back to MR2's CSS
I found the MRv2 web UI very difficult to use because it's not clear which items are links and which aren't. I'd like to change the CSS so that links are underlined, making it easier to see them (since they're also not in any different color)
- MAPREDUCE-3188.
Major bug reported by tlipcon and fixed by tlipcon (mrv2)
Lots of errors in logs when daemon startup fails
Since the MR2 daemons are made up of lots of component services, if one of those components fails to start, it will cause the others to shut down as well, even if they haven't fully finished starting up. Currently, this causes the error output to have a bunch of NullPointerExceptions, IllegalStateExceptions, etc, which mask the actual root cause error at the top.
- MAPREDUCE-3187.
Minor improvement reported by tlipcon and fixed by tlipcon (mrv2)
Add names for various unnamed threads in MR2
Simple patch to add thread names for all the places we use Executors, etc.
- MAPREDUCE-3186.
Blocker bug reported by ramgopalnaali and fixed by eepayne (mrv2)
User jobs are getting hanged if the Resource manager process goes down and comes up while job is getting executed.
New Yarn configuration property:
Name: yarn.app.mapreduce.am.scheduler.connection.retries
Description: Number of times AM should retry to contact RM if connection is lost.
- MAPREDUCE-3185.
Critical bug reported by mahadev and fixed by jeagles (mrv2)
RM Web UI does not sort the columns in some cases.
While running lots of jobs on a MRv2 cluster the RM web UI shows this error on loading the RM web UI:
"DataTables warning (table id = 'apps'): Added data (size 8) does not match known number of columns (9)"
After ignoring the error, the column sorting on Web UI stops working.
- MAPREDUCE-3183.
Trivial bug reported by hitesh and fixed by hitesh (build)
hadoop-assemblies/src/main/resources/assemblies/hadoop-mapreduce-dist.xml missing license header
Re-assigning as this is part of the mavenization related changes and requires a delayed merge to the 23 branch.
- MAPREDUCE-3181.
Blocker bug reported by anupamseth and fixed by acmurthy (mrv2)
Terasort fails with Kerberos exception on secure cluster
We are seeing the following Kerberos exception upon trying to run terasort on secure single and multi-node clusters using the latest build from branch 0.23.
java.io.IOException: Can't get JobTracker Kerberos principal for use as renewer
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:106)
at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:90)
at org.apache.hadoop.mapre...
- MAPREDUCE-3179.
Major bug reported by jeagles and fixed by jeagles (mrv2, test)
Incorrect exit code for hadoop-mapreduce-test tests when exception thrown
Exit code for test jar is 0 despite exception thrown
hadoop jar hadoop-mapreduce-test-0.23.0-SNAPSHOT.jar loadgen -Dmapreduce.job.acl-view -m 18 -r 0 -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text -indir nonexistentdir
Loadgen output snippet
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://machine.name.example.com:9000/user/exampleuser/nonexistentdir
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:23...
- MAPREDUCE-3176.
Blocker bug reported by raviprak and fixed by hitesh (mrv2, test)
ant mapreduce tests are timing out
Secondary YARN builds started taking inordinately long and lots of tests started failing. Usually the secondary build would take ~ 2 hours. But recently even after 7 hours it wasn't done.
- MAPREDUCE-3175.
Blocker sub-task reported by tgraves and fixed by jeagles (mrv2)
Yarn httpservers not created with access Control lists
RM, NM, job history, and application master httpservers are not created with access Control lists. I believe this means that anyone can access any of the standard servlets that check to see if the user has administrator access - like /jmx, /stacks, etc and ops has no way to restrict access to these things.
- MAPREDUCE-3171.
Major improvement reported by tucu00 and fixed by tucu00 (build)
normalize nodemanager native code compilation with common/hdfs native
Use same build pattern as used by common/hdfs native:
* rename src/c to src/native
* run autoreconf, configure and make under target not to pollute the src tree
* use maven-make-plugin in an identical way as in common/hdfs native
- MAPREDUCE-3170.
Critical bug reported by mahadev and fixed by hitesh (build, mrv1, mrv2)
Trunk nightly commit builds are failing.
Looks like the trunk commit builds are failing after MAPREDUCE-3148 and MAPREDUCE-3126 were committed. I suspect its MAPREDUCE-3148.
- MAPREDUCE-3167.
Minor bug reported by mahadev and fixed by mahadev (mrv2)
container-executor is not being packaged with the assembly target.
Looks like MAPREDUCE-2988 broke this. This is a temporary fix until we get a full fledged maven dist tar working. Trivial fix.
- MAPREDUCE-3166.
Major bug reported by ravidotg and fixed by ravidotg (tools/rumen)
Make Rumen use job history api instead of relying on current history file name format
Makes Rumen use job history api instead of relying on current history file name format.
- MAPREDUCE-3165.
Blocker bug reported by acmurthy and fixed by tlipcon (applicationmaster, mrv2)
Ensure logging option is set on child command line
Currently the logging config is set in env in MapReduceChildJVM - we need to set it on command line.
- MAPREDUCE-3163.
Blocker bug reported by tlipcon and fixed by mahadev (job submission, mrv2)
JobClient spews errors when killing MR2 job
When I used the "hadoop job" command line to kill a running MR2 job, I got a bunch of error spew on the console, despite the kill actually taking effect.
- MAPREDUCE-3162.
Minor improvement reported by tlipcon and fixed by tlipcon (mrv2, nodemanager)
Separate application-init and container-init event types in NM's ApplicationImpl FSM
Currently, the ApplicationImpl receives an INIT_APPLICATION event on every container initialization. Only on the first one does it really mean to init the application, whereas all subsequent events are for specific containers. This JIRA is to separate the events into INIT_APPLICATION, sent once and only once per application, and INIT_CONTAINER, which is sent for every container. The first container sends INIT_APPLICATION followed by INIT_CONTAINER.
- MAPREDUCE-3161.
Minor improvement reported by tlipcon and fixed by tlipcon (mrv2)
Improve javadoc and fix some typos in MR2 code
Just some simple cleanup, documentation, typos in variable names, etc. The only code change is to refactor ResourceLocalizationService so each event type is handled in its own method instead of a giant switch statement (just using eclipse's Extract Method - no semantic change)
- MAPREDUCE-3159.
Blocker bug reported by tlipcon and fixed by tlipcon (mrv2)
DefaultContainerExecutor removes appcache dir on every localization
The DefaultContainerExecutor currently has code that removes the application dir from appcache/ in the local directories on every task localization. This causes any concurrent executing tasks from the same job to fail.
- MAPREDUCE-3158.
Major bug reported by hitesh and fixed by hitesh (mrv2)
Fix trunk build failures
https://builds.apache.org/view/G-L/view/Hadoop/job/Hadoop-Mapreduce-trunk-Commit/1060/
- MAPREDUCE-3157.
Major bug reported by ravidotg and fixed by ravidotg (tools/rumen)
Rumen TraceBuilder is skipping analyzing 0.20 history files
Fixes TraceBuilder to handle 0.20 history file names also.
- MAPREDUCE-3154.
Major improvement reported by abhijit.shingate and fixed by abhijit.shingate (client, mrv2)
Validate the Jobs Output Specification as the first statement in JobSubmitter.submitJobInternal(Job, Cluster) method
Presently the output specification is validated after getting new JobId from ClientRMService, Copying the job jar, Configuration file, archives etc.
Instead of that move following Job Output specification validation call to the begining of JobSubmitter.submitJobInternal(Job, Cluster) method.
{code}
checkSpecs(job);
{code}
This will avoid unnecessary work in case of invalid output specs.
- MAPREDUCE-3153.
Major bug reported by vinodkv and fixed by mahadev (mrv2, test)
TestFileOutputCommitter.testFailAbort() is failing on trunk on Jenkins
This mostly is caused by MAPREDUCE-2702.
- MAPREDUCE-3148.
Blocker sub-task reported by acmurthy and fixed by acmurthy (mrv2)
Port MAPREDUCE-2702 to old mapred api
Port MAPREDUCE-2702 to old mapred api
- MAPREDUCE-3146.
Critical sub-task reported by vinodkv and fixed by sseth (mrv2, nodemanager)
Add a MR specific command line to dump logs for a given TaskAttemptID
- MAPREDUCE-3144.
Critical sub-task reported by vinodkv and fixed by sseth (mrv2)
Augment JobHistory to include information needed for serving aggregated logs.
- MAPREDUCE-3141.
Blocker sub-task reported by vinodkv and fixed by vinodkv (applicationmaster, mrv2, security)
Yarn+MR secure mode is broken, uncovered after MAPREDUCE-3056
- MAPREDUCE-3140.
Major bug reported by kam_iitkgp and fixed by subrotosanyal (mrv2)
Invalid JobHistory URL for failed applications
After completion of the applications execution (application has failed though), to verify the job history, I clicked on the JobHistory hyper-link displayed as part of the application details.In this case, it is displaying [http://n/A].
- MAPREDUCE-3138.
Blocker bug reported by acmurthy and fixed by owen.omalley (client, mrv2)
Allow for applications to deal with MAPREDUCE-954
MAPREDUCE-954 changed the context-objs api to interfaces. This breaks Pig. We need a bridge for them to move to 0.23.
- MAPREDUCE-3137.
Trivial sub-task reported by hitesh and fixed by hitesh (mrv2)
Fix broken merge of MR-2719 to 0.23 branch for the distributed shell test case
- MAPREDUCE-3136.
Blocker sub-task reported by acmurthy and fixed by acmurthy (documentation, mrv2)
Add docs for setting up real-world MRv2 clusters
Add docs for setting up real-world MRv2 clusters - MR portion of http://hadoop.apache.org/common/docs/stable/cluster_setup.html
- MAPREDUCE-3134.
Blocker sub-task reported by acmurthy and fixed by acmurthy (documentation, mrv2, scheduler)
Add documentation for CapacityScheduler
Add documentation for CapacityScheduler in MRv2 similar to http://hadoop.apache.org/common/docs/stable/capacity_scheduler.html.
- MAPREDUCE-3133.
Major improvement reported by jeagles and fixed by jeagles (build)
Running a set of methods in a Single Test Class
Instead of running every test method in a class, limit to specific testing methods as describe in the link below.
http://maven.apache.org/plugins/maven-surefire-plugin/examples/single-test.html
Upgrade to the latest version of maven-surefire-plugin that has this feature.
- MAPREDUCE-3127.
Blocker sub-task reported by amolkekre and fixed by acmurthy (mrv2, resourcemanager)
Unable to restrict users based on resourcemanager.admin.acls value set
Setting the following property in yarn-site.xml with user ids to restrict ability to run
'rmadmin -refreshQueues is not honoured
<property>
<name>yarn.server.resourcemanager.admin.acls</name>
<value>hadoop1</value>
<description></description>
<final></final>
</property>
Should it be the same for rmadmin -refreshNodes?
- MAPREDUCE-3126.
Blocker bug reported by tgraves and fixed by acmurthy (mrv2)
mr job stuck because reducers using all slots and mapper isn't scheduled
The command in MAPREDUCE-3124 run and this job got hung with 1 Map task waiting for resources and 7 Reducers running (2 waiting). The mapper got scheduler, then AM scheduled the reducers, the map task failed and tried to start a new attempt but reducers were using all the slots.
I will try to add some more info from the logs.
- MAPREDUCE-3125.
Critical bug reported by tgraves and fixed by hitesh (mrv2)
app master web UI shows reduce task progress 100% even though reducers not complete and state running/scheduled
ran same command as MAPREDUCE-3124. The app master web ui was displaying the reduce task progress as 100% even though the states were still running/scheduled. Each of those reduce tasks had attempts that failed or killed and another one unassigned. Attaching screenshots.
- MAPREDUCE-3124.
Blocker bug reported by tgraves and fixed by johnvijoe (mrv2)
mapper failed with failed to load native libs
hadoop jar hadoop-mapreduce-examples-*.jar sort -Dmapreduce.job.acl-view
-job=* -Dmapreduce.map.output.compress=true
-Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec
-Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compression.type=NONE -Dmap
reduce.output.fileoutputformat.compression.codec=org.apache.hadoop.io.compress.GzipCodec -outKey
org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text Compression/textinput Co...
- MAPREDUCE-3123.
Blocker bug reported by tgraves and fixed by hitesh (mrv2)
Symbolic links with special chars causing container/task.sh to fail
the following job throws an exception when you have the special characters in it.
hadoop jar hadoop-streaming.jar -Dmapreduce.job.acl-view-job=* -Dmapreduce.job.queuename=queue1 -files file:///homes/user/hadoop/Streaming/data/streaming-980//InputDir#testlink!@$&*()-_+= -input Streaming/streaming-980/input.txt -mapper 'xargs cat' -reducer cat -output Streaming/streaming-980/Output -jobconf mapred.job.name=streamingTest-980 -jobconf mapreduce.job.acl-view-job=*
Exception:
2011-09-27 20:58:48...
- MAPREDUCE-3114.
Major bug reported by subrotosanyal and fixed by subrotosanyal (mrv2)
Invalid ApplicationMaster URL in Applications Page
When the Application is in Accepted state and user tries to click the ApplicationMaster URL in Applications Page, it ends up in Invalid HTTP URL.
The screenshot attached with this Issue makes it more clear.
The HTTP url formed is: http://n/A
- MAPREDUCE-3113.
Minor improvement reported by xiexianshan and fixed by xiexianshan (mrv2)
the scripts yarn-daemon.sh and yarn are not working properly
When we execute them on any path but $YARN_HOME with bash -x option,it is giving the error as follows:
(Of course we should set the path variable of that scritps into the .bashrc or profile in advance)
{code}
/usr/share/hadoop/hadoop-mapreduce-0.24.0-SNAPSHOT/bin/yarn: line 55: /usr/share/hadoop/yarn-config.sh: No such file or directory
{code}
- MAPREDUCE-3112.
Major bug reported by eyang and fixed by eyang (contrib/streaming)
Calling hadoop cli inside mapreduce job leads to errors
Removed inheritance of certain server environment variables (HADOOP_OPTS and HADOOP_ROOT_LOGGER) in task attempt process.
- MAPREDUCE-3110.
Major bug reported by devaraj.k and fixed by vinodkv (mrv2, test)
TestRPC.testUnknownCall() is failing
{code:xml}
Failed tests:
testUnknownCall(org.apache.hadoop.yarn.TestRPC): null expected:<...icationId called on []org.apache.hadoop.ya...> but was:<...icationId called on [interface ]org.apache.hadoop.ya...>
Tests run: 65, Failures: 1, Errors: 0, Skipped: 0
{code}
- MAPREDUCE-3104.
Blocker sub-task reported by vinodkv and fixed by vinodkv (mrv2, resourcemanager, security)
Implement Application ACLs, Queue ACLs and their interaction
- MAPREDUCE-3103.
Blocker sub-task reported by vinodkv and fixed by mahadev (mrv2, security)
Implement Job ACLs for MRAppMaster
- MAPREDUCE-3099.
Major sub-task reported by mahadev and fixed by mahadev
Add docs for setting up a single node MRv2 cluster.
- MAPREDUCE-3098.
Blocker sub-task reported by hitesh and fixed by hitesh (mrv2)
Report Application status as well as ApplicationMaster status in GetApplicationReportResponse
Currently, an application report received by the client from the RM/ASM for a given application returns the status of the application master. It does not return the status of the application i.e. whether that particular job succeeded or failed.
The AM status would be one of FINISHED (SUCCEEDED should be renamed to FINISHED as AM state does not indicate overall success/failure), FAILED or KILLED.
The final state sent by the AM to the RM in the FinishApplicationMasterRequest should be expose...
- MAPREDUCE-3095.
Major bug reported by johnvijoe and fixed by johnvijoe (mrv2)
fairscheduler ivy including wrong version for hdfs
fairscheduler ivy.xml includes the common version for hdfs dependency. This could break builds that have different common and hdfs version numbers. The reason we dont see it on the jenkins build is because we use the same version number for common and hdfs.
- MAPREDUCE-3092.
Minor bug reported by devaraj.k and fixed by devaraj.k (mrv2)
Remove JOB_ID_COMPARATOR usage in JobHistory.java
As part of the defect MAPREDUCE-2965, JobId.compareTo() has been implemented. Usage of JOB_ID_COMPARATOR in JobHistory.java can be removed because comparison is handling by JobId itself.
- MAPREDUCE-3090.
Major improvement reported by acmurthy and fixed by acmurthy (applicationmaster, mrv2)
Change MR AM to use ApplicationAttemptId rather than <applicationId, startCount> everywhere
Change MR AM to use ApplicationAttemptId rather than <applicationId, startCount> everywhere, particularly after MAPREDUCE-3055
- MAPREDUCE-3087.
Critical bug reported by raviprak and fixed by raviprak (mrv2)
CLASSPATH not the same after MAPREDUCE-2880
After MAPREDUCE-2880, my classpath was missing key jar files.
- MAPREDUCE-3081.
Major bug reported by vitthal_gogate and fixed by (contrib/vaidya)
Change the name format for hadoop core and vaidya jar to be hadoop-{core/vaidya}-{version}.jar in vaidya.sh
contrib/vaidya/bin/vaidya.sh script fixed to use appropriate jars and classpath
- MAPREDUCE-3078.
Blocker bug reported by vinodkv and fixed by vinodkv (applicationmaster, mrv2, resourcemanager)
Application's progress isn't updated from AM to RM.
It helps to be able to monitor the application-progress from the RM UI itself.
Bits of it is already there, even the AM-RM API (in AllocateRequest). We just need to make sure the progress is produced and consumed properly.
- MAPREDUCE-3073.
Blocker bug reported by mahadev and fixed by mahadev
Build failure for MRv1 caused due to changes to MRConstants.
When runnning ant -Dresolvers=internal binary, the build seems to be failing with:
[javac] public class JobTracker implements MRConstants,
InterTrackerProtocol,
[javac] ^
[javac]
/home/y/var/builds/thread2/workspace/Cloud-Yarn-0.23-Secondary/hadoop-mapred
uce-project/src/java/org/apache/hadoop/mapred/TaskTracker.java:131:
interface expected here
[javac] implements MRConstants, TaskUmbilicalProtocol, Runnable,
TTConfig {
[javac] ...
- MAPREDUCE-3071.
Major bug reported by tgraves and fixed by tgraves (mrv2)
app master configuration web UI link under the Job menu opens up application menu
If you go to the app master web UI for a particular job. The job menu on the left side displays links for overview, counters, configuration, etc..
If you click on the configuration one, it closes the job menu and opens the application menu on that left side. It shouldn't do this. It should leave the job menu open.
- MAPREDUCE-3070.
Blocker bug reported by raviteja and fixed by devaraj.k (mrv2, nodemanager)
NM not able to register with RM after NM restart
After stopping NM gracefully then starting NM, NM registration fails with RM with Duplicate registration from the node! error.
{noformat}
2011-09-23 01:50:46,705 FATAL nodemanager.NodeManager (NodeManager.java:main(204)) - Error starting NodeManager
org.apache.hadoop.yarn.YarnException: Failed to Start org.apache.hadoop.yarn.server.nodemanager.NodeManager
at org.apache.hadoop.yarn.service.CompositeService.start(CompositeService.java:78)
at org.apache.hadoop.yarn.server.nodemanager.NodeMa...
- MAPREDUCE-3068.
Blocker bug reported by vinodkv and fixed by criccomini (mrv2)
Should set MALLOC_ARENA_MAX for all YARN daemons and AMs/Containers
This is same as HADOOP-7154 but for yarn. RM, NM, AM and containers should all have this.
- MAPREDUCE-3067.
Blocker bug reported by hitesh and fixed by hitesh (mrv2)
Container exit status not set properly to launched process's exit code on successful completion of process
When testing the distributed shell sample app master, the container exit status was being returned incorrectly.
11/09/21 11:32:58 INFO DistributedShell.ApplicationMaster: Got container status for containerID= container_1316629955324_0001_01_000002, state=COMPLETE, exitStatus=-1000, diagnostics=
- MAPREDUCE-3066.
Major bug reported by criccomini and fixed by criccomini (mrv2, nodemanager)
YARN NM fails to start
Please check conf.get() calls. Every time I svn up, I get one of these.
2011-09-21 15:36:33,534 INFO service.AbstractService (AbstractService.java:stop(71)) - Service:org.apache.hadoop.yarn.server.nodemanager.DeletionService is stopped.
2011-09-21 15:36:33,534 FATAL nodemanager.NodeManager (NodeManager.java:main(204)) - Error starting NodeManager
org.apache.hadoop.yarn.YarnException: Failed to Start org.apache.hadoop.yarn.server.nodemanager.NodeManager
at org.apache.hadoop.yarn.service.Co...
- MAPREDUCE-3064.
Blocker bug reported by tgraves and fixed by venug
27 unit test failures with Invalid "mapreduce.jobtracker.address" configuration value for JobTracker: "local"
unit test failure here: https://builds.apache.org/view/G-L/view/Hadoop/job/Hadoop-Mapreduce-trunk-Commit/946/
Test Result (27 failures / +27)
org.apache.hadoop.mapred.TestCollect.testCollect
org.apache.hadoop.mapred.TestComparators.testDefaultMRComparator
org.apache.hadoop.mapred.TestComparators.testUserMRComparator
org.apache.hadoop.mapred.TestComparators.testUserValueGroupingComparator
org.apache.hadoop.mapred.TestComparators.testAllUserComparators
org.apache.hado...
- MAPREDUCE-3062.
Major bug reported by criccomini and fixed by criccomini (mrv2, nodemanager, resourcemanager)
YARN NM/RM fail to start
2011-09-21 10:21:41,932 FATAL resourcemanager.ResourceManager (ResourceManager.java:main(502)) - Error starting ResourceManager
java.lang.RuntimeException: Not a host:port pair: yarn.resourcemanager.admin.address
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:148)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:132)
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.init(AdminService.java:88)
at org.apache.hadoop.yarn.service.CompositeService....
- MAPREDUCE-3059.
Blocker bug reported by karams and fixed by devaraj.k (mrv2)
QueueMetrics do not have metrics for aggregate containers-allocated and aggregate containers-released
QueueMetrics for ResourceManager do not have any metrics for aggregate containers-allocated and containers-released.
We need the aggregates of containers-allocated and containers-released to figure out the rate at which RM is dishing out containers. NodeManager do have containers-launched and container-released metrics, but this is not across all nodes; so to get the cluster level aggregate, we need to preprocess NM metrics from all nodes - which is troublesome.
Currently, we do have Alloca...
- MAPREDUCE-3058.
Critical bug reported by karams and fixed by vinodkv (contrib/gridmix, mrv2)
Sometimes task keeps on running while its Syslog says that it is shutdown
While running GridMixV3, one of the jobs got stuck for 15 hrs. After clicking on the Job-page, found one of its reduces to be stuck. Looking at syslog of the stuck reducer, found this:
Task-logs' head:
{code}
2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2011-09-19 17:57:22,002 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: ReduceTask metrics system started
{code}
Task-logs' tail:
{code}
2011-09-19 18:06:4...
- MAPREDUCE-3057.
Blocker bug reported by karams and fixed by eepayne (jobhistoryserver, mrv2)
Job History Server goes of OutOfMemory with 1200 Jobs and Heap Size set to 10 GB
History server was started with -Xmx10000m
Ran GridMix V3 with 1200 Jobs trace in STRESS mode on 350 nodes with each node 4 NMS.
All jobs finished as reported by RM Web UI and HADOOP_MAPRED_HOME/bin/mapred job -list all
But found that GridMix job client was stuck while trying connect to HistoryServer
Then tried to do HADOOP_MAPRED_HOME/bin/mapred job -status jobid
JobClient also got stuck while looking for token to connect to History server
Then looked at History Server logs and found History...
- MAPREDUCE-3056.
Blocker bug reported by devaraj.k and fixed by devaraj.k (applicationmaster, mrv2)
Jobs are failing when those are submitted by other users
MR cluster is started by the user 'root'. If any other users other than 'root' submit a job, it is failing always.
Find the conatiner logs in the comments section.
- MAPREDUCE-3055.
Minor bug reported by hitesh and fixed by vinodkv (mrv2)
Simplify parameter passing to Application Master from Client. SImplify approach to pass info such appId, ClusterTimestamp and failcount required by App Master.
The Application master needs the application attempt id to register with the Applications Manager. To create an appAttemptId object, the appId object(needs cluster timestamp and app id) and failCount are needed.
Currently, all clients need to pass in the appId, cluster timestamp and fail count to the app master for the required objects to be constructed.
We could look at simplifying this by providing either placeholders that would have values replaced by the app master launcher or setting ...
- MAPREDUCE-3054.
Blocker bug reported by sseth and fixed by mahadev (mrv2)
Unable to kill submitted jobs
Found by Philip Su
The "mapred job -kill" command
appears to succeed, but listing the jobs again shows that the job supposedly killed is still there.
{code}
mapred job -list
Total jobs:2
JobId State StartTime UserName Queue Priority SchedulingInfo
job_1316203984216_0002 PREP 1316204924937 hadoopqa default NORMAL
job_1316203984216_0001 PREP 1316204031206 hadoopqa default NORMAL
mapred job -kill job_1316203984216_0002
Killed job job_131620...
- MAPREDUCE-3053.
Major bug reported by criccomini and fixed by vinodkv (mrv2, resourcemanager)
YARN Protobuf RPC Failures in RM
When I try to register my ApplicationMaster with YARN's RM, it fails.
In my ApplicationMaster's logs:
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.yarn.api.impl.pb.client.AMRMProtocolPBClientImpl.registerApplicationMaster(AMRMProtocolPBClientImpl.java:108)
at kafka.yarn.util.ApplicationMasterHelper.registerWithResourceManager(YarnHelper.scala:48)
at kafka.yarn.ApplicationMaster$.main(ApplicationMaster.scala:32)
at kafka.yarn.ApplicationM...
- MAPREDUCE-3050.
Blocker bug reported by revans2 and fixed by revans2 (mrv2, resourcemanager)
YarnScheduler needs to expose Resource Usage Information
Before the recent refactor The nodes had information in them about how much resources they were using. This information is not hidden inside SchedulerNode. Similarly resource usage information about an application, or in aggregate is only available through the Scheduler and there is not interface to pull it out.
We need to expose APIs to get Resource and Container information from the scheduler, in aggregate across the entire cluster, per application, per node, and ideally also per queue i...
- MAPREDUCE-3048.
Major bug reported by vinodkv and fixed by vinodkv (build)
Fix test-patch to run tests via "mvn clean install test"
Some tests like the ones failing at MAPREDUCE-3040 depend on the generated jars. TestMRJobs for e.g. won't run if we simply run "mvn clean test".
I propose that we change test-patch to run tests using "mvn clean install test".
- MAPREDUCE-3044.
Blocker bug reported by rramya and fixed by mahadev (mrv2)
Pipes jobs stuck without making progress
A simple example pipes job gets stuck without making any progress. The AM is launched but the maps do not make any progress.
- MAPREDUCE-3042.
Major bug reported by criccomini and fixed by criccomini (mrv2, resourcemanager)
YARN RM fails to start
Simple typo fix to allow ResourceManager to start instead of fail
- MAPREDUCE-3041.
Blocker bug reported by hitesh and fixed by hitesh (mrv2)
Enhance YARN Client-RM protocol to provide access to information such as cluster's Min/Max Resource capabilities similar to that of AM-RM protocol
To request a container to launch an application master, the client needs to know the min/max resource capabilities so as to be able to make a proper resource request when submitting a new application.
- MAPREDUCE-3040.
Major bug reported by tgraves and fixed by acmurthy (mrv2)
TestMRJobs, TestMRJobsWithHistoryService, TestMROldApiJobs fail
Running org.apache.hadoop.mapreduce.v2.TestMRJobs
Tests run: 4, Failures: 0, Errors: 4, Skipped: 0, Time elapsed: 6.229 sec <<< FAILURE!
Running org.apache.hadoop.mapreduce.v2.TestMRJobsWithHistoryService
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 5.887 sec <<< FAILURE!
Running org.apache.hadoop.mapreduce.v2.TestMROldApiJobs
Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 6.067 sec <<< FAILURE!
All of them have the exception:
java.lang.NullPointerExcept...
- MAPREDUCE-3038.
Blocker bug reported by tgraves and fixed by naisbitt (mrv2)
job history server not starting because conf() missing HsController
Exception starting history server.
Sep 19, 2011 6:51:53 PM com.google.inject.MessageProcessor visit
INFO: An exception was caught and reported. Message: org.apache.hadoop.yarn.webapp.WebAppException: conf() not found in class org.apache.hadoop.mapreduce.v2.hs.webapp.HsController org.apache.hadoop.yarn.webapp.WebAppException: conf() not found in class org.apache.hadoop.mapreduce.v2.hs.webapp.HsController
at o...
- MAPREDUCE-3036.
Blocker bug reported by revans2 and fixed by revans2 (mrv2)
Some of the Resource Manager memory metrics go negative.
ReservedGB seems to always be decremented when a container is released, even though the container never reserved any memory.
AvailableGB also seems to be able to go negative in a few situations.
- MAPREDUCE-3035.
Critical bug reported by karams and fixed by chaku88 (mrv2)
MR V2 jobhistory does not contain rack information
When topology.node.switch.mapping.impl is set to enable rack-locality resolution via the topology script, from the RM web-UI, we can see the rack information for each node. Running a job also reveals the information about rack-local map tasks launched at end of job completion on the client side.
But the hostname field for attempts in the JobHistory does not contain this rack information.
In case of hadoop-0.20 securiy or MRV1, hostname field of job history does contain rackid/hostname where...
- MAPREDUCE-3033.
Blocker bug reported by karams and fixed by hitesh (job submission, mrv2)
JobClient requires mapreduce.jobtracker.address config even when mapreduce.framework.name is set to yarn
If mapreduce.jobtracker.address is not set in mapred-site.xml and mapreduce.framework.name is set yarn, job submission fails :
Tried to submit sleep job with maps 1 task. Job submission failed with following exception -:
{code}
11/09/19 13:19:20 INFO ipc.YarnRPC: Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
11/09/19 13:19:20 INFO mapred.ResourceMgrDelegate: Connecting to ResourceManager at <RMHost>:8040
11/09/19 13:19:20 INFO ipc.HadoopYarnRPC: Creating a HadoopYarnProt...
- MAPREDUCE-3032.
Blocker bug reported by vinodkv and fixed by devaraj.k (applicationmaster, mrv2)
JobHistory doesn't have error information from failed tasks
- MAPREDUCE-3031.
Blocker bug reported by karams and fixed by sseth (mrv2)
Job Client goes into infinite loop when we kill AM
Started a cluster. Submitted a sleep job with around 10000 maps and 1000 reduces.
Killed AM with kill -9 by which time already 7000 thousands maps got completed.
On the RM webUI, Application is stuck in Application.RUNNING state. And JobClient goes into an infinite loop as RM keeps telling the client that the application is running.
- MAPREDUCE-3030.
Blocker bug reported by devaraj.k and fixed by devaraj.k (mrv2, resourcemanager)
RM is not processing heartbeat and continuously giving the message 'Node not found rebooting'
{code:title=Node Manager Logs|borderStyle=solid}
2011-09-19 13:39:29,816 INFO webapp.WebApps (WebApps.java:start(162)) - Registered webapp guice modules
2011-09-19 13:39:29,817 INFO service.AbstractService (AbstractService.java:start(61)) - Service:org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer is started.
2011-09-19 13:39:29,818 INFO service.AbstractService (AbstractService.java:start(61)) - Service:Dispatcher is started.
2011-09-19 13:39:29,819 INFO nodemanager.NodeStatusUpd...
- MAPREDUCE-3028.
Blocker bug reported by kamrul and fixed by raviprak (mrv2)
Support job end notification in .next /0.23
Oozie primarily depends on the job end notification to determine when the job finishes. In the current version, job end notification is implemented in job tracker. Since job tracker will be removed in the upcoming hadoop release (.next), we wander where this support will move. I think this best effort notification could be implemented in the new Application Manager as one of the last step of job completion.
Whatever implementation will it be, Oozie badly needs this feature to be continued ...
- MAPREDUCE-3023.
Major bug reported by raviprak and fixed by raviprak (mrv2)
Queue state is not being translated properly (is always assumed to be running)
During translation of QueueInfo,
bq. TypeConverter.java:435 : queueInfo.toString(), QueueState.RUNNING,
ought to be
bq. queueInfo.toString(), QueueState.getState(queueInfo.getQueueState().toString().toLowerCase()),
- MAPREDUCE-3021.
Major bug reported by tgraves and fixed by tgraves (mrv2)
all yarn webapps use same base name of "yarn/"
All of the yarn webapps (resource manager, node manager, app master, job history) use the same base url of /yarn/. This doesn't lend itself very well to filters be able to differentiate them to say allow some to be not authenticated and other to be authenticated. Perhaps we should rename them based on component.
There are also things in the code that hardcode paths to "/yarn" that should be fixed up.
- MAPREDUCE-3020.
Major bug reported by chaku88 and fixed by chaku88 (jobhistoryserver)
Node link in reduce task attempt page is not working [Job History Page]
RM UI -> Applications -> Application(Job History) -> Reduce Tasks -> Task ID -> Node link is not working
hostname for ReduceAttemptFinishedEvent is coming wrong when loading from history file.
- MAPREDUCE-3018.
Blocker bug reported by mahadev and fixed by mahadev (mrv2)
Streaming jobs with -file option fail to run.
Streaming jobs fail to run with the -file option.
hadoop jar streaming.jar -input input.txt -output Out -mapper "mapper.sh" -reducer NONE -file path_to_mapper.sh
fails to run.
- MAPREDUCE-3017.
Blocker bug reported by mahadev and fixed by mahadev (mrv2)
The Web UI shows FINISHED for killed/successful/failed jobs.
The RM web ui shows FINISHED status for all the jobs even if they failed/killed or were successful. This should be fixed. Only the jobs where the AM crashes are marked as Failed.
- MAPREDUCE-3014.
Major improvement reported by tucu00 and fixed by tucu00 (build)
Rename and invert logic of '-cbuild' profile to 'native' and off by default
This would align MR modules with common & hdfs modules.
- MAPREDUCE-3013.
Major sub-task reported by vinodkv and fixed by vinodkv (mrv2, security)
Remove YarnConfiguration.YARN_SECURITY_INFO
We don't need this anymore since RPC client uses SecurityUtil to pick it up via going through the providers for SecurityInfo interface.
- MAPREDUCE-3007.
Major sub-task reported by vinodkv and fixed by vinodkv (jobhistoryserver, mrv2)
JobClient cannot talk to JobHistory server in secure mode
In secure mode, Jobclient cannot connect to HistoryServer. Thanks to [~karams] for finding this out.
{code}
11/09/14 09:57:51 INFO mapred.ClientServiceDelegate: Application state is completed. Redirecting to job history server
11/09/14 09:57:51 INFO security.ApplicationTokenSelector: Looking for a token with service <history-server>:10020
11/09/14 09:57:51 INFO security.ApplicationTokenSelector: Token kind is YARN_APPLICATION_TOKEN and the token's service name is <Am-ip>:46257
11/09/14 09:57...
- MAPREDUCE-3006.
Major bug reported by vinodkv and fixed by vinodkv (applicationmaster, mrv2)
MapReduce AM exits prematurely before completely writing and closing the JobHistory file
[~Karams] was executing a sleep job with 100,000 tasks on a 350 node cluster to test MR AM's scalability and ran into this. The job ran successfully but the history was not available.
I debugged around and figured that the job is finishing prematurely before the JobHistory is written. In most of the cases, we don't see this bug as we have a 5 seconds sleep in AM towards the end.
- MAPREDUCE-3005.
Major bug reported by vinodkv and fixed by acmurthy (mrv2)
MR app hangs because of a NPE in ResourceManager
The app hangs and it turns out to be a NPE in ResourceManager. This happened two of five times on [~karams]'s sort runs on a big cluster.
{code}
2011-09-12 15:02:33,715 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler
java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:244)
at org.apache.hadoop.yarn.serve...
- MAPREDUCE-3004.
Minor bug reported by hitesh and fixed by hitesh (mrv2)
sort example fails in shuffle/reduce stage as it assumes a local job by default
Log trace when running sort on a single node setup:
11/09/13 17:01:06 INFO mapreduce.Job: map 100% reduce 0%
11/09/13 17:01:10 INFO mapreduce.Job: Task Id : attempt_1315949787252_0009_r_000000_0, Status : FAILED
java.lang.UnsupportedOperationException: Incompatible with LocalRunner
at org.apache.hadoop.mapred.YarnOutputFiles.getInputFile(YarnOutputFiles.java:200)
at org.apache.hadoop.mapred.ReduceTask.getMapFiles(ReduceTask.java:183)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask....
- MAPREDUCE-3003.
Major bug reported by tomwhite and fixed by tucu00 (build)
Publish MR JARs to Maven snapshot repository
Currently this is failing since no distribution management section is defined in the POM.
https://builds.apache.org/view/G-L/view/Hadoop/job/Hadoop-Common-trunk-Commit/883/consoleFull
- MAPREDUCE-3001.
Blocker improvement reported by revans2 and fixed by revans2 (jobhistoryserver, mrv2)
Map Reduce JobHistory and AppMaster UI should have ability to display task specific counters.
Map Reduce JobHistory and AppMaster UI should have ability to display task specific counters. I think the best way to do this is to include in the Nav Block a task specific section with task links when a task is selected. Counters is already set up to deal with a task passed in.
- MAPREDUCE-2999.
Critical bug reported by tgraves and fixed by tgraves (mrv2)
hadoop.http.filter.initializers not working properly on yarn UI
Currently httpserver only has *.html", "*.jsp as user facing urls when you add a filter. For the new web framework in yarn, the pages no longer have the *.html or *.jsp and thus they are not properly being filtered.
- MAPREDUCE-2998.
Critical bug reported by naisbitt and fixed by vinodkv (mrv2)
Failing to contact Am/History for jobs: java.io.EOFException in DataInputStream
I am getting an exception frequently when running my jobs on a single-node cluster. It happens with basically any job I run: sometimes the job will work, but most of the time I get this exception (in this case, I was running a simple wordcount from the examples jar - where I got the exception 4 times in a row, and then the job worked the fifth time I submitted it).
Sometimes restarting the namenode, resourcemanager, and historyserver helps - but not always. Several other developers have se...
- MAPREDUCE-2997.
Major bug reported by vinodkv and fixed by vinodkv (applicationmaster, mrv2)
MR task fails before launch itself with an NPE in ContainerLauncher
Exception found on the AM web UI while the application is running:
{code}
Container launch failed for container_1315908079531_0002_01_000387 : java.lang.NullPointerException
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.getCMProxy(ContainerLauncherImpl.java:162)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886...
- MAPREDUCE-2996.
Blocker bug reported by vinodkv and fixed by jeagles (jobhistoryserver, mrv2)
Log uberized information into JobHistory and use the same via CompletedJob
We always print the uberized info on the UI to be false irrespective of whether it is uberized or not.
- MAPREDUCE-2995.
Major bug reported by vinodkv and fixed by vinodkv (mrv2)
MR AM crashes when a container-launch hangs on a faulty NM
AM tries to launch containers on a faulty node which blocks several/all of the {{StartContainer}} requests. Eventually, RM expires the container-allocations, informs the AM about container-expiry. But AM crashes with an INTERNAL_ERROR as the event is unexpected.
{code}
11/09/12 14:11:38 ERROR impl.TaskAttemptImpl: Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: TA_CONTAINER_COMPLETED at ASSIGNED
at org.apache.hadoop....
- MAPREDUCE-2994.
Major bug reported by devaraj.k and fixed by devaraj.k (mrv2, resourcemanager)
Parse Error is coming for App ID when we click application link on the RM UI
{code:xml}
Caused by: org.apache.hadoop.yarn.YarnException: Error parsing app ID: application_1315895242400_1
at org.apache.hadoop.yarn.util.Apps.throwParseException(Apps.java:60)
at org.apache.hadoop.yarn.util.Apps.toAppID(Apps.java:43)
at org.apache.hadoop.yarn.util.Apps.toAppID(Apps.java:38)
at org.apache.hadoop.yarn.server.resourcemanager.webapp.RmController.app(RmController.java:74)
... 30 more
{code}
- MAPREDUCE-2991.
Major bug reported by priyomustafi and fixed by priyomustafi (scheduler)
queueinfo.jsp fails to show queue status if any Capacity scheduler queue name has dash/hiphen in it.
If any queue name has a dash/hiphen in it, the queueinfo.jsp doesn't show any queue information. This is happening because the queue name is used to create javascript variables and javascript doesn't allow dash in variable names.
- MAPREDUCE-2990.
Blocker improvement reported by mahadev and fixed by subrotosanyal (mrv2)
Health Report on Resource Manager UI is null if the NM's are all healthy.
The web UI on the RM for the link Nodes shows that Health-report as null when the NM is healthy.
This is a simple fix where in we can check for null in NodesPage.java and put something meaningful instead of null.
NodesPage.java:
{code}
render(..)
td((health.getHealthReport() == null) ?"REPORT HEALTHY": health.getHealthReport());
{code}
Or something like that.
- MAPREDUCE-2989.
Critical sub-task reported by sseth and fixed by sseth (mrv2)
JobHistory should link to task logs
The log link on the task attempt page is currently broken - since it relies on a ContainerId. We should either pass the containerId via a history event - or some kind of field with information about the log location.
- MAPREDUCE-2988.
Critical sub-task reported by eepayne and fixed by revans2 (mrv2, security, test)
Reenable TestLinuxContainerExecutor reflecting the current NM code.
TestLinuxContainerExecutor is currently disabled completely.
- MAPREDUCE-2987.
Major bug reported by tgraves and fixed by tgraves (mrv2)
RM UI display logged in user as null
All the pages of the UI, currently show "Logged in as: null" instead of the correct username
- MAPREDUCE-2986.
Critical task reported by anupamseth and fixed by anupamseth (mrv2, test)
Multiple node managers support for the MiniYARNCluster
The current MiniYARNCluster can only support 1 node manager, which is not enough for the full test purposes.
Would like to have a simulator that can support multiple node managers as the real scenario. This might be beneficial for hadoop users, testers and developers.
- MAPREDUCE-2985.
Major bug reported by tgraves and fixed by tgraves (mrv2)
findbugs error in ResourceLocalizationService.handle(LocalizationEvent)
hudson mapreduce is reporting a findbugs error:
https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/707//artifact/trunk/hadoop-mapreduce-project/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
WMI Method org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.handle(LocalizationEvent) makes inefficient use of keySet iterator instead of entrySet iterator
Bug type WMI_WRONG_MAP_ITERATOR (click for details)
In class org.a...
- MAPREDUCE-2984.
Minor bug reported by devaraj.k and fixed by devaraj.k (mrv2, nodemanager)
Throwing NullPointerException when we open the container page
{code:xml}
Caused by: java.lang.NullPointerException
at org.apache.hadoop.yarn.api.records.ContainerId.compareTo(ContainerId.java:97)
at org.apache.hadoop.yarn.api.records.ContainerId.compareTo(ContainerId.java:23)
at java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:819)
at java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1640)
at org.apache.hadoop.yarn.server.nodemanager.webapp.ContainerPage$ContainerBlock.render(ContainerPage.java:70)...
- MAPREDUCE-2979.
Major bug reported by sseth and fixed by sseth (mrv2)
Remove ClientProtocolProvider configuration under mapreduce-client-core
ClientProtocolProvider configuration exists under the job-client and core modules. It's really only required in job-client. The version in core points to JobTrackerClientProtocolProvider which causes
java.util.ServiceConfigurationError: org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider: Provider org.apache.hadoop.mapred.JobTrackerClientProtocolProvider not found
at java.util.ServiceLoader.fail(ServiceLoader.java:214)
at java.util.ServiceLoader.access$400(ServiceLoad...
- MAPREDUCE-2977.
Blocker sub-task reported by owen.omalley and fixed by acmurthy (mrv2, resourcemanager, security)
ResourceManager needs to renew and cancel tokens associated with a job
The JobTracker currently manages tokens for the applications and the resource manager needs the same functionality.
- MAPREDUCE-2975.
Blocker bug reported by mahadev and fixed by mahadev
ResourceManager Delegate is not getting initialized with yarn-site.xml as default configuration.
MAPREDUCE-2937 accidentally changes ResourceMgrDelegate so that it does not pick up yarn-site.xml as a default resource. Will upload patch.
- MAPREDUCE-2971.
Blocker bug reported by tgraves and fixed by tgraves (mrv2)
ant build mapreduce fails protected access jc.displayJobList(jobs);
Running the ant target in the hadoop-mapreduce-project directory fails with:
[jsp-compile] log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[javac] /home/tgraves/branch23/branch-0.23/hadoop-mapreduce-project/build.xml:398: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 50 source files to /home/tgraves/branch23/branch-0.23/hadoop-mapreduce-project/build/classes
...
- MAPREDUCE-2970.
Major bug reported by venug and fixed by venug (job submission, mrv2)
Null Pointer Exception while submitting a Job, If mapreduce.framework.name property is not set.
If mapreduce.framework.name property is not set in mapred-site.xml, Null pointer Exception is thrown.
java.lang.NullPointerException
at org.apache.hadoop.mapreduce.Cluster$1.run(Cluster.java:133)
at org.apache.hadoop.mapreduce.Cluster$1.run(Cluster.java:1)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
at org.apache.hadoop.mapreduc...
- MAPREDUCE-2966.
Major improvement reported by abhijit.shingate and fixed by abhijit.shingate (applicationmaster, jobhistoryserver, nodemanager, resourcemanager)
Add ShutDown hooks for MRV2 processes
NodeManager registers a shudown hook in case of JVM exit.
Similar way, all other processes RM, HistoryServer, MRAppMaster should also handle the shutdown gracefully in case of JVM exit.
- MAPREDUCE-2965.
Blocker bug reported by vinodkv and fixed by sseth (mrv2)
Streamline hashCode(), equals(), compareTo() and toString() for all IDs
MAPREDUCE-2954 moved these methods to the record interfaces from the PB impls for ContainerId, ApplicationId and ApplicationAttemptId. This is good as they don't need to be tied to the implementation.
We should do the same for all IDs. In fact some of these are missing for IDs like MR AM JobId, TaskId etc.
- MAPREDUCE-2963.
Critical bug reported by mahadev and fixed by sseth
TestMRJobs hangs waiting to connect to history server.
TestMRJobs is hanging waiting to connect to history server. I will post the logs next.
- MAPREDUCE-2961.
Blocker improvement reported by mahadev and fixed by vinodkv (mrv2)
Increase the default threadpool size for container launching in the application master.
Currently the default threadpool size is 10 for launching containers in ContainerLauncherImpl. We should increase that to 100 for a reasonable default, so that container launching is not backed up by a small thread pool size.
- MAPREDUCE-2958.
Critical bug reported by tgraves and fixed by acmurthy (mrv2)
mapred-default.xml not merged from mr279
I have been running wordcount out of the 23 examples jar. It says it succeeds but doesn't actually output a file.
hadoop jar examples/hadoop-mapreduce-0.23.0-SNAPSHOT/hadoop-mapreduce-examples-0.23.0-SNAPSHOT.jar wordcount input output2
input file is really basic:
fdksajl
dlkfsajlfljda;j
kldfsjallj
test
one
two
test
- MAPREDUCE-2954.
Critical bug reported by vinodkv and fixed by sseth (mrv2)
Deadlock in NM with threads racing for ApplicationAttemptId
Found this:
{code}
Java stack information for the threads listed above:
===================================================
"Thread-45":
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.getApplicationId(ApplicationAttemptIdPBImpl.java:101)
- waiting to lock <0xb6a43ba0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdP...
- MAPREDUCE-2953.
Major bug reported by vinodkv and fixed by tgraves (mrv2, resourcemanager)
JobClient fails due to a race in RM, removes staged files and in turn crashes MR AM
[~Karams] ran into this multiple times. MR JobClient crashes immediately.
{code}
11/09/08 10:52:35 INFO mapreduce.JobSubmitter: number of splits:2094
11/09/08 10:52:36 INFO mapred.YARNRunner: AppMaster capability = memory: 2048,
11/09/08 10:52:36 INFO mapred.YARNRunner: Command to launch container for ApplicationMaster is : $JAVA_HOME/bin/java -Dhadoop.root.logger=INFO,console -Xmx1536m org.apache.hadoop.mapreduce.v2.app.MRAppMaster 1315478927026 1 <FAILCOUNT> 1><LOG_DIR>/stdout 2><LOG_DIR>/...
- MAPREDUCE-2952.
Blocker bug reported by vinodkv and fixed by acmurthy (mrv2, resourcemanager)
Application failure diagnostics are not consumed in a couple of cases
When Container crashes, the reason for failures isn't propagated because of a bug in _RMAppAttemptImpl.AMContainerCrashedTransition_ which simply discards the diagnostics of the container. Also RMAppAttemptImpl.diagnostics is never consumed.
- MAPREDUCE-2949.
Major bug reported by raviteja and fixed by raviteja (mrv2, nodemanager)
NodeManager in a inconsistent state if a service startup fails.
When a service startup fails at the Nodemanager, the Nodemanager JVM doesnot exit as the following threads are still running.
Daemon Thread [Timer for 'NodeManager' metrics system] (Running)
Thread [pool-1-thread-1] (Running)
Thread [Thread-11] (Running)
Thread [DestroyJavaVM] (Running).
As a result, the NodeManager keeps running even though no services are started.
- MAPREDUCE-2948.
Major bug reported by milindb and fixed by mahadev (contrib/streaming)
Hadoop streaming test failure, post MR-2767
After removing LinuxTaskController in MAPREDUCE-2767, one of the tests in contrib/streaming: TestStreamingAsDifferentUser.java is failing since it imports import org.apache.hadoop.mapred.ClusterWithLinuxTaskController. Patch forthcoming.
- MAPREDUCE-2947.
Major bug reported by vinodkv and fixed by vinodkv (mrv2)
Sort fails on YARN+MR with lots of task failures
[~karams](the great man the world hardly knows about) found lots of failing tasks while running sort on a 350 node cluster. The failed tasks eventually failed the job and this happening consistently on the big cluster.
{quote}
Container launch failed for container_1315410418107_0002_01_002511 : RemoteTrace: java.lang.IllegalArgumentException at java.nio.Buffer.position(Buffer.java:218) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:129) at java.nio.ByteBuffer.get(ByteBuffer.java:675) at c...
- MAPREDUCE-2938.
Trivial bug reported by acmurthy and fixed by acmurthy (mrv2, scheduler)
Missing log stmt for app submission fail CS
Missing log stmt for app submission fail CS
- MAPREDUCE-2937.
Critical bug reported by mahadev and fixed by mahadev (mrv2)
Errors in Application failures are not shown in the client trace.
The client side does not show enough information on why the job failed. Here is step to reproduce it:
1) set the scheduler to be capacity scheduler with queues a, b
2) submit a job to a queue that is not a,b
The job just fails without saying why it failed. We should have enough trace log at the client side to let the user know why it failed.
- MAPREDUCE-2936.
Major bug reported by vinodkv and fixed by vinodkv
Contrib Raid compilation broken after HDFS-1620
After working around MAPREDUCE-2935 by removing TestServiceLevelAuthorization and runing the following:
At the trunk level: mvn clean install package -Dtar -Pdist -Dmaven.test.skip.exec=true
In hadoop-mapreduce-project: ant compile-contrib -Dresolvers=internal
yields 14 errors.
- MAPREDUCE-2933.
Blocker sub-task reported by acmurthy and fixed by acmurthy (applicationmaster, mrv2, nodemanager, resourcemanager)
Change allocate call to return ContainerStatus for completed containers rather than Container
Change allocate call to return ContainerStatus for completed containers rather than Container, we should do this all the way from the NodeManager too.
- MAPREDUCE-2930.
Major improvement reported by sharadag and fixed by decster (mrv2)
Generate state graph from the State Machine Definition
Generate state graph from State Machine Definition
- MAPREDUCE-2925.
Major bug reported by devaraj.k and fixed by devaraj.k (mrv2)
job -status <JOB_ID> is giving continuously info message for completed jobs on the console
This below message is coming continuously on the console.
{code:xml}
11/09/02 16:00:00 INFO mapred.ClientServiceDelegate: Failed to contact AM for job job_1314955256658_0009 Will retry..
11/09/02 16:00:00 INFO mapred.ClientServiceDelegate: Application state is completed. Redirecting to job history server null
11/09/02 16:00:00 INFO mapred.ClientServiceDelegate: Failed to contact AM for job job_1314955256658_0009 Will retry..
11/09/02 16:00:00 INFO mapred.ClientServiceDelegate: Application ...
- MAPREDUCE-2917.
Major bug reported by acmurthy and fixed by acmurthy (mrv2, resourcemanager)
Corner case in container reservations
Saw a corner case in container reservations where the node on which the AM is running was reserved, and hence never fulfilled leaving the application hanging.
- MAPREDUCE-2916.
Major bug reported by mahadev and fixed by mahadev
Ivy build for MRv1 fails with bad organization for common daemon.
This jira is to ignore ivy resolve errors because of bad poms in common daemons.
- MAPREDUCE-2913.
Critical bug reported by revans2 and fixed by jeagles (mrv2, test)
TestMRJobs.testFailingMapper does not assert the correct thing.
{code}
Assert.assertEquals(TaskCompletionEvent.Status.FAILED,
events[0].getStatus().FAILED);
Assert.assertEquals(TaskCompletionEvent.Status.FAILED,
events[1].getStatus().FAILED);
{code}
when optimized would be
{code}
Assert.assertEquals(TaskCompletionEvent.Status.FAILED,
TaskCompletionEvent.Status.FAILED);
Assert.assertEquals(TaskCompletionEvent.Status.FAILED,
TaskCompletionEvent.Status.FAILED);
{code}
obviously these assertions will neve...
- MAPREDUCE-2909.
Major sub-task reported by acmurthy and fixed by acmurthy (documentation, mrv2)
Docs for remaining records in yarn-api
MAPREDUCE-2891 , MAPREDUCE-2897 & MAPREDUCE-2898 added javadocs for core protocols (i.e. AMRMProtocol, ClientRMProtocol & ContainerManager). Most 'records' also have javadocs - this jira is to track the remaining ones.
- MAPREDUCE-2908.
Critical bug reported by mahadev and fixed by vinodkv (mrv2)
Fix findbugs warnings in Map Reduce.
In the current trunk/0.23 codebase there are 5 findbugs warnings which cause the precommit CI builds to -1 the patches.
- MAPREDUCE-2907.
Major bug reported by raviprak and fixed by raviprak (mrv2, resourcemanager)
ResourceManager logs filled with [INFO] debug messages from org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue
I see a lot of info messages (probably used for debugging during development)
- MAPREDUCE-2904.
Major bug reported by sharadag and fixed by sharadag
HDFS jars added incorrectly to yarn classpath
- MAPREDUCE-2899.
Major sub-task reported by acmurthy and fixed by acmurthy (mrv2, resourcemanager)
Replace major parts of ApplicationSubmissionContext with a ContainerLaunchContext
We can replace major parts of ApplicationSubmissionContext with a ContainerLaunchContext.
- MAPREDUCE-2898.
Major sub-task reported by acmurthy and fixed by acmurthy (documentation, mrv2)
Docs for core protocols in yarn-api - ContainerManager
Track docs for ContainerManager and related apis/records.
- MAPREDUCE-2897.
Major sub-task reported by acmurthy and fixed by acmurthy (documentation, mrv2)
Docs for core protocols in yarn-api - ClientRMProtocol
Track docs for ClientRMProtocol and related apis/records.
- MAPREDUCE-2896.
Major sub-task reported by acmurthy and fixed by acmurthy (mrv2)
Remove all apis other than getters and setters in all org/apache/hadoop/yarn/api/records/*
Remove all apis other than getters and setters in all org/apache/hadoop/yarn/api/records/*.
We initially added some list manipulation methods etc. which are ungainly and need to go.
- MAPREDUCE-2894.
Blocker improvement reported by acmurthy and fixed by (mrv2)
Improvements to YARN apis
Ticket to track improvements to YARN apis.
- MAPREDUCE-2893.
Trivial improvement reported by viirya and fixed by viirya (client)
Removing duplicate service provider in hadoop-mapreduce-client-jobclient
There is duplicate provider class name in the configuration file of ClientProtocolProvider under hadoop-mapreduce-client-jobclient. Although it will be ignored.
- MAPREDUCE-2891.
Major sub-task reported by acmurthy and fixed by acmurthy (documentation, mrv2)
Docs for core protocols in yarn-api - AMRMProtocol
We need to add docs for AMRMProtocol
- MAPREDUCE-2890.
Blocker improvement reported by acmurthy and fixed by (documentation, mrv2)
Documentation for MRv2
Let's use this jira to track docs for all of MRv2.
- MAPREDUCE-2889.
Critical sub-task reported by acmurthy and fixed by hitesh (documentation, mrv2)
Add docs for writing new application frameworks
We need to add docs for writing new application frameworks, including examples, javadocs and sample apps.
- MAPREDUCE-2887.
Major improvement reported by sanjay.radia and fixed by sanjay.radia
MR changes to match HADOOP-7524 (multiple RPC protocols)
- MAPREDUCE-2886.
Critical bug reported by mahadev and fixed by mahadev (mrv2)
Fix Javadoc warnings in MapReduce.
On the current trunk and 0.23, there are 73 javadoc warnings which is causing the buildbot to -1 every patch in MR. We need to fix this to stabilize the CI precommit builds.
- MAPREDUCE-2885.
Blocker bug reported by acmurthy and fixed by acmurthy
mapred-config.sh doesn't look for $HADOOP_COMMON_HOME/libexec/hadoop-config.sh
mapred-config.sh doesn't look for $HADOOP_COMMON_HOME/libexec/hadoop-config.sh and thus fails to find it and errors out.
- MAPREDUCE-2882.
Minor bug reported by tlipcon and fixed by tlipcon (test)
TestLineRecordReader depends on ant jars
This test is currently importing an ant utility class to read a file - this dependency doesn't work in mavenized land.
- MAPREDUCE-2881.
Major bug reported by gkesavan and fixed by gkesavan (build)
mapreduce ant compilation fails "java.lang.IllegalStateException: impossible to get artifacts"
[ivy:resolve] found com.cenqua.clover#clover;3.0.2 in fs
[ivy:resolve]
[ivy:resolve] :: problems summary ::
[ivy:resolve] :::: ERRORS
[ivy:resolve] impossible to get artifacts when data has not been loaded. IvyNode = log4j#log4j;1.2.16
[ivy:resolve]
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
BUILD FAILED
/home/jenkins/jenkins-slave/workspace/Hadoop-Mapreduce-trunk-Commit/trunk/hadoop-mapreduce-project/build.xml:451: The following error occurred while executing t...
- MAPREDUCE-2880.
Blocker improvement reported by vicaya and fixed by acmurthy (mrv2)
Fix classpath construction for MRv2
MRConstants.java refers a hard-coded version of MR AM jar. The build config works around with a symlink. The deployment currently needs symlink workaround as well. We need to fix this so that we can actually launch arbitrary versions of AMs.
- MAPREDUCE-2877.
Major bug reported by mahadev and fixed by mahadev
Add missing Apache license header in some files in MR and also add the rat plugin to the poms.
Some of the files in MR have a missing Apache header files. We also need to add the apache-rat plugin to be able to run rat automatically via the top level pom.
- MAPREDUCE-2876.
Critical bug reported by revans2 and fixed by anupamseth (mrv2)
ContainerAllocationExpirer appears to use the incorrect configs
ContainerAllocationExpirer sets the expiration interval to be RMConfig.CONTAINER_LIVELINESS_MONITORING_INTERVAL but uses AMLIVELINESS_MONITORING_INTERVAL as the interval. This is very different from what AMLivelinessMonitor does.
There should be two configs RMConfig.CONTAINER_LIVELINESS_MONITORING_INTERVAL for the monitoring interval and RMConfig.CONTAINER_EXPIRY_INTERVAL for the expiry.
- MAPREDUCE-2874.
Major bug reported by tgraves and fixed by eepayne (mrv2)
ApplicationId printed in 2 different formats and has 2 different toString routines that are used
Looks like the ApplicationId is now printed in 2 different formats. ApplicationIdPBImpl.java has a toString routine that prints it in the format: return "application_" + this.getClusterTimestamp() + "_" + this.getId();
While the webapps use ./hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/Apps.java toString that prints it like:
return _join("app", id.getClusterTimestamp(), id.getId());
- MAPREDUCE-2868.
Major bug reported by tgraves and fixed by mahadev (build)
ant build broken in hadoop-mapreduce dir
The ant build target doesn't work in the hadoop-mapreduce directory since the mavenization of hdfs changes were checked in.
Error it gives is:
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: UNRESOLVED DEPENDENCIES ::
[ivy:resolve] ::::::::::::::::::::::::::::::::::::::::::::::
[ivy:resolve] :: org.apache.avro#avro-ipc;working@host: not found
[ivy:resolve] :: org.apache.hadoop#hadoop-alfredo;work...
- MAPREDUCE-2867.
Major bug reported by mahadev and fixed by mahadev
Remove Unused TestApplicaitonCleanup in resourcemanager/applicationsmanager.
TestApplicationCleanup in resourcemanager/applicationsmanager doesnt do anything. There is already a test in resourcemanager/TestApplicationCleanup which tests all the cleanup code for container and applications. We should remove the unused one in the trunk.
- MAPREDUCE-2864.
Major improvement reported by revans2 and fixed by revans2 (jobhistoryserver, mrv2, nodemanager, resourcemanager)
Renaming of configuration property names in yarn
Now that YARN has been put in to trunk we should do something similar to MAPREDUCE-849. We should go back and look at all of the configurations that have been added in and rename them as needed to be consistent and subdivided by component.
# We should use all lowercase in the config names. e.g., we should use appsmanager instead of appsManager etc.
# history server config names should be prefixed with mapreduce instead of yarn.
- MAPREDUCE-2860.
Major bug reported by mahadev and fixed by mahadev (mrv2)
Fix log4j logging in the maven test cases.
At present the logging in the new test cases is broken because surefire isnt able to find the log4j properties file.
- MAPREDUCE-2859.
Major bug reported by gkesavan and fixed by gkesavan
mapreduce trunk is broken with eclipse plugin contrib
ant compile with eclipse home fails mapreduce trunk builds.
$ANT_HOME/bin/ant -Dversion=${VERSION} -Declipse.home=$ECLIPSE_HOME compile
compile:
[echo] contrib: eclipse-plugin
[javac] Compiling 45 source files to /home/jenkins/jenkins-slave/workspace/Hadoop-Mapreduce-trunk/trunk/build/contrib/eclipse-plugin/classes
[javac] /home/jenkins/jenkins-slave/workspace/Hadoop-Mapreduce-trunk/trunk/src/contrib/eclipse-plugin/src/java/org/apache/hadoop/eclipse/server/HadoopServer.java:39...
- MAPREDUCE-2858.
Blocker sub-task reported by vicaya and fixed by revans2 (applicationmaster, mrv2, security)
MRv2 WebApp Security
A new server has been added to yarn. It is a web proxy that sits in front of the AM web UI. The server is controlled by the yarn.web-proxy.address config. If that config is set, and it points to an address that is different then the RM web interface then a separate proxy server needs to be launched.
This can be done by running
yarn-daemon.sh start proxyserver
If a separate proxy server is needed other configs also may need to be set, if security is enabled.
yarn.web-proxy.principal
yarn.web-proxy.keytab
The proxy server is stateless and should be able to support a VIP or other load balancing sitting in front of multiple instances of this server.
- MAPREDUCE-2854.
Major bug reported by tgraves and fixed by tgraves
update INSTALL with config necessary run mapred on yarn
The following config is needed to run mapreduce on yarn framework. Document it in the INSTALL doc.
<property>
<name> mapreduce.framework.name</name>
<value>yarn</value>
</property>
The INSTALL doc also still references the old 22 mapred examples jar.
- MAPREDUCE-2848.
Major improvement reported by vicaya and fixed by vicaya
Upgrade avro to 1.5.2
Upgrade avro to the current version requires some code changes in mapreduce due to avro package split. The mapreduce part of the change will be part of the atomic commit of HADOOP-7264 after MAPREDUCE-279 is merged to trunk. The jira is for mapreduce change log.
- MAPREDUCE-2846.
Blocker bug reported by aw and fixed by owen.omalley (task, task-controller, tasktracker)
a small % of all tasks fail with DefaultTaskController
Fixed a race condition in writing the log index file that caused tasks to 'fail'.
- MAPREDUCE-2844.
Trivial bug reported by rramya and fixed by raviteja (mrv2)
[MR-279] Incorrect node ID info
The node ID info for the nodemanager entires on the RM UI incorrectly displays the value of $yarn.server.nodemanager.address instead of the ID.
- MAPREDUCE-2843.
Major bug reported by rramya and fixed by abhijit.shingate (mrv2)
[MR-279] Node entries on the RM UI are not sortable
The nodemanager entries on the RM UI is not sortable unlike the other web pages.
- MAPREDUCE-2840.
Minor bug reported by tgraves and fixed by jeagles (mrv2)
mr279 TestUberAM.testSleepJob test fails
Currently the TestUberAM.testSleepJob is failing on the mr279 branch.
snippet of failure:
junit.framework.AssertionFailedError: null
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.assertTrue(Assert.java:20)
at junit.framework.Assert.assertTrue(Assert.java:27)
at org.apache.hadoop.mapreduce.v2.TestMRJobs.testSleepJob(TestMRJobs.java:150)
at org.apache.hadoop.mapreduce.v2.TestUberAM.testSleepJob(TestUberAM.java:58)
at sun.reflect.NativeMethodAccessorImpl.invok...
- MAPREDUCE-2839.
Major bug reported by sseth and fixed by sseth
MR Jobs fail on a secure cluster with viewfs
TokenCache needs to use the new FileSystem.getDelegationTokens api for it to work with viewfs.
- MAPREDUCE-2821.
Blocker bug reported by rramya and fixed by mahadev (mrv2)
[MR-279] Missing fields in job summary logs
The following fields are missing in the job summary logs in mrv2:
- numSlotsPerMap
- numSlotsPerReduce
- clusterCapacity (Earlier known as clusterMapCapacity and clusterReduceCapacity in 0.20.x)
The first two fields are important to know if the job was a High RAM job or not and the last field is important to know the total available resource in the cluster during job execution.
- MAPREDUCE-2808.
Minor bug reported by tgraves and fixed by tgraves (mrv2)
pull MAPREDUCE-2797 into mr279 branch
The ant tar command fails in the mapreduce directory on the mr279 branch. The issue was a change in hdfs and was fixed on trunk with jira MAPREDUCE-2797. Pull that change into mr279.
- MAPREDUCE-2807.
Major sub-task reported by sharadag and fixed by sharadag (applicationmaster, mrv2, resourcemanager)
MR-279: AM restart does not work after RM refactor
When the AM crashes, RM is not able to launch a new App attempt.
- MAPREDUCE-2805.
Minor improvement reported by szetszwo and fixed by szetszwo (contrib/raid)
Update RAID for HDFS-2241
{noformat}
src/contrib/raid/src/java/org/apache/hadoop/hdfs/server/datanode/RaidBlockSender.java:44: interface expected here
[javac] public class RaidBlockSender implements java.io.Closeable, FSConstants {
[javac] ^
{noformat}
- MAPREDUCE-2802.
Critical improvement reported by rramya and fixed by jeagles (mrv2)
[MR-279] Jobhistory filenames should have jobID to help in better parsing
For jobID such as job_1312933838300_0007, jobhistory file names are named as job%5F1312933838300%5F0007_<submit_time>_ramya_<jobname>_<finish_time>_1_1_SUCCEEDED.jhist It would be easier for parsing if the jobIDs were a part of the filenames.
- MAPREDUCE-2800.
Major bug reported by rramya and fixed by sseth (mrv2)
clockSplits, cpuUsages, vMemKbytes, physMemKbytes is set to -1 in jhist files
clockSplits, cpuUsages, vMemKbytes, physMemKbytes is set to -1 for all the map tasks for the last 4 progress interval in the jobhistory files.
- MAPREDUCE-2797.
Major bug reported by szetszwo and fixed by szetszwo (contrib/raid, test)
Some java files cannot be compiled
Due to the changes in HDFS-2239, the following files cannot be compiled (Thanks Amar for pointing them out.)
1. src/test/mapred/org/apache/hadoop/mapreduce/security/TestTokenCache.java
2. src/test/mapred/org/apache/hadoop/mapreduce/security/TestBinaryTokenFile.java
3. src/test/mapred/org/apache/hadoop/mapreduce/security/TestTokenCacheOldApi.java
4. src/contrib/raid/src/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyRaid.java
- MAPREDUCE-2796.
Major bug reported by rramya and fixed by devaraj.k (mrv2)
[MR-279] Start time for all the apps is set to 0
The start time for all the apps in the output of "job -list" is set to 0
- MAPREDUCE-2794.
Blocker bug reported by rramya and fixed by johnvijoe (mrv2)
[MR-279] Incorrect metrics value for AvailableGB per queue per user
AvailableGB per queue is not the same as AvailableGB per queue per user when the user limit is set to 100%.
i.e. if the total available GB of the cluster is 60, and queue "default" has 92% capacity with 100% as the user limit, AvailableGB per queue default = 55 (i.e. 0.92*60) whereas AvailableGB per queue for user ramya is 56 (however it should be 55 = 0.92*60*1)
Also, unlike the AvailableGB/queue, AvailableGB/queue/user is not decremented when user ramya is running apps on the "default" qu...
- MAPREDUCE-2792.
Blocker sub-task reported by rramya and fixed by vinodkv (mrv2, security)
[MR-279] Replace IP addresses with hostnames
Currently, all the logs, UI, CLI have IP addresses of the NM/RM, which are difficult to manage. It will be useful to have hostnames like in 0.20.x for easier debugging and maintenance purpose.
- MAPREDUCE-2791.
Blocker bug reported by rramya and fixed by devaraj.k (mrv2)
[MR-279] Missing/incorrect info on job -status CLI
There are a couple of details missing/incorrect on the job -status command line output for completed jobs:
1. Incorrect job file
2. map() completion is always 0
3. reduce() completion is always set to 0
4. history URL is empty
5. Missing launched map tasks
6. Missing launched reduce tasks
- MAPREDUCE-2789.
Major bug reported by rramya and fixed by eepayne (mrv2)
[MR:279] Update the scheduling info on CLI
"mapred/job -list" now contains map/reduce, container, and resource information.
- MAPREDUCE-2788.
Major bug reported by ahmed.radwan and fixed by ahmed.radwan (mrv2)
Normalize requests in FifoScheduler.allocate to prevent NPEs later
The assignContainer() method in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue can cause the scheduler to crash if the ResourseRequest capability memory == 0 (divide by zero).
- MAPREDUCE-2783.
Critical bug reported by tgraves and fixed by eepayne (mrv2)
mr279 job history handling after killing application
The job history/application tracking url handling during kill is not consistent. Currently if you kill a job that was running the tracking url points to job history, but job history server doesn't have the job.
- MAPREDUCE-2782.
Major test reported by acmurthy and fixed by acmurthy (mrv2)
MR-279: Unit (mockito) tests for CS
Add (true) unit tests for CapacityScheduler
- MAPREDUCE-2781.
Minor bug reported by tgraves and fixed by tgraves (mrv2)
mr279 RM application finishtime not set
The RM Application finishTime isn't being set. Looks like it got lost in the RM refactor.
- MAPREDUCE-2779.
Major bug reported by mingma and fixed by mingma (job submission)
JobSplitWriter.java can't handle large job.split file
We use cascading MultiInputFormat. MultiInputFormat sometimes generates big job.split used internally by hadoop, sometimes it can go beyond 2GB.
In JobSplitWriter.java, the function that generates such file uses 32bit signed integer to compute offset into job.split.
writeNewSplits
...
int prevCount = out.size();
...
int currCount = out.size();
writeOldSplits
...
long offset = out.size();
...
int currLen = out.size();
- MAPREDUCE-2776.
Major bug reported by sseth and fixed by sseth (mrv2)
MR 279: Fix some of the yarn findbug warnings
Fix / ignore some of the findbug warnings in the yarn module.
- MAPREDUCE-2775.
Blocker bug reported by rramya and fixed by devaraj.k (mrv2)
[MR-279] Decommissioned node does not shutdown
A Nodemanager which is decommissioned by an admin via refreshnodes does not automatically shutdown.
- MAPREDUCE-2774.
Minor bug reported by rramya and fixed by venug (mrv2)
[MR-279] Add a startup msg while starting RM/NM
Add a startup msg while starting NM/RM indicating the version, build details etc. This will help in easier parsing of logs and debugging.
- MAPREDUCE-2773.
Minor bug reported by tgraves and fixed by tgraves (mrv2)
[MR-279] server.api.records.NodeHealthStatus renamed but not updated in client NodeHealthStatus.java
On the mr279 branch, you can't successfully run the ant target from the mapreduce directory since the checkin of the RM refactor.
The issue is the NodeHealthStatus rename from org.apache.hadoop.yarn.server.api.records.NodeHealthStatus to org.apache.hadoop.yarn.api.records.NodeHealthStatus but the client mapreduce/src/java/org/apache/hadoop/mapred/NodeHealthStatus.java wasn't updated with the change
- MAPREDUCE-2772.
Major bug reported by revans2 and fixed by revans2 (mrv2)
MR-279: mrv2 no longer compiles against trunk after common mavenization.
mrv2 no longer compiles against trunk after common mavenization
- MAPREDUCE-2767.
Blocker bug reported by milindb and fixed by milindb (security)
Remove Linux task-controller from 0.22 branch
There's a potential security hole in the task-controller as it stands. Based on the discussion on general@, removing task-controller from the 0.22 branch will pave way for 0.22.0 release. (This was done for the 0.21.0 release as well: see MAPREDUCE-2014.) We can roll a 0.22.1 release with the task-controller when it is fixed.
- MAPREDUCE-2766.
Blocker sub-task reported by rramya and fixed by hitesh (mrv2)
[MR-279] Set correct permissions for files in dist cache
Currently, the files in both public and private dist cache are having 777 permission. Also, the group ownership of files on private cache have to be set to $TT_SPECIAL_GROUP
- MAPREDUCE-2764.
Major bug reported by daryn and fixed by owen.omalley
Fix renewal of dfs delegation tokens
Generalizes token renewal and canceling to a common interface and provides a plugin interface for adding renewers for new kinds of tokens. Hftp changed to store the tokens as HFTP and renew them over http.
- MAPREDUCE-2763.
Major bug reported by rramya and fixed by (mrv2)
IllegalArgumentException while using the dist cache
IllegalArgumentException is seen while using distributed cache to cache some files and custom jars in classpath.
A simple way to reproduce this error is by using a streaming job:
hadoop jar hadoop-streaming.jar -libjars file://<path to custom jar> -input <path to input file> -output out -mapper "cat" -reducer NONE -cacheFile hdfs://<path to some file>#linkname
This is a regression introduced and the same command works fine on 0.20.x
- MAPREDUCE-2762.
Blocker bug reported by rramya and fixed by mahadev (mrv2)
[MR-279] - Cleanup staging dir after job completion
The files created under the staging dir have to be deleted after job completion. Currently, all job.* files remain forever in the ${yarn.apps.stagingDir}
- MAPREDUCE-2760.
Minor bug reported by tlipcon and fixed by tlipcon (documentation)
mapreduce.jobtracker.split.metainfo.maxsize typoed in mapred-default.xml
The configuration mapreduce.jobtracker.split.metainfo.maxsize is incorrectly included in mapred-default.xml as mapreduce.*job*.split.metainfo.maxsize. It seems that {{jobtracker}} is correct, since this is a JT-wide property rather than a job property.
- MAPREDUCE-2756.
Minor bug reported by revans2 and fixed by revans2 (client, mrv2)
JobControl can drop jobs if an error occurs
If you run a pig job with UDFs that has not been recompiled for MRV2. There are situations where pig will fail with an error message stating that Hadoop failed and did not give a reason. There is even the possibility of deadlock if an Error is thrown and the JobControl thread dies.
- MAPREDUCE-2754.
Blocker bug reported by rramya and fixed by raviteja (mrv2)
MR-279: AM logs are incorrectly going to stderr and error messages going incorrectly to stdout
The log messages for AM container are going into stderr instead of syslog. Also, stderr and stdout roles are reversed.
- MAPREDUCE-2751.
Blocker bug reported by vinodkv and fixed by sseth (mrv2)
[MR-279] Lot of local files left on NM after the app finish.
This ticket is about app-only files which should be cleaned after app-finish.
I see these undeleted after app-finish:
/tmp/nm-local-dir/0/nmPrivate/application_1305091029545_0001/*
/tmp/nm-local-dir/0/nmPrivate/container_1305019205843_0001_000002/*
/tmp/nm-local-dir/0/usercache/nobody/appcache/application_1305091029545_0001/*
We should check for other left-over files too, if any.
- MAPREDUCE-2749.
Major bug reported by vinodkv and fixed by tgraves (mrv2)
[MR-279] NM registers with RM even before it starts various servers
In case NM eventually fails to start the ContainerManager server because of say a port clash, RM will have to wait for expiry to detect the NM crash.
It is desirable to make NM register with RM only after it can start all of its components successfully.
- MAPREDUCE-2747.
Blocker sub-task reported by vinodkv and fixed by revans2 (mrv2, nodemanager, security)
[MR-279] [Security] Cleanup LinuxContainerExecutor binary sources
There are a lot of references to the old task-controller nomenclature still, job/task refs instead of app/container.
Also the configuration file is named as taskcontroller.cfg and the configured variables are also from the mapred world (mrv1). These SHOULD be fixed before we make a release. Marking this as blocker.
- MAPREDUCE-2746.
Blocker sub-task reported by vinodkv and fixed by acmurthy (mrv2, security)
[MR-279] [Security] Yarn servers can't communicate with each other with hadoop.security.authorization set to true
Because of this problem, till now, we've been testing YARN+MR with {{hadoop.security.authorization}} set to false. We need to register yarn communication protocols in the implementation of the authorization related PolicyProvider (MapReducePolicyProvider.java).
[~devaraj] also found this issue independently.
- MAPREDUCE-2741.
Major task reported by tucu00 and fixed by tucu00 (build)
Make ant build system work with hadoop-common JAR generated by Maven
Some tweaks must be done in MAPRED & its contribs ivy configuration to work with HADOOP-6671.
This wil be a temporary fix until MAPRED is mavenized.
- MAPREDUCE-2740.
Major bug reported by tlipcon and fixed by tlipcon
MultipleOutputs in new API creates needless TaskAttemptContexts
MultipleOutputs.write creates a new TaskAttemptContext, which we've seen to take a significant amount of CPU. The TaskAttemptContext constructor creates a JobConf, gets current UGI, etc. I don't see any reason it needs to do this, instead of just creating a single TaskAttemptContext when the InputFormat is created (or lazily but cached as a member)
- MAPREDUCE-2738.
Blocker bug reported by rramya and fixed by revans2 (mrv2)
Missing cluster level stats on the RM UI
Cluster usage information such as the following are currently not available in the RM UI.
- Total number of apps submitted so far
- Total number of containers running/total memory usage
- Total capacity of the cluster (in terms of memory)
- Reserved memory
- Total number of NMs - sorting based on Node IDs is an option but when there are lost NMs or restarted NMs, the node ids does not correspond to the actual value
- Blacklisted NMs - sorting based on health-status and counting manually is...
- MAPREDUCE-2737.
Major bug reported by rramya and fixed by sseth (mrv2)
Update the progress of jobs on client side
The progress of the jobs are not being correctly updated on the client side. The map progress halts at 66% and both map/reduce progress % does not display 100 when the job completes.
- MAPREDUCE-2736.
Major task reported by eli and fixed by eli (jobtracker, tasktracker)
Remove unused contrib components dependent on MR1
The pre-MR2 MapReduce implementation (JobTracker, TaskTracer, etc) and contrib components are no longer supported. This implementation is currently supported in the 0.20.20x releases.
- MAPREDUCE-2735.
Major bug reported by tgraves and fixed by tgraves (mrv2)
MR279: finished applications should be added to an application summary log
When an application finishes it should be added to an application summary log for historical purposes. jira MAPREDUCE-2649 is going to start purging applications from RM when certain limits are hit which makes this more critical. We also need to save the information early enough after the app finishes so we don't lose the info if the RM does get restarted.
- MAPREDUCE-2732.
Major bug reported by szetszwo and fixed by szetszwo (test)
Some tests using FSNamesystem.LOG cannot be compiled
- MAPREDUCE-2727.
Major bug reported by naisbitt and fixed by naisbitt (mrv2)
MR-279: SleepJob throws divide by zero exception when count = 0
When the count is 0 for mappers or reducers, a divide-by-zero exception is thrown. There are existing checks to error out when count < 0, which obviously doesn't handle the 0 case. This is causing the MRReliabilityTest to fail.
- MAPREDUCE-2726.
Blocker improvement reported by naisbitt and fixed by naisbitt (mrv2)
MR-279: Add the jobFile to the web UI
MAPREDUCE:2716 adds the jobfile information to the ApplicationReport. With that information available, we should add the jobfile to the web UI as well.
- MAPREDUCE-2719.
Major new feature reported by sharadag and fixed by hitesh (mrv2)
MR-279: Write a shell command application
Adding a simple, DistributedShell application as an alternate framework to MapReduce and to act as an illustrative example for porting applications to YARN.
- MAPREDUCE-2716.
Major bug reported by naisbitt and fixed by naisbitt (mrv2)
MR279: MRReliabilityTest job fails because of missing job-file.
The ApplicationReport should have the jobFile (e.g. hdfs://localhost:9000/tmp/hadoop-<USER>/mapred/staging/<USER>/.staging/job_201107121640_0001/job.xml)
Without it, jobs such as MRReliabilityTest fail with the following error (caused by the fact that jobFile is hardcoded to "" in TypeConverter.java):
e.g. java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:88)
at org.apache.hadoop.fs.Path.<init>(...
- MAPREDUCE-2711.
Major bug reported by szetszwo and fixed by szetszwo (contrib/raid)
TestBlockPlacementPolicyRaid cannot be compiled
{{TestBlockPlacementPolicyRaid}} access internal {{FSNamesystem}} directly. It cannot be compiled after HDFS-2147.
- MAPREDUCE-2710.
Major bug reported by szetszwo and fixed by szetszwo (client)
Update DFSClient.stringifyToken(..) in JobSubmitter.printTokens(..) for HDFS-2161
{{DFSClient.stringifyToken(..)}} was removed by HDFS-2161. {{JobSubmitter.printTokens(..)}} won't be compiled.
- MAPREDUCE-2708.
Blocker sub-task reported by sharadag and fixed by sharadag (applicationmaster, mrv2)
[MR-279] Design and implement MR Application Master recovery
Design recovery of MR AM from crashes/node failures. The running job should recover from the state it left off.
- MAPREDUCE-2707.
Major improvement reported by jnp and fixed by jnp
ProtoOverHadoopRpcEngine without using TunnelProtocol over WritableRpc
ProtoOverHadoopRpcEngine is introduced in MR-279, which uses TunnelProtocol over WritableRpcEngine. This jira removes the tunnel protocol and lets ProtoOverHadoopRpcEngine directly interact with ipc.Client and ipc.Server.
- MAPREDUCE-2706.
Major bug reported by naisbitt and fixed by naisbitt (mrv2)
MR-279: Submit jobs beyond the max jobs per queue limit no longer gets logged
Submitting jobs over the queue limits used to print log messages such as these:
hadoop-mapred-jobtracker-HOSTNAME.log. ... INFO
org.apache.hadoop.mapred.CapacityTaskScheduler: default has 10 active tasks for user MYUSER, cannot initialize
job_XXX with 10 tasks since it will exceed limit of 15 active tasks per user for this queue
and
hadoop-mapred-jobtracker-HOSTNAME.log ... INFO org.apache.hadoop.mapred.CapacityTaskScheduler: default already has 2 running jobs and 0 initializing jobs; cannot ...
- MAPREDUCE-2705.
Major bug reported by tgraves and fixed by tgraves (tasktracker)
tasks localized and launched serially by TaskLauncher - causing other tasks to be delayed
The current TaskLauncher serially launches new tasks one at a time. During the launch it does the localization and then starts the map/reduce task. This can cause any other tasks to be blocked waiting for the current task to be localized and started. In some instances we have seen a task that has a large file to localize (1.2MB) block another task for about 40 minutes. This particular task being blocked was a cleanup task which caused the job to be delayed finishing for the 40 minutes.
- MAPREDUCE-2702.
Blocker sub-task reported by sharadag and fixed by sharadag (applicationmaster, mrv2)
[MR-279] OutputCommitter changes for MR Application Master recovery
Enhance OutputCommitter and FileOutputCommitter to allow for recover of tasks across job restart.
- MAPREDUCE-2701.
Major improvement reported by revans2 and fixed by revans2 (mrv2)
MR-279: app/Job.java needs UGI for the user that launched it
./mr-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/Job.java is missing some data that is needed by the Job History GUI. It needs the UGI for the user that launched it.
- MAPREDUCE-2697.
Major bug reported by acmurthy and fixed by acmurthy (mrv2)
Enhance CS to cap concurrently running jobs
Enhance CS to cap concurrently running jobs ala 0.20.203
- MAPREDUCE-2696.
Major sub-task reported by acmurthy and fixed by sseth (mrv2, nodemanager)
Container logs aren't getting cleaned up when LogAggregation is disabled
Container logs aren't getting cleaned up when log-aggregation is disabled.
- MAPREDUCE-2693.
Critical bug reported by amolkekre and fixed by hitesh (mrv2)
NPE in AM causes it to lose containers which are never returned back to RM
The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
because of these lost containers.
It happens when there are blacklisted nodes at the app level in AM. A bug in AM
(RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
request-table. We should make sure ...
- MAPREDUCE-2691.
Major improvement reported by amolkekre and fixed by sseth (mrv2)
Finish up the cleanup of distributed cache file resources and related tests.
Implement cleanup of distributed cache file resources
- MAPREDUCE-2690.
Major bug reported by rramya and fixed by eepayne (mrv2)
Construct the web page for default scheduler
Currently, the web page for default scheduler reads as "Under construction". This is a long known issue, but could not find a tracking ticket. Hence opening one.
- MAPREDUCE-2689.
Major bug reported by rramya and fixed by (mrv2)
InvalidStateTransisiton when AM is not assigned to a job
In cases where an AM is not being assigned to a job, RELEASED at COMPLETED invalid event is observed. This is easily reproducible in cases such as MAPREDUCE-2687.
- MAPREDUCE-2687.
Blocker bug reported by rramya and fixed by mahadev (mrv2)
Non superusers unable to launch apps in both secure and non-secure cluster
Apps of non superuser fail to succeed in both secure and non-secure environment. Only the superuser(i.e. one who started/owns the mrv2 cluster) is able to launch apps successfully. However, when a normal user submits a job, the job fails.
- MAPREDUCE-2682.
Trivial improvement reported by acmurthy and fixed by vinodkv
Add a -classpath option to bin/mapred
We should have a bin/mapred classpath switch, MR-279 uses this in the branch.
- MAPREDUCE-2680.
Minor improvement reported by acmurthy and fixed by acmurthy
Enhance job-client cli to show queue information for running jobs
It'd be very useful to display queue-information for running jobs alongwith jobid, user, start-time etc.
- MAPREDUCE-2679.
Trivial improvement reported by acmurthy and fixed by acmurthy
MR-279: Merge MR-279 related minor patches into trunk
Jira to track very minor and misc. changes to trunk for MR-279
- MAPREDUCE-2678.
Major bug reported by naisbitt and fixed by naisbitt (contrib/capacity-sched)
MR-279: minimum-user-limit-percent no longer honored
MR-279: In the capacity-scheduler.xml configuration, the 'minimum-user-limit-percent' property is no longer honored.
- MAPREDUCE-2677.
Major bug reported by rramya and fixed by revans2 (mrv2)
MR-279: 404 error while accessing pages from history server
Accessing the following pages from the history server, causes 404 HTTP error
1. Cluster-> About
2. Cluster -> Applications
3. Cluster -> Scheduler
4. Application -> About
- MAPREDUCE-2676.
Major improvement reported by revans2 and fixed by revans2 (mrv2)
MR-279: JobHistory Job page needs reformatted
The Job page, The Maps page and the Reduces page for the job history server needs to be reformatted.
The Job Overview needs to add in the User, a link to the Job Conf, and the Job ACLs
It also needs Submitted at, launched at, and finished at, depending on how they relates to Started and Elapsed.
In the attempts table we need to remove the new and the running columns
In the tasks table we need to remove progress, pending, and running columns and add in a failed count column
We also need to i...
- MAPREDUCE-2675.
Major improvement reported by revans2 and fixed by revans2 (mrv2)
MR-279: JobHistory Server main page needs to be reformatted
The main page of the Job History Server is based off of the Application Master code. It needs to be reformatted to be more useful and better match what was there before.
- The Active Jobs title needs to be replaced with something more appropriate (i.e. Retired Jobs)
- The table of jobs should have the following columns in it
- Submit time, Job Id, Job Name, User and just because I think it would be useful state, maps completed, maps failed, reduces completed, reduces failed
- The table ne...
- MAPREDUCE-2672.
Major improvement reported by revans2 and fixed by revans2 (mrv2)
MR-279: JobHistory Server needs Analysis this job
The JobHistory Server needs to implement the Analysis this job functionality from the previous server.
This should include the following info
Hadoop Job ID
User :
JobName :
JobConf :
Submitted At :
Launched At : (including duration)
Finished At : (including duration)
Status :
Time taken by best performing Map task <TASK_LINK>:
Average time taken by Map tasks:
Worse performing map tasks: (including task links and duration)
The last Map task <TASK_LINK> finished at (relative to the Job...
- MAPREDUCE-2670.
Trivial bug reported by eli and fixed by eli
Fixing spelling mistake in FairSchedulerServlet.java
"Admininstration" is misspelled.
- MAPREDUCE-2668.
Blocker bug reported by revans2 and fixed by tgraves (mrv2)
MR-279: APPLICATION_STOP is never sent to AuxServices
APPLICATION_STOP is never sent to the AuxServices only APPLICATION_INIT. This means that all map intermediate data will never be deleted.
- MAPREDUCE-2667.
Major bug reported by tgraves and fixed by tgraves (mrv2)
MR279: mapred job -kill leaves application in RUNNING state
the mapred job -kill command doesn't seem to fully clean up the application.
If you kill a job and run mapred job -list again it still shows up as running:
mapred job -kill job_1310072430717_0003
Killed job job_1310072430717_0003
mapred job -list
Total jobs:1
JobId State StartTime UserName Queue Priority SchedulingInfo
job_1310072430717_0003 RUNNING 0 tgraves default NORMAL 98.139.92.22:19888/yarn/job/job_1310072430717_3_3
Running kill again will error o...
- MAPREDUCE-2666.
Blocker sub-task reported by revans2 and fixed by jeagles (mrv2)
MR-279: Need to retrieve shuffle port number on ApplicationMaster restart
MAPREDUCE-2652 allows ShuffleHandler to return the port it is operating on. In the case of an ApplicationMaster crash where it needs to be restarted that information is lost. We either need to re-query it from each of the NodeManagers or to persist it to the JobHistory logs and retrieve it again. The job history logs is probably the simpler solution.
- MAPREDUCE-2664.
Major improvement reported by sseth and fixed by sseth (mrv2)
MR 279: Implement JobCounters for MRv2 + Fix for Map Data Locality
MRv2 is currently not setting any Job Counters.
- MAPREDUCE-2663.
Minor bug reported by ahmed.radwan and fixed by ahmed.radwan (mrv2)
MR-279: Refactoring StateMachineFactory inner classes
The code for ApplicableSingleTransition and ApplicableMultipleTransition inner classes is almost identical. For maintainability, it is better to refactor them into a single inner class.
- MAPREDUCE-2661.
Minor bug reported by ahmed.radwan and fixed by ahmed.radwan (mrv2)
MR-279: Accessing MapTaskImpl from TaskImpl
We are directly accessing MapTaskImpl in TaskImpl.InitialScheduleTransition.transition(..). It'll be better to reorganize the code so each subclass can provide its own behavior instead of explicitly checking for the subclass type.
- MAPREDUCE-2655.
Major bug reported by tgraves and fixed by tgraves (mrv2)
MR279: Audit logs for YARN
We need audit logs for YARN components:
ResourceManager:
- All the refresh* protocol access points - refreshQueues, refreshNodes, refreshProxyUsers,
refreshUserToGroupMappings.
- All app-submissions, app-kills to RM.
- Illegal and successful(?) AM registrations.
- Illegal container allocations/deallocations from AMs
- Successful container allocations/deallocations from AMs too?
NodeManager:
- Illegal container launches from AMs
- Successful container launches from AMs too?
Not sure ...
- MAPREDUCE-2652.
Major bug reported by revans2 and fixed by revans2 (mrv2)
MR-279: Cannot run multiple NMs on a single node
Currently in MR-279 the Auxiliary services, like ShuffleHandler, have no way to communicate information back to the applications. Because of this the Map Reduce Application Master has hardcoded in a port of 8080 for shuffle. This prevents the configuration "mapreduce.shuffle.port" form ever being set to anything but 8080. The code should be updated to allow this information to be returned to the application master. Also the data needs to be persisted to the task log so that on restart the...
- MAPREDUCE-2649.
Major bug reported by tgraves and fixed by tgraves (mrv2)
MR279: Fate of finished Applications on RM
New config added:
// the maximum number of completed applications the RM keeps <name>yarn.server.resourcemanager.expire.applications.completed.max</name>
- MAPREDUCE-2646.
Critical bug reported by sharadag and fixed by sharadag (applicationmaster, mrv2)
MR-279: AM with same sized maps and reduces hangs in presence of failing maps
Currently AM can assign a container given by RM to any map or reduce. However RM allocates for a particular priority. This leads to AM and RM data structures going out of sync.
- MAPREDUCE-2644.
Major bug reported by jwills and fixed by jwills (mrv2)
NodeManager fails to create containers when NM_LOG_DIR is not explicitly set in the Configuration
If the yarn configuration does not explicitly specify a value for the yarn.server.nodemanager.log.dir property, container allocation will fail on the NodeManager w/an NPE when the LocalDirAllocator goes to create the temp directory. In most of the code, we handle this by defaulting to /tmp/logs, but we cannot do this in the LocalDirAllocator context, so we need to set the default value explicitly in the Configuration.
Marking this as major b/c it's annoying to bump into it when you're gettin...
- MAPREDUCE-2641.
Minor sub-task reported by jwills and fixed by jwills (mrv2)
Fix the ExponentiallySmoothedTaskRuntimeEstimator and its unit test
Fixed the ExponentiallySmoothedTaskRuntimeEstimator so that it can run and pass the test defined for it in TestRuntimeEstimators.
- MAPREDUCE-2630.
Minor bug reported by jwills and fixed by jwills (mrv2)
MR-279: refreshQueues leads to NPEs when used w/FifoScheduler
The RM's admin service exposes a method refreshQueues that is used to update the queue configuration when used with the CapacityScheduler, but if it is used with the FifoScheduler, it will set the containerTokenSecretManager/clusterTracker fields on the FifoScheduler to null, which eventually leads to NPE. Since the FifoScheduler only has one queue that cannot be refreshed, the correct behavior is for the refreshQueues call to be a no-op.
I will attach a patch that fixes this by splitting th...
- MAPREDUCE-2629.
Minor improvement reported by ecaspole and fixed by ecaspole (task)
Class loading quirk prevents inner class method compilation
While profiling jobs like terasort and gridmix, I noticed that a
method "org.apache.hadoop.mapreduce.task.ReduceContextImpl.access
$000" is near the top. It turns out that this is because the
ReduceContextImpl class has a member backupStore which is accessed
from an inner class ReduceContextImpl$ValueIterator. Due to the way
synthetic accessor methods work, every access of backupStore results
in a call to access$000 to the outer class. For some portion of the
run, backupStore is null and the ...
- MAPREDUCE-2628.
Minor bug reported by jeagles and fixed by jeagles (mrv2)
MR-279: Add compiled on date to NM and RM info/about page
Compiled on dates were present on the JobTracker UI. Bring compiled on dates to resource manager and node
manager UI.
NM and RM retrieves build version for hadoop and yarn version via the getBuildVersion util api. This function used to
contain the compiled on date, but since has been removed since that function is used to determine hadoop compatible
versions, but was too restrictive with build date being present. Instead, a getDate call should be used to retrieve the
compiled on date.
- MAPREDUCE-2625.
Minor bug reported by jeagles and fixed by jeagles (mrv2)
MR-279: Add Node Manager Version to NM info page
Hadoop and YARN versions are missing from the NM info page
- MAPREDUCE-2624.
Major improvement reported by szetszwo and fixed by szetszwo (contrib/raid)
Update RAID for HDFS-2107
HDFS-2107 is going to move BlockPlacementPolicy to another package.
- MAPREDUCE-2623.
Minor improvement reported by jimplush and fixed by qwertymaniac (test)
Update ClusterMapReduceTestCase to use MiniDFSCluster.Builder
Looking at test class ClusterMapReduceTestCase it issues a warning that the dfsCluster = new MiniDFSCluster(conf, 2, reformatDFS, null); line of code is deprecated and MiniDFSCluster.Builder should be used instead. It notes that the current API will be phased out in version 24. I propose to update the test class to the most up to date code as it's referenced several places on the internet as an example of how to write a Hadoop Unit Test.
- MAPREDUCE-2622.
Minor task reported by qwertymaniac and fixed by qwertymaniac (test)
Remove the last remaining reference to "io.sort.mb"
TestLocalRunner still carries "io.sort.mb", which must be updated to "mapreduce.task.io.sort.mb" (MRJobConfig.IO_SORT_MB).
- MAPREDUCE-2620.
Major bug reported by szetszwo and fixed by szetszwo (contrib/raid)
Update RAID for HDFS-2087
DataTransferProtocol was changed by HDFS-2087. Need to update RAID.
- MAPREDUCE-2618.
Major bug reported by naisbitt and fixed by naisbitt (mrv2)
MR-279: 0 map, 0 reduce job fails with Null Pointer Exception
A 0 map, 0 reduce job fails with an NPE. This case works fine on hadoop-0.20.x. The job should succeed and run setup/cleanup code - with no tasks. Below is the stacktrace:
11/06/05 19:35:37 WARN mapred.ClientServiceDelegate:
StackTrace: java.lang.NullPointerException
at org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.getTaskAttemptCompletionEvents(JobImpl.java:498)
at
org.apache.hadoop.mapreduce.v2.app.client.MRClientService$MRClientProtocolHandler.getTaskAttemptComplet...
- MAPREDUCE-2615.
Major bug reported by sseth and fixed by sseth (mrv2)
MR 279: KillJob should go through AM whenever possible
KillJob currently goes directly to the RM - which effectively causes the AM and tasks to be killed via a signal. History information is not recorded in this case.
- MAPREDUCE-2611.
Major improvement reported by sseth and fixed by (mrv2)
MR 279: Metrics, finishTimes, etc in JobHistory
- MAPREDUCE-2606.
Major bug reported by tucu00 and fixed by tucu00
Remove IsolationRunner
IsolationRunner is no longer maintained. See <a href="/jira/browse/MAPREDUCE-2637" title="Providing options to debug the mapreduce user code (Mapper, Reducer, Combiner, Sort implementations)">MAPREDUCE-2637</a> for its replacement.
- MAPREDUCE-2603.
Major bug reported by vinaythota and fixed by vinaythota (contrib/gridmix)
Gridmix system tests are failing due to high ram emulation enable by default for normal mr jobs in the trace which exceeds the solt capacity.
In Gridmix high ram emulation enable by default.Because of this feature, some of the gridmix system tests are hanging for some time and then failing after timeout. Actually the failure case was occurring whenever reserved slot capacity exceeds the cluster slot capacity.So for fixing the issue by disabling the high ram emulation in the tests which are using the normal mr jobs in the traces.
- MAPREDUCE-2602.
Major improvement reported by ahmed.radwan and fixed by ahmed.radwan
Allow setting of end-of-record delimiter for TextInputFormat (for the old API)
Since there are users who are still using the old MR API, it will be useful to modify the org.apache.hadoop.mapred.LineRecordReader and org.apache.hadoop.mapred.TextInputFormat to be able to use custom (user-specified) end-of-record delimiters. This will make use of the LineReader improvement introduced in HADOOP-7096 that enables the LineReader to break lines at user-specified delimiters.
Note: MAPREDUCE-2254 already added this improvement to the new API (but not the old API).
- MAPREDUCE-2598.
Minor bug reported by sseth and fixed by sseth (mrv2)
MR 279: miscellaneous UI, NPE fixes for JobHistory, UI
- MAPREDUCE-2596.
Major improvement reported by acmurthy and fixed by amar_kamat (benchmarks, contrib/gridmix)
Gridmix should notify job failures
Gridmix now prints a summary information after every run. It summarizes the runs w.r.t input trace details, input data statistics, cli arguments, data-gen runtime, simulation runtimes etc and also the cluster w.r.t map slots, reduce slots, jobtracker-address, hdfs-address etc.
- MAPREDUCE-2595.
Minor bug reported by tgraves and fixed by tgraves
MR279: update yarn INSTALL doc
yarn install doc needs to be updated after unsplit: http://svn.apache.org/repos/asf/hadoop/common/branches/MR-279/mapreduce/INSTALL
- MAPREDUCE-2588.
Major bug reported by szetszwo and fixed by szetszwo (contrib/raid)
Raid is not compile after DataTransferProtocol refactoring
Raid is directly using {{DataTransferProtocol}}. It cannot be compiled after HDFS-2066.
- MAPREDUCE-2587.
Minor bug reported by tgraves and fixed by tgraves
MR279: Fix RM version in the cluster->about page
The Resource Manager version in the Cluster->About page always shows 1.0-SNAPSHOT.
- MAPREDUCE-2582.
Major bug reported by sseth and fixed by sseth (mrv2)
MR 279: Cleanup JobHistory event generation
Generate JobHistoryEvents for the correct transitions. Fix missing / incorrect values being set.
- MAPREDUCE-2581.
Trivial bug reported by david_syer and fixed by tim_s
Spelling errors in log messages (MapTask)
Spelling errors in log messages (MapTask) - e.g. search for "recieve" (should be "receive"). A decent IDE should detect these errors as well.
- MAPREDUCE-2580.
Minor improvement reported by sseth and fixed by sseth (mrv2)
MR 279: RM UI should redirect finished jobs to History UI
The RM UI currently has a link to the AM UI. After an application finishes (AM not available), the RM UI should link to the history UI.
- MAPREDUCE-2576.
Trivial bug reported by sherri_chen and fixed by tim_s
Typo in comment in SimulatorLaunchTaskAction.java
This JIRA is to track a fix to a super-trivial issue of a typo of "or" misspelled as "xor " in Line 24 of SimulatorLaunchTaskAction.java
- MAPREDUCE-2575.
Major bug reported by tgraves and fixed by tgraves (test)
TestMiniMRDFSCaching fails if test.build.dir is set to something other than build/test
TestMiniMRDFSCaching fails if test.build.dir is set to something other than build/test
- MAPREDUCE-2573.
Major bug reported by tlipcon and fixed by revans2
New findbugs warning after MAPREDUCE-2494
MAPREDUCE-2494 introduced the following findbugs warning in trunk:
TrackerDistributedCacheManager.java:739, SIC_INNER_SHOULD_BE_STATIC, Priority: Low
Should org.apache.hadoop.mapreduce.filecache.TrackerDistributedCacheManager$CacheDir be a _static_ inner class?
This class is an inner class, but does not use its embedded reference to the object which created it. This reference makes the instances of the class larger, and may keep the reference to the creator object alive longer than necessar...
- MAPREDUCE-2569.
Minor bug reported by jeagles and fixed by jeagles (mrv2)
MR-279: Restarting resource manager with root capacity not equal to 100 percent should result in error
root.capacity is set to 90% without failure
- MAPREDUCE-2566.
Major bug reported by sseth and fixed by sseth (mrv2)
MR 279: YarnConfiguration should reloadConfiguration if instantiated with a non YarnConfiguration object
YarnConfiguration(conf) uses the ctor Configuration(conf) which is effectively a clone. If the configuration object is created before YarnConfiguration has been loaded - yarn-site.xml will not be available to the configuration.
- MAPREDUCE-2563.
Major task reported by vinaythota and fixed by vinaythota (contrib/gridmix)
Gridmix high ram jobs emulation system tests.
Adds system tests to test the High-Ram feature in Gridmix.
- MAPREDUCE-2559.
Major bug reported by eyang and fixed by eyang (build)
ant binary fails due to missing c++ lib dir
Post MAPRED-2521 ant binary fails without "-Dcompile.c++=true -Dcompile.native=true". The bin-package is trying to copy from the c++ lib dir which doesn't exist yet. The binary target should check for the existence of this dir or would also be reasonable to depend on the compile-c++ (since this is the binary target).
- MAPREDUCE-2556.
Major bug reported by sseth and fixed by sseth (mrv2)
MR 279: NodeStatus.getNodeHealthStatus().setBlah broken
- MAPREDUCE-2554.
Major task reported by vinaythota and fixed by vinaythota (contrib/gridmix)
Gridmix distributed cache emulation system tests.
Adds distributed cache related system tests to Gridmix.
- MAPREDUCE-2552.
Minor bug reported by sseth and fixed by sseth (mrv2)
MR 279: NPE when requesting attemptids for completed jobs
While constructing a CompletedJob instance on the JobHistory server - successfuleAttempt is not populated. Causes an NPE when listing completed attempts for a job via the CLI.
CLI: hadoop job -list-attempt-ids <job_id> MAP completed
- MAPREDUCE-2551.
Major improvement reported by sseth and fixed by sseth (mrv2)
MR 279: Implement JobSummaryLog
Implement JobSummary log for MR.Next
- MAPREDUCE-2550.
Blocker bug reported by eyang and fixed by eyang (build)
bin/mapred no longer works from a source checkout
Developer may want to run hadoop without extracting tarball. It would be nice if existing method to run mapred scripts from source code is preserved for developers.
- MAPREDUCE-2544.
Major task reported by vinaythota and fixed by vinaythota (contrib/gridmix)
Gridmix compression emulation system tests.
Adds system tests for testing the compression emulation feature of Gridmix.
- MAPREDUCE-2543.
Major new feature reported by amar_kamat and fixed by amar_kamat (contrib/gridmix)
[Gridmix] Add support for HighRam jobs
Adds High-Ram feature emulation in Gridmix.
- MAPREDUCE-2541.
Critical bug reported by decster and fixed by decster (tasktracker)
Race Condition in IndexCache(readIndexFileToCache,removeMap) causes value of totalMemoryUsed corrupt, which may cause TaskTracker continue throw Exception
The race condition goes like this:
Thread1: readIndexFileToCache() totalMemoryUsed.addAndGet(newInd.getSize())
Thread2: removeMap() totalMemoryUsed.addAndGet(-info.getSize());
When SpillRecord is being read from fileSystem, client kills the job, info.getSize() equals 0, so in fact totalMemoryUsed is not reduced, but after thread1 finished reading SpillRecord, it adds the real index size to totalMemoryUsed, which makes the value of totalMemoryUsed wrong(larger).
When this value(totalMemoryUse...
- MAPREDUCE-2537.
Minor bug reported by revans2 and fixed by revans2 (mrv2)
MR-279: The RM writes its log to yarn-mapred-resourcemanager-<RM_Host>.out
- MAPREDUCE-2536.
Minor test reported by daryn and fixed by daryn (test)
TestMRCLI broke due to change in usage output
One of the tests broke because it checks the FsShell mv usage line that is emitted after an error. The usage was updated to from "-mv <src> <dst>" to "-mv <src> ... <dst>", so the "..." broke the test.
- MAPREDUCE-2534.
Major bug reported by vicaya and fixed by vicaya (mrv2)
MR-279: Fix CI breaking hard coded version in jobclient pom
- MAPREDUCE-2533.
Major new feature reported by vicaya and fixed by vicaya (mrv2)
MR-279: Metrics for reserved resource in ResourceManager
Add metrics for reserved resources.
- MAPREDUCE-2532.
Major new feature reported by vicaya and fixed by vicaya (mrv2)
MR-279: Metrics for NodeManager
Metrics for node manager. Requires a recent (last night) update of hadoop common in the yahoo-merge branch.
- MAPREDUCE-2531.
Blocker bug reported by revans2 and fixed by revans2 (client)
org.apache.hadoop.mapred.jobcontrol.getAssignedJobID throw class cast exception
When using a combination of the mapred and mapreduce APIs (PIG) it is possible to have the following exception
Caused by: java.lang.ClassCastException: org.apache.hadoop.mapreduce.JobID cannot be cast to
org.apache.hadoop.mapred.JobID
at org.apache.hadoop.mapred.jobcontrol.Job.getAssignedJobID(Job.java:71)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:239)
at org.apache.pig.PigServer.launchPlan(PigSe...
- MAPREDUCE-2529.
Major bug reported by tgraves and fixed by tgraves (tasktracker)
Recognize Jetty bug 1342 and handle it
Added 2 new config parameters:
mapreduce.reduce.shuffle.catch.exception.stack.regex
mapreduce.reduce.shuffle.catch.exception.message.regex
- MAPREDUCE-2527.
Major new feature reported by vicaya and fixed by vicaya (mrv2)
MR-279: Metrics for MRAppMaster
- MAPREDUCE-2522.
Major sub-task reported by sseth and fixed by sseth (mrv2)
MR 279: Security for JobHistory service
- MAPREDUCE-2521.
Major new feature reported by eyang and fixed by eyang (build)
Mapreduce RPM integration project
Created rpm and debian packages for MapReduce.
- MAPREDUCE-2518.
Major bug reported by weiyj and fixed by weiyj (distcp)
missing t flag in distcp help message '-p[rbugp]'
't: modification and access times' flag is defined but
missing in distcp help message '-p[rbugp]'. should be
changed to -p[rbugpt].
- MAPREDUCE-2517.
Major task reported by vinaythota and fixed by vinaythota (contrib/gridmix)
Porting Gridmix v3 system tests into trunk branch.
Adds system tests to Gridmix. These system tests cover various features like job types (load and sleep), user resolvers (round-robin, submitter-user, echo) and submission modes (stress, replay and serial).
- MAPREDUCE-2514.
Trivial bug reported by jeagles and fixed by jeagles (tasktracker)
ReinitTrackerAction class name misspelled RenitTrackerAction in task tracker log
- MAPREDUCE-2509.
Major bug reported by vicaya and fixed by vicaya (mrv2)
MR-279: Fix NPE in UI for pending attempts
The task attempts page gets a 500 (and NPE in the AM logs) if the attempt is pending (not running yet).
- MAPREDUCE-2504.
Major bug reported by sseth and fixed by sseth (mrv2)
MR 279: race in JobHistoryEventHandler stop
The condition to stop the eventHandling thread currently requires it to be 'stopped' AND interrupted. If an interrupt arrives after a take, but before handleEvent is called - the interrupt status ends up being handled by hadoop.util.Shell.runCommand() - which ignores it (and in the process resets the flag).
The eventHandling thread subsequently hangs on eventQueue.take()
This currently randomly fails unit tests - and can hang MR AMs.
- MAPREDUCE-2501.
Major improvement reported by vicaya and fixed by vicaya (mrv2)
MR-279: Attach sources in builds
Attach sources to builds for various reasons, one of which is better debuggability on clusters.
- MAPREDUCE-2500.
Major bug reported by sseth and fixed by sseth (mrv2)
MR 279: PB factories are not thread safe
- MAPREDUCE-2497.
Trivial bug reported by rrh and fixed by eli
missing spaces in error messages
Error message(s) are missing spaces. Here's an example output:
11/05/15 09:44:10 WARN mapred.JobClient: Error reading task outputhttp://
Generated from this line of source.
./src/mapred/org/apache/hadoop/mapred/JobClient.java: LOG.warn("Error reading task output" + ioe.getMessage());
The 1st arg to LOG.warn should end with a ' '.
There may be other instances of this problem in the source base.
- MAPREDUCE-2495.
Minor improvement reported by revans2 and fixed by revans2 (distributed-cache)
The distributed cache cleanup thread has no monitoring to check to see if it has died for some reason
The cleanup thread in the distributed cache handles IOExceptions and the like correctly, but just to be a bit more defensive it would be good to monitor the thread, and check that it is still alive regularly, so that the distributed cache does not fill up the entire disk on the node.
- MAPREDUCE-2494.
Major improvement reported by revans2 and fixed by revans2 (distributed-cache)
Make the distributed cache delete entires using LRU priority
Added config option mapreduce.tasktracker.cache.local.keep.pct to the TaskTracker. It is the target percentage of the local distributed cache that should be kept in between garbage collection runs. In practice it will delete unused distributed cache entries in LRU order until the size of the cache is less than mapreduce.tasktracker.cache.local.keep.pct of the maximum cache size. This is a floating point value between 0.0 and 1.0. The default is 0.95.
- MAPREDUCE-2492.
Major improvement reported by amar_kamat and fixed by amar_kamat (task)
[MAPREDUCE] The new MapReduce API should make available task's progress to the task
Map and Reduce task can access the attempt's overall progress via TaskAttemptContext.
- MAPREDUCE-2490.
Trivial improvement reported by jeagles and fixed by jeagles (jobtracker)
Log blacklist debug count
Gain some insight into blacklist increments/decrements by enhancing the debug logging
- MAPREDUCE-2489.
Major bug reported by naisbitt and fixed by naisbitt (jobtracker)
Jobsplits with random hostnames can make the queue unusable
We saw an issue where a custom InputSplit was returning invalid hostnames for the splits that were then causing the JobTracker to attempt to excessively resolve host names. This caused a major slowdown for the JobTracker. We should prevent invalid InputSplit hostnames from affecting everyone else.
I propose we implement some verification for the hostnames to try to ensure that we only do DNS lookups on valid hostnames (and fail otherwise). We could also fail the job after a certain number...
- MAPREDUCE-2483.
Major bug reported by eyang and fixed by eyang (build)
Clean up duplication of dependent jar files
Removed duplicated hadoop-common library dependencies.
- MAPREDUCE-2480.
Major bug reported by vicaya and fixed by vicaya (mrv2)
MR-279: mr app should not depend on hard-coded version of shuffle
The following commit introduced a dependency of shuffle with hard-coded version for mr app:
{noformat}
commit 6f69742140516be7493c9a9177b81d0516cc9539
Author: Vinod Kumar Vavilapalli <vinodkv@apache.org>
Date: Wed May 4 06:53:52 2011 +0000
Adding user log handling for YARN. Making NM put the user-logs on DFS and providing log-dump tools. Contributed by Vinod Kumar Vavilapalli.
{noformat}
- MAPREDUCE-2478.
Major improvement reported by sseth and fixed by sseth (mrv2)
MR 279: Improve history server
Looks great. I just committed this. Thanks Siddharth!
- MAPREDUCE-2475.
Major bug reported by sureshms and fixed by sureshms (test)
Disable IPV6 for junit tests
IPV6 addresses not handles currently in the common library methods. IPV6 can return address as "0:0:0:0:0:0:port". Some utility methods such as NetUtils#createSocketAddress(), NetUtils#normalizeHostName(), NetUtils#getHostNameOfIp() to name a few, do not handle IPV6 address and expect address to be of format host:port.
Until IPV6 is formally supported, I propose disabling IPV6 for junit tests to avoid problems seen in HDFS-1891.
- MAPREDUCE-2474.
Minor improvement reported by qwertymaniac and fixed by qwertymaniac (documentation)
Add docs to the new API Partitioner on how to access Job Configuration data
Improve the Partitioner interface's docs to help fetch Job Configuration objects.
- MAPREDUCE-2473.
Major new feature reported by atm and fixed by atm (jobtracker)
MR portion of HADOOP-7214 - Hadoop /usr/bin/groups equivalent
Introduces a new command, "mapred groups", which displays what groups are associated with a user as seen by the JobTracker.
- MAPREDUCE-2470.
Major bug reported by drizzt321 and fixed by revans2 (client)
Receiving NPE occasionally on RunningJob.getCounters() call
This is running in a Java daemon that is used as an interface (Thrift) to get information and data from MR Jobs. Using JobClient.getJob(JobID) I successfully get a RunningJob object (I'm checking for NULL), and then rarely I get an NPE when I do RunningJob.getCounters(). This seems to occur after the daemon has been up and running for a while, and in the event of an Exception, I close the JobClient, set it to NULL, and a new one should then be created on the next request for data. Yet, I stil...
- MAPREDUCE-2469.
Major improvement reported by amar_kamat and fixed by amar_kamat (task)
Task counters should also report the total heap usage of the task
Task attempt's total heap usage gets recorded and published via counters as COMMITTED_HEAP_BYTES.
- MAPREDUCE-2467.
Major bug reported by sureshms and fixed by sureshms (contrib/raid)
HDFS-1052 changes break the raid contrib module in MapReduce
Raid contrib module requires changes to work with the federation changes made in HDFS-1052.
- MAPREDUCE-2466.
Blocker bug reported by tlipcon and fixed by tlipcon
TestFileInputFormat.testLocality failing after federation merge
This test is failing, I believe due to federation merge. It's only finding one location for the test file instead of the expected two.
- MAPREDUCE-2463.
Major bug reported by devaraj.k and fixed by devaraj.k (jobtracker)
Job History files are not moving to done folder when job history location is hdfs location
If "mapreduce.jobtracker.jobhistory.location" is configured as HDFS location then either during initialization of Job Tracker (while moving old job history files) or after completion of the job, history files are not moving to done and giving following exception.
{code:xml}
2011-04-29 15:27:27,813 ERROR org.apache.hadoop.mapreduce.jobhistory.JobHistory: Unable to move history file to DONE folder.
java.lang.IllegalArgumentException: Wrong FS: hdfs://10.18.52.146:9000/history/job_201104291518...
- MAPREDUCE-2462.
Minor improvement reported by sseth and fixed by sseth (mrv2)
MR 279: Write job conf along with JobHistory, other minor improvements
Write the job xml along with the job history file. Split some common functionality into a helper class, etc.
- MAPREDUCE-2460.
Blocker bug reported by tlipcon and fixed by tlipcon
TestFairSchedulerSystem failing on Hudson
Seems to have been failing for a while. For example: https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/655/testReport/junit/org.apache.hadoop.mapred/TestFairSchedulerSystem/testFairSchedulerSystem/
- MAPREDUCE-2459.
Major improvement reported by macyang and fixed by macyang (harchive)
Cache HAR filesystem metadata
Each HAR file system has two index files that contains information on how files are stored in the part files. During the block location calculation, these indexes are reread for every file in the archive. Caching the indexes and the status of the part files will greatly reduce the number of name node operations during the job setup time.
- MAPREDUCE-2458.
Major bug reported by vicaya and fixed by vicaya (mrv2)
MR-279: Rename sanitized pom.xml in build directory to work around IDE bug
The sanitized pom.xml in target directory apparently triggered a bug in NetBeans (http://netbeans.org/bugzilla/show_bug.cgi?id=198162) causing it to fail to recognize the generated sources. The work-around is to rename the generated pom.xml to saner-pom.xml
- MAPREDUCE-2456.
Trivial improvement reported by naisbitt and fixed by naisbitt (jobtracker)
Show the reducer taskid and map/reduce tasktrackers for "Failed fetch notification #_ for task attempt..." log messages
This jira is to provide more useful log information for debugging the "Too many fetch-failures" error.
Looking at the JobTracker node, we see messages like this:
"2010-12-14 00:00:06,911 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #8 for task
attempt_201011300729_189729_m_007458_0".
I would be useful to see which reducer is reporting the error here.
So, I propose we add the following to these log messages:
1. reduce task ID
2. TaskTracker nodenames for both t...
- MAPREDUCE-2455.
Major sub-task reported by tomwhite and fixed by tomwhite (build, client)
Remove deprecated JobTracker.State in favour of JobTrackerStatus
MAPREDUCE-2337 deprecated getJobTrackerState() on ClusterStatus, this issue is to remove the getter (in favour of getJobTrackerStatus(), which will remain) so there is no longer a direct dependency of the public API on JobTracker. This is for MAPREDUCE-1638.
- MAPREDUCE-2452.
Major bug reported by devaraj and fixed by devaraj (jobtracker)
Delegation token cancellation shouldn't hold global JobTracker lock
Currently, when the JobTracker cancels a job's delegation token (at the end of the job), it holds the global lock. This is not desired.
- MAPREDUCE-2451.
Trivial bug reported by tgraves and fixed by tgraves (jobtracker)
Log the reason string of healthcheck script
The information on why a specific TaskTracker got blacklisted is not stored anywhere. The jobtracker web ui will show the detailed reason string until the TT gets unblacklisted. After that it is lost.
- MAPREDUCE-2449.
Minor improvement reported by jzemerick and fixed by jzemerick (contrib/eclipse-plugin)
Allow for command line arguments when performing "Run on Hadoop" action.
It is currently not possible to specify command line arguments when creating a run configuration for "Run on Hadoop." This patch adds a text box to the RunOnHadoopWizard dialog for providing command line arguments. The arguments are then stored as part of the run configuration. Additionally (as a result), this patch prevents the creation of duplicate run configuration creation by seeing if the original configuration has been changed first.
- MAPREDUCE-2440.
Major bug reported by vicaya and fixed by vicaya (mrv2)
MR-279: Name clashes in TypeConverter
public static TaskTrackerInfo[] fromYarn(List<NodeManagerInfo> nodes) has the same erasure as
public static JobStatus[] fromYarn(List<Application> applications)
Not detected by the current JDK 6 but still wrong according to the JLS 8.4.2.
See also: http://bugs.sun.com/view_bug.do?bug_id=6182950
The patch renames the former signature to fromYarnNodes and the later fromYarnApps.
- MAPREDUCE-2439.
Major bug reported by mahadev and fixed by sseth (mrv2)
MR-279: Fix YarnRemoteException to give more details.
Fix YarnRemoteException to add more details.
- MAPREDUCE-2438.
Major new feature reported by mahadev and fixed by ramach (mrv2)
MR-279: WebApp for Job History
Add webapp for job history server in MR-279 branch.
- MAPREDUCE-2434.
Major new feature reported by vicaya and fixed by vicaya (mrv2)
MR-279: ResourceManager metrics
I just committed this. Thanks Luke!
- MAPREDUCE-2433.
Blocker bug reported by vicaya and fixed by mahadev (mrv2)
MR-279: YARNApplicationConstants hard code app master jar version
YARNApplicationConstants hard code version string in HADOOP_MAPREDUCE_CLIENT_APP_JAR_NAME and consequently YARN_MAPREDUCE_APP_JAR_PATH
This is a blocker.
- MAPREDUCE-2432.
Major improvement reported by vicaya and fixed by vicaya (mrv2)
MR-279: Install sanitized poms for downstream sanity
Due to [MNG-4223|http://jira.codehaus.org/browse/MNG-4223], the installed POMs of MR-279 is downstream hostile. E.g., it's impossible to use versions of hadoop-mapreduce-client-core.version in ivy other than 1.0-SNAPSHOT without changing the multiple POMs, rendering the version properties (hadoop-mapreduce.version and yarn.version) practically useless.
This patch will install POMs with version (only) properties expanded. This patch also use inheritance and dependencyManagement to make POMs D...
- MAPREDUCE-2430.
Major task reported by nidaley and fixed by nidaley
Remove mrunit contrib
MRUnit is now available as a separate Apache project.
- MAPREDUCE-2429.
Major bug reported by acmurthy and fixed by sseth (tasktracker)
Check jvmid during task status report
Currently TT doens't check to ensure jvmid is relevant during communication with the Child via TaskUmbilicalProtocol.
- MAPREDUCE-2428.
Blocker bug reported by tomwhite and fixed by tomwhite
start-mapred.sh script fails if HADOOP_HOME is not set
MapReduce portion of HADOOP-6953
- MAPREDUCE-2426.
Trivial test reported by tlipcon and fixed by tlipcon (contrib/fair-share)
Make TestFairSchedulerSystem fail with more verbose output
The TestFairSchedulerSystem test failed here: https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk/644/testReport/junit/org.apache.hadoop.mapred/TestFairSchedulerSystem/testFairSchedulerSystem/
with a failed assertion {{assertTrue(contents.contains("</svg>"));}}. We should make the assertion failure include the value of {{contents}}
- MAPREDUCE-2424.
Major improvement reported by roelofs and fixed by roelofs (mrv2)
MR-279: counters/UI/etc. for uber-AppMaster (in-cluster LocalJobRunner for MRv2)
Polish uber-AM (MAPREDUCE-2405). Specifically:
* uber-specific counters ("command-line UI")
* GUI indicators
** RM all-containers level
** multi-job app level [if exists]
** single-job level
* fix uber-decision ("is this a small job?"):
** memory criterion
** input-bytes criterion
* disable speculation
* isUber() method (somewhere) for unit tests to use
* delete (most of) old UberTask code (MAPREDUCE-1220; came in with initial MR-279 branch)
* implement non-RPC, local version of umbilical
* ...
- MAPREDUCE-2422.
Major sub-task reported by tomwhite and fixed by tomwhite (client)
Removed unused internal methods from DistributedCache
DistributedCache has a number of deprecated methods that are no longer used ever since TrackerDistributedCacheManager was introduced in MAPREDUCE-476. Removing these methods (which are not user-facing) will make it possible to complete MAPREDUCE-1638 by keeping DistributedCache in the API tree, and TrackerDistributedCacheManager, TaskDistributedCacheManager in the implementation tree.
- MAPREDUCE-2420.
Major bug reported by boryas and fixed by boryas
JobTracker should be able to renew delegation token over HTTP
in case JobTracker has to talk to a NameNode running a different version (RPC version mismatch), Jobtracker should be able to fall back to HTTP renewal.
Example of the case - running distcp between different versions using hfpt.
- MAPREDUCE-2417.
Major bug reported by ravidotg and fixed by ravidotg (contrib/gridmix)
In Gridmix, in RoundRobinUserResolver mode, the testing/proxy users are not associated with unique users in a trace
Fixes Gridmix in RoundRobinUserResolver mode to map testing/proxy users to unique users in a trace.
- MAPREDUCE-2416.
Major bug reported by ravidotg and fixed by ravidotg (contrib/gridmix)
In Gridmix, in RoundRobinUserResolver, the list of groups for a user obtained from users-list-file is incorrect
Removes the restriction of specifying group names in users-list file for Gridmix in RoundRobinUserResolver mode.
- MAPREDUCE-2414.
Major improvement reported by acmurthy and fixed by sseth (mrv2)
MR-279: Use generic interfaces for protocols
Use generic interfaces for protocols for MAPREDUCE-279.
- MAPREDUCE-2409.
Major bug reported by sseth and fixed by sseth (distributed-cache)
Distributed Cache does not differentiate between file /archive for files with the same path
If a 'global' file is specified as a 'file' by one job - subsequent jobs cannot override this source file to be an 'archive' (until the TT cleans up it's cache or a TT restart).
The other way around as well -> 'archive' to 'file'
In case of an accidental submission using the wrong type - some of the tasks for the second job will end up seeing the source file as an archive, others as a file.
- MAPREDUCE-2408.
Major new feature reported by ravidotg and fixed by amar_kamat (contrib/gridmix)
Make Gridmix emulate usage of data compression
Emulates the MapReduce compression feature in Gridmix. By default, compression emulation is turned on. Compression emulation can be disabled by setting 'gridmix.compression-emulation.enable' to 'false'. Use 'gridmix.compression-emulation.map-input.decompression-ratio', 'gridmix.compression-emulation.map-output.compression-ratio' and 'gridmix.compression-emulation.reduce-output.compression-ratio' to configure the compression ratios at map input, map output and reduce output side respectively. Currently, compression ratios in the range [0.07, 0.68] are supported. Gridmix auto detects whether map-input, map output and reduce output should emulate compression based on original job's compression related configuration parameters.
- MAPREDUCE-2407.
Major new feature reported by ravidotg and fixed by ravidotg (contrib/gridmix)
Make Gridmix emulate usage of Distributed Cache files
Makes Gridmix emulate HDFS based distributed cache files and local file system based distributed cache files.
- MAPREDUCE-2405.
Major improvement reported by mahadev and fixed by roelofs (mrv2)
MR-279: Implement uber-AppMaster (in-cluster LocalJobRunner for MRv2)
An efficient implementation of small jobs by running all tasks in the MR ApplicationMaster JVM, there-by affecting lower latency.
- MAPREDUCE-2403.
Major improvement reported by mahadev and fixed by ramach (mrv2)
MR-279: Improve job history event handling in AM to log to HDFS
Improve the job history event handling in the application master to log to HDFS in the staging directory for the job and also move it to the required location for the job history server to use.
- MAPREDUCE-2395.
Critical bug reported by tlipcon and fixed by rvadali (contrib/raid)
TestBlockFixer timing out on trunk
In recent Hudson builds, TestBlockFixer has been timing out. Not clear how long it has been broken since MAPREDUCE-2394 was hiding the RAID tests from Hudson's test result parsing.
- MAPREDUCE-2381.
Major improvement reported by philip and fixed by philip
JobTracker instrumentation not consistent about error handling
In the current code, if the class specified by the JobTracker instrumentation config property is not there, the JobTracker fails to start with a ClassNotFound. If it's there, but it can't load for whatever reason, the JobTracker continues with the default. Having two different error-handling routes is a bit confusing; I propose to move one line so that it's consistent. (On the TaskTracker instrumentation side, if any of the multiple instrumentations aren't available, the default is used.)
...
- MAPREDUCE-2379.
Major bug reported by tlipcon and fixed by tlipcon (distributed-cache, documentation)
Distributed cache sizing configurations are missing from mapred-default.xml
* MAPREDUCE-1538 added {{mapreduce.tasktracker.cache.local.numberdirectories}} which is not documented in mapred-default.xml
* When MAPREDUCE-711 moved DistributedCache into the mapred project, the {{local.cache.size}} parameter was left in core-default.xml instead of moved to mapred-default.xml. It has since been renamed to {{mapreduce.tasktracker.cache.local.size}}
- MAPREDUCE-2367.
Minor improvement reported by tlipcon and fixed by tlipcon
Allow using a file to exclude certain tests from build
It would be nice to be able to exclude certain tests when running builds. For example, when a test is "known flaky", you may want to exclude it from the main Hudson job, but not actually disable it in the codebase (so that it still runs as part of another Hudson job, for example).
- MAPREDUCE-2365.
Major bug reported by owen.omalley and fixed by sseth
Add counters for FileInputFormat (BYTES_READ) and FileOutputFormat (BYTES_WRITTEN)
MAP_INPUT_BYTES and MAP_OUTPUT_BYTES will be computed using the difference between FileSystem
counters before and after each next(K,V) and collect/write op.
In case compression is being used, these counters will represent the compressed data sizes. The uncompressed size will
not be available.
This is not a direct back-port of 5710. (Counters will be computed in MapTask instead of in individual RecordReaders).
0.20.100 ->
New API -> MAP_INPUT_BYTES will be computed using this method
O...
- MAPREDUCE-2351.
Major improvement reported by tomwhite and fixed by tomwhite
mapred.job.tracker.history.completed.location should support an arbitrary filesystem URI
Currently, mapred.job.tracker.history.completed.location is resolved relative to the default filesystem. If not set it defaults to history/done in the local log directory. There is no way to set it to another local filesystem location (with a file:// URI) or an arbitrary Hadoop filesystem.
- MAPREDUCE-2331.
Major test reported by tlipcon and fixed by tlipcon
Add coverage of task graph servlet to fair scheduler system test
Would be useful to hit the TaskGraph servlet in the fair scheduler system test. This way, when run under JCarder, it will check for any lock inversions in this code.
- MAPREDUCE-2326.
Major improvement reported by acmurthy and fixed by
Port gridmix changes from hadoop-0.20.100 to trunk
We have some changes to gridmix in hadoop-0.20.100. Uber jira to track merges to trunk.
- MAPREDUCE-2323.
Major new feature reported by tlipcon and fixed by tlipcon (contrib/fair-share)
Add metrics to the fair scheduler
It would be useful to be able to monitor various metrics in the fair scheduler, like demand, fair share, min share, and running task count.
- MAPREDUCE-2317.
Minor bug reported by devaraj.k and fixed by devaraj.k (harchive)
HadoopArchives throwing NullPointerException while creating hadoop archives (.har files)
While we are trying to run hadoop archive tool in widows using this way, it is giving the below exception.
java org.apache.hadoop.tools.HadoopArchives -archiveName temp.har D:/test/in E:/temp
{code:xml}
java.lang.NullPointerException
at org.apache.hadoop.tools.HadoopArchives.writeTopLevelDirs(HadoopArchives.java:320)
at org.apache.hadoop.tools.HadoopArchives.archive(HadoopArchives.java:386)
at org.apache.hadoop.tools.HadoopArchives.run(HadoopArchives.java:725)
at org.apache.hadoop.uti...
- MAPREDUCE-2311.
Blocker bug reported by tlipcon and fixed by schen (contrib/fair-share)
TestFairScheduler failing on trunk
Most of the test cases in this test are failing on trunk, unclear how long since the contrib tests weren't running while the core tests were failed.
- MAPREDUCE-2307.
Minor bug reported by devaraj.k and fixed by devaraj.k (contrib/fair-share)
Exception thrown in Jobtracker logs, when the Scheduler configured is FairScheduler.
If we try to start the job tracker with fair scheduler using the default configuration, It is giving the below exception.
{code:xml}
2010-07-03 10:18:27,142 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9001: starting
2010-07-03 10:18:27,143 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9001: starting
2010-07-03 10:18:27,143 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001: starting
2010-07-03 10:18:27,143 INFO org.apache.hadoop.ipc.Server: IPC Serv...
- MAPREDUCE-2302.
Major improvement reported by schen and fixed by schen (contrib/raid)
Add static factory methods in GaloisField
GaloisField is immutable and should be kept reuse after creation to avoid redundant calculation of the multiplication and division tables.
- MAPREDUCE-2290.
Major bug reported by eli and fixed by eli (test)
TestTaskCommit missing getProtocolSignature override
Fixes an MR compilation error, HADOOP-6904 added a new implementation of getProtocolSignature but TestTaskCommit doesn't override it.
- MAPREDUCE-2271.
Blocker bug reported by tlipcon and fixed by liangly (jobtracker)
TestSetupTaskScheduling failing in trunk
This test case is failing in trunk after the commit of MAPREDUCE-2207
- MAPREDUCE-2263.
Major improvement reported by hairong and fixed by hairong
MapReduce side of HADOOP-6904
Make changes in Map/Reduce to incorporate HADOOP-6904.
- MAPREDUCE-2260.
Major improvement reported by rvs and fixed by rvs (build)
Remove auto-generated native build files
The native build run when from trunk now requires autotools, libtool and openssl dev libraries.
- MAPREDUCE-2258.
Major bug reported by tlipcon and fixed by tlipcon (task)
IFile reader closes stream and compressor in wrong order
In IFile.Reader.close(), we return the decompressor to the pool and then call close() on the input stream. This is backwards and causes a rare race in the case of LzopCodec, since LzopInputStream makes a few calls on the decompressor object inside close(). If another thread pulls the decompressor out of the pool and starts to use it in the meantime, the first thread's close() will cause the second thread to potentially miss pieces of data.
- MAPREDUCE-2254.
Major improvement reported by ahmed.radwan and fixed by ahmed.radwan
Allow setting of end-of-record delimiter for TextInputFormat
TextInputFormat may now split lines with delimiters other than newline, by specifying a configuration parameter "textinputformat.record.delimiter"
- MAPREDUCE-2250.
Trivial improvement reported by rvadali and fixed by rvadali (contrib/raid)
Fix logging in raid code.
There are quite a few error messages being logged with a log level of info. That should be fixed to help debugging.
- MAPREDUCE-2249.
Major improvement reported by kam_iitkgp and fixed by devaraj.k
Better to check the reflexive property of the object while overriding equals method of it
It is better to check the reflexive property of the object while overriding equals method of it.
It improves the performance when a heavy object is compared to itself.
- MAPREDUCE-2248.
Major improvement reported by rvadali and fixed by rvadali
DistributedRaidFileSystem should unraid only the corrupt block
DistributedRaidFileSystem unraids the entire file if it hits a corrupt block. It is better to unraid just the corrupt block and use the rest of the file as normal. This becomes really important when we have tera-byte sized files.
- MAPREDUCE-2243.
Minor improvement reported by kam_iitkgp and fixed by devaraj.k (jobtracker, tasktracker)
Close all the file streams propely in a finally block to avoid their leakage.
In the following classes streams should be closed in finally block to avoid their leakage in the exceptional cases.
CompletedJobStatusStore.java
------------------------------------------
dataOut.writeInt(events.length);
for (TaskCompletionEvent event : events) {
event.write(dataOut);
}
dataOut.close() ;
EventWriter.java
----------------------
encoder.flush();
out.close();
MapTask.java
-------------------
splitMetaInfo.write(out);
out....
- MAPREDUCE-2239.
Major improvement reported by schen and fixed by schen (contrib/raid)
BlockPlacementPolicyRaid should call getBlockLocations only when necessary
Currently BlockPlacementPolicyRaid calls getBlockLocations for every chooseTarget().
This puts pressure on NameNode. We should avoid calling if this file is not raided or a parity file.
- MAPREDUCE-2225.
Blocker improvement reported by qwertymaniac and fixed by qwertymaniac (job submission)
MultipleOutputs should not require the use of 'Writable'
MultipleOutputs should not require the use/check of 'Writable' interfaces in key and value classes.
- MAPREDUCE-2215.
Major bug reported by pkling and fixed by pkling (contrib/raid)
A more elegant FileSystem#listCorruptFileBlocks API (RAID changes)
Map/reduce changes related to HADOOP-7060 and HDFS-1533.
- MAPREDUCE-2207.
Major improvement reported by schen and fixed by liangly (jobtracker)
Task-cleanup task should not be scheduled on the node that the task just failed
Task-cleanup task should not be scheduled on the node that the task just failed
- MAPREDUCE-2206.
Major improvement reported by schen and fixed by schen (jobtracker)
The task-cleanup tasks should be optional
For job does not use OutputCommitter.abort(), this should be able to turn off.
This improves the latency of the job because failed tasks are often the bottleneck of the jobs.
- MAPREDUCE-2203.
Trivial improvement reported by yaojingguo and fixed by yaojingguo
Wong javadoc for TaskRunner's appendJobJarClasspaths method
"{@link Configuration.getJar()})" should be "{@link JobConf.getJar()})"
- MAPREDUCE-2202.
Major improvement reported by cos and fixed by cos
Generalize CLITest structure and interfaces to facilitate upstream adoption (e.g. for web or system testing)
Counterpart of HADOOP-7014 and HDFS-1486
- MAPREDUCE-2199.
Major bug reported by cos and fixed by cos (build)
build is broken 0.22 branch creation
hdfs and common dep versions weren't updated properly.
- MAPREDUCE-2185.
Major bug reported by hairong and fixed by rvadali (job submission)
Infinite loop at creating splits using CombineFileInputFormat
This is caused by a missing block in HDFS. So the block's locations are empty. The following code adds the block to blockToNodes map but not to rackToBlocks map. Later on when generating splits, only blocks in rackToBlocks are removed from blockToNodes map. So blockToNodes map can never become empty therefore causing infinite loop
{code}
// add this block to the block --> node locations map
blockToNodes.put(oneblock, oneblock.hosts);
// add this block to the ra...
- MAPREDUCE-2172.
Major bug reported by pkling and fixed by nidaley
test-patch.properties contains incorrect/version-dependent values of OK_FINDBUGS_WARNINGS and OK_RELEASEAUDIT_WARNINGS
Running ant test-patch with an empty patch yields 25 findbugs warning and 3 release audit warnings (rather than the 0 findbugs warnings and 1 release audit warning specified in test-patch.properties):
{code}
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no new tests are needed for this patch...
- MAPREDUCE-2156.
Major improvement reported by pkling and fixed by pkling (contrib/raid)
Raid-aware FSCK
Currently, FSCK reports files as corrupt even if they can be fixed using parity blocks. We need a tool that only reports files that are irreparably corrupt (i.e., files for which too many data or parity blocks belonging to the same stripe have been lost or corrupted).
- MAPREDUCE-2155.
Major improvement reported by pkling and fixed by pkling (contrib/raid)
RaidNode should optionally dispatch map reduce jobs to fix corrupt blocks (instead of fixing locally)
Recomputing blocks based on parity information is expensive. Rather than doing this locally at the RaidNode, we should run map reduce jobs. This will allow us to quickly fix a large number of corrupt or missing blocks.
- MAPREDUCE-2153.
Major improvement reported by ravidotg and fixed by rajesh.balamohan (tools/rumen)
Bring in more job configuration properties in to the trace file
Adds job configuration parameters to the job trace. The configuration parameters are stored under the 'jobProperties' field as key-value pairs.
- MAPREDUCE-2137.
Major bug reported by ravidotg and fixed by ravidotg (contrib/gridmix)
Mapping between Gridmix jobs and the corresponding original MR jobs is needed
New configuration properties gridmix.job.original-job-id and gridmix.job.original-job-name in the configuration of simulated job are exposed/documented to gridmix user for mapping between original cluster's jobs and simulated jobs.
- MAPREDUCE-2127.
Major bug reported by gkesavan and fixed by bmahe (build, pipes)
mapreduce trunk builds are failing on hudson
https://hudson.apache.org/hudson/job/Hadoop-Mapreduce-trunk-Commit/507/console
[exec] checking for pthread.h... yes
[exec] checking for pthread_create in -lpthread... yes
[exec] checking for HMAC_Init in -lssl... no
[exec] configure: error: Cannot find libssl.so
[exec] /grid/0/hudson/hudson-slave/workspace/Hadoop-Mapreduce-trunk-Commit/trunk/src/c++/pipes/configure: line 4250: exit: please: numeric argument required
[exec] /grid/0/hudson/hudson-slave/workspace/Hadoop...
- MAPREDUCE-2107.
Major improvement reported by ranjit and fixed by amar_kamat (contrib/gridmix)
Emulate Memory Usage of Tasks in GridMix3
Adds total heap usage emulation to Gridmix. Also, Gridmix can configure the simulated task's JVM heap options with max heap options obtained from the original task (via Rumen). Use 'gridmix.task.jvm-options.enable' to disable the task max heap options configuration.
- MAPREDUCE-2106.
Major improvement reported by ranjit and fixed by amar_kamat (contrib/gridmix)
Emulate CPU Usage of Tasks in GridMix3
Adds cumulative cpu usage emulation to Gridmix
- MAPREDUCE-2105.
Major improvement reported by ranjit and fixed by amar_kamat (contrib/gridmix)
Simulate Load Incrementally and Adaptively in GridMix3
Tasks launched by GridMix3 should incrementally and adaptively simulate load (I/O, CPU, memory, etc.) rather than doing
everything upfront and then sleeping. This helps in evening out the load when fine-grained information from the original
Task is not available and greater accuracy when it is.
By "incremental" I mean having several iterations corresponding to appropriate phases/time-slices. By "adaptive" I mean
taking the existing load into account before inflicting additional load to meet ...
- MAPREDUCE-2104.
Major bug reported by ranjit and fixed by amar_kamat (tools/rumen)
Rumen TraceBuilder Does Not Emit CPU/Memory Usage Details in Traces
Adds cpu, physical memory, virtual memory and heap usages to TraceBuilder's output.
- MAPREDUCE-2081.
Major test reported by vinaythota and fixed by vinaythota (contrib/gridmix)
[GridMix3] Implement functionality for get the list of job traces which has different intervals.
Girdmix system tests should require different job traces with different time intervals for generate and submit the gridmix jobs. So, implement a functionaliy for getting the job traces and arrange them in hash table with time interval as key.Also getting the list of traces from resource location irrespective of time. The following methods needs to implement.
Method signature:
public static Map <String, String> getMRTraces(Configuration conf) throws IOException; - it get the traces with time...
- MAPREDUCE-2074.
Minor bug reported by knoguchi and fixed by priyomustafi (distributed-cache)
Task should fail when symlink creation fail
If I pass an invalid symlink as -Dmapred.cache.files=/user/knoguchi/onerecord.txt#abc/abc
Task only reports a WARN and goes on.
{noformat}
2010-09-16 21:38:49,782 INFO org.apache.hadoop.mapred.TaskRunner: Creating symlink: /0/tmp/mapred-local/taskTracker/knoguchi/distcache/-5031501808205559510_-128488332_1354038698/abc-nn1.def.com/user/knoguchi/onerecord.txt <- /0/tmp/mapred-local/taskTracker/knoguchi/jobcache/job_201008310107_15105/attempt_201008310107_15105_m_000000_0/work/./abc/abc
20...
- MAPREDUCE-2053.
Major task reported by vinaythota and fixed by vinaythota (contrib/gridmix)
[Herriot] Test Gridmix file pool for different input file sizes based on pool minimum size.
Scenario:
1. Generate 1.8G data with Gridmix data generator, such that the files can create under different folders inside the given input directory and also create the files directly in the given input directory with the following sizes {50 MB,100 MB,400 MB, 50 MB,300 MB,10 MB ,60 MB,40 MB,20 MB,10 MB,500 MB}.
2.Set the FilePool minimum size is 100 MB.
3. Verify the files count and sizes after excluding the files that are less than file pool minimum size.Also make sure, whether files are col...
- MAPREDUCE-2037.
Major new feature reported by dking and fixed by dking
Capturing interim progress times, CPU usage, and memory usage, when tasks reach certain progress thresholds
Capture intermediate task resource consumption information:
* Time taken so far
* CPU load [either at the time the data are taken, or exponentially smoothed]
* Memory load [also either at the time the data are taken, or exponentially smoothed]
This would be taken at intervals that depend on the task progress plateaus. For example, reducers have three progress ranges - [0-1/3], (1/3-2/3], and (2/3-3/3] - where fundamentally different activities happen. Mappers have different boundaries that are not symmetrically placed [0-9/10], (9/10-1]. Data capture boundaries should coincide with activity boundaries. For the state information capture [CPU and memory] we should average over the covered interval.
- MAPREDUCE-2033.
Major task reported by vinaythota and fixed by vinaythota (contrib/gridmix)
[Herriot] Gridmix generate data tests with various submission policies and different user resolvers.
Tests for submitting and verifying the gridmix generate input data in different submission policies and various user resolver modes. It covers the following scenarios.
1. Generate the data in a STRESS submission policy with SubmitterUserResolver mode and verify whether the generated data matches with given size of input or not.
2. Generate the data in a REPLAY submission policy with RoundRobinUserResolver mode and verify whether the generated data matches with the given input size or not.
3....
- MAPREDUCE-2026.
Major improvement reported by schen and fixed by jsensarma
JobTracker.getJobCounters() should not hold JobTracker lock while calling JobInProgress.getCounters()
JobTracker.getJobCounter() will lock JobTracker and call JobInProgress.getCounters().
JobInProgress.getCounters() can be very expensive because it aggregates all the task counters.
We found that from the JobTracker jstacks that this method is one of the bottleneck of the JobTracker performance.
JobInProgress.getCounters() should be able to be called out side the JobTracker lock because it already has JobInProgress lock.
For example, it is used by jobdetails.jsp without a JobTracker lock.
- MAPREDUCE-1996.
Trivial bug reported by glynnbach and fixed by qwertymaniac (documentation)
API: Reducer.reduce() method detail misstatement
Fix a misleading documentation note about the usage of Reporter objects in Reducers.
- MAPREDUCE-1978.
Major improvement reported by amar_kamat and fixed by ravidotg (tools/rumen)
[Rumen] TraceBuilder should provide recursive input folder scanning
Adds -recursive option to TraceBuilder for scanning the input directories recursively.
- MAPREDUCE-1938.
Blocker new feature reported by devaraj and fixed by ramach (job submission, task, tasktracker)
Ability for having user's classes take precedence over the system classes for tasks' classpath
It would be nice to have the ability in MapReduce to allow users to specify for their jobs alternate implementations of classes that are already defined in the MapReduce libraries. For example, an alternate implementation for CombineFileInputFormat.
- MAPREDUCE-1927.
Minor test reported by roelofs and fixed by roelofs (test)
unit test for HADOOP-6835 (concatenated gzip support)
More extensive test of concatenated gzip (and bzip2) decoding support for HADOOP-6835 (and HADOOP-4012 and HADOOP-6852).
- MAPREDUCE-1906.
Major improvement reported by scott_carey and fixed by tlipcon (jobtracker, tasktracker)
Lower minimum heartbeat interval for tasktracker > Jobtracker
The minimum heartbeat interval has been dropped from 3 seconds to 300ms to increase scheduling throughput on small clusters. Users may tune mapreduce.jobtracker.heartbeats.in.second to adjust this value.
- MAPREDUCE-1831.
Major improvement reported by schen and fixed by schen (contrib/raid)
BlockPlacement policy for RAID
Raid introduce the new dependency between blocks within a file.
The blocks help decode each other. Therefore we should avoid put them on the same machine.
The proposed BlockPlacementPolicy does the following
1. When writing parity blocks, it avoid the parity blocks and source blocks sit together.
2. When reducing replication number, it deletes the blocks that sits with other dependent blocks.
3. It does not change the way we write normal files. It only has different behavior when processing ...
- MAPREDUCE-1811.
Minor bug reported by amareshwari and fixed by qwertymaniac (client)
Job.monitorAndPrintJob() should print status of the job at completion
Print the resultant status of a Job on completion instead of simply saying 'Complete'.
- MAPREDUCE-1788.
Major bug reported by acmurthy and fixed by acmurthy (client)
o.a.h.mapreduce.Job shouldn't make a copy of the JobConf
Having o.a.h.mapreduce.Job make a copy of the passed in JobConf has several issues: any modifications done by various pieces such as InputSplit etc. are not reflected back and causes issues for frameworks built on top.
- MAPREDUCE-1783.
Major improvement reported by rvadali and fixed by rvadali (contrib/fair-share)
Task Initialization should be delayed till when a job can be run
The FairScheduler task scheduler uses PoolManager to impose limits on the number of jobs that can be running at a given time. However, jobs that are submitted are initiaiized immediately by EagerTaskInitializationListener by calling JobInProgress.initTasks. This causes the job split file to be read into memory. The split information is not needed until the number of running jobs is less than the maximum specified. If the amount of split information is large, this leads to unnecessary memory p...
- MAPREDUCE-1752.
Major improvement reported by dms and fixed by dms (harchive)
Implement getFileBlockLocations in HarFilesystem
To efficiently run map reduce on the data that has been HAR'ed it will be great to actually implement getFileBlockLocations for a given filename.
This way the JobTracker will have information about data locality and will schedule tasks appropriately.
I believe the overhead introduced by doing lookups in the index files can be smaller than that of copying data over the wire.
Will upload the patch shortly, but would love to get some feedback on this. And any ideas on how to test it are very wel...
- MAPREDUCE-1738.
Major improvement reported by vicaya and fixed by vicaya
MapReduce portion of HADOOP-6728 (ovehaul metrics framework)
- MAPREDUCE-1706.
Major improvement reported by rschmidt and fixed by schen (contrib/raid)
Log RAID recoveries on HDFS
It would be good to have a way to centralize all the recovery logs, since recovery can be executed by any hdfs client. The best place to store this information is HDFS itself.
- MAPREDUCE-1702.
Minor improvement reported by jaideep and fixed by (contrib/gridmix)
CPU/Memory emulation for GridMix3
Currently GridMix3 can successfully recreate I/O workload of jobs from job traces. The goal of this feature is to emulate CPU and memory usage of jobs as well. For this we need to record cpu/memory usage of tasks on the cluster, save them to JobHistory so that they can be read by Rumen, and replay the cpu and memory usage in gridmix3 jobs.
- MAPREDUCE-1624.
Major improvement reported by devaraj and fixed by devaraj (documentation)
Document the job credentials and associated details to do with delegation tokens (on the client side)
Document the job credentials and associated details to do with delegation tokens (on the client side)
- MAPREDUCE-1461.
Major improvement reported by rajesh.balamohan and fixed by rajesh.balamohan (tools/rumen)
Feature to instruct rumen-folder utility to skip jobs worth of specific duration
Added a ''-starts-after' option to Rumen's Folder utility. The time duration specified after the '-starts-after' option is an offset with respect to the submit time of the first job in the input trace. Jobs in the input trace having a submit time (relative to the first job's submit time) lesser than the specified offset will be ignored.
- MAPREDUCE-1334.
Major bug reported by kaykay.unique and fixed by kaykay.unique (contrib/index)
contrib/index - test - TestIndexUpdater fails due to an additional presence of file _SUCCESS in hdfs
$ cd src/contrib/index
$ ant clean test
This fails the test TestIndexUpdater due to a mismatch in the - doneFileNames - data structure, when it is being run with different parameters.
(ArrayIndexOutOfBoundsException raised when inserting elements in doneFileNames, array ).
Debugging further - there seems to be an additional file called as - hdfs://localhost:36021/myoutput/_SUCCESS , taken into consideration in addition to those that begins with done* . The presence of the extra file ca...
- MAPREDUCE-1242.
Trivial bug reported by amogh and fixed by qwertymaniac
Chain APIs error misleading
Fix a misleading exception message in case the Chained Mappers have mismatch in input/output Key/Value pairs between them.
- MAPREDUCE-1207.
Blocker improvement reported by acmurthy and fixed by acmurthy (client, mrv2)
Allow admins to set java options for map/reduce tasks
It will be useful for allow cluster-admins to set some java options for child map/reduce tasks.
E.g. We've had to ask users to set -Djava.net.preferIPv4Stack=true in their jobs, it would be nice to do it for all users in such scenarios even when people override mapred.child.{map|reduce}.java.opts but forget to add this.
- MAPREDUCE-1159.
Trivial improvement reported by zshao and fixed by qwertymaniac
Limit Job name on jobtracker.jsp to be 80 char long
Job names on jobtracker.jsp should be 80 characters long at most.
- MAPREDUCE-993.
Minor bug reported by iyappans and fixed by qwertymaniac (jobtracker)
bin/hadoop job -events <jobid> <from-event-#> <#-of-events> help message is confusing
Added a helpful description message to the `mapred job -events` command.
- MAPREDUCE-901.
Major improvement reported by owen.omalley and fixed by vicaya (task)
Move Framework Counters into a TaskMetric structure
Efficient implementation of MapReduce framework counters.
- MAPREDUCE-587.
Minor bug reported by stevel@apache.org and fixed by amar_kamat (contrib/streaming)
Stream test TestStreamingExitStatus fails with Out of Memory
Fixed the streaming test TestStreamingExitStatus's failure due to an OutOfMemory error by reducing the testcase's io.sort.mb.
- MAPREDUCE-517.
Critical bug reported by acmurthy and fixed by acmurthy
The capacity-scheduler should assign multiple tasks per heartbeat
HADOOP-3136 changed the default o.a.h.mapred.JobQueueTaskScheduler to assign multiple tasks per TaskTracker heartbeat, the capacity-scheduler should do the same.
- MAPREDUCE-461.
Minor new feature reported by fhedberg and fixed by fhedberg
Enable ServicePlugins for the JobTracker
Allow ServicePlugins (see HADOOP-5257) for the JobTracker.
- MAPREDUCE-279.
Major improvement reported by acmurthy and fixed by (mrv2)
Map-Reduce 2.0
MapReduce has undergone a complete re-haul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2).
The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs. The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
The ResourceManager has two main components:
* Scheduler (S)
* ApplicationsManager (ASM)
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees on restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a Resource Container which incorporates elements such as memory, cpu, disk, network etc.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the cluster resources among the various queues, applications etc. The current Map-Reduce schedulers such as the CapacityScheduler and the FairScheduler would be some examples of the plug-in.
The CapacityScheduler supports hierarchical queues to allow for more predictable sharing of cluster resources.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent who is responsible for launching the applications' containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.