Parent Directory
|
Revision Log
| Links to HEAD: | (view) (annotate) |
| Sticky Revision: |
NUTCH-684 - Dedup support for Solr
NUTCH-669 - Consolidate code for Fetcher and Fetcher2
NUTCH-442 - Integrate Solr/Nutch
NUTCH-586 - Add option to run compiled classes without job file.
Fix breakage in "if" syntax.
revert to previous version as requested by ab
fix bin/nutch: line 152: cygpath: command not found on linux (FC5), hope i am not breaking it for some other env
NUTCH-432 - JAVA_PLATFORM with spaces breaks bin/nutch. Also, apply the patch proposed in HADOOP-1080 to fix CLASSPATH problems under Cygwin.
Revert accidental change to bin/nutch. Fix Fetcher.java to correctly split input. Add Fetcher2 - a queue-based fetcher implementation.
NUTCH-68 - ported to use map-reduce.
Upgrade to Hadoop 0.10.1. HTTPClient is now a dependency - move it to lib/ and remove it as a plugin. Add also native Linux libraries for Hadoop compression, plus corresponding logic in bin/nutch. Hadoop uses larger buffers now - explicitly set large heap size for JUnit tests. All tests should pass now.
NUTCH-383: upgrade to Hadoop 0.7.1 and Lucene 2.0.0. NUTCH-373: replace DeleteDuplicates with a version that implements both parts of the algorithm. Add JUnit test.
Change the name of SegmentReader alias to 'readseg' for consistency with other reading-related commands. Keep the old 'segread' for compatibility, and give a deprecation message.
NUTCH-327 fix log path under cygwin
NUTCH-307 : Nutch now uses Hadoop var names for the file name used by DRFA logging
NUTCH-303 : Make use of the Commons Logging API and use log4j as the default implementation
Add the following tools (see also NUTCH-264): * CrawlDbMerger: merges one or more crawldb-s, with optional filtering * LinkDbMerger: merges one or more linkdb-s, with optional filtering * SegmentMerger: merges one or more segments, with optional filtering and slicing Development of these tools has been sponsored by houxou.com - thank you!
Fix a bug when there are spaces in CWD, as is common on Windows.
Fix for NUTCH-209. Nutch now supplies all code to remote MapReduce daemons through a job jar file. So Hadoop daemons no longer need to be restarted when Nutch code changes.
NUTCH-193: MapReduce and NDFS code moved to new project, Hadoop. See bug report for details.
Add functionality to run individual classes in plugins. Add a user-friendly shortcut to bin/nutch (the "plugin" command). Fix missing setConf() in Http plugins.
Explicitly specify bash, since this script requires some bash-specific features.
A framework for using different page signature implementations. Ordinary MD5 hash of a raw page content is very often unsuitable, when many near-duplicate pages are crawled. Now users can select their own page signature implementation, possibly with better properties than the old one. Two implementations are provided: * MD5Signature: backward-compatible with the old schema. * TextProfileSignature: an example implementation of a signature, which gives the same values for near-duplicate pages. Please see Javadoc for more information. This commit changes the CrawlDatum to store page signatures in CrawlDb. Last modified time field was added, too. Both changes are in preparation for patches implementing self-adjustable fetch interval. NutchConf was extended to store and retrieve also plain Object values. This is useful when caching per-job instances. StringUtil: added methods to display / parse byte[] values. Added SegmentReader (based on a contribution in NUTCH-121 by Rod Taylor). Fixed Fetcher to actually use the command-line parameters.
Mega-cleanup patch: * remove obsolete classes and packages * move new classes to the more appropriate packages * change the bin/nutch script appropriately * change the Protocol API in preparation for patches implementing flexible re-fetch schedules. Please report any errors (if any? :).
Merge mapred branch to trunk & remove it.
Fix for nutch shell script problem on Mac OS X. Submitted by Erik Hatcher
Moving Nutch from the Incubator to Lucene.
Closed Issue NUTCH-19: Space in Java.exe path chokes bin/nutch.
Add updatesegs command.
Initial import of Nutch to Apache.
This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.
| apache@apache.org | ViewVC Help |
| Powered by ViewVC 1.1.2 |