Parent Directory
|
Revision Log
| Links to HEAD: | (view) (annotate) |
| Sticky Revision: |
NUTCH-758 Set subversion eol-style to "native".
NUTCH-442 - Integrate Solr/Nutch
NUTCH-640 - confusing description "set it to Integer.MAX_VALUE"
NUTCH-639 - Change LuceneDocumentWrapper visibility from private to protected
NUTCH-634 Upgrade Nutch to Hadoop 0.17.1 .
NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.
NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
NUTCH-474 - Replace usage of ObjectWritable with something based on GenericWritable.
NUTCH-504 - Parsing during fetching is broken.
- fix for NUTCH-443 (contributed by Dogacan)
NUTCH-392 - OutputFormat implementations should pass on Progressable.
NUTCH-61 - adaptive fetch interval patch.
NUTCH-393 - Indexer should handle null documents returned by filters.
NUTCH-433
When indexing redirected pages, drop intermediate pages and only index the final page. Avoid NPEs in Crawl tool, when no URLs are generated or fetched.
Fix two bugs reported by Dogacan Guney.
This patch addresses several issues: * NUTCH-415 - Generator should mark selected records in CrawlDb. Due to increased resource consumption this step is optional. Application-level locking has been added to prevent concurrent modification of databases. * NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is now possible to correctly update CrawlDb from multiple segments. Introduce new status codes for temporary and permanent redirection. * NUTCH-322 - Fix Fetcher to store redirected pages and to store protocol-level status. This also should fix NUTCH-273. * Change default Fetcher behavior not to follow redirects immediately. Instead Fetcher will record redirects as new pages to be added to CrawlDb. This also partially addresses NUTCH-273. * Detect and report when Generator creates 0-sized segments. * Fix Injector to preserve already existing CrawlDatum if the seed list being injected also contains such URL. This development was partially supported by SiteSell Inc.
Move some constants to Nutch.java, so that Metadata could use them properly.
NUTCH-400 update headers
NUTCH-383: upgrade to Hadoop 0.7.1 and Lucene 2.0.0. NUTCH-373: replace DeleteDuplicates with a version that implements both parts of the algorithm. Add JUnit test.
This patch addresses two issues: * NUTCH-242: The code to activate url normalization and filtering has been refactored and extracted into CrawlDbFilter and LinkDbFilter. These two concerns (normmaliztion and filtering) have been made independent. Command line options have been modified to reflect these changes. * NUTCH-143: all command-line tools have been modified to return meaningful OS exit codes. At the moment this uses a modified copy of Hadoop's ToolBase, which will be removed when HADOOP-488 is fixed and Nutch upgrades to Hadoop 0.6.0 . All JUnit tests pass.
NUTCH-312. Upgrade to Hadoop 0.4.0.
NUTCH-309 : Added logging code guards
NUTCH-303 : Make use of the Commons Logging API and use log4j as the default implementation
removed unused import, removed unused code
Scoring API (NUTCH-240). Development of this functionality was supported by Krugle.net. Thank you!
Change parameters passed to Hadoop's FileSystem from (now-deprecated) java.io.File to (new) org.apache.hadoop.fs.Path.
Upgrade to latest Hadoop jar. Add job names to Nutch mapred jobs. Update OutputFormat implementations to implement new checkOutputSpecs() method.
Reactivate usage of AnalyzerFactory
NUTCH-221, removed deprecated Lucene API usage
Undo unintentional changes made in r381751. Thanks, Jerome, for catching this!
Adding DOAP for Nutch. Contributed by Chris Mattmann.
Fix for NUTCH-209. Nutch now supplies all code to remote MapReduce daemons through a job jar file. So Hadoop daemons no longer need to be restarted when Nutch code changes.
Updating to latest Hadoop jar, adding now-required close() methods to mapper and reducer implementations.
NUTCH-139 * Add standard metadata names * Syntax tolerant metadata names container * Review usage of metadata among plugins
NUTCH-193: MapReduce and NDFS code moved to new project, Hadoop. See bug report for details.
removed unused imports
Apply patches from NUTCH-169 (remove static NutchConf). Submitted by: Marko Bauhardt, Stefan Groschupf, Jerome Charron.
A framework for using different page signature implementations. Ordinary MD5 hash of a raw page content is very often unsuitable, when many near-duplicate pages are crawled. Now users can select their own page signature implementation, possibly with better properties than the old one. Two implementations are provided: * MD5Signature: backward-compatible with the old schema. * TextProfileSignature: an example implementation of a signature, which gives the same values for near-duplicate pages. Please see Javadoc for more information. This commit changes the CrawlDatum to store page signatures in CrawlDb. Last modified time field was added, too. Both changes are in preparation for patches implementing self-adjustable fetch interval. NutchConf was extended to store and retrieve also plain Object values. This is useful when caching per-job instances. StringUtil: added methods to display / parse byte[] values. Added SegmentReader (based on a contribution in NUTCH-121 by Rod Taylor). Fixed Fetcher to actually use the command-line parameters.
Mega-cleanup patch: * remove obsolete classes and packages * move new classes to the more appropriate packages * change the bin/nutch script appropriately * change the Protocol API in preparation for patches implementing flexible re-fetch schedules. Please report any errors (if any? :).
This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.
| apache@apache.org | ViewVC Help |
| Powered by ViewVC 1.1.2 |