/[Apache-SVN]/lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java
ViewVC logotype

Log of /lucene/nutch/trunk/src/java/org/apache/nutch/indexer/Indexer.java

Parent Directory Parent Directory | Revision Log Revision Log


Links to HEAD: (view) (annotate)
Sticky Revision:

Revision 823614 - (view) (annotate) - [select for diffs]
Modified Fri Oct 9 17:02:32 2009 UTC (6 weeks, 5 days ago) by ab
File length: 3491 byte(s)
Diff to previous 733738 (colored)
NUTCH-758 Set subversion eol-style to "native".

Revision 733738 - (view) (annotate) - [select for diffs]
Modified Mon Jan 12 13:26:16 2009 UTC (10 months, 1 week ago) by dogacan
File length: 3491 byte(s)
Diff to previous 701052 (colored)
NUTCH-442 - Integrate Solr/Nutch

Revision 701052 - (view) (annotate) - [select for diffs]
Modified Thu Oct 2 09:17:23 2008 UTC (13 months, 3 weeks ago) by dogacan
File length: 12491 byte(s)
Diff to previous 697395 (colored)
NUTCH-640 - confusing description "set it to Integer.MAX_VALUE"

Revision 697395 - (view) (annotate) - [select for diffs]
Modified Sat Sep 20 17:05:03 2008 UTC (14 months ago) by dogacan
File length: 12401 byte(s)
Diff to previous 678533 (colored)
NUTCH-639 - Change LuceneDocumentWrapper visibility from private to protected

Revision 678533 - (view) (annotate) - [select for diffs]
Modified Mon Jul 21 19:20:21 2008 UTC (16 months ago) by ab
File length: 12402 byte(s)
Diff to previous 638779 (colored)
NUTCH-634 Upgrade Nutch to Hadoop 0.17.1 .

Revision 638779 - (view) (annotate) - [select for diffs]
Modified Wed Mar 19 10:34:14 2008 UTC (20 months, 1 week ago) by ab
File length: 12260 byte(s)
Diff to previous 593151 (colored)
NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.

Revision 593151 - (view) (annotate) - [select for diffs]
Modified Thu Nov 8 13:18:05 2007 UTC (2 years ago) by dogacan
File length: 12146 byte(s)
Diff to previous 561092 (colored)
NUTCH-547 - Redirection handling: YahooSlurp's algorithm.

Revision 561092 - (view) (annotate) - [select for diffs]
Modified Mon Jul 30 19:02:27 2007 UTC (2 years, 3 months ago) by dogacan
File length: 11848 byte(s)
Diff to previous 551081 (colored)
NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.

Revision 551081 - (view) (annotate) - [select for diffs]
Modified Wed Jun 27 07:05:52 2007 UTC (2 years, 5 months ago) by dogacan
File length: 11778 byte(s)
Diff to previous 550196 (colored)
NUTCH-474 - Replace usage of ObjectWritable with something based on GenericWritable.

Revision 550196 - (view) (annotate) - [select for diffs]
Modified Sun Jun 24 10:04:30 2007 UTC (2 years, 5 months ago) by dogacan
File length: 11030 byte(s)
Diff to previous 548076 (colored)
NUTCH-504 - Parsing during fetching is broken.

Revision 548076 - (view) (annotate) - [select for diffs]
Modified Sun Jun 17 17:19:14 2007 UTC (2 years, 5 months ago) by mattmann
File length: 10959 byte(s)
Diff to previous 543264 (colored)
- fix for NUTCH-443 (contributed by Dogacan)

Revision 543264 - (view) (annotate) - [select for diffs]
Modified Thu May 31 21:23:45 2007 UTC (2 years, 5 months ago) by ab
File length: 10992 byte(s)
Diff to previous 542903 (colored)
NUTCH-392 - OutputFormat implementations should pass on Progressable.

Revision 542903 - (view) (annotate) - [select for diffs]
Modified Wed May 30 18:35:24 2007 UTC (2 years, 5 months ago) by ab
File length: 10953 byte(s)
Diff to previous 536629 (colored)
NUTCH-61 - adaptive fetch interval patch.

Revision 536629 - (view) (annotate) - [select for diffs]
Modified Wed May 9 19:36:54 2007 UTC (2 years, 6 months ago) by ab
File length: 10825 byte(s)
Diff to previous 499878 (colored)
NUTCH-393 - Indexer should handle null documents returned by filters.

Revision 499878 - (view) (annotate) - [select for diffs]
Modified Thu Jan 25 18:11:59 2007 UTC (2 years, 10 months ago) by siren
File length: 10739 byte(s)
Diff to previous 495214 (colored)
NUTCH-433

Revision 495214 - (view) (annotate) - [select for diffs]
Modified Thu Jan 11 13:25:43 2007 UTC (2 years, 10 months ago) by ab
File length: 11731 byte(s)
Diff to previous 491291 (colored)
When indexing redirected pages, drop intermediate pages and only index the
final page.

Avoid NPEs in Crawl tool, when no URLs are generated or fetched.

Revision 491291 - (view) (annotate) - [select for diffs]
Modified Sat Dec 30 19:13:06 2006 UTC (2 years, 10 months ago) by ab
File length: 11453 byte(s)
Diff to previous 490607 (colored)
Fix two bugs reported by Dogacan Guney.

Revision 490607 - (view) (annotate) - [select for diffs]
Modified Thu Dec 28 00:03:04 2006 UTC (2 years, 10 months ago) by ab
File length: 11705 byte(s)
Diff to previous 480188 (colored)
This patch addresses several issues:

* NUTCH-415 - Generator should mark selected records in CrawlDb.
  Due to increased resource consumption this step is optional.
  Application-level locking has been added to prevent concurrent
  modification of databases.

* NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
  now possible to correctly update CrawlDb from multiple segments.
  Introduce new status codes for temporary and permanent
  redirection.

* NUTCH-322 - Fix Fetcher to store redirected pages and to store
  protocol-level status. This also should fix NUTCH-273.

* Change default Fetcher behavior not to follow redirects immediately.
  Instead Fetcher will record redirects as new pages to be added to CrawlDb.
  This also partially addresses NUTCH-273.

* Detect and report when Generator creates 0-sized segments.

* Fix Injector to preserve already existing CrawlDatum if the seed list
  being injected also contains such URL.

This development was partially supported by SiteSell Inc.


Revision 480188 - (view) (annotate) - [select for diffs]
Modified Tue Nov 28 20:14:58 2006 UTC (2 years, 11 months ago) by ab
File length: 11668 byte(s)
Diff to previous 473936 (colored)
Move some constants to Nutch.java, so that Metadata could use them properly.

Revision 473936 - (view) (annotate) - [select for diffs]
Modified Sun Nov 12 11:37:02 2006 UTC (3 years ago) by siren
File length: 11632 byte(s)
Diff to previous 464654 (colored)
NUTCH-400 update headers

Revision 464654 - (view) (annotate) - [select for diffs]
Modified Mon Oct 16 20:38:57 2006 UTC (3 years, 1 month ago) by ab
File length: 11444 byte(s)
Diff to previous 438670 (colored)
NUTCH-383: upgrade to Hadoop 0.7.1 and Lucene 2.0.0.

NUTCH-373: replace DeleteDuplicates with a version that implements both
parts of the algorithm. Add JUnit test.

Revision 438670 - (view) (annotate) - [select for diffs]
Modified Wed Aug 30 22:12:53 2006 UTC (3 years, 2 months ago) by ab
File length: 11267 byte(s)
Diff to previous 417884 (colored)
This patch addresses two issues:

* NUTCH-242: The code to activate url normalization and filtering has been
  refactored and extracted into CrawlDbFilter and LinkDbFilter. These
  two concerns (normmaliztion and filtering) have been made independent.
  Command line options have been modified to reflect these changes.

* NUTCH-143: all command-line tools have been modified to return
  meaningful OS exit codes. At the moment this uses a modified copy of
  Hadoop's ToolBase, which will be removed when HADOOP-488 is fixed and
  Nutch upgrades to Hadoop 0.6.0 .

All JUnit tests pass.

Revision 417884 - (view) (annotate) - [select for diffs]
Modified Wed Jun 28 21:54:53 2006 UTC (3 years, 4 months ago) by cutting
File length: 10997 byte(s)
Diff to previous 416346 (colored)
NUTCH-312.  Upgrade to Hadoop 0.4.0.

Revision 416346 - (view) (annotate) - [select for diffs]
Modified Thu Jun 22 12:20:29 2006 UTC (3 years, 5 months ago) by jerome
File length: 10930 byte(s)
Diff to previous 413742 (colored)
NUTCH-309 : Added logging code guards

Revision 413742 - (view) (annotate) - [select for diffs]
Modified Mon Jun 12 20:51:40 2006 UTC (3 years, 5 months ago) by jerome
File length: 10570 byte(s)
Diff to previous 411249 (colored)
NUTCH-303 : Make use of the Commons Logging API and use log4j as the default implementation

Revision 411249 - (view) (annotate) - [select for diffs]
Modified Fri Jun 2 18:57:35 2006 UTC (3 years, 5 months ago) by siren
File length: 10574 byte(s)
Diff to previous 405967 (colored)
removed unused import, removed unused code

Revision 405967 - (view) (annotate) - [select for diffs]
Modified Sat May 13 00:52:33 2006 UTC (3 years, 6 months ago) by ab
File length: 10699 byte(s)
Diff to previous 405204 (colored)
Scoring API (NUTCH-240).

Development of this functionality was supported by Krugle.net. Thank you!

Revision 405204 - (view) (annotate) - [select for diffs]
Modified Mon May 8 22:34:29 2006 UTC (3 years, 6 months ago) by cutting
File length: 10343 byte(s)
Diff to previous 388310 (colored)
Change parameters passed to Hadoop's FileSystem from (now-deprecated) java.io.File to (new) org.apache.hadoop.fs.Path.

Revision 388310 - (view) (annotate) - [select for diffs]
Modified Fri Mar 24 00:57:56 2006 UTC (3 years, 8 months ago) by cutting
File length: 10327 byte(s)
Diff to previous 385322 (colored)
Upgrade to latest Hadoop jar.  Add job names to Nutch mapred jobs.  Update OutputFormat implementations to implement new checkOutputSpecs() method.

Revision 385322 - (view) (annotate) - [select for diffs]
Modified Sun Mar 12 17:46:07 2006 UTC (3 years, 8 months ago) by jerome
File length: 10285 byte(s)
Diff to previous 383304 (colored)
Reactivate usage of AnalyzerFactory

Revision 383304 - (view) (annotate) - [select for diffs]
Modified Sun Mar 5 10:55:17 2006 UTC (3 years, 8 months ago) by siren
File length: 9929 byte(s)
Diff to previous 382912 (colored)
NUTCH-221, removed deprecated Lucene API usage

Revision 382912 - (view) (annotate) - [select for diffs]
Modified Fri Mar 3 19:05:41 2006 UTC (3 years, 8 months ago) by cutting
File length: 9835 byte(s)
Diff to previous 381751 (colored)
Undo unintentional changes made in r381751.  Thanks, Jerome, for catching this!

Revision 381751 - (view) (annotate) - [select for diffs]
Modified Tue Feb 28 19:25:12 2006 UTC (3 years, 8 months ago) by cutting
File length: 10217 byte(s)
Diff to previous 376485 (colored)
Adding DOAP for Nutch.  Contributed by Chris Mattmann.

Revision 376485 - (view) (annotate) - [select for diffs]
Modified Thu Feb 9 23:20:28 2006 UTC (3 years, 9 months ago) by cutting
File length: 9835 byte(s)
Diff to previous 376435 (colored)
Fix for NUTCH-209.  Nutch now supplies all code to remote MapReduce daemons through a job jar file.  So Hadoop daemons no longer need to be restarted when Nutch code changes.

Revision 376435 - (view) (annotate) - [select for diffs]
Modified Thu Feb 9 20:57:44 2006 UTC (3 years, 9 months ago) by cutting
File length: 9795 byte(s)
Diff to previous 376089 (colored)
Updating to latest Hadoop jar, adding now-required close() methods to mapper and reducer implementations.

Revision 376089 - (view) (annotate) - [select for diffs]
Modified Wed Feb 8 21:48:52 2006 UTC (3 years, 9 months ago) by jerome
File length: 9769 byte(s)
Diff to previous 374796 (colored)
NUTCH-139
 * Add standard metadata names
 * Syntax tolerant metadata names container
 * Review usage of metadata among plugins

Revision 374796 - (view) (annotate) - [select for diffs]
Modified Sat Feb 4 00:38:32 2006 UTC (3 years, 9 months ago) by cutting
File length: 9736 byte(s)
Diff to previous 374741 (colored)
NUTCH-193: MapReduce and NDFS code moved to new project, Hadoop.  See bug report for details.

Revision 374741 - (view) (annotate) - [select for diffs]
Modified Fri Feb 3 20:56:28 2006 UTC (3 years, 9 months ago) by siren
File length: 9637 byte(s)
Diff to previous 373853 (colored)
removed unused imports

Revision 373853 - (view) (annotate) - [select for diffs]
Modified Tue Jan 31 16:08:58 2006 UTC (3 years, 9 months ago) by ab
File length: 9703 byte(s)
Diff to previous 359822 (colored)
Apply patches from NUTCH-169 (remove static NutchConf).

Submitted by: Marko Bauhardt, Stefan Groschupf, Jerome Charron.


Revision 359822 - (view) (annotate) - [select for diffs]
Modified Thu Dec 29 15:28:30 2005 UTC (3 years, 10 months ago) by ab
File length: 9599 byte(s)
Diff to previous 359668 (colored)
A framework for using different page signature implementations. Ordinary
MD5 hash of a raw page content is very often unsuitable, when many
near-duplicate pages are crawled.

Now users can select their own page signature implementation, possibly
with better properties than the old one.

Two implementations are provided:

* MD5Signature: backward-compatible with the old schema.

* TextProfileSignature: an example implementation of a signature, which
  gives the same values for near-duplicate pages. Please see Javadoc for
  more information.

This commit changes the CrawlDatum to store page signatures in CrawlDb.
Last modified time field was added, too. Both changes are in preparation
for patches implementing self-adjustable fetch interval.

NutchConf was extended to store and retrieve also plain Object values.
This is useful when caching per-job instances.

StringUtil: added methods to display / parse byte[] values.

Added SegmentReader (based on a contribution in NUTCH-121 by Rod Taylor).

Fixed Fetcher to actually use the command-line parameters.

Revision 359668 - (view) (annotate) - [select for diffs]
Added Thu Dec 29 00:37:13 2005 UTC (3 years, 10 months ago) by ab
File length: 9596 byte(s)
Mega-cleanup patch:

* remove obsolete classes and packages

* move new classes to the more appropriate packages

* change the bin/nutch script appropriately

* change the Protocol API in preparation for patches implementing
  flexible re-fetch schedules.

Please report any errors (if any? :).


This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.

  Diffs between and
  Type of Diff should be a

apache@apache.org
ViewVC Help
Powered by ViewVC 1.1.2