/[Apache-SVN]/lucene/nutch/trunk/bin/nutch
ViewVC logotype

Log of /lucene/nutch/trunk/bin/nutch

Parent Directory Parent Directory | Revision Log Revision Log


Links to HEAD: (view) (annotate)
Sticky Revision:

Revision 751774 - (view) (annotate) - [select for diffs]
Modified Mon Mar 9 17:34:51 2009 UTC (8 months, 2 weeks ago) by dogacan
File length: 7998 byte(s)
Diff to previous 749289 (colored)
NUTCH-684 - Dedup support for Solr

Revision 749289 - (view) (annotate) - [select for diffs]
Modified Mon Mar 2 12:28:22 2009 UTC (8 months, 3 weeks ago) by siren
File length: 7841 byte(s)
Diff to previous 733738 (colored)
NUTCH-669 - Consolidate code for Fetcher and Fetcher2

Revision 733738 - (view) (annotate) - [select for diffs]
Modified Mon Jan 12 13:26:16 2009 UTC (10 months, 1 week ago) by dogacan
File length: 7924 byte(s)
Diff to previous 604956 (colored)
NUTCH-442 - Integrate Solr/Nutch

Revision 604956 - (view) (annotate) - [select for diffs]
Modified Mon Dec 17 18:22:17 2007 UTC (23 months, 1 week ago) by ab
File length: 7753 byte(s)
Diff to previous 520676 (colored)
NUTCH-586 - Add option to run compiled classes without job file.

Revision 520676 - (view) (annotate) - [select for diffs]
Modified Wed Mar 21 00:20:12 2007 UTC (2 years, 8 months ago) by ab
File length: 7301 byte(s)
Diff to previous 516908 (colored)
Fix breakage in "if" syntax.

Revision 516908 - (view) (annotate) - [select for diffs]
Modified Sun Mar 11 14:30:35 2007 UTC (2 years, 8 months ago) by siren
File length: 7294 byte(s)
Diff to previous 516888 (colored)
revert to previous version as requested by ab

Revision 516888 - (view) (annotate) - [select for diffs]
Modified Sun Mar 11 11:12:23 2007 UTC (2 years, 8 months ago) by siren
File length: 7290 byte(s)
Diff to previous 515698 (colored)
fix bin/nutch: line 152: cygpath: command not found on linux (FC5), hope i am not breaking it for some other env

Revision 515698 - (view) (annotate) - [select for diffs]
Modified Wed Mar 7 19:02:56 2007 UTC (2 years, 8 months ago) by ab
File length: 7294 byte(s)
Diff to previous 497172 (colored)
NUTCH-432 - JAVA_PLATFORM with spaces breaks bin/nutch.

Also, apply the patch proposed in HADOOP-1080 to fix CLASSPATH problems
under Cygwin.

Revision 497172 - (view) (annotate) - [select for diffs]
Modified Wed Jan 17 21:06:50 2007 UTC (2 years, 10 months ago) by ab
File length: 7160 byte(s)
Diff to previous 497141 (colored)
Revert accidental change to bin/nutch.

Fix Fetcher.java to correctly split input.

Add Fetcher2 - a queue-based fetcher implementation.

Revision 497141 - (view) (annotate) - [select for diffs]
Modified Wed Jan 17 19:55:07 2007 UTC (2 years, 10 months ago) by ab
File length: 7073 byte(s)
Diff to previous 495392 (colored)
NUTCH-68 - ported to use map-reduce.

Revision 495392 - (view) (annotate) - [select for diffs]
Modified Thu Jan 11 21:51:20 2007 UTC (2 years, 10 months ago) by ab
File length: 6899 byte(s)
Diff to previous 464654 (colored)
Upgrade to Hadoop 0.10.1. HTTPClient is now a dependency - move it
to lib/ and remove it as a plugin.

Add also native Linux libraries for Hadoop compression, plus corresponding
logic in bin/nutch.

Hadoop uses larger buffers now - explicitly set large heap size for
JUnit tests. All tests should pass now.

Revision 464654 - (view) (annotate) - [select for diffs]
Modified Mon Oct 16 20:38:57 2006 UTC (3 years, 1 month ago) by ab
File length: 6066 byte(s)
Diff to previous 425354 (colored)
NUTCH-383: upgrade to Hadoop 0.7.1 and Lucene 2.0.0.

NUTCH-373: replace DeleteDuplicates with a version that implements both
parts of the algorithm. Add JUnit test.

Revision 425354 - (view) (annotate) - [select for diffs]
Modified Tue Jul 25 09:54:58 2006 UTC (3 years, 4 months ago) by ab
File length: 5907 byte(s)
Diff to previous 424779 (colored)
Change the name of SegmentReader alias to 'readseg' for consistency with other
reading-related commands. Keep the old 'segread' for compatibility, and
give a deprecation message.

Revision 424779 - (view) (annotate) - [select for diffs]
Modified Sun Jul 23 18:43:55 2006 UTC (3 years, 4 months ago) by siren
File length: 5743 byte(s)
Diff to previous 416100 (colored)
NUTCH-327 fix log path under cygwin

Revision 416100 - (view) (annotate) - [select for diffs]
Modified Wed Jun 21 20:22:33 2006 UTC (3 years, 5 months ago) by jerome
File length: 5615 byte(s)
Diff to previous 413742 (colored)
NUTCH-307 : Nutch now uses Hadoop var names for the file name used by DRFA logging

Revision 413742 - (view) (annotate) - [select for diffs]
Modified Mon Jun 12 20:51:40 2006 UTC (3 years, 5 months ago) by jerome
File length: 5612 byte(s)
Diff to previous 405183 (colored)
NUTCH-303 : Make use of the Commons Logging API and use log4j as the default implementation

Revision 405183 - (view) (annotate) - [select for diffs]
Modified Mon May 8 21:58:18 2006 UTC (3 years, 6 months ago) by ab
File length: 5328 byte(s)
Diff to previous 389634 (colored)
Add the following tools (see also NUTCH-264):

* CrawlDbMerger: merges one or more crawldb-s, with optional filtering

* LinkDbMerger: merges one or more linkdb-s, with optional filtering

* SegmentMerger: merges one or more segments, with optional filtering
  and slicing

Development of these tools has been sponsored by houxou.com - thank you! 

Revision 389634 - (view) (annotate) - [select for diffs]
Modified Wed Mar 29 00:04:51 2006 UTC (3 years, 7 months ago) by cutting
File length: 4841 byte(s)
Diff to previous 376485 (colored)
Fix a bug when there are spaces in CWD, as is common on Windows.

Revision 376485 - (view) (annotate) - [select for diffs]
Modified Thu Feb 9 23:20:28 2006 UTC (3 years, 9 months ago) by cutting
File length: 4841 byte(s)
Diff to previous 374796 (colored)
Fix for NUTCH-209.  Nutch now supplies all code to remote MapReduce daemons through a job jar file.  So Hadoop daemons no longer need to be restarted when Nutch code changes.

Revision 374796 - (view) (annotate) - [select for diffs]
Modified Sat Feb 4 00:38:32 2006 UTC (3 years, 9 months ago) by cutting
File length: 4849 byte(s)
Diff to previous 374348 (colored)
NUTCH-193: MapReduce and NDFS code moved to new project, Hadoop.  See bug report for details.

Revision 374348 - (view) (annotate) - [select for diffs]
Modified Thu Feb 2 10:58:16 2006 UTC (3 years, 9 months ago) by ab
File length: 5674 byte(s)
Diff to previous 372810 (colored)
Add functionality to run individual classes in plugins.

Add a user-friendly shortcut to bin/nutch (the "plugin" command).

Fix missing setConf() in Http plugins.

Revision 372810 - (view) (annotate) - [select for diffs]
Modified Fri Jan 27 10:45:35 2006 UTC (3 years, 9 months ago) by cutting
File length: 5510 byte(s)
Diff to previous 359822 (colored)
Explicitly specify bash, since this script requires some bash-specific features.

Revision 359822 - (view) (annotate) - [select for diffs]
Modified Thu Dec 29 15:28:30 2005 UTC (3 years, 10 months ago) by ab
File length: 5508 byte(s)
Diff to previous 359668 (colored)
A framework for using different page signature implementations. Ordinary
MD5 hash of a raw page content is very often unsuitable, when many
near-duplicate pages are crawled.

Now users can select their own page signature implementation, possibly
with better properties than the old one.

Two implementations are provided:

* MD5Signature: backward-compatible with the old schema.

* TextProfileSignature: an example implementation of a signature, which
  gives the same values for near-duplicate pages. Please see Javadoc for
  more information.

This commit changes the CrawlDatum to store page signatures in CrawlDb.
Last modified time field was added, too. Both changes are in preparation
for patches implementing self-adjustable fetch interval.

NutchConf was extended to store and retrieve also plain Object values.
This is useful when caching per-job instances.

StringUtil: added methods to display / parse byte[] values.

Added SegmentReader (based on a contribution in NUTCH-121 by Rod Taylor).

Fixed Fetcher to actually use the command-line parameters.

Revision 359668 - (view) (annotate) - [select for diffs]
Modified Thu Dec 29 00:37:13 2005 UTC (3 years, 10 months ago) by ab
File length: 5368 byte(s)
Diff to previous 357197 (colored)
Mega-cleanup patch:

* remove obsolete classes and packages

* move new classes to the more appropriate packages

* change the bin/nutch script appropriately

* change the Protocol API in preparation for patches implementing
  flexible re-fetch schedules.

Please report any errors (if any? :).


Revision 357197 - (view) (annotate) - [select for diffs]
Modified Fri Dec 16 17:51:05 2005 UTC (3 years, 11 months ago) by cutting
File length: 5435 byte(s)
Diff to previous 219772 (colored)
Merge mapred branch to trunk & remove it.

Revision 219772 - (view) (annotate) - [select for diffs]
Modified Tue Jul 19 20:40:49 2005 UTC (4 years, 4 months ago) by pkosiorowski
File length: 6198 byte(s)
Diff to previous 179640 (colored)
Fix for nutch shell script problem on Mac OS X. Submitted by Erik Hatcher

Revision 179640 - (view) (annotate) - [select for diffs]
Modified Thu Jun 2 20:37:21 2005 UTC (4 years, 5 months ago) by cutting
File length: 6202 byte(s)
Diff to previous 161681 (colored)
Moving Nutch from the Incubator to Lucene.

Revision 161681 - (view) (annotate) - [select for diffs]
Modified Sun Apr 17 19:23:35 2005 UTC (4 years, 7 months ago) by johnx
Original Path: incubator/nutch/trunk/bin/nutch
File length: 6202 byte(s)
Diff to previous 158625 (colored)
Closed Issue NUTCH-19: Space in Java.exe path chokes bin/nutch.

Revision 158625 - (view) (annotate) - [select for diffs]
Modified Tue Mar 22 16:47:02 2005 UTC (4 years, 8 months ago) by cutting
Original Path: incubator/nutch/trunk/bin/nutch
File length: 6200 byte(s)
Diff to previous 155829 (colored)
Add updatesegs command.

Revision 155829 - (view) (annotate) - [select for diffs]
Added Tue Mar 1 22:04:46 2005 UTC (4 years, 8 months ago) by cutting
Original Path: incubator/nutch/trunk/bin/nutch
File length: 6043 byte(s)
Initial import of Nutch to Apache.

This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.

  Diffs between and
  Type of Diff should be a

apache@apache.org
ViewVC Help
Powered by ViewVC 1.1.2