/[Apache-SVN]/lucene/nutch/trunk/conf/nutch-default.xml
ViewVC logotype

Log of /lucene/nutch/trunk/conf/nutch-default.xml

Parent Directory Parent Directory | Revision Log Revision Log


Links to HEAD: (view) (annotate)
Sticky Revision:

Revision 823614 - (view) (annotate) - [select for diffs]
Modified Fri Oct 9 17:02:32 2009 UTC (6 weeks, 4 days ago) by ab
File length: 41654 byte(s)
Diff to previous 807485 (colored)
NUTCH-758 Set subversion eol-style to "native".

Revision 807485 - (view) (annotate) - [select for diffs]
Modified Tue Aug 25 05:45:53 2009 UTC (3 months ago) by dogacan
File length: 41654 byte(s)
Diff to previous 751471 (colored)
Fetcher2 slow. Patch contributed by Julien Nioche.

Revision 751471 - (view) (annotate) - [select for diffs]
Modified Sun Mar 8 17:20:59 2009 UTC (8 months, 2 weeks ago) by siren
File length: 41653 byte(s)
Diff to previous 745503 (colored)
preparing for release

Revision 745503 - (view) (annotate) - [select for diffs]
Modified Wed Feb 18 12:53:12 2009 UTC (9 months ago) by siren
File length: 41657 byte(s)
Diff to previous 730845 (colored)
NUTCH-563 Include custom fields in BasicQueryFilter, contributed by Julien Nioche

Revision 730845 - (view) (annotate) - [select for diffs]
Modified Fri Jan 2 21:38:58 2009 UTC (10 months, 3 weeks ago) by kubes
File length: 41432 byte(s)
Diff to previous 730405 (colored)
NUTCH-594: Serve Nutch search results in multiple formats including XML and JSON.

Revision 730405 - (view) (annotate) - [select for diffs]
Modified Wed Dec 31 14:54:16 2008 UTC (10 months, 3 weeks ago) by kubes
File length: 40007 byte(s)
Diff to previous 701052 (colored)
Missed default configuration variable for NUTCH-668.

Revision 701052 - (view) (annotate) - [select for diffs]
Modified Thu Oct 2 09:17:23 2008 UTC (13 months, 3 weeks ago) by dogacan
File length: 39751 byte(s)
Diff to previous 678533 (colored)
NUTCH-640 - confusing description "set it to Integer.MAX_VALUE"

Revision 678533 - (view) (annotate) - [select for diffs]
Modified Mon Jul 21 19:20:21 2008 UTC (16 months ago) by ab
File length: 39766 byte(s)
Diff to previous 628631 (colored)
NUTCH-634 Upgrade Nutch to Hadoop 0.17.1 .

Revision 628631 - (view) (annotate) - [select for diffs]
Modified Mon Feb 18 06:38:46 2008 UTC (21 months ago) by kubes
File length: 39279 byte(s)
Diff to previous 619648 (colored)
NUTCH-44 - Too many search results.  Configurable limit on max number of search results returned.  Thanks Emilijan Mirceski and Susam Pal.

Revision 619648 - (view) (annotate) - [select for diffs]
Modified Thu Feb 7 21:32:06 2008 UTC (21 months, 2 weeks ago) by kubes
File length: 38878 byte(s)
Diff to previous 608972 (colored)
NUTCH-602 - Allow configurable number of handlers for search servers.  Thanks to Seth Hartbecke from Search Wikia for spotting this.

Revision 608972 - (view) (annotate) - [select for diffs]
Modified Fri Jan 4 19:48:32 2008 UTC (22 months, 2 weeks ago) by dogacan
File length: 38707 byte(s)
Diff to previous 594591 (colored)
NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy. Contributed by Susam Pal.

Revision 594591 - (view) (annotate) - [select for diffs]
Modified Tue Nov 13 17:35:08 2007 UTC (2 years ago) by kubes
File length: 36983 byte(s)
Diff to previous 586032 (colored)
NUTCH-574 - Including inlink anchor text in index can create irrelevant search results.  Moved inbound anchor text indexing from index-basic to new index-anchor plugin.  For backwards compatibility index-anchor will need to be added to the nutch-site.xml plugin.includes configuration variable. 

Revision 586032 - (view) (annotate) - [select for diffs]
Modified Thu Oct 18 16:53:48 2007 UTC (2 years, 1 month ago) by kubes
File length: 36974 byte(s)
Diff to previous 583016 (colored)
NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink list.  Thanks to Marcin Okraszewski and Emmanuel Joke.

Revision 583016 - (view) (annotate) - [select for diffs]
Modified Tue Oct 9 00:23:38 2007 UTC (2 years, 1 month ago) by mattmann
File length: 36546 byte(s)
Diff to previous 579656 (colored)
- fix for NUTCH-562

Revision 579656 - (view) (annotate) - [select for diffs]
Modified Wed Sep 26 14:02:48 2007 UTC (2 years, 1 month ago) by dogacan
File length: 36542 byte(s)
Diff to previous 575360 (colored)
NUTCH-25 - needs 'character encoding' detector. Mostly contributed by Doug Cook. Some parts are contributed by Marcin Okraszewski and Renaud Richardet. Also fixes NUTCH-369 and NUTCH-487.

Revision 575360 - (view) (annotate) - [select for diffs]
Modified Thu Sep 13 16:23:52 2007 UTC (2 years, 2 months ago) by ab
File length: 36277 byte(s)
Diff to previous 542903 (colored)
Document a property. Spotted by Emmanuel Joke.

Revision 542903 - (view) (annotate) - [select for diffs]
Modified Wed May 30 18:35:24 2007 UTC (2 years, 5 months ago) by ab
File length: 35939 byte(s)
Diff to previous 526035 (colored)
NUTCH-61 - adaptive fetch interval patch.

Revision 526035 - (view) (annotate) - [select for diffs]
Modified Fri Apr 6 02:36:56 2007 UTC (2 years, 7 months ago) by mattmann
File length: 33538 byte(s)
Diff to previous 522679 (colored)
- update for new development, Nutch 1.0-dev

Revision 522679 - (view) (annotate) - [select for diffs]
Modified Tue Mar 27 00:36:15 2007 UTC (2 years, 8 months ago) by mattmann
File length: 33534 byte(s)
Diff to previous 515844 (colored)
Release 0.9 steps 1-5

Revision 515844 - (view) (annotate) - [select for diffs]
Modified Wed Mar 7 23:37:21 2007 UTC (2 years, 8 months ago) by ab
File length: 33538 byte(s)
Diff to previous 500093 (colored)
NUTCH-167 - Observation of robots "noarchive" directive.

Revision 500093 - (view) (annotate) - [select for diffs]
Modified Fri Jan 26 02:03:41 2007 UTC (2 years, 9 months ago) by mattmann
File length: 33080 byte(s)
Diff to previous 500090 (colored)
- forgot period

Revision 500090 - (view) (annotate) - [select for diffs]
Modified Fri Jan 26 02:02:13 2007 UTC (2 years, 9 months ago) by mattmann
File length: 33079 byte(s)
Diff to previous 497867 (colored)
- add comment about enabling protocol-httpclient in order to support HTTPS

Revision 497867 - (view) (annotate) - [select for diffs]
Modified Fri Jan 19 16:37:35 2007 UTC (2 years, 10 months ago) by siren
File length: 32922 byte(s)
Diff to previous 493548 (colored)
NUTCH-400

Revision 493548 - (view) (annotate) - [select for diffs]
Modified Sat Jan 6 19:49:49 2007 UTC (2 years, 10 months ago) by siren
File length: 32147 byte(s)
Diff to previous 490607 (colored)
fix NUTCH-421

Revision 490607 - (view) (annotate) - [select for diffs]
Modified Thu Dec 28 00:03:04 2006 UTC (2 years, 10 months ago) by ab
File length: 31391 byte(s)
Diff to previous 476617 (colored)
This patch addresses several issues:

* NUTCH-415 - Generator should mark selected records in CrawlDb.
  Due to increased resource consumption this step is optional.
  Application-level locking has been added to prevent concurrent
  modification of databases.

* NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
  now possible to correctly update CrawlDb from multiple segments.
  Introduce new status codes for temporary and permanent
  redirection.

* NUTCH-322 - Fix Fetcher to store redirected pages and to store
  protocol-level status. This also should fix NUTCH-273.

* Change default Fetcher behavior not to follow redirects immediately.
  Instead Fetcher will record redirects as new pages to be added to CrawlDb.
  This also partially addresses NUTCH-273.

* Detect and report when Generator creates 0-sized segments.

* Fix Injector to preserve already existing CrawlDatum if the seed list
  being injected also contains such URL.

This development was partially supported by SiteSell Inc.


Revision 476617 - (view) (annotate) - [select for diffs]
Modified Sat Nov 18 21:55:44 2006 UTC (3 years ago) by siren
File length: 30777 byte(s)
Diff to previous 450799 (colored)
NUTCH-388 Fix description of urlfilter.order

Revision 450799 - (view) (annotate) - [select for diffs]
Modified Thu Sep 28 10:48:25 2006 UTC (3 years, 1 month ago) by ab
File length: 30752 byte(s)
Diff to previous 449088 (colored)
Bring back the '-noAdditions' option. This is useful for running
constrained crawls, where the complete list of URLs is known in
advance.

Revision 449088 - (view) (annotate) - [select for diffs]
Modified Fri Sep 22 21:05:33 2006 UTC (3 years, 2 months ago) by ab
File length: 30479 byte(s)
Diff to previous 431364 (colored)
Refactor URLNormalizers (NUTCH-365). Iterative normalization has been
implemented, but is not used by default.

Development of this functionality was supported by SiteSell Inc.

Revision 431364 - (view) (annotate) - [select for diffs]
Modified Mon Aug 14 14:56:54 2006 UTC (3 years, 3 months ago) by ab
File length: 29949 byte(s)
Diff to previous 425495 (colored)
Optionally skip pages with abnormally large Crawl-Delay values. Original
patch submitted by Dennis Kubes.

Revision 425495 - (view) (annotate) - [select for diffs]
Modified Tue Jul 25 19:38:54 2006 UTC (3 years, 4 months ago) by siren
File length: 29533 byte(s)
Diff to previous 425494 (colored)
prepare for 0.9-dev

Revision 425494 - (view) (annotate) - [select for diffs]
Modified Tue Jul 25 19:37:25 2006 UTC (3 years, 4 months ago) by siren
File length: 29533 byte(s)
Diff to previous 425321 (colored)
prepare for 0.9-dev

Revision 425321 - (view) (annotate) - [select for diffs]
Modified Tue Jul 25 07:57:11 2006 UTC (3 years, 4 months ago) by siren
File length: 29529 byte(s)
Diff to previous 425092 (colored)
preparing 0.8 release

Revision 425092 - (view) (annotate) - [select for diffs]
Modified Mon Jul 24 15:27:20 2006 UTC (3 years, 4 months ago) by ab
File length: 29533 byte(s)
Diff to previous 423670 (colored)
Apply NUTCH-324, and clarify documentation in nutch-default.xml .

Revision 423670 - (view) (annotate) - [select for diffs]
Modified Thu Jul 20 00:04:56 2006 UTC (3 years, 4 months ago) by ab
File length: 29371 byte(s)
Diff to previous 423539 (colored)
Set http.agent.name and related properties to empty values. This forces
people to put some sensible values there, and protects the Nutch project
from being blamed for someone else's misbehavior.

Revision 423539 - (view) (annotate) - [select for diffs]
Modified Wed Jul 19 17:35:08 2006 UTC (3 years, 4 months ago) by ab
File length: 28812 byte(s)
Diff to previous 423291 (colored)
Add ability to limit outlinks to only include initial hosts (NUTCH-173).

Revision 423291 - (view) (annotate) - [select for diffs]
Modified Tue Jul 18 23:27:49 2006 UTC (3 years, 4 months ago) by ab
File length: 28497 byte(s)
Diff to previous 417285 (colored)
Add "db.max.inlinks" with its default value, and document it.

Revision 417285 - (view) (annotate) - [select for diffs]
Modified Mon Jun 26 19:38:39 2006 UTC (3 years, 4 months ago) by ab
File length: 28201 byte(s)
Diff to previous 409275 (colored)
Add an optional mechanism to time limit long-running queries. This helps to
protect search servers from adverse effects of certain resource-intensive
queries.

Development of this functionality was supported by Krugle.net. Thank you!


Revision 409275 - (view) (annotate) - [select for diffs]
Modified Thu May 25 00:38:16 2006 UTC (3 years, 6 months ago) by ab
File length: 27449 byte(s)
Diff to previous 407567 (colored)
Fix for incorrect behavior (collecting action URLs from forms). This
is now optional, and turned off by default.

Update JUnit test to cover this option.

Revision 407567 - (view) (annotate) - [select for diffs]
Modified Thu May 18 15:26:06 2006 UTC (3 years, 6 months ago) by ab
File length: 27118 byte(s)
Diff to previous 406757 (colored)
Refactor HTTP plugins so that both support gzip encoding. Add
appropriate headers in protocol-httpclient so that it prefers this
encoding.

Add an option to use HTTP 1.1 (at the moment only protocol-httpclient
supports it).

Revision 406757 - (view) (annotate) - [select for diffs]
Modified Mon May 15 22:18:34 2006 UTC (3 years, 6 months ago) by ab
File length: 26899 byte(s)
Diff to previous 406625 (colored)
Fix NUTCH-268. Default settings are still different to avoid DOS-ing
remote DNS servers during fetchlist generation.

Revision 406625 - (view) (annotate) - [select for diffs]
Modified Mon May 15 12:14:36 2006 UTC (3 years, 6 months ago) by ab
File length: 25932 byte(s)
Diff to previous 405967 (colored)
Add a suffix-based URLFilter. Correct also extension IDs for other urlfilter
plugins, so that they can be active at the same time.

Revision 405967 - (view) (annotate) - [select for diffs]
Modified Sat May 13 00:52:33 2006 UTC (3 years, 6 months ago) by ab
File length: 25701 byte(s)
Diff to previous 405165 (colored)
Scoring API (NUTCH-240).

Development of this functionality was supported by Krugle.net. Thank you!

Revision 405165 - (view) (annotate) - [select for diffs]
Modified Mon May 8 21:04:01 2006 UTC (3 years, 6 months ago) by jerome
File length: 25286 byte(s)
Diff to previous 391958 (colored)
NUTCH-134 : Added a summarizer extension point and two enxtensions:
* summary-basic is the current nutch implementation moved into a plugin
* summary-lucene a raw version of a summarizer plugin based on lucene highlighter

Revision 391958 - (view) (annotate) - [select for diffs]
Modified Thu Apr 6 10:49:40 2006 UTC (3 years, 7 months ago) by jerome
File length: 25272 byte(s)
Diff to previous 387655 (colored)
NUTCH-244, db.max.outlinks.per.page can now be negative for no limit of handled outlinks per page

Revision 387655 - (view) (annotate) - [select for diffs]
Modified Tue Mar 21 22:35:20 2006 UTC (3 years, 8 months ago) by jerome
File length: 25117 byte(s)
Diff to previous 386876 (colored)
Add lib-regex-filter and urlfilter-automaton to the list of javadoc packages.
Add lib-regex-filter and urlfilter-automaton to the list of deployes, tested and cleaned plugins.
Add the regular expression rule file property for urlfilter-automaton.

Revision 386876 - (view) (annotate) - [select for diffs]
Modified Sat Mar 18 19:26:49 2006 UTC (3 years, 8 months ago) by ab
File length: 24867 byte(s)
Diff to previous 384639 (colored)
Document new option "db.score.count.filtered".

Revision 384639 - (view) (annotate) - [select for diffs]
Modified Thu Mar 9 23:04:24 2006 UTC (3 years, 8 months ago) by jerome
File length: 24483 byte(s)
Diff to previous 382535 (colored)
Add boost configuration param for RawFieldQueryFilters

Revision 382535 - (view) (annotate) - [select for diffs]
Modified Thu Mar 2 22:38:40 2006 UTC (3 years, 8 months ago) by jerome
File length: 23529 byte(s)
Diff to previous 376072 (colored)
Fix content.limit inconsistency in http, ftp and file

Revision 376072 - (view) (annotate) - [select for diffs]
Modified Wed Feb 8 21:25:30 2006 UTC (3 years, 9 months ago) by cutting
File length: 23560 byte(s)
Diff to previous 374796 (colored)
Restore accidentally removed file defaults.

Revision 374796 - (view) (annotate) - [select for diffs]
Modified Sat Feb 4 00:38:32 2006 UTC (3 years, 9 months ago) by cutting
File length: 22836 byte(s)
Diff to previous 370638 (colored)
NUTCH-193: MapReduce and NDFS code moved to new project, Hadoop.  See bug report for details.

Revision 370638 - (view) (annotate) - [select for diffs]
Modified Thu Jan 19 21:24:58 2006 UTC (3 years, 10 months ago) by cutting
File length: 29644 byte(s)
Diff to previous 370632 (colored)
Document a few more properties.  Contributed by Dominik Friedrich.

Revision 370632 - (view) (annotate) - [select for diffs]
Modified Thu Jan 19 20:58:54 2006 UTC (3 years, 10 months ago) by cutting
File length: 28914 byte(s)
Diff to previous 366280 (colored)
Switch default to protocol-http, since it seems more reliable than protocol-httpclient.

Revision 366280 - (view) (annotate) - [select for diffs]
Modified Thu Jan 5 21:08:27 2006 UTC (3 years, 10 months ago) by cutting
File length: 28920 byte(s)
Diff to previous 366242 (colored)
Fix NUTCH-131: add mapred.child.heap.size.  From Marko Bauhardt.

Revision 366242 - (view) (annotate) - [select for diffs]
Modified Thu Jan 5 18:38:44 2006 UTC (3 years, 10 months ago) by cutting
File length: 28732 byte(s)
Diff to previous 365459 (colored)
Fix NegativeArraySizeException.

Revision 365459 - (view) (annotate) - [select for diffs]
Modified Mon Jan 2 23:27:50 2006 UTC (3 years, 10 months ago) by cutting
File length: 28704 byte(s)
Diff to previous 359822 (colored)
Add index sorter & ability to stop searching after N hits.

Revision 359822 - (view) (annotate) - [select for diffs]
Modified Thu Dec 29 15:28:30 2005 UTC (3 years, 10 months ago) by ab
File length: 28416 byte(s)
Diff to previous 358060 (colored)
A framework for using different page signature implementations. Ordinary
MD5 hash of a raw page content is very often unsuitable, when many
near-duplicate pages are crawled.

Now users can select their own page signature implementation, possibly
with better properties than the old one.

Two implementations are provided:

* MD5Signature: backward-compatible with the old schema.

* TextProfileSignature: an example implementation of a signature, which
  gives the same values for near-duplicate pages. Please see Javadoc for
  more information.

This commit changes the CrawlDatum to store page signatures in CrawlDb.
Last modified time field was added, too. Both changes are in preparation
for patches implementing self-adjustable fetch interval.

NutchConf was extended to store and retrieve also plain Object values.
This is useful when caching per-job instances.

StringUtil: added methods to display / parse byte[] values.

Added SegmentReader (based on a contribution in NUTCH-121 by Rod Taylor).

Fixed Fetcher to actually use the command-line parameters.

Revision 358060 - (view) (annotate) - [select for diffs]
Modified Tue Dec 20 18:16:57 2005 UTC (3 years, 11 months ago) by siren
File length: 27541 byte(s)
Diff to previous 357335 (colored)
NUTCH-146, removed duplicate configuration property

Revision 357335 - (view) (annotate) - [select for diffs]
Modified Sat Dec 17 10:10:32 2005 UTC (3 years, 11 months ago) by jerome
File length: 27741 byte(s)
Diff to previous 357334 (colored)
NUTCH-3, Rollback nutch-default.xml (commit error)

Revision 357334 - (view) (annotate) - [select for diffs]
Modified Sat Dec 17 10:06:31 2005 UTC (3 years, 11 months ago) by jerome
File length: 27770 byte(s)
Diff to previous 357197 (colored)
NUTCH-3, ContentProperties can handle multivalued properties (S. Groschupf)

Revision 357197 - (view) (annotate) - [select for diffs]
Modified Fri Dec 16 17:51:05 2005 UTC (3 years, 11 months ago) by cutting
File length: 27741 byte(s)
Diff to previous 292865 (colored)
Merge mapred branch to trunk & remove it.

Revision 292865 - (view) (annotate) - [select for diffs]
Modified Fri Sep 30 22:11:19 2005 UTC (4 years, 1 month ago) by jerome
File length: 24711 byte(s)
Diff to previous 280547 (colored)
NUTCH-88, Second step implementation:
* Add a configuration property for the parse-plugins.xml file location
* ParserFactory now returns an ordered list of Parsers
* Improve logging
* Improve Parser selection policy
* Unit Tests added

Revision 280547 - (view) (annotate) - [select for diffs]
Modified Tue Sep 13 12:35:55 2005 UTC (4 years, 2 months ago) by jerome
File length: 24474 byte(s)
Diff to previous 280176 (colored)
plugin.auto-activation setted to true by default

Revision 280176 - (view) (annotate) - [select for diffs]
Modified Sun Sep 11 20:28:07 2005 UTC (4 years, 2 months ago) by jerome
File length: 24475 byte(s)
Diff to previous 279286 (colored)
Automatically loads active plugins dependencies (add a property, default is on)

Revision 279286 - (view) (annotate) - [select for diffs]
Modified Wed Sep 7 09:57:24 2005 UTC (4 years, 2 months ago) by jerome
File length: 24193 byte(s)
Diff to previous 233492 (colored)
Includes protocol-httpclient plugin (instead of protocol-http) and parse-js erroneously removed during commit of revision 233492

Revision 233492 - (view) (annotate) - [select for diffs]
Modified Fri Aug 19 15:55:46 2005 UTC (4 years, 3 months ago) by jerome
File length: 24184 byte(s)
Diff to previous 233161 (colored)
NUTCH-10, extension points defined only once (Stefan Grroschupf)

Revision 233161 - (view) (annotate) - [select for diffs]
Modified Wed Aug 17 11:36:46 2005 UTC (4 years, 3 months ago) by pkosiorowski
File length: 24098 byte(s)
Diff to previous 233032 (colored)
0.8-dev version started.

Revision 233032 - (view) (annotate) - [select for diffs]
Modified Tue Aug 16 18:39:23 2005 UTC (4 years, 3 months ago) by pkosiorowski
File length: 24094 byte(s)
Diff to previous 231315 (colored)
Preparing 0.7 release

Revision 231315 - (view) (annotate) - [select for diffs]
Modified Wed Aug 10 20:30:54 2005 UTC (4 years, 3 months ago) by pkosiorowski
File length: 24098 byte(s)
Diff to previous 230887 (colored)
Fix versionnumber format and add -dev suffix.

Revision 230887 - (view) (annotate) - [select for diffs]
Modified Mon Aug 8 20:44:23 2005 UTC (4 years, 3 months ago) by pkosiorowski
File length: 24095 byte(s)
Diff to previous 208872 (colored)
User agent string related properties updated.

Revision 208872 - (view) (annotate) - [select for diffs]
Modified Sat Jul 2 20:39:14 2005 UTC (4 years, 4 months ago) by ab
File length: 24101 byte(s)
Diff to previous 208869 (colored)
Applied patches in NUTCH-56, with minor changes. Submitted by Andy Liu.

Revision 208869 - (view) (annotate) - [select for diffs]
Modified Sat Jul 2 19:32:05 2005 UTC (4 years, 4 months ago) by ab
File length: 23757 byte(s)
Diff to previous 179640 (colored)
Improvements and fixes in NUTCH-60. Submitted by Jerome Charron.

Revision 179640 - (view) (annotate) - [select for diffs]
Modified Thu Jun 2 20:37:21 2005 UTC (4 years, 5 months ago) by cutting
File length: 22691 byte(s)
Diff to previous 179574 (colored)
Moving Nutch from the Incubator to Lucene.

Revision 179574 - (view) (annotate) - [select for diffs]
Modified Thu Jun 2 12:08:53 2005 UTC (4 years, 5 months ago) by ab
Original Path: incubator/nutch/trunk/conf/nutch-default.xml
File length: 22691 byte(s)
Diff to previous 179436 (colored)
Missing closing tag - added. Reported by Piotr Kosiorowski.

Revision 179436 - (view) (annotate) - [select for diffs]
Modified Wed Jun 1 22:20:01 2005 UTC (4 years, 5 months ago) by ab
Original Path: incubator/nutch/trunk/conf/nutch-default.xml
File length: 22679 byte(s)
Diff to previous 169406 (colored)
This patchset contains improvements to Fetcher, described in NUTCH-54,
specifically the following:

* protocol- and content-based redirection handling in Fetcher.

* parse-js: heuristic link extractor for JavaScript

* protocol-httpclient: HTTP and HTTPS protocol handler, based on
Jakarta Commons HttpClient library.

* alternative HTML parser based on TagSoup.

* improved status reporting for protocol and parse plugins. Status
information is persisted in segment data, so that other plugins can
use it.

* and other assorted fixes...

This work has been sponsored by EvaluMetrix LLC (http://www.evalumetrix.com).
Thank you!


Revision 169406 - (view) (annotate) - [select for diffs]
Modified Tue May 10 03:20:11 2005 UTC (4 years, 6 months ago) by cutting
Original Path: incubator/nutch/trunk/conf/nutch-default.xml
File length: 22447 byte(s)
Diff to previous 161984 (colored)
Add ability to set Lucene's term index interval from config.

Revision 161984 - (view) (annotate) - [select for diffs]
Modified Tue Apr 19 21:36:26 2005 UTC (4 years, 7 months ago) by cutting
Original Path: incubator/nutch/trunk/conf/nutch-default.xml
File length: 22085 byte(s)
Diff to previous 161952 (colored)
Make query boosts configurable.  Patch by Piotr Kosiorowski.

Revision 161952 - (view) (annotate) - [select for diffs]
Modified Tue Apr 19 18:58:12 2005 UTC (4 years, 7 months ago) by cutting
Original Path: incubator/nutch/trunk/conf/nutch-default.xml
File length: 21198 byte(s)
Diff to previous 161630 (colored)
Deprecate link analysis.  Remove it from the tutorial and change the default configuration so that link counts are used instead.

Revision 161630 - (view) (annotate) - [select for diffs]
Modified Sun Apr 17 06:51:28 2005 UTC (4 years, 7 months ago) by johnx
Original Path: incubator/nutch/trunk/conf/nutch-default.xml
File length: 21200 byte(s)
Diff to previous 160080 (colored)
Close Issue #33 - MIME content type detector (using magic char sequences).

Revision 160080 - (view) (annotate) - [select for diffs]
Modified Mon Apr 4 18:43:18 2005 UTC (4 years, 7 months ago) by siren
Original Path: incubator/nutch/trunk/conf/nutch-default.xml
File length: 21203 byte(s)
Diff to previous 157453 (colored)
Changed ipc timeout to be configurable (NUTCH-15)

Revision 157453 - (view) (annotate) - [select for diffs]
Modified Mon Mar 14 19:46:44 2005 UTC (4 years, 8 months ago) by cutting
Original Path: incubator/nutch/trunk/conf/nutch-default.xml
File length: 21016 byte(s)
Diff to previous 155829 (colored)
Fixed http://issues.apache.org/jira/browse/NUTCH-8.  Added a working parameter to control the number of fetcher retries and removed a broken one.

Revision 155829 - (view) (annotate) - [select for diffs]
Added Tue Mar 1 22:04:46 2005 UTC (4 years, 8 months ago) by cutting
Original Path: incubator/nutch/trunk/conf/nutch-default.xml
File length: 21029 byte(s)
Initial import of Nutch to Apache.

This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.

  Diffs between and
  Type of Diff should be a

apache@apache.org
ViewVC Help
Powered by ViewVC 1.1.2