Log of /lucene/nutch/trunk/CHANGES.txt
Parent Directory
|
Revision Log
Revision
782412 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Sun Jun 7 17:12:18 2009 UTC
(5 months, 2 weeks ago)
by
dogacan
File length: 45038 byte(s)
Diff to
previous 757500
(
colored)
NUTCH-735 - crawl-tool.xml must be read before nutch-site.xml when invoked using crawl command. Patch by Susam Pal.
Revision
752000 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Tue Mar 10 07:07:22 2009 UTC
(8 months, 2 weeks ago)
by
siren
File length: 44815 byte(s)
Diff to
previous 751774
(
colored)
NUTCH-715 - Subcollection plugin doesn't work with default subcollections.xml file. Contributed by Dmitry Lihachev
Revision
750037 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Wed Mar 4 15:02:29 2009 UTC
(8 months, 3 weeks ago)
by
ab
File length: 44646 byte(s)
Diff to
previous 749289
(
colored)
NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1. This is a temporary
fix, to be revisited later.
Revision
748408 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Fri Feb 27 06:21:37 2009 UTC
(8 months, 4 weeks ago)
by
siren
File length: 44291 byte(s)
Diff to
previous 747324
(
colored)
NUTCH-699 - Add an "official" solr schema for solr integration. Contributed by dogacan, Dmitry Lihachev
Revision
747312 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Tue Feb 24 09:18:03 2009 UTC
(9 months ago)
by
siren
File length: 44028 byte(s)
Diff to
previous 746900
(
colored)
NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links set at cross domain redirects, contributed by Remco Verhoef, dogacan
Revision
745808 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Feb 19 10:25:47 2009 UTC
(9 months, 1 week ago)
by
siren
File length: 43816 byte(s)
Diff to
previous 745503
(
colored)
NUTCH-695 - incorrect mime type detection by MoreIndexingFilter plugin, contributed by Dmitry Lihachev
Revision
628631 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Mon Feb 18 06:38:46 2008 UTC
(21 months, 1 week ago)
by
kubes
File length: 39251 byte(s)
Diff to
previous 627893
(
colored)
NUTCH-44 - Too many search results. Configurable limit on max number of search results returned. Thanks Emilijan Mirceski and Susam Pal.
Revision
627893 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Feb 14 22:21:50 2008 UTC
(21 months, 1 week ago)
by
kubes
File length: 39111 byte(s)
Diff to
previous 627890
(
colored)
NUTCH-611 - Upgrade Nutch to use Hadoop 0.16. This upgrade removes the deprecated addDefaultResouce and addFinalResource methods. Should now use addResource. Two scripts start-balancer.sh and stop-balancer.sh are added to the bin directory.
Revision
619648 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Feb 7 21:32:06 2008 UTC
(21 months, 2 weeks ago)
by
kubes
File length: 38676 byte(s)
Diff to
previous 618975
(
colored)
NUTCH-602 - Allow configurable number of handlers for search servers. Thanks to Seth Hartbecke from Search Wikia for spotting this.
Revision
616095 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Mon Jan 28 22:40:29 2008 UTC
(21 months, 4 weeks ago)
by
kubes
File length: 38530 byte(s)
Diff to
previous 616093
(
colored)
NUTCH-587 - Upgrade Nutch to use Hadoop 0.15.3 release. Goof on changes.txt, didn't change the number. Changed it to 68.
Revision
601043 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Tue Dec 4 19:13:28 2007 UTC
(23 months, 3 weeks ago)
by
kubes
File length: 37913 byte(s)
Diff to
previous 594591
(
colored)
NUTCH-581 - DistributedSearch does not update search servers added to search-servers.txt on the fly. This allows search servers to be added and removed on the fly. Thanks Rohan.
Revision
594591 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Tue Nov 13 17:35:08 2007 UTC
(2 years ago)
by
kubes
File length: 37777 byte(s)
Diff to
previous 593263
(
colored)
NUTCH-574 - Including inlink anchor text in index can create irrelevant search results. Moved inbound anchor text indexing from index-basic to new index-anchor plugin. For backwards compatibility index-anchor will need to be added to the nutch-site.xml plugin.includes configuration variable.
Revision
591793 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Sun Nov 4 16:01:53 2007 UTC
(2 years ago)
by
kubes
File length: 37157 byte(s)
Diff to
previous 591791
(
colored)
NUTCH-565 - Arc File to Nutch Segments Converter. This tools allows the conversion of multiple .arc files, a format used by the internet archive and grub distributed crawler projects, into Nutch segments.
Revision
586032 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Oct 18 16:53:48 2007 UTC
(2 years, 1 month ago)
by
kubes
File length: 36928 byte(s)
Diff to
previous 583016
(
colored)
NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink list. Thanks to Marcin Okraszewski and Emmanuel Joke.
Revision
582775 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Mon Oct 8 10:58:11 2007 UTC
(2 years, 1 month ago)
by
dogacan
File length: 36687 byte(s)
Diff to
previous 579656
(
colored)
NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated to the tasktracker. Contributed by Mathijs Homminga and Emmanuel Joke.
Revision
579656 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Wed Sep 26 14:02:48 2007 UTC
(2 years, 2 months ago)
by
dogacan
File length: 36540 byte(s)
Diff to
previous 578703
(
colored)
NUTCH-25 - needs 'character encoding' detector. Mostly contributed by Doug Cook. Some parts are contributed by Marcin Okraszewski and Renaud Richardet. Also fixes NUTCH-369 and NUTCH-487.
Revision
578703 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Mon Sep 24 08:27:34 2007 UTC
(2 years, 2 months ago)
by
dogacan
File length: 36412 byte(s)
Diff to
previous 577018
(
colored)
NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child. Contributed by Emmanuel Joke.
Revision
570331 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Tue Aug 28 06:34:36 2007 UTC
(2 years, 3 months ago)
by
dogacan
File length: 35967 byte(s)
Diff to
previous 570327
(
colored)
NUTCH-545 - Configuration and OnlineClusterer get initialized in every request. Contributed by Dawid Weiss.
Revision
570327 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Tue Aug 28 06:26:51 2007 UTC
(2 years, 3 months ago)
by
dogacan
File length: 35852 byte(s)
Diff to
previous 568053
(
colored)
NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable release (2.1). Contributed by Dawid Weiss.
Revision
561306 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Tue Jul 31 12:07:30 2007 UTC
(2 years, 3 months ago)
by
dogacan
File length: 35366 byte(s)
Diff to
previous 561092
(
colored)
NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and inlinks list. Contributed by Emmanuel Joke.
Revision
559754 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Jul 26 08:44:33 2007 UTC
(2 years, 4 months ago)
by
dogacan
File length: 34958 byte(s)
Diff to
previous 559742
(
colored)
NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment. Contributed by Vishal Shah.
Revision
559742 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Jul 26 08:10:38 2007 UTC
(2 years, 4 months ago)
by
dogacan
File length: 34811 byte(s)
Diff to
previous 557344
(
colored)
NUTCH-516 - Next fetch time is not set when it is a CrawlDatum.STATUS_FETCH_GONE. Contributed by Emmanuel Joke.
Revision
554530 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Mon Jul 9 06:15:53 2007 UTC
(2 years, 4 months ago)
by
dogacan
File length: 34061 byte(s)
Diff to
previous 551147
(
colored)
NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml. Contributed by Emmanuel Joke.
Revision
551147 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Wed Jun 27 12:46:05 2007 UTC
(2 years, 5 months ago)
by
dogacan
File length: 33953 byte(s)
Diff to
previous 551098
(
colored)
NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation. Contributed by Espen Amble Kolstad.
Revision
550683 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Tue Jun 26 04:45:35 2007 UTC
(2 years, 5 months ago)
by
kubes
File length: 33655 byte(s)
Diff to
previous 550196
(
colored)
NUTCH-497: Fixes problems relating to StackOverflow errors
and extreme nested tags. Adds general framework for stack
based Node walking.
Revision
548429 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Mon Jun 18 18:13:15 2007 UTC
(2 years, 5 months ago)
by
dogacan
File length: 32730 byte(s)
Diff to
previous 548103
(
colored)
NUTCH-489 - URLFilter-suffix management of the url path when the url contains some query parameters.
Revision
538273 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Tue May 15 18:29:49 2007 UTC
(2 years, 6 months ago)
by
siren
File length: 31741 byte(s)
Diff to
previous 537860
(
colored)
NUTCH-161 Change Plain text parser to use parser.character.encoding.default property for fall back encoding
spotted by KuroSaka TeruHiko
Revision
536925 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu May 10 16:29:51 2007 UTC
(2 years, 6 months ago)
by
siren
File length: 31439 byte(s)
Diff to
previous 536909
(
colored)
NUTCH-446 RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt, contributed by Doğacan Güney
Revision
521933 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Fri Mar 23 22:59:01 2007 UTC
(2 years, 8 months ago)
by
ab
File length: 30820 byte(s)
Diff to
previous 521182
(
colored)
Upgrade to Hadoop 0.12.2 release.
Fix whitespace issues in platform name in bin/hadoop under Cygwin.
Replace deprecated method call.
Revision
515698 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Wed Mar 7 19:02:56 2007 UTC
(2 years, 8 months ago)
by
ab
File length: 30139 byte(s)
Diff to
previous 511159
(
colored)
NUTCH-432 - JAVA_PLATFORM with spaces breaks bin/nutch.
Also, apply the patch proposed in HADOOP-1080 to fix CLASSPATH problems
under Cygwin.
Revision
495392 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Jan 11 21:51:20 2007 UTC
(2 years, 10 months ago)
by
ab
File length: 29275 byte(s)
Diff to
previous 495214
(
colored)
Upgrade to Hadoop 0.10.1. HTTPClient is now a dependency - move it
to lib/ and remove it as a plugin.
Add also native Linux libraries for Hadoop compression, plus corresponding
logic in bin/nutch.
Hadoop uses larger buffers now - explicitly set large heap size for
JUnit tests. All tests should pass now.
Revision
495214 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Jan 11 13:25:43 2007 UTC
(2 years, 10 months ago)
by
ab
File length: 29239 byte(s)
Diff to
previous 493548
(
colored)
When indexing redirected pages, drop intermediate pages and only index the
final page.
Avoid NPEs in Crawl tool, when no URLs are generated or fetched.
Revision
490607 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Dec 28 00:03:04 2006 UTC
(2 years, 11 months ago)
by
ab
File length: 28767 byte(s)
Diff to
previous 478619
(
colored)
This patch addresses several issues:
* NUTCH-415 - Generator should mark selected records in CrawlDb.
Due to increased resource consumption this step is optional.
Application-level locking has been added to prevent concurrent
modification of databases.
* NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
now possible to correctly update CrawlDb from multiple segments.
Introduce new status codes for temporary and permanent
redirection.
* NUTCH-322 - Fix Fetcher to store redirected pages and to store
protocol-level status. This also should fix NUTCH-273.
* Change default Fetcher behavior not to follow redirects immediately.
Instead Fetcher will record redirects as new pages to be added to CrawlDb.
This also partially addresses NUTCH-273.
* Detect and report when Generator creates 0-sized segments.
* Fix Injector to preserve already existing CrawlDatum if the seed list
being injected also contains such URL.
This development was partially supported by SiteSell Inc.
Revision
464654 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Mon Oct 16 20:38:57 2006 UTC
(3 years, 1 month ago)
by
ab
File length: 26666 byte(s)
Diff to
previous 451649
(
colored)
NUTCH-383: upgrade to Hadoop 0.7.1 and Lucene 2.0.0.
NUTCH-373: replace DeleteDuplicates with a version that implements both
parts of the algorithm. Add JUnit test.
Revision
451649 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Sat Sep 30 19:38:30 2006 UTC
(3 years, 1 month ago)
by
pkosiorowski
File length: 26035 byte(s)
Diff to
previous 449293
(
colored)
NUTCH-374: when http.content.limit be set to -1 and Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing.(King Kong)
Revision
449293 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Sat Sep 23 19:36:47 2006 UTC
(3 years, 2 months ago)
by
ab
File length: 25864 byte(s)
Diff to
previous 449102
(
colored)
NUTCH-350: urls incorrectly marked as STATUS_FETCH_GONE when blocked by
http.max.delays. Instead the status is set to STATUS_FETCH_RETRY. Since this
is an intermittent problem related to the Fetcher implementation, we don't
increase the retry counter.
Revision
449102 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Fri Sep 22 21:49:09 2006 UTC
(3 years, 2 months ago)
by
ab
File length: 25423 byte(s)
Diff to
previous 447940
(
colored)
NUTCH-332: fix the problem of doubling scores caused by links pointing
to the current page (e.g. anchors).
Revision
432615 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Fri Aug 18 15:12:12 2006 UTC
(3 years, 3 months ago)
by
siren
File length: 25097 byte(s)
Diff to
previous 432611
(
colored)
NUTCH-338 - Remove the text parser as an option for parsing PDF files in parse-plugins.xml (Chris A. Mattmann)
Revision
431364 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Mon Aug 14 14:56:54 2006 UTC
(3 years, 3 months ago)
by
ab
File length: 24688 byte(s)
Diff to
previous 429788
(
colored)
Optionally skip pages with abnormally large Crawl-Delay values. Original
patch submitted by Dennis Kubes.
Revision
405204 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Mon May 8 22:34:29 2006 UTC
(3 years, 6 months ago)
by
cutting
File length: 17789 byte(s)
Diff to
previous 395676
(
colored)
Change parameters passed to Hadoop's FileSystem from (now-deprecated) java.io.File to (new) org.apache.hadoop.fs.Path.
Revision
160446 -
(
view)
(
annotate)
-
[select for diffs]
Modified
Thu Apr 7 19:53:14 2005 UTC
(4 years, 7 months ago)
by
siren
Original Path:
incubator/nutch/trunk/CHANGES.txt
File length: 16476 byte(s)
Diff to
previous 160113
(
colored)
Added some features to DistributedSearch: new segments can be added
to searchservers without restarting the frontend, defective search
servers are not queried until tey come back online, watchdog keeps
an eye for your searchservers and writes simple statistics.
This form allows you to request diffs between any two revisions of this file.
For each of the two "sides" of the diff,
enter a numeric revision.