Parent Directory
|
Revision Log
| Links to HEAD: | (view) (annotate) |
| Sticky Revision: |
NUTCH-758 Set subversion eol-style to "native".
Fetcher2 slow. Patch contributed by Julien Nioche.
preparing for release
NUTCH-563 Include custom fields in BasicQueryFilter, contributed by Julien Nioche
NUTCH-594: Serve Nutch search results in multiple formats including XML and JSON.
Missed default configuration variable for NUTCH-668.
NUTCH-640 - confusing description "set it to Integer.MAX_VALUE"
NUTCH-634 Upgrade Nutch to Hadoop 0.17.1 .
NUTCH-44 - Too many search results. Configurable limit on max number of search results returned. Thanks Emilijan Mirceski and Susam Pal.
NUTCH-602 - Allow configurable number of handlers for search servers. Thanks to Seth Hartbecke from Search Wikia for spotting this.
NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy. Contributed by Susam Pal.
NUTCH-574 - Including inlink anchor text in index can create irrelevant search results. Moved inbound anchor text indexing from index-basic to new index-anchor plugin. For backwards compatibility index-anchor will need to be added to the nutch-site.xml plugin.includes configuration variable.
NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink list. Thanks to Marcin Okraszewski and Emmanuel Joke.
- fix for NUTCH-562
NUTCH-25 - needs 'character encoding' detector. Mostly contributed by Doug Cook. Some parts are contributed by Marcin Okraszewski and Renaud Richardet. Also fixes NUTCH-369 and NUTCH-487.
Document a property. Spotted by Emmanuel Joke.
NUTCH-61 - adaptive fetch interval patch.
- update for new development, Nutch 1.0-dev
Release 0.9 steps 1-5
NUTCH-167 - Observation of robots "noarchive" directive.
- forgot period
- add comment about enabling protocol-httpclient in order to support HTTPS
NUTCH-400
fix NUTCH-421
This patch addresses several issues: * NUTCH-415 - Generator should mark selected records in CrawlDb. Due to increased resource consumption this step is optional. Application-level locking has been added to prevent concurrent modification of databases. * NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is now possible to correctly update CrawlDb from multiple segments. Introduce new status codes for temporary and permanent redirection. * NUTCH-322 - Fix Fetcher to store redirected pages and to store protocol-level status. This also should fix NUTCH-273. * Change default Fetcher behavior not to follow redirects immediately. Instead Fetcher will record redirects as new pages to be added to CrawlDb. This also partially addresses NUTCH-273. * Detect and report when Generator creates 0-sized segments. * Fix Injector to preserve already existing CrawlDatum if the seed list being injected also contains such URL. This development was partially supported by SiteSell Inc.
NUTCH-388 Fix description of urlfilter.order
Bring back the '-noAdditions' option. This is useful for running constrained crawls, where the complete list of URLs is known in advance.
Refactor URLNormalizers (NUTCH-365). Iterative normalization has been implemented, but is not used by default. Development of this functionality was supported by SiteSell Inc.
Optionally skip pages with abnormally large Crawl-Delay values. Original patch submitted by Dennis Kubes.
prepare for 0.9-dev
prepare for 0.9-dev
preparing 0.8 release
Apply NUTCH-324, and clarify documentation in nutch-default.xml .
Set http.agent.name and related properties to empty values. This forces people to put some sensible values there, and protects the Nutch project from being blamed for someone else's misbehavior.
Add ability to limit outlinks to only include initial hosts (NUTCH-173).
Add "db.max.inlinks" with its default value, and document it.
Add an optional mechanism to time limit long-running queries. This helps to protect search servers from adverse effects of certain resource-intensive queries. Development of this functionality was supported by Krugle.net. Thank you!
Fix for incorrect behavior (collecting action URLs from forms). This is now optional, and turned off by default. Update JUnit test to cover this option.
Refactor HTTP plugins so that both support gzip encoding. Add appropriate headers in protocol-httpclient so that it prefers this encoding. Add an option to use HTTP 1.1 (at the moment only protocol-httpclient supports it).
Fix NUTCH-268. Default settings are still different to avoid DOS-ing remote DNS servers during fetchlist generation.
Add a suffix-based URLFilter. Correct also extension IDs for other urlfilter plugins, so that they can be active at the same time.
Scoring API (NUTCH-240). Development of this functionality was supported by Krugle.net. Thank you!
NUTCH-134 : Added a summarizer extension point and two enxtensions: * summary-basic is the current nutch implementation moved into a plugin * summary-lucene a raw version of a summarizer plugin based on lucene highlighter
NUTCH-244, db.max.outlinks.per.page can now be negative for no limit of handled outlinks per page
Add lib-regex-filter and urlfilter-automaton to the list of javadoc packages. Add lib-regex-filter and urlfilter-automaton to the list of deployes, tested and cleaned plugins. Add the regular expression rule file property for urlfilter-automaton.
Document new option "db.score.count.filtered".
Add boost configuration param for RawFieldQueryFilters
Fix content.limit inconsistency in http, ftp and file
Restore accidentally removed file defaults.
NUTCH-193: MapReduce and NDFS code moved to new project, Hadoop. See bug report for details.
Document a few more properties. Contributed by Dominik Friedrich.
Switch default to protocol-http, since it seems more reliable than protocol-httpclient.
Fix NUTCH-131: add mapred.child.heap.size. From Marko Bauhardt.
Fix NegativeArraySizeException.
Add index sorter & ability to stop searching after N hits.
A framework for using different page signature implementations. Ordinary MD5 hash of a raw page content is very often unsuitable, when many near-duplicate pages are crawled. Now users can select their own page signature implementation, possibly with better properties than the old one. Two implementations are provided: * MD5Signature: backward-compatible with the old schema. * TextProfileSignature: an example implementation of a signature, which gives the same values for near-duplicate pages. Please see Javadoc for more information. This commit changes the CrawlDatum to store page signatures in CrawlDb. Last modified time field was added, too. Both changes are in preparation for patches implementing self-adjustable fetch interval. NutchConf was extended to store and retrieve also plain Object values. This is useful when caching per-job instances. StringUtil: added methods to display / parse byte[] values. Added SegmentReader (based on a contribution in NUTCH-121 by Rod Taylor). Fixed Fetcher to actually use the command-line parameters.
NUTCH-146, removed duplicate configuration property
NUTCH-3, Rollback nutch-default.xml (commit error)
NUTCH-3, ContentProperties can handle multivalued properties (S. Groschupf)
Merge mapred branch to trunk & remove it.
NUTCH-88, Second step implementation: * Add a configuration property for the parse-plugins.xml file location * ParserFactory now returns an ordered list of Parsers * Improve logging * Improve Parser selection policy * Unit Tests added
plugin.auto-activation setted to true by default
Automatically loads active plugins dependencies (add a property, default is on)
Includes protocol-httpclient plugin (instead of protocol-http) and parse-js erroneously removed during commit of revision 233492
NUTCH-10, extension points defined only once (Stefan Grroschupf)
0.8-dev version started.
Preparing 0.7 release
Fix versionnumber format and add -dev suffix.
User agent string related properties updated.
Applied patches in NUTCH-56, with minor changes. Submitted by Andy Liu.
Improvements and fixes in NUTCH-60. Submitted by Jerome Charron.
Moving Nutch from the Incubator to Lucene.
Missing closing tag - added. Reported by Piotr Kosiorowski.
This patchset contains improvements to Fetcher, described in NUTCH-54, specifically the following: * protocol- and content-based redirection handling in Fetcher. * parse-js: heuristic link extractor for JavaScript * protocol-httpclient: HTTP and HTTPS protocol handler, based on Jakarta Commons HttpClient library. * alternative HTML parser based on TagSoup. * improved status reporting for protocol and parse plugins. Status information is persisted in segment data, so that other plugins can use it. * and other assorted fixes... This work has been sponsored by EvaluMetrix LLC (http://www.evalumetrix.com). Thank you!
Add ability to set Lucene's term index interval from config.
Make query boosts configurable. Patch by Piotr Kosiorowski.
Deprecate link analysis. Remove it from the tutorial and change the default configuration so that link counts are used instead.
Close Issue #33 - MIME content type detector (using magic char sequences).
Changed ipc timeout to be configurable (NUTCH-15)
Fixed http://issues.apache.org/jira/browse/NUTCH-8. Added a working parameter to control the number of fetcher retries and removed a broken one.
Initial import of Nutch to Apache.
This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, enter a numeric revision.
| apache@apache.org | ViewVC Help |
| Powered by ViewVC 1.1.2 |