| 1 |
Nutch Change Log
|
| 2 |
|
| 3 |
Release 0.8 - 2006-07-25
|
| 4 |
|
| 5 |
0. Totally new architecture, based on hadoop
|
| 6 |
[http://lucene.apache.org/hadoop] (cutting)
|
| 7 |
|
| 8 |
1. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross).
|
| 9 |
|
| 10 |
2. NUTCH-108 - Log hosts that exceed generate.max.per.host.
|
| 11 |
(Rod Taylor via cutting)
|
| 12 |
|
| 13 |
3. NUTCH-88 - Enhance ParserFactory plugin selection policy
|
| 14 |
(jerome)
|
| 15 |
|
| 16 |
4. NUTCH-124 - Protocol-httpclient does not follow redirects when
|
| 17 |
fetching robots.txt (cutting)
|
| 18 |
|
| 19 |
5. NUTCH-130 - Be explicit about target JVM when building (1.4.x?)
|
| 20 |
(stack@archive.org, cutting)
|
| 21 |
|
| 22 |
6. NUTCH-114 - Getting number of urls and links from crawldb
|
| 23 |
(Stefan Groschupf via ab)
|
| 24 |
|
| 25 |
7. NUTCH-112 - Link in cached.jsp page to cached content is an
|
| 26 |
absolute link (Chris A. Mattmann via jerome)
|
| 27 |
|
| 28 |
8. NUTCH-135 - Http header meta data are case insensitive in the
|
| 29 |
real world (Stefan Groschupf via jerome)
|
| 30 |
|
| 31 |
9. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due
|
| 32 |
to UTF-8 BOM (KuroSaka TeruHiko via siren)
|
| 33 |
|
| 34 |
10. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab)
|
| 35 |
|
| 36 |
11. Added support for OpenSearch (cutting)
|
| 37 |
|
| 38 |
12. NUTCH-142 - NutchConf should use the thread context classloader
|
| 39 |
(Mike Cannon-Brookes via pkosiorowski)
|
| 40 |
|
| 41 |
13. NUTCH-160 - Use standard Java Regex library rather than
|
| 42 |
org.apache.oro.text.regex (Rod Taylor via cutting)
|
| 43 |
|
| 44 |
14. NUTCH-151 - CommandRunner can hang after the main thread exec is
|
| 45 |
finished and has inefficient busy loop (Paul Baclace via cutting)
|
| 46 |
|
| 47 |
15. NUTCH-174 - Problem encountered with ant during compilation
|
| 48 |
|
| 49 |
16. NUTCH-190 - ParseUtil drops reason for failed parse
|
| 50 |
(stack@archive.org via ab)
|
| 51 |
|
| 52 |
17. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab)
|
| 53 |
|
| 54 |
18. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab)
|
| 55 |
|
| 56 |
19. NUTCH-178 - in search.jsp must be session creation "false"
|
| 57 |
(YourSoft via siren)
|
| 58 |
|
| 59 |
20. NUTCH-200 - OpenSearch Servlet ist broken
|
| 60 |
(Marko Bauhardt via siren)
|
| 61 |
|
| 62 |
21. NUTCH-81 - Webapp only works when deployed in root
|
| 63 |
(AJ Banck, Michael Nebel via siren)
|
| 64 |
|
| 65 |
22. NUTCH-139 - Standard metadata property names in the ParseData
|
| 66 |
metadata (Chris A. Mattmann, jerome)
|
| 67 |
|
| 68 |
23. NUTCH-192 - Meta data support for CrawlDatum
|
| 69 |
(Stefan Groschupf via ab)
|
| 70 |
|
| 71 |
24. NUTCH-52 - Parser plugin for MS Excel files
|
| 72 |
(Rohit Kulkarni via jerome)
|
| 73 |
|
| 74 |
25. NUTCH-53 - Parser plugin for Zip files
|
| 75 |
(Rohit Kulkarni via jerome)
|
| 76 |
|
| 77 |
26. NUTCH-137 - footer is not displayed in search result page
|
| 78 |
(KuroSaka TeruHiko via siren)
|
| 79 |
|
| 80 |
27. NUTCH-118 - FAQ link points to invalid URL
|
| 81 |
(Steve Betts via siren)
|
| 82 |
|
| 83 |
28. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin)
|
| 84 |
translation (Ivan Sekulovic via siren)
|
| 85 |
|
| 86 |
29. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf
|
| 87 |
via cutting)
|
| 88 |
|
| 89 |
30. NUTCH-140 - Add alias capability in parse-plugins.xml file that
|
| 90 |
allows mimeType->extensionId mapping (Chris A. Mattmann via jerome)
|
| 91 |
|
| 92 |
31. NUTCH-214 - Added Links to web site to search mailling list
|
| 93 |
(Jake Vanderdray via jerome)
|
| 94 |
|
| 95 |
32. NUTCH-204 - Multiple field values in HitDetails
|
| 96 |
(Stefan Groschupf via jerome)
|
| 97 |
|
| 98 |
33. NUTCH-219 - file.content.limit & ftp.content.limit should be changed
|
| 99 |
to -1 to be consistent with http (jerome)
|
| 100 |
|
| 101 |
34. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren)
|
| 102 |
|
| 103 |
35. NUTCH-91 - Empty encoding causes exception (Michael Nebel via
|
| 104 |
pkosiorowski)
|
| 105 |
|
| 106 |
36. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via
|
| 107 |
jerome)
|
| 108 |
|
| 109 |
37. NUTCH-229 - Improved handling of plugin folder configuration
|
| 110 |
(Stefan Groschupf via ab)
|
| 111 |
|
| 112 |
38. NUTCH-206 - Search server throws InstantiationException (ab)
|
| 113 |
|
| 114 |
39. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt
|
| 115 |
via ab)
|
| 116 |
|
| 117 |
40. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab)
|
| 118 |
|
| 119 |
41. Update to lucene 1.9.1 (cutting)
|
| 120 |
|
| 121 |
42. NUTCH-235 - Duplicate Inlink values (ab)
|
| 122 |
|
| 123 |
43. NUTCH-234 - Clustering extension code cleanups and a real
|
| 124 |
JUnit test case for the current implementation (Dawid Weiss via ab)
|
| 125 |
|
| 126 |
44. NUTCH-210 - Context.xml file for Nutch web application
|
| 127 |
(Chris A. Mattmann via jerome)
|
| 128 |
|
| 129 |
45. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome)
|
| 130 |
|
| 131 |
46. NUTCH-232 - Search.jsp has multiple search forms creating
|
| 132 |
invalid html / incorrect focus function (jerome)
|
| 133 |
|
| 134 |
47. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome)
|
| 135 |
|
| 136 |
48. NUTCH-244 - Inconsistent handling of property values
|
| 137 |
boundaries / unable to set db.max.outlinks.per.page to
|
| 138 |
infinite (jerome)
|
| 139 |
|
| 140 |
49. NUTCH-245 - DTD for plugin.xml configuration files
|
| 141 |
(Chris A. Mattmann via jerome)
|
| 142 |
|
| 143 |
50. NUTCH-250 - Generate to log truncation caused by
|
| 144 |
generate.max.per.host (Rod Taylor via cutting)
|
| 145 |
|
| 146 |
51. NUTCH-125 - OpenOffice Parser plugin (ab)
|
| 147 |
|
| 148 |
52. Switch from using java.io.File to org.apache.hadoop.fs.Path.
|
| 149 |
(cutting)
|
| 150 |
|
| 151 |
53. NUTCH-240 - Scoring API: extension point, scoring filters and
|
| 152 |
an OPIC plugin (ab)
|
| 153 |
|
| 154 |
54. NUTCH-134 - Summarizer doesn't select the best snippets (jerome)
|
| 155 |
|
| 156 |
55. NUTCH-268 - Generator and lib-http use different definitions of
|
| 157 |
"unique host" (ab)
|
| 158 |
|
| 159 |
56. NUTCH-280 - Url query causes NullPointerException (Grant Glouser
|
| 160 |
via siren)
|
| 161 |
|
| 162 |
57. NUTCH-285 - LinkDb Fails rename doesn't create parent directories
|
| 163 |
(Dennis Kubes via ab)
|
| 164 |
|
| 165 |
58. NUTCH-201 - Add support for subcollections
|
| 166 |
(siren)
|
| 167 |
|
| 168 |
59. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown
|
| 169 |
(Stefan Groschupf via jerome)
|
| 170 |
|
| 171 |
60. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome)
|
| 172 |
|
| 173 |
61. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query
|
| 174 |
(Stefan Groschupf via jerome)
|
| 175 |
|
| 176 |
62. NUTCH-110 - OpenSearchServlet outputs illegal xml characters
|
| 177 |
(stack@archive.org via siren)
|
| 178 |
|
| 179 |
63. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space
|
| 180 |
(Stefan Neufeind via siren)
|
| 181 |
|
| 182 |
64. NUTCH-307 - Wrong configured log4j.properties (jerome)
|
| 183 |
|
| 184 |
65. NUTCH-303 - Logging improvements (jerome)
|
| 185 |
|
| 186 |
66. NUTCH-308 - Maximum search time limit (ab)
|
| 187 |
|
| 188 |
67. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency
|
| 189 |
problem (Grant Glouser via siren)
|
| 190 |
|
| 191 |
68. Update to hadoop-0.4 (Milind Bhandarkar, cutting)
|
| 192 |
|
| 193 |
69. NUTCH-317 - Clarify what the queryLanguage argument of
|
| 194 |
Query.parse(...) means (jerome)
|
| 195 |
|
| 196 |
70. Added alternative experimental web gui in contrib containing
|
| 197 |
extensions like subcollection, keymatch, user preferences,
|
| 198 |
caching, implemented mainly using tiles and jstl (siren)
|
| 199 |
|
| 200 |
71. NUTCH-320 DmozParser does not output list of urls to stdout
|
| 201 |
but to a log file instead. Original functionality restored.
|
| 202 |
|
| 203 |
72. NUTCH-271 - Add ability to limit crawling to the set of initially
|
| 204 |
injected hosts (db.ignore.external.links) (Philippe Eugene,
|
| 205 |
Stefan Neufeind via ab)
|
| 206 |
|
| 207 |
73. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab)
|
| 208 |
|
| 209 |
74. NUTCH-327 - Fixed logging directory on cygwin (siren)
|
| 210 |
|
| 211 |
Release 0.7 - 2005-08-17
|
| 212 |
|
| 213 |
1. Added support for "type:" in queries. Search results are limited/qualified
|
| 214 |
by mimetype or its primary type or sub type. For example,
|
| 215 |
(1) searching with "type:application/pdf" restricts results
|
| 216 |
to pages which were identified to be of mimetype "application/pdf".
|
| 217 |
(2) with "type:application", nutch will return pages of
|
| 218 |
primary type "application".
|
| 219 |
(3) with "type:pdf", only pages of sub type "pdf" will be listed.
|
| 220 |
(John Xing, 20050120)
|
| 221 |
|
| 222 |
2. Added support for "date:" in queries. Last-Modified is indexed.
|
| 223 |
Search results are restricted by lower and upper date (inclusive)
|
| 224 |
as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231
|
| 225 |
only returns pages with Last-Modified in year 2004.
|
| 226 |
(John Xing, 20050122)
|
| 227 |
|
| 228 |
3. Add URLFilter plugin interface and convert existing url filters into
|
| 229 |
plugins. (John Xing, 20050206)
|
| 230 |
|
| 231 |
4. Add UpdateSegmentsFromDb tool, which updates the scores and
|
| 232 |
anchors of existing segments with the current values in the web
|
| 233 |
db. This is used by CrawlTool, so that pages are now only fetched
|
| 234 |
once per crawl. (Doug Cutting, 20050221)
|
| 235 |
|
| 236 |
5. Moved code into org.apache.nutch sub-packages. Changed license to
|
| 237 |
Apache 2.0. Removed jar files whose licenses do not permit
|
| 238 |
redistribution by Apache. Disabled compilation of plugins which
|
| 239 |
require these libraries. (Doug Cutting 20050301)
|
| 240 |
|
| 241 |
6. Index host and title in separate fields. Host was indexed
|
| 242 |
previously only as a part of the URL. Title was indexed as an
|
| 243 |
anchor. Now boosts for matching these fields may be adjusted
|
| 244 |
separately from boosts for matching anchors and url. Also: move
|
| 245 |
site indexing to index-basic plugin to minimize the number of
|
| 246 |
times the URL needs to be parsed; and, stop using anchor analyzer
|
| 247 |
for anything but anchors. (Piotr Kosiorowski via Doug Cutting
|
| 248 |
20050323)
|
| 249 |
|
| 250 |
7. Add servlet Cached.java that serves cached Content of any mime type.
|
| 251 |
Slightly modified are web.xml and cached.jsp.
|
| 252 |
(John Xing, 20050401)
|
| 253 |
|
| 254 |
8. Add skipCompressedByteArray() to WritableUtils.java.
|
| 255 |
(John Xing, 20050402)
|
| 256 |
|
| 257 |
9. Fixes to jsp and static web pages. These now use relative links,
|
| 258 |
so that the Nutch webapp file can be used in places other than at
|
| 259 |
the root. Also fixed links to the about and help pages. Bug #32.
|
| 260 |
(Jerome Charron via cutting, 20050404)
|
| 261 |
|
| 262 |
10. Added some features to DistributedSearch: new segments can be added
|
| 263 |
to searchservers without restarting the frontend, defective search
|
| 264 |
servers are not queried until tey come back online, watchdog keeps
|
| 265 |
an eye for your searchservers and writes simple statistics.
|
| 266 |
(Sami Siren, 20050407)
|
| 267 |
|
| 268 |
11. Fix for bug #4 - Unbalanced quote in query eats all resources.
|
| 269 |
(Piotr Kosiorowski, Sami Siren, 20050407)
|
| 270 |
|
| 271 |
12. Close Issue #33 - MIME content type detector (using magic char sequences).
|
| 272 |
(Jerome Charron and Hari Kodungallur via John Xing, 20050416)
|
| 273 |
|
| 274 |
13. Add a servlet that implements A9's OpenSearch RSS web service.
|
| 275 |
(cutting, 20050418)
|
| 276 |
|
| 277 |
14. Remove references to link analysis from tutorial, and enable
|
| 278 |
scoring by link count when generating fetchlists and searching.
|
| 279 |
(cutting, 20040419)
|
| 280 |
|
| 281 |
15. Make query boosts for host, title, anchor and phrase matches
|
| 282 |
configurable. (Piotr Kosiorowski via cutting, 20050419)
|
| 283 |
|
| 284 |
16. Add support for sorting search results and search-time deduping by
|
| 285 |
fields other than site.
|
| 286 |
|
| 287 |
17. Automatically convert range queries into cached range filters.
|
| 288 |
This improves the performance and scalability of, e.g., date range
|
| 289 |
searching.
|
| 290 |
|
| 291 |
18. Several methods have been renamed due to misspellings. The old
|
| 292 |
methods have been deprecated and will be removed before the 1.0
|
| 293 |
release.
|
| 294 |
|
| 295 |
|
| 296 |
Release 0.6
|
| 297 |
|
| 298 |
1. Added clustering-carrot2 plugin, together with introduction of clustering
|
| 299 |
api and modification to search jsp. (Dawid Weiss via John Xing, 20040809)
|
| 300 |
|
| 301 |
2. Make a number of changes to NDFS (Nutch Distributed File System)
|
| 302 |
to fix bugs, add admin tools, etc.
|
| 303 |
|
| 304 |
Also, modify all command line tools so you can indicate whether to
|
| 305 |
use NDFS or the local filesystem. If you indicate nothing, then
|
| 306 |
it defaults to the local fs.
|
| 307 |
|
| 308 |
I've used this to do a 35m page crawl via NDFS, distributed over a
|
| 309 |
dozen machines. (Mike Cafarella)
|
| 310 |
|
| 311 |
3. Add support for BASE tags in HTML. Outlinks are now correctly
|
| 312 |
extracted when a BASE tag is present. (cutting)
|
| 313 |
|
| 314 |
4. Fix two bugs in result pagination. When the last hit on a page
|
| 315 |
was the last hit overall, the "next" button was sometimes shown
|
| 316 |
when the "show all" button should be shown instead. Also, in
|
| 317 |
certain cases, the "show all" button would be shown when the
|
| 318 |
"next" button should have been shown. (cutting)
|
| 319 |
|
| 320 |
5. Add config parameter "indexer.max.tokens" that determines the
|
| 321 |
maximum number of tokens indexed per field. (Andy Hedges via cutting)
|
| 322 |
|
| 323 |
6. Add parser for mp3 files. (Andy Hedges via cutting)
|
| 324 |
|
| 325 |
7. Add RegexUrlNormalizer. This is useful for things like stripping
|
| 326 |
out session IDs from URLs. To use it, add values for
|
| 327 |
urlnormalizer.class and urlnormalizer.regex.file to your
|
| 328 |
nutch-site.xml. The RegexUrlNormalizer class extends the
|
| 329 |
BasicUrlNormalizer, and does basic normalization as well.
|
| 330 |
(Luke Baker via cutting)
|
| 331 |
|
| 332 |
8. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910)
|
| 333 |
|
| 334 |
9. Added Polish translation (Andrzej Bialecki, 20040911)
|
| 335 |
|
| 336 |
10. Added 3 more language profiles to language identifier (ru,hu,pl).
|
| 337 |
Other changes to language identifier: Porfiles converted to utf8,
|
| 338 |
added some test cases, changed the similarity calculation.
|
| 339 |
(Sami Siren, 20040925)
|
| 340 |
|
| 341 |
11. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929)
|
| 342 |
|
| 343 |
12. Added plugin index-more and more.jsp (John Xing, 20041003)
|
| 344 |
|
| 345 |
13. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced
|
| 346 |
in DistributedSearch.java. text.jsp is added. (John Xing, 20041006)
|
| 347 |
|
| 348 |
14. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp
|
| 349 |
(but not search.jsp) with NullPointerException in distributed search.
|
| 350 |
It seems that this bug appears after "hits per site" stuff is added.
|
| 351 |
The fix is done in Hit.java, making sure String site is never null.
|
| 352 |
Hope this fix not have bad effetct on "hits per site" code.
|
| 353 |
(John Xing, 20041006)
|
| 354 |
|
| 355 |
15. Fixed a bug that fails fullyDelete() in FileUtil.java for
|
| 356 |
LocalFileSystem.java. This bug also exposes possible incompleteness
|
| 357 |
of NDFSFile.java, where a few methods are not supported, including
|
| 358 |
delete(). Nothing changed in NDFSFile.java though. Leave it for future
|
| 359 |
improvement (John Xing, 20041022).
|
| 360 |
|
| 361 |
16. Introduced option -noParsing to Fetcher.java and added ParseSegment.java.
|
| 362 |
A new status code CANT_PARSE is added to FetcherOutput.java.
|
| 363 |
Without option -noParsing , no change in fetcher behavior. With
|
| 364 |
option -noParsing, fetcher does crawls only, no parsing is carried out.
|
| 365 |
Then, ParseSegment.java should be used to parse in separate pass.
|
| 366 |
(John Xing, 20041025)
|
| 367 |
|
| 368 |
17. Added ontology plugin. Currently it is used for query refinement, as
|
| 369 |
examplified in refine-query-init.jsp and refine-query.jsp. By default,
|
| 370 |
query refinement is disabled in search.jsp. Please check
|
| 371 |
./src/plugin/ontology/README.txt for further description.
|
| 372 |
Ontology plugin certainly can be used for many other things.
|
| 373 |
(Michael J. Pan via John Xing, 20041129)
|
| 374 |
|
| 375 |
18. Changed fetcher.server.delay to be a float, so that sub-second
|
| 376 |
delays can be specified. (cutting)
|
| 377 |
|
| 378 |
19. Added plugin.includes config parameter that determines which
|
| 379 |
plugins are included. By default now only http, html and basic
|
| 380 |
indexing and search plugins are enabled, rather than all plugins.
|
| 381 |
This should make default performance more predictable and reliable
|
| 382 |
going forward. (cutting)
|
| 383 |
|
| 384 |
20. Cleaned up some filesystem code, including:
|
| 385 |
|
| 386 |
- Replaced BufferedRandomAccessFile with two simpler utilties,
|
| 387 |
NFSDataInputStream and NFSDataOutputStream.
|
| 388 |
|
| 389 |
- Fixed the bug where SequenceFiles were no longer flushed when
|
| 390 |
created, so that, when fetches crashed, segments were
|
| 391 |
unreadable. Now segments are always readable after crashes.
|
| 392 |
Only the contents of the last buffer is lost.
|
| 393 |
|
| 394 |
- Simplified the FSOutputStream API to not include seek(). We
|
| 395 |
should never need that functionality.
|
| 396 |
|
| 397 |
- Simplified LocalFileSystem's implementations of FSInputStream
|
| 398 |
and FSOutputStream and optimized FSInputStream.seek().
|
| 399 |
|
| 400 |
(cutting)
|
| 401 |
|
| 402 |
21. Fixed BasicUrlNormalizer to better handle relative urls. The file
|
| 403 |
part of a URL is normalized in the following manner:
|
| 404 |
|
| 405 |
1. "/aa/../" will be replaced by "/" This is done step by step until
|
| 406 |
the url doesn´t change anymore. So we ensure, that
|
| 407 |
"/aa/bb/../../" will be replaced by "/", too
|
| 408 |
|
| 409 |
2. leading "/../" will be replaced by "/"
|
| 410 |
|
| 411 |
(Sven Wende via cutting)
|
| 412 |
|
| 413 |
22. Fix Page constructors so that next fetch date is less likely to be
|
| 414 |
misconstrued as a float. This patches a problem in WebDBInjector,
|
| 415 |
where new pages were added to the db with nextScore set to the
|
| 416 |
intended nextFetch date. This, in turn, confused link analysis.
|
| 417 |
|
| 418 |
23. In ndfs code, replace addLocalFile(), putToLocalFile() with
|
| 419 |
copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and
|
| 420 |
moveToLocalFile(). (John Xing, 20041217)
|
| 421 |
|
| 422 |
24. Added new config parameter fetcher.threads.per.host. This is used
|
| 423 |
by the Http protocol. When this is one behavior is as before.
|
| 424 |
When this is greater than one then multiple threads are permitted
|
| 425 |
to access a host at once. Note that fetcher.server.delay is no
|
| 426 |
longer consistently observed when this is greater than one.
|
| 427 |
(Luke Baker via Doug Cutting)
|
| 428 |
|
| 429 |
Release 0.5
|
| 430 |
|
| 431 |
1. Changed plugin directory to be a list of directories.
|
| 432 |
|
| 433 |
2. Permit Plugin to be the default plugin implementation.
|
| 434 |
|
| 435 |
3. Added pluggable interface for network protocols in new package
|
| 436 |
net.nutch.protocol. Moved http code from core into a plugin.
|
| 437 |
|
| 438 |
4. Added pluggable interface for content parsing in new package
|
| 439 |
net.nutch.parse. Moved html parsing code from core into a
|
| 440 |
plugin.
|
| 441 |
|
| 442 |
5. Fixed a bug in NutchAnalysis where 16-bit characters were not
|
| 443 |
processed correctly.
|
| 444 |
|
| 445 |
6. Fixed bug #971731: random summaries on result page.
|
| 446 |
(Daniel Naber via cutting)
|
| 447 |
|
| 448 |
7. Made Nutch logo transparent. (Daniel Naber via cutting)
|
| 449 |
|
| 450 |
8. Added file protocol plugin. (John Xing via cutting)
|
| 451 |
|
| 452 |
9. Added ftp protocol plugin. (John Xing via cutting)
|
| 453 |
|
| 454 |
10. Added pdf and msword parser plugins. (John Xing via cutting)
|
| 455 |
|
| 456 |
11. Added pluggable indexing interface. By default, url, content,
|
| 457 |
anchors and title are indexed, as before, but now one can easily
|
| 458 |
alter this to, e.g., index metadata. A demonstration is provided
|
| 459 |
which extracts and indexes Creative Commons license urls. (cutting)
|
| 460 |
|
| 461 |
12. Add language identification plugin.
|
| 462 |
|
| 463 |
The process of identification is as follows:
|
| 464 |
|
| 465 |
1. html (html only, HTML 4.0 "lang" attribute)
|
| 466 |
2. meta tags (html only, http-equiv, dc.language)
|
| 467 |
3. http header (Content-Language)
|
| 468 |
4. if all above fail "statistical analysis"
|
| 469 |
|
| 470 |
1 & 2 are run during the fetching phase and 3 & 4 are run on
|
| 471 |
indexing phase.
|
| 472 |
|
| 473 |
Currently supported languages (in "statistical analysis") are
|
| 474 |
da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed
|
| 475 |
from http://www.isi.edu/~koehn/europarl/ and the profiles were
|
| 476 |
build with tool supplied in patch.
|
| 477 |
|
| 478 |
After indexing the language can be found from field named "lang"
|
| 479 |
|
| 480 |
It's not 100% accurate but it's a start.
|
| 481 |
(Sami Siren)
|
| 482 |
|
| 483 |
13. Added SegmentMergeTool and "mergesegs" command, to remove
|
| 484 |
duplicated or otherwise not used content from several segments and
|
| 485 |
joining them together into a single new segment. The tool also
|
| 486 |
optionally performs several other steps required for proper
|
| 487 |
operation of Nutch - such as indexing segments, deleting
|
| 488 |
duplicates, merging indices, and indexing the new single segment.
|
| 489 |
(Andrzej Bialecki)
|
| 490 |
|
| 491 |
14. Add the ability to retrieve ParseData of a search hit. ParseData
|
| 492 |
contains many valuable properties of a search hit.
|
| 493 |
|
| 494 |
This is required (among others) to properly display the cached
|
| 495 |
content because it's not possible to determine the character
|
| 496 |
encoding from the output of the getContent() method (which returns
|
| 497 |
byte[]). The symptoms are that for HTML pages using non-latin1 or
|
| 498 |
non-UTF8 encodings the cached preview will almost certainly look
|
| 499 |
broken. Using the attached patch it is possible to determine the
|
| 500 |
character encoding from the ParseData (for HTTP: Content-Type
|
| 501 |
metadata), and encode the content accordingly. (Andrzej Bialecki)
|
| 502 |
|
| 503 |
15. Add a pluggable query interface. By default, the content, anchor
|
| 504 |
and url fields are searched as before. A sample plugin indexes
|
| 505 |
the host name and adds a "site:" keyword to query parsing.
|
| 506 |
|
| 507 |
16. Added support for "lang:" in queries. For example, searching with
|
| 508 |
"lang:en" restricts results to pages which were identified to
|
| 509 |
be in English.
|
| 510 |
|
| 511 |
17. Automatically optimize field queries to use cached Lucene filters.
|
| 512 |
This makes, for example, searches restricted by languages or sites
|
| 513 |
that are very common much faster.
|
| 514 |
|
| 515 |
18. Improved charset handling in jsp pages. (jshin by cutting)
|
| 516 |
|
| 517 |
19. Permit topic filtering when injecting DMOZ pages. (jshin by cutting)
|
| 518 |
|
| 519 |
20. When parsing crawled pages, interpret charset specifications in
|
| 520 |
html meta tags. (jshin by cutting)
|
| 521 |
|
| 522 |
21. Added support for "cc:licensed" in queries, which searches for documents
|
| 523 |
released under Creative Commons licenses. Attributes of the
|
| 524 |
license may also be queried, with, e.g., "cc:by" for
|
| 525 |
attribution-required licenses, "cc:nc" for non-commercial
|
| 526 |
licenses, etc.
|
| 527 |
|
| 528 |
22. Relative paths named in plugin.folders are now searched for on the
|
| 529 |
classpath. This makes, e.g., deployment in a war file much simpler.
|
| 530 |
|
| 531 |
23. Modifications to Fetcher.java.
|
| 532 |
|
| 533 |
1. Make sure it works properly with regard to creation and initialization
|
| 534 |
of plugin instances. The problem was that multiple threads race to
|
| 535 |
startUp() or shutDown() plugin instances. It was solved by synchronizing
|
| 536 |
certain codes in PluginRepository.java and Extension.java.
|
| 537 |
(Stefan Groschupf via John Xing)
|
| 538 |
|
| 539 |
2. Added code to explictly shutDown() plugins. Otherwise FetcherThreads
|
| 540 |
may never return (quit) if there are still data or other structures
|
| 541 |
(e.g., persistent socket connections) associated with plugins. (John Xing)
|
| 542 |
|
| 543 |
3. Fixed one type of Fetcher "hang" problems by monitoring named
|
| 544 |
FetcherThreads. If all FetcherThreads are gone (finished),
|
| 545 |
Fetcher.java is considered done. The problem was: there could be
|
| 546 |
runaway threads started by external libs via FetcherThreads.
|
| 547 |
Those threads never return, thus keep Fetcher from exiting normally.
|
| 548 |
(John Xing)
|
| 549 |
|
| 550 |
24. Eliminate excessive hits from sites. This is done efficiently by
|
| 551 |
adding the site name to Hit instances, and, when needed,
|
| 552 |
re-querying with too-frequent sites prohibited in the query.
|
| 553 |
|
| 554 |
|
| 555 |
Release 0.4
|
| 556 |
|
| 557 |
1. Http class refactored. (Kevin Smith via Tom Pierce)
|
| 558 |
|
| 559 |
2. Add Finnish translation. (Sampo Syreeni via Doug Cutting)
|
| 560 |
|
| 561 |
3. Added Japanese translation. (Yukio Andoh via Doug Cutting)
|
| 562 |
|
| 563 |
4. Updated Dutch translation. (Ype Kingma via Doug Cutting)
|
| 564 |
|
| 565 |
5. Initial version of Distributed DB code. (Mike Cafarella)
|
| 566 |
|
| 567 |
6. Make things more tolerant of crashed fetcher output files.
|
| 568 |
(Doug Cutting)
|
| 569 |
|
| 570 |
7. New skin for website. (Frank Henze via Doug Cutting)
|
| 571 |
|
| 572 |
8. Added Spanish translation. (Diego Basch via Doug Cutting)
|
| 573 |
|
| 574 |
9. Add FTP support to fetcher. (John Xing via Doug Cutting)
|
| 575 |
|
| 576 |
10. Added Thai translation. (Pichai Ongvasith via Doug Cutting)
|
| 577 |
|
| 578 |
11. Added Robots.txt & throttling support to Fetcher.java. (Mike
|
| 579 |
Cafarella)
|
| 580 |
|
| 581 |
12. Added nightly build. (Doug Cutting)
|
| 582 |
|
| 583 |
13. Default all link scores to 1.0. (Doug Cutting)
|
| 584 |
|
| 585 |
14. Permit one to keep internal links. (Doug Cutting)
|
| 586 |
|
| 587 |
15. Fixed dedup to select shortest URL. (Doug Cutting)
|
| 588 |
|
| 589 |
16. Changed index merger so that merged index is written to named
|
| 590 |
directory, rather than to a generated name in that directory.
|
| 591 |
(Doug Cutting)
|
| 592 |
|
| 593 |
17. Disable coordination weighting of query clauses and other minor
|
| 594 |
scoring improvements. (Doug Cutting)
|
| 595 |
|
| 596 |
18. Added a new command, crawl, that constructs a database, injects a
|
| 597 |
url file and performs a few rounds of generate/fetch/updatedb.
|
| 598 |
This simplifies use for intranet sites. Changed some defaults to
|
| 599 |
be more intranet friendly. (Doug Cutting)
|
| 600 |
|
| 601 |
19. Fixed a bug where Fetcher.java didn't construct correct relative
|
| 602 |
links when a page was redirected. (Doug Cutting)
|
| 603 |
|
| 604 |
20. Fixed a query parser problem with lookahead over plusses and minuses.
|
| 605 |
(Doug Cutting)
|
| 606 |
|
| 607 |
21. Add support for HTTP proxy servers. (Sami Siren via Doug Cutting)
|
| 608 |
|
| 609 |
22. Permit searching while fetching and/or indexing.
|
| 610 |
(Sami Siren via Doug Cutting)
|
| 611 |
|
| 612 |
23. Fix a bug when throttling is disabled. (Sami Siren via Doug Cutting)
|
| 613 |
|
| 614 |
24. Updated Bahasa Malaysia translation. (Michael Lim via Doug Cutting)
|
| 615 |
|
| 616 |
25. Added Catalan translation. (Xavier Guardiola via Doug Cutting)
|
| 617 |
|
| 618 |
26. Added brazilian portuguese translation.
|
| 619 |
(A. Moreir via Doug Cutting)
|
| 620 |
|
| 621 |
27. Added a french translation. (Julien Nioche via Doug Cutting)
|
| 622 |
|
| 623 |
28. Updated to Lucene 1.4RC3. (Doug Cutting)
|
| 624 |
|
| 625 |
29. Add capability to boost by link count & use it in crawl tool.
|
| 626 |
(Doug Cutting)
|
| 627 |
|
| 628 |
30. Added plugin system. (Stefan Groschupf via Doug Cutting)
|
| 629 |
|
| 630 |
31. Add this change log file, for recording significant changes to
|
| 631 |
Nutch. Populate it with changes from the last few months.
|