|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.util.MimeUtil
public final class MimeUtil
This is a facade class to insulate Nutch from its underlying Mime Type substrate library, Apache Tika. Any mime handling code should be placed in this utility class, and hidden from the Nutch classes that rely on it.
Constructor Summary | |
---|---|
MimeUtil(org.apache.hadoop.conf.Configuration conf)
|
Method Summary | |
---|---|
String |
autoResolveContentType(String typeName,
String url,
byte[] data)
A facade interface to trying all the possible mime type resolution strategies available within Tika. |
static String |
cleanMimeType(String origType)
Cleans a MimeType name by removing out the actual MimeType ,
from a string of the form: |
String |
forName(String name)
A facade interface to Tika's underlying MimeTypes.forName(String)
method. |
String |
getMimeType(File f)
Facade interface to Tika's underlying MimeTypes.getMimeType(File)
method. |
String |
getMimeType(String url)
Facade interface to Tika's underlying MimeTypes.getMimeType(String)
method. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public MimeUtil(org.apache.hadoop.conf.Configuration conf)
Method Detail |
---|
public static String cleanMimeType(String origType)
MimeType
name by removing out the actual MimeType
,
from a string of the form:
<primary type>/<sub type> ; < optional params
origType
- The original mime type string to be cleaned.
public String autoResolveContentType(String typeName, String url, byte[] data)
typeName
is cleaned, with cleanMimeType(String)
.
Then the cleaned mime type is looked up in the underlying Tika
MimeTypes
registry, by its cleaned name. If the MimeType
is
found, then that mime type is used, otherwise URL resolution is
used to try and determine the mime type. If that means is unsuccessful, and
if mime.type.magic
is enabled in NutchConfiguration
,
then mime type magic resolution is used to try and obtain a
better-than-the-default approximation of the MimeType
.
typeName
- The original mime type, returned from a ProtocolOutput
.url
- The given @see url, that Nutch was trying to crawl.data
- The byte data, returned from the crawl, if any.
MimeType
name.public String getMimeType(String url)
MimeTypes.getMimeType(String)
method.
url
- A string representation of the document URL
to sense the
MimeType
for.
MimeType
, identified from the given
Document url in string form.public String forName(String name)
MimeTypes.forName(String)
method.
name
- The name of a valid MimeType
in the Tika mime registry.
MimeType
, if it exists,
or null otherwise.public String getMimeType(File f)
MimeTypes.getMimeType(File)
method.
f
- The File
to sense the MimeType
for.
MimeType
of the given File
, or null if it
cannot be determined.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |