public class FileRetrievalSystem extends Object
Will crawl external directory structures and will download the files within these structures. This class's settings are set using a java .properties file which can be read in and parsed by Config.java. This .properties file should have the following properties set: #list of sites to crawl protocol.external.sources=<path-to-xml-file> #protocol types protocolfactory.types=<list-of-protocols-separated-by-commas> (e.g. ftp,http,https,sftp) #Protocol factories per types (must have one for each protocol mention in protocolfactory.types -- the property must be name # as such: protocolfactory.<name-of-protocol-type> protocolfactory.ftp=<path-to-java-protocolfactory-class> (e.g. org.apache.oodt.cas.protocol.ftp.FtpClientFactory) protocolfactory.http=<path-to-java-protocolfactory-class> protocolfactory.https=<path-to-java-protocolfactory-class> protocolfactory.sftp=<path-to-java-protocolfactory-class> #configuration to make java.net.URL accept unsupported protocols -- must exist just as shown java.protocol.handler.pkgs=org.apache.oodt.cas.url.handlers In order to specify which external sites to crawl you must create a XML file which contains the the site and necessary information needed to crawl the site, such as username and password. protocol.external.sources must contain the path to this file so the crawl knows where to find it. You can also train this class on how to crawl each given site. This is also specified in an XML file, whose path must be given in the first mentioned XML file which contians the username and password. Then schema for the external sites XML file is as such: <sources> <source url="url-of-server"> <username>username</username> <password>password</password> <dirstruct>path-to-xml-file</dirstruct> <crawl>yes-or-no</crawl> </source> ... ... ... </sources\> You may specify as many sources as you would like by specifying multiple <source> tags. In the <source> tag, the parameter 'url' must be specified. This is the url of the server you want the crawler to connect to. It should be of the following format: <protocol>://<host> (e.g. sftp://remote.computer.gov) If no username and password exist, then these elements can be omitted (they are optional). For <crawl> place yes or no here. This is for convenience of being able to keep record of the sites and their information in this XML file even if you decide that you no longer need to crawl it anymore (just put <crawl>no</crawl> and the crawl will not crawl that site). <dirStruct> contains a path to another XML file which is documented in DirStruct.java javadoc. This element is optional. If no <dirStruct> is given, then every directory will be crawled on the site and every encountered file will be downloaded.
Constructor and Description |
---|
FileRetrievalSystem(Config config,
SiteInfo siteInfo)
Creates a Crawler based on the URL, DirStruct, and Config objects passed
in.
|
public FileRetrievalSystem(Config config, SiteInfo siteInfo) throws InstantiationException
url
- The URL for which you want this Crawler to crawldirStruct
- The specified directory structure located at the host -- use
to train crawler (see DirStruct).config
- The Configuration file that is passed to this objects
ProtocolHandler.InstantiationException
DatabaseException
public void registerDownloadListener(DownloadListener dListener)
public void initialize() throws IOException
IOException
public void clearErrorFlag()
public boolean isAlreadyInDatabase(RemoteFile rf) throws CatalogException
CatalogException
public List<RemoteSiteFile> getNextPage(RemoteSiteFile dir, ProtocolFileFilter filter) throws RemoteConnectionException
RemoteConnectionException
public void changeToRoot(RemoteSite remoteSite) throws ProtocolException, MalformedURLException, ProtocolException
public void changeToHOME(RemoteSite remoteSite) throws ProtocolException, MalformedURLException
public void changeToDir(String dir, RemoteSite remoteSite) throws MalformedURLException, ProtocolException
public void changeToDir(RemoteSiteFile pFile) throws ProtocolException, MalformedURLException
public ProtocolFile getHomeDir(RemoteSite remoteSite) throws ProtocolException
ProtocolException
public ProtocolFile getProtocolFile(RemoteSite remoteSite, String file, boolean isDir) throws ProtocolException
ProtocolException
public ProtocolFile getCurrentFile(RemoteSite remoteSite) throws ProtocolFileException, ProtocolException, MalformedURLException
public boolean addToDownloadQueue(RemoteSite remoteSite, String file, String renamingString, File downloadToDir, String uniqueMetadataElement, boolean deleteAfterDownload) throws ToManyFailedDownloadsException, RemoteConnectionException, ProtocolFileException, ProtocolException, AlreadyInDatabaseException, UndefinedTypeException, CatalogException, IOException
public boolean validate(RemoteSite remoteSite)
public void waitUntilAllCurrentDownloadsAreComplete() throws ProtocolException
ProtocolException
public boolean addToDownloadQueue(RemoteSiteFile file, String renamingString, File downloadToDir, String uniqueMetadataElement, boolean deleteAfterDownload) throws ToManyFailedDownloadsException, RemoteConnectionException, AlreadyInDatabaseException, UndefinedTypeException, CatalogException, IOException
public boolean isDownloading(ProtocolFile pFile)
public LinkedList<ProtocolFile> getCurrentlyDownloadingFiles()
public LinkedList<ProtocolFile> getListOfFailedDownloads()
public void clearFailedDownloadsList()
public void shutdown()
public boolean closeSessions() throws RemoteConnectionException
RemoteConnectionException
Copyright © 1999-2014 Apache OODT. All Rights Reserved.