Indexer for the DBpedia dataset (see http://dbpedia.org/) This Tool creates a full cache for DBPedia based on the RDF Dump available via the download section of the dbpedia.org web page. Building: ======== If not yet build by the built process of the entityhub call mvn install in this directory. To create the runable jar that contains all the dependencies call mvn assembly:assembly If everything completes successfully, than there should be two jar files within the target directory. The one called org.apache.stanbol.entityhub.indexing.dbpedia-0.1*-jar-with-dependencies.jar is the one to be used for indexing. Creating the index: ================== (1) download the all the RDF files you need from the download section of the dbpedia.org web page. Make sure you download all the files needed to have all data available used by the configured mappings. All files need to be in the directory parsed as second parameter to the tool. (2) To enable ranking for DBpedia resources you need to also to calculate the incoming links for wikipedia sites using [1]. The generated file needs to be parsed to the tool by using the -i parameter. In case you use this feature also note the -ri parameter that can be used to define the minimum required number of incomming links so that an entity gets included in the index. (setting it to 2 will result in about 50% of all the entities to be indexed) (3) The Indexer will need a SolrServer. So you need to prepare the Solr Index to store the data. A default configuration is provided within the "/solrConf" directory. This can be used to configure a SorlServer or a new Core to an existing SolrServer. You can parse the absolute path. In that case an EmbeddedSolrServer will be used for indexing. NOTE that the "/solrConf" directory only represents a Core and not a full SolrServer configuration. You need to have a valid "solr.xml" in the parent Directory of dbpedia. See the Solr documentation for details how to configure Cores (4) call the tool with the -h option to print the help screen java -jar ./target/org.apache.stanbol.entityhub.indexing.dblp-*-jar-with-dependencies.jar -h The help screen should provide you with all the information needed for indexing Indexing will take a lot of time. Indexing time heavily depends on the IO operations/sec of the used hard disc. [1] https://gist.github.com/360315: NOTE: There are two "head". The first restricts to 10e6 lines and the second prints only the first ten lines. When calculating the page rank for all entities one need to change this and pipe the results into a file. Also NOTE that the Link to download the file with the incomming links should be adopted to the version of the dumps you use for indexing. e.g. http://downloads.dbpedia.org/3.6/en/page_links_en.nt.bz2 for Version 3.6 of the dump