IKS Autotagging =============== :author: ogrisel@nuxeo.com Text document classification / topic assignment service based on the text content of DBpedia. The implementation is based on lucene and the `MoreLikeThis` similarity query that leverages term frequencies of the index. Building ======== 1- Download maven_ and ensure that the `mvn` command is registered in your `PATH` environment variable. 2- From the top of the `iks-autoagging/` folder build using maven as usual (this will install the `iks-autotagging-X.X.X-SNAPSHOT.jar` jar in your local maven repository and make it available to dependent projects such as Stanbol Enhancer):: % mvn install 3- From the same folder, build a standalone jar suitable for commandline usage. The resulting jar will be named: `target/iks-autotagging-X.X.X-SNAPSHOT-jar-with-dependencies.jar`:: % mvn assembly:assembly 4- (Optional) To import the project in eclipse, first run:: % mvn eclipse:eclipse This will generate `.project` and `.classpath` files to that you can "Import > Import existing projects into workspace" from the Eclipse UI. Alternatively you can install the `m2eclipse` plugin to directly import maven projects into eclipse. .. _maven: http://maven.apache.org Command line usage ================== You can use the autotagger with a default embedded lucene index of the top 10000 entities of DBpedia: % java -jar target/iks-autotagging-*-SNAPSHOT-jar-with-dependencies.jar \ suggest -f samples/bob_marley.txt [...] Annotating 'samples/bob_marley.txt'... done in 739ms: Suggestion #1 (score: 4.648216): 'Bob Marley' URI: http://dbpedia.org/resource/Bob_Marley type: http://www.w3.org/2002/07/owl#Thing type: http://dbpedia.org/ontology/Person type: http://dbpedia.org/ontology/Artist type: http://dbpedia.org/ontology/MusicalArtist Suggestion #2 (score: 0.127039): 'Bunny Wailer' URI: http://dbpedia.org/resource/Bunny_Wailer type: http://www.w3.org/2002/07/owl#Thing type: http://dbpedia.org/ontology/Person type: http://dbpedia.org/ontology/Artist type: http://dbpedia.org/ontology/MusicalArtist Suggestion #3 (score: 0.121009): 'Desmond Dekker' URI: http://dbpedia.org/resource/Desmond_Dekker type: http://www.w3.org/2002/07/owl#Thing type: http://dbpedia.org/ontology/Person type: http://dbpedia.org/ontology/Artist type: http://dbpedia.org/ontology/MusicalArtist For better recall performance it is strongly recommended to use a more comprehensive index of DBpedia entities. To do so you first need to build or download a dedicated DBpedia lucene index in a folder named `/path/to/lucene-idx` (for instance) on the local filesystem (see later sections for instructions). You can download a prebuilt index from here: http://dl.dropbox.com/u/5743203/IKS/autotagging/iks-dbpedia-lucene-idx-20100331-0.tar.bz2 (A better index is currently under construction...) You can then add the "-i /path/to/lucene-idx" option to the previous command line to use your custom index. Instructions for building your own index from scratch are available in the following sections. Restful API =========== :TODO: implement me first! Launch a HTTP server to provide the service using a RESTful API thanks to jetty and Jersey:: % mvn jetty:run % curl -T file-to-anotate.txt http://localhost:8080/autotagging RDF/JSON annotations could be serialized using this convents: http://jdil.org/. Also the Stanbol Enhancer project features an OSGi embedding of this library combined with RESTful interface and persistent annotation and content stores: http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/ Building a lucene index from DBpedia dumps ========================================== 1- Download and uncompress (using `bzip2 -d `)the following datasets from DBpedia: - instancetype_en.nt.bz2_ - longabstract_en_nt.bz2_ - article_label_en_nt.bz2_ .. _instancetype_en_nt.bz2: http://downloads.dbpedia.org/3.4/en/instancetype_en.nt.bz2 .. _longabstract_en_nt.bz2: http://downloads.dbpedia.org/3.4/en/longabstract_en.nt.bz2 .. _article_label_en_nt.bz2: http://downloads.dbpedia.org/3.4/en/article_label.nt.bz2 2- Build a temporary Jena TDB store:: % java -Xmx2g -server -jar target/iks-autotagging-*-SNAPSHOT-jar-with-dependencies.jar \ model /path/to/dbpedia-tdb /path/to/instancetype_en.nt /path/to/longabstract.nt /path/to/articles_label_en.nt Alternatively you can download and use the `bin/tdbloader` tool from the TDB distribution. 3- Index the Jena TDB into a Lucene `FSDirectory`:: % java -Xmx2g -server -jar target/iks-autotagging-*-SNAPSHOT-jar-with-dependencies.jar \ index /path/to/dbpedia-tdb /path/to/lucene-directory You can then use luke_ to check the content of the resulting index:: % java -jar /path/to/lukeall-1.0.0.jar -index /path/to/lucene-idx .. _luke: http://www.getopt.org/luke/ Recently implemented ==================== 0- Finish implementing the `JenaIndexer#main` method to be able to create a Jena TDB store out of DBpedia dumps from the command line. 1- Use the lucene `ShingleFilter` to generate bi-grams (or tri-grams) of token and improve the accuracy of the results at the expense of the size of the index. 2- Extend the `Autotagger` API to allow the requester to ask for a specific entity type (useful to combine with the output of a Named Entity detection module). 4- Extend the `TagInfo` class to feedback the caller with confidence levels for each suggestions (from lucene scores). 5- Improve the `JenaIndexer` to index other text literal sources such as labels, comments, ... 6- Build a small index of the most popular people / place / organization from DBpedia to be packaged easily as the default IKS model Roadmap ======= 3- Implement standalone jersey-based JAX-RS components that takes the text content as an input and output suggested annotations as RDF/XML of RDF/JSON synchronously. 7- Index the DBpedia categories dumps and by aggregating literal text from directly related entities and propose suggestions with type topic to complement entities typed tags. 8- Use a complete wikimedia markup dump as a fulltext source for the index instead of just DBpedia 9- Index the textual context (enclosing paragraph) of all incoming links to an entity coming from other wikipedia articles.