Apache OpenNLP ${pom.version}
===============================


Building from the Source Distribution
-------------------------------------

At least Maven 3.0.0 is required for building.

To build everything go into the opennlp directory and run the following command:
    mvn clean install
   
The results of the build will be placed  in:
    opennlp-distr/target/apache-opennlp-[version]-bin.tar-gz (or .zip)

What is in Similarity component in Apache OpenNLP ${pom.version}
---------------------------------------
SIMILARITY COMPONENT of OpenNLP

1. Introduction
This component does text relevance assessment. It takes two portions of texts (phrases, sentences, paragraphs) and returns a similarity score.
Similarity component can be used on top of search to improve relevance, computing similarity score between a question and all search results (snippets). 
Also, this component is useful for web mining of images, videos, forums, blogs, and other media with textual descriptions. Such applications as content generation 
and filtering meaningless speech recognition results are included in the sample applications of this component.
   Relevance assessment is based on machine learning of syntactic parse trees (constituency trees, http://en.wikipedia.org/wiki/Parse_tree). 
The similarity score is calculated as the size of all maximal common sub-trees for sentences from a pair of texts (
www.aaai.org/ocs/index.php/WS/AAAIW11/paper/download/3971/4187, www.aaai.org/ocs/index.php/FLAIRS/FLAIRS11/paper/download/2573/3018,
www.aaai.org/ocs/index.php/SSS/SSS10/paper/download/1146/1448).
   The objective of Similarity component is to give an application engineer as tool for text relevance which can be used as a black box, no need to understand 
 computational linguistics or machine learning. 
 
 2. Installation
 Please refer to OpenNLP installation instructions
 
 3. First use case of Similarity component: search
 
 To start with this component, please refer to SearchResultsProcessorTest.java in package opennlp.tools.similarity.apps
   public void testSearchOrder() runs web search using Bing API and improves search relevance.
   Look at the code of 
      public List<HitBase> runSearch(String query) 
   and then at 
      private	BingResponse calculateMatchScoreResortHits(BingResponse resp, String searchQuery)
   which gets search results from Bing and re-ranks them based on computed similarity score.
 
   The main entry to Similarity component is 
    SentencePairMatchResult matchRes = sm.assessRelevance(snapshot, searchQuery);
    where we pass the search query and the snapshot and obtain the similarity assessment structure which includes the similarity score.
   
   To run this test you need to obtain search API key from Bing at www.bing.com/developers/s/APIBasics.html and specify it in public class BingQueryRunner in
  protected static final String APP_ID. 
  
  4. Solving a unique problem: content generation
  To demonstrate the usability of Similarity component to tackle a problem which is hard to solve without a linguistic-based technology, 
  we introduce a content generation component:
   RelatedSentenceFinder.java
   
   The entry point here is the function call
   hits = f.generateContentAbout("Albert Einstein");
   which writes a biography of Albert Einstein by finding sentences on the web about various kinds of his activities (such as 'born', 'graduate', 'invented' etc.).
   The key here is to compute similarity between the seed expression like "Albert Einstein invented relativity theory" and search result like 
   "Albert Einstein College of Medicine | Medical Education | Biomedical ...
    www.einstein.yu.edu/Albert Einstein College of Medicine is one of the nation's premier institutions for medical education, ..."
    and filter out irrelevant search results.
   
   This is done in function 
   public HitBase augmentWithMinedSentencesAndVerifyRelevance(HitBase item, String originalSentence,
			List<String> sentsAll)
			
   	  SentencePairMatchResult matchRes = sm.assessRelevance(pageSentence + " " + title, originalSentence);
   You can consult the results in gen.txt, where an essay on Einstein bio is written.
   
   These are examples of generated articles, given the article title
     http://www.allvoices.com/contributed-news/9423860/content/81937916-ichie-sings-jazz-blues-contemporary-tunes
     http://www.allvoices.com/contributed-news/9415063-britney-spears-femme-fatale-in-north-sf-bay-area
     
  5. Solving a high-importance problem: filtering out meaningless speech recognition results.
  Speech recognitions SDKs usually produce a number of phrases as results, such as 
  			 "remember to buy milk tomorrow from trader joes",
			 "remember to buy milk tomorrow from 3 to jones"
  One can see that the former is meaningful, and the latter is meaningless (although similar in terms of how it is pronounced).
  We use web mining and Similarity component to detect a meaningful option (a mistake caused by trying to interpret meaningless 
  request by a query understanding system such as Siri for iPhone can be costly).
 
  SpeechRecognitionResultsProcessor.java does the job:
  public List<SentenceMeaningfullnessScore> runSearchAndScoreMeaningfulness(List<String> sents)
  re-ranks the phrases in the order of decrease of meaningfulness.
  
  6. Similarity component internals
  in the package   opennlp.tools.textsimilarity.chunker2matcher
  ParserChunker2MatcherProcessor.java does parsing of two portions of text and matching the resultant parse trees to assess similarity between 
  these portions of text.
  To run ParserChunker2MatcherProcessor
     private static String MODEL_DIR = "resources/models";
  needs to be specified
  
  The key function
  public SentencePairMatchResult assessRelevance(String para1, String para2)
  takes two portions of text and does similarity assessment by finding the set of all maximum common subtrees 
  of the set of parse trees for each portion of text
  
  It splits paragraphs into sentences, parses them, obtained chunking information and produces grouped phrases (noun, evrn, prepositional etc.):
  public synchronized List<List<ParseTreeChunk>> formGroupedPhrasesFromChunksForPara(String para)
  
  and then attempts to find common subtrees:
  in ParseTreeMatcherDeterministic.java
		List<List<ParseTreeChunk>> res = md.matchTwoSentencesGroupedChunksDeterministic(sent1GrpLst, sent2GrpLst)
  
  Phrase matching functionality is in package opennlp.tools.textsimilarity;
  ParseTreeMatcherDeterministic.java:
  Here's the key matching function which takes two phrases, aligns them and finds a set of maximum common sub-phrase
  public List<ParseTreeChunk> generalizeTwoGroupedPhrasesDeterministic
  
  7. Package structure
  	opennlp.tools.similarity.apps : 3 main applications
	opennlp.tools.similarity.apps.utils: utilities for above applications
	
	opennlp.tools.textsimilarity.chunker2matcher: parser which converts text into a form for matching parse trees
	opennlp.tools.textsimilarity: parse tree matching functionality
	



Requirements
------------
Java 1.5 is required to run OpenNLP
Maven 3.0.0 is required for building it

Known OSGi Issues
------------
In an OSGi environment the following things are not supported:
- The coreference resolution component
- The ability to load a user provided feature generator class

Note
----
The current API contains still many deprecated methods, these
will be removed in one of our next releases, please
migrate to our new API.