This is a simple example of a use of maximum entropy and the OpenNLP Maxent toolkit. (It was designed to work with Maxent v2.5.0.) There are two example data sets provided, one for whether a game should be played indoors or outdoors and another for whether Arsenal or Manchester United (two English football clubs) will win when they play each other, based on a few potentially salient features for either decision. The java classes should be helpful getting up and running with your own maxent implementation, though the context generator is about as simple as it gets. For more complex examples, look at the classes in the opennlp.tools packages, available at http://opennlp.sourceforge.net. To play with this sample application, do the following: Be sure that maxent-2.5.0.jar and trove.jar (found in the lib directory) are in your classpath. Compile the java files: > javac *.java Note: the following will avoid the need to setup you classpath in your environment (be sure to fix the maxent jar for the correct version number): > javac -classpath .:../../lib/trove.jar:../../output/maxent-2.5.0.jar *.java Now, build the models: > java CreateModel gameLocation.dat > java CreateModel football.dat This will produce the two models "gameLocationModel.txt" and "footballModel.txt" in this directory. Again, to fix classpath issues on the command line, do the following instead: > java -cp .:../../lib/trove.jar:../../output/classes CreateModel football.dat You can then test the models on the data itself to see what sort of results they get on the data they were trained on: > java Predict gameLocation.dat > java Predict football.dat or, with command line classpath: > java -cp .:../../lib/trove.jar:../../output/classes Predict gameLocation.test You'll get output such as the following: -------------------------------------------------- For context: Cloudy Happy Humid Outdoor[0.9255] Indoor[0.0745] For context: Rainy Dry Outdoor[0.0133] Indoor[0.9867] -------------------------------------------------- For the first, the model has assigned a normalized probability of 77% to the Outdoor outcome, so given the context "Cloudy,Happy,Humid" it would choose to have the game outdoors. For the second, the model appears to be almost entirely sure that the game should be indoors. The Arsenal vs. Manchester United decision is a bit more interesting because there are three possible outcomes: Arsenal wins, ManU wins, or they tie. Here is some example output: -------------------------------------------------- For context: home=arsenal Beckham=true Henry=false arsenal[0.3201] man_united[0.6343] tie[0.0456] For context: home=man_united Beckham=true Henry=true arsenal[0.1499] man_united[0.2060] tie[0.6441] -------------------------------------------------- In the first case, ManU looks like the clear winner, but in the second it looks like it will be a tie, though ManU looks to have more of a chance at winning it than Arsenal. (For those who don't know, Beckham, Scholes, and Neville are/were ManU players and Ferguson is the coach, while Henry, Kanu, and Parlour are Arsenal players with Wengler as their coach. By "Beckham=false" I mean that Beckham won't play this game.) Also, try this on the test files: > java Predict gameLocation.test > java Predict football.test Go ahead and modify the data to experiment with how the results can vary depending on the input to training. There isn't much data, so its not a full-fledged example of maxent, but it should still give the general idea. Also, add more contexts in the test files to see what the model will produce with different features active. In all the previous examples, the features we're binary values, meaning that the feature was either on or off. You can also use features which have real values (like 0.07). The features are formatted with the value specified after an equals sign such as the "pdiff" and "ptwins" features below. away pdiff=0.9375 ptwins=0.25 tie away pdiff=0.6875 ptwins=0.6666 lose home pdiff=1.0625 ptwins=0.3333 win Features which don't contains are not in this format are considered to have a value of 1. Note feature values MUST BE POSITIVE. Using real-valued features has some additional overhead so you'll need to let the model know that it should look for these features. For these examples, you can use the "-real" option. > java CreateModel -real realTeam.dat You can then test the models on the test data: > java Predict -real realTeam.test You see output like: -------------------------------------------------- For context: home pdiff=0.6875 ptwins=0.5 lose[0.3279] win[0.4311] tie[0.2410] For context: home pdiff=1.0625 ptwins=0.5 lose[0.3414] win[0.4301] tie[0.2284] For context: away pdiff=0.8125 ptwins=0.5 lose[0.5590] win[0.1864] tie[0.2546] For context: away pdiff=0.6875 ptwins=0.6 lose[0.5578] win[0.1866] tie[0.2556] -------------------------------------------------- You can see that the values of the features as well as their presence or absence (such as the home or away features) impact the probabilities assigned to each outcome. The use of the "-real" option to indicate real-valued data. In general you'll need to use the classes: RealBasicEventStream, RealValueFileEventStream, OnePassRealValueDataIndexer, and TwoPassRealValueDataIndexer. For all models, though the features appear in almost the same orderings in the data files, this is not important. You can list them in whatever order you like. If you have any suggestions, interesting modifications, or data sets for other examples to add to this sample maxent application, please post them to the maxent open discussion forum: http://sourceforge.net/forum/forum.php?forum_id=18384