Recommender First Timer Dos and Don’ts
Many people with an interest in recommenders arrive at Mahout since they’re
building a first recommender system. Some starting questions have been
asked enough times to warrant a FAQ collecting advice and rules-of-thumb to
newcomers.
For the interested, these topics are treated in detail in the book Mahout in Action.
Don’t start with a distributed, Hadoop-based recommender; take on that
complexity only if necessary. Start with non-distributed recommenders. It
is simpler, has fewer requirements, and is more flexible.
As a crude rule of thumb, a system with up to 100M user-item associations
(ratings, preferences) should “fit” onto one modern server machine with 4GB
of heap available and run acceptably as a real-time recommender. The system
is invariably memory-bound since keeping data in memory is essential to
performance.
Beyond this point it gets expensive to deploy a machine with enough RAM,
so, designing for a distributed makes sense when nearing this scale.
However most applications don’t “really” have 100M associations to process.
Data can be sampled; noisy and old data can often be aggressively pruned
without significant impact on the result.
The next question is whether or not your system has preference values, or
ratings. Do users and items merely have an association or not, such as the
existence or lack of a click? or is behavior translated into some scalar
value representing the user’s degree of preference for the item.
If you have ratings, then a good place to start is a
GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity
metric. If you don’t have ratings, then a good place to start is
GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.
If you want to do content-based item-item similarity, you need to implement
your own ItemSimilarity.
If your data can be simply exported to a CSV file, use FileDataModel and
push new files periodically.
If your data is in a database, use MySQLJDBCDataModel (or its “BooleanPref”
counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on
top a ReloadFromJDBCDataModel.
This should give a reasonable starter system which responds fast. The
nature of the system is that new data comes in from the file or database
only periodically – perhaps on the order of minutes.