Using A SQL Database for Bayesian Storage Module ------------------------------------------------------- SpamAssassin can now store users' bayesian filter data in a SQL database. The most common use for a system like this would be for users to be able to have per user bayesian filter data on systems where users may not have a home directory to store the data. In order to activate the SQL based bayesian storage you have to configure spamassassin and spamd to use a different bayes storage module. This can be done via a setting in the global configuration file. The directives required to turn on the SQL based bayesian storage are: bayes_store_module Mail::SpamAssassin::BayesStore::SQL This directive is used by the Bayes module to determine which storage module should be used. If not set it will default to: Mail::SpamAssassin::BayesStore::DBM PostgreSQL users will want to use the PostgreSQL specific storage module: bayes_store_module Mail::SpamAssassin::BayesStore::PgSQL This module provides a slightly different interface to makes better use of the resources that PostgreSQL offers. In addition, please make sure that you follow the instructions below for loading the proper procedural language and installing the tables and stored procedure. There is also a MySQL specific storage driver available to provides a small boost in performance. It requires version 4.1 or above of the MySQL database software to work properly. In addition, it provides rollback on error functionality if you create your bayes database table using the InnoDB storage engine (ie s/MyISAM/InnoDB/ on the bayes_mysql.sql file). WARNING: Using this module with a version of MySQL < 4.1 could have unexpected results. To use the MySQL 4.1+ specific module set your bayes_store_module directive accordingly: bayes_store_module Mail::SpamAssassin::BayesStore::MySQL Additional configuration directives provided by BayesSQL: bayes_sql_dsn DBI:driver:database:hostname[:port] bayes_sql_username dbusername bayes_sql_password dbpassword The bayes_sql_dsn directive describes the data source name that will be used to create the connection to your SQL server. It MUST be in the format as listed above. should be the DBD driver that you have installed to access your database (initially tested with MySQL, PostgreSQL, and SQLite). must be the name of the database that you created to store the bayes data tables. is the name of the host that contains the SQL database server. is the optional port number where your database server is listening. For an example of connection to PostgreSQL, see the main README file. In addition to the global configuration directives there is a user preference: bayes_sql_override_username someusername This directive, if used, will override the username used for storing data in the database. This could be used to group users together to share bayesian filter data. You can also use this config option to trick sa-learn to learn data as a specific user. Requirements ------------ In order for SpamAssassin to work with your SQL database, you must have the perl DBI module installed, AS WELL AS the DBD driver/module for your specific database. For example, if using MySQL as your RDBMS, you must have the DBD::mysql module installed. Check CPAN for the latest versions of DBI and your database driver/module. The BayesStore::SQL module was tested with: DBI-1.38 DBD-mysql-2.9002 perl v5.8.0 But older versions should work fine as the SQL code in SpamAssassin is as simple as could be. NOTE: Some users have reported problems using DBD::Pg v1.22. There appears to be a bug in how parameters are quoted in that version, we recommend you upgrade to at least v1.31. In addition, there appears to be a quote bug in some versions of PostgreSQL. It's unclear when the bug was fixed. 7.4.2 seems to work just fine, however some have reported problems with 7.3.6. Your milage may vary with versions less than 7.4.2. If you happen to know when the specific bug was fixed please feel free to notify the dev team so they can update this documentation. Database Schema --------------- The database schema for storage of the bayesian filter data contains several different tables. Several sample SQL schemas have been included in to help in setting up your database. The schemas contain the minimum tables and columns necessary to work with the code as written. You are free to add other columns as needed for your local implementation. Presently there is no way to override the table and column names used by the BayesStore::SQL code, this feature may be added in the future. Example setup of bayes tables for MySQL: This assumes that you have already created a database for use with spamassassin and setup a username/password that can access that database. (See "Creating A Database", in "sql/README", if you don't have a suitable database ready.) To install the tables using the included example, use the following command: mysql -h -u -p databasename < bayes_mysql.sql Enter password: To install the tables for PostgreSQL, use: You need to install the plpgsql procedural language (assuming it hasn't already been installed in template1). You can do this by running the following command: createlang plpgsql and then: psql -U -f bayes_pg.sql (If you are using older versions of PostgreSQL, such as version 7.4.x, you may need to patch this. See http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4653 ) From a security viewpoint, it is wise to grant the minimum privileges needed to the user on the bayes tables. These grants largely depend on your system policies but here are some example grants to get you started, please review carefully before putting into production: GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE bayes_token TO ; GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE bayes_vars TO ; GRANT SELECT, DELETE, INSERT ON TABLE bayes_seen TO ; GRANT SELECT, DELETE, INSERT ON TABLE bayes_expire TO ; GRANT SELECT ON TABLE bayes_global_vars TO ; GRANT UPDATE, SELECT, INSERT ON bayes_vars_id_seq TO ; Once you have created the database and added the tables, just add the required lines to your global configuration file (local.cf). Testing SpamAssassin/SQL ------------------------ To test your SQL setup, and debug any possible problems, you should start spamd with the -D option, which will keep spamd in the foreground, and will output debug message to the terminal. You should then test spamd with a message by calling spamc. You can use the sample-spam.txt file with the following command: cat sample-spam.txt | spamc Watch the debug output from spamd and look for the following debug line: debug: bayes: Database connection established debug: bayes: Using username: If you do not see the above text, then the SQL query was not successful, and you should see any error messages reported. This code has been tested using MySQL as the RDMS, with basic tests against PostgreSQL and SQLite. It does require a database that allows you to refer to a column on the right hand side of an expression (ie update foo set bar = bar + 1). Any database driver that allows for that usage should work with the BayesStore::SQL code. NOTE: You may find that some implementations do not provide a significant advantage over using the default DBM implementation. If you find a driver that should work and has issues, please report them to the SADev list. ****** NB: This should be considered BETA, and the interface, schema, or overall operation of SQL support may change at any time with future releases of SA. ****** Converting Bayes Data From a DBM Database ----------------------------------------- Converting your bayes database data from Berkeley (DBM) based storage to SQL based storage is as simple as a backup and then restore. If you are upgrading from a previous version of SpamAssassin you should first follow any recommended upgrade instructions for that release, in most cases this will be as simple as running an sa-learn --sync Once you have performed this upgrade, for each bayes database follow this procedure: o Run 'sa-learn --backup > backup.txt' which will backup your bayes data into a text file. o Optionally you can run 'sa-learn --clear' to remove the DBM based bayes files. o Modify your local.cf file according to the directions above. o Run 'sa-learn --restore backup.txt' to restore your bayes data to the SQL database. NOTE: sa-learn must be run as the user who's data you are loading, or you must make use of the bayes_sql_override_username config option.