Prerequisites_ are discussed with the rest of the system. One should note, in particular, the requirement for postgres or another SQL server as well as the SQLObject library.
All the supplied scripts take certain options from a config file. The files are read in the following order, where later files override previous ones.
A typical file would look like this:
[DEFAULT]
verbosity = 1
[sql]
url = postgres://user:password@localhost/pysteg
imageroot = /work/images
modeldir = /work/svmodel
The three options in the sql section are critical, defining the URL for the database, the root directory for all the image sets, and the directory to store SVM model files, respectively. There are no sensible defaults.
The module requires a database system (e.g. postgres) to be installed with a database named pysteg. There is a good PostgreSql tutorial on how to do this under Debian and Ubuntu.
The database tables are created with the following command:
sqladm.py --create --load-feature-sets config/features.cfg
where the paths are relative to the root of the pysteg distribution.
If one needs to recreate the tables in an existing database, one can use the –drop-and-create option instead of –create, to drop all the tables before they are created.
The easiest way to define image sets is to write a config file using the format below, and run
sqladm --load-images config/images.cfg
A commented config file can be found in the config subdirectory.
New TestSet objects are created using the following command, where the -T option gives the key of the new object, and the -i and -g options give the ImageSet objects from which to draw clean images and steganograms respectively.
Two TestSet objects are created, one for training and one for testing. The latter has `_test' appended to the key.
Often additional TestSet objects are needed, using the same clean images but taking steganograms from a different set. This is achieved by specifying the source test set with the -S option:
sqltset.py -S source -T key -g StegoImageSet
Note that this command must be run separately for the training set and the test set.
Feature extraction is most easily done using the queue system. The database includes a queue table where each record identifies an image and a number of feature extraction functions to run. Clients can connect to the DB, retrieve a queue item, and execute the job including adding newly calculated features to the DB.
The assigned and assignee fields give the date and client ID of assignment. Available jobs have null in those fields. If a client dies for whatever reason, it will leave a queue item with assigned non-null. One can easily identify old, assigned jobs and relaunching them by resetting assigned and assignee to null.
TO DO:
More easy-to-use tools/scripts to requeue jobs should be added.
Jobs can be added from the command line using the sqlq.py script. For instance
sqlq.py -s URL -T ImageSet Key1 Key2 ...
will add a queue item for every image in the given image set, extracting all the given feature sets Key1, Key2, etc. Note that one queue item is created per image, with the same item listing multiple feature sets.
It is possible to inspect the queue:
sqlq.py -s URL --list
sqlq.py -s URL --count
The first line lists all the tasks, whereas the second only counts them.
The queues can be processed using the sqlworker.py script, which loops as long as new jobs are received. It can be called as
sqlworker.py -s URL [-i idstring]
where URL is the URL for the database and the optional idstring identifies the client process. The script calls the pysteg.svm.extract.worker() function which can obviously be called separately.
The sql module support training and testing of SVM classifiers. The basis for SVM training is one TestSet record serving as training set and one FeatureVector record defining the features to use.
An SVM Model corresponds to a feature, namely the classification score. Once the model has been trained, this feature can be calculated from other images. This is useful for two reasons:
1. The classification score can be used alongside other features in fusion schemes.
2. The same functions can study stastical properties and relationships between classification scores and other features alike.
We will discuss three ways of using SVM models in pysteg.sql. The simplest and recommended approach is the queue system which we discuss first. Then we will discuss the older sqlsvm.py script. This is arguable up for deprecation, in favour of the queueing system. On the other hand, sqlsvm.py can also be used to list existing models. Finally, we will discuss how to use SVM models directly in python.
The command lines to enqueue tasks respectively to train and test an SVM model are as follows.
sqlq.py -T trainingSet -f featureVector -F featureSet [--feature-set-description "FS description"] -M modelKey -d "Description" [-g fold]
sqlq.py -M modelKey S1 S2 ...
The arguments are as follows.
trainingSet: The key for a TestSet object to be used for training. featureVector: The key for a FeatureVector object for the classifier. featureSet: The key for a FeatureSet object to contain the new feature (classification score). This is created if it does not exist. modelKey: A key to identify the SVM model and corresponding feature. “Description”: Any descriptive string to be stored with the new feature. “FS Description”: A descriptive string to be stored with the new feature set. fold: (Default is n=5) The value n for n-fold cross-validation in the grid search. If the training set is very large, n=2 is recommended. S1, S2, ...: TestSet objects on which to run and test the classifier. One task is queued per object.
The sqlworker.py processes these queue items like any others. The training task creates an SVModel record and corresponding Feature record, which may belong to an existing or new FeatureSet as specified. The testing task calculates the classification score feature for all the images, and creates an SVMPerformance record to record the accuracy as well as FP/FN rates.
Note that feature values are not recorded for the training set, but it is straight forward to run a testing task on the training set to do so.
A number of options are provided to bulk enter training and testing tasks. For training, the following command is used to create models for one given training set testset and each feature vector fv1, fv2, ... Training tasks are queued for each new model.
sqlq.py –new-models -T testset [-g k] [–feature-set-description text] fv1 fv2 ...
The optional arguments are used to specify k for k-fold cross-validation when SVM with grid search is used, and the text for the FeatureSet description if it has to be created. All the new models belong to a feature set named by the traning set name with “SVM” appended.
It is straight forward to test SVM models on the canonical test set which is obtained by appending “_test” to the name of the training set. Running the command
sqlq.py –svm-performance
will queue tasks for each SVM model to test it on its canonical test set. Similarly, we can bulk test on the training sets (to get training errors) with the comand:
sqlq.py –svmtraining-performance
Note that the two options above can be combined in a single invokation.
It is also possible to test every model on one specific test set.
sqlq.py –svm-test T
The above command will queue a test for each model using the test set T every time
The sqlsvm.py script provides a few core features to train and test SVM models. There are three major modes of operations:
list: listing performance data recorded in the database training: testing:
sqlsvm.py --list
The command call is as follows.
sqlsvm.py -T trainingSet -f featureVector -F featureSet -M modelKey -D "Description" [-g fold]
The arguments are as follows.
trainingSet: The key for a TestSet object to be used for training. featureVector: The key for a FeatureVector object for the classifier. featureSet: The key for a FeatureSet object to contain the new feature (classification score). This is created if it does not exist. modelKey: A key to identify the SVM model and corresponding feature. “Description”: Any descriptive string to be stored with the new feature. fold: (Default is n=5) The value n for n-fold cross-validation in the grid search. If the training set is very large, n=2 is recommended.
The command call is as follows.
sqlsvm.py -S testSet -M modelKey
Note that the -S and -T options can be combined to perform both training and testing in a single run.
The SVM model, as created by libSVM is not stored in the database, but rather on file with the file name stored in the database. The reason for this is that the libSVM uses ctypes, with the type or class for the SVM model implemented in C. Therefore it cannot be pickled, and tailor made methods would be necessary to store the model in the database.
The model filename is configurable with a full path name as a parameter to the saveModel() method. The default filename is derived from the feature vector key and the name of the training set, and the file goes in the current directory. There is room for improvement here.
Features may be used in many ways, for various kinds of data mining and statistical analysis. The sqlstats.py script can be used to compare features in various ways.
sqlstats.py -S TestSet [flags] [-p plotfile] -F FeatureSet
sqlstats.py -S TestSet [flags] [-p plotfile] f1 f2 ...
The features to consider are specified either as a feature set using the -F option, or as a list of individual features f1, f2, etc. The statistics to report are determined by the flags:
-m: | Calculate statistical moments and median of each feature. |
---|---|
-c: | Calculate the correlation coefficient matrix for the features. |
-d: | Calculate the difference between steganograms and corresponding cover images, reporting the mean and the variance of these differences. |
-C: | Treat the features as classification scores and compare the error rates. |
The -p options requires exactly two features to be specified, and it gives a scatter plot in the given file.
The easiest approach is to use the –load-feature-vectors option to the sqladm.py script, using the existing config files as an example. See Creating the database above.
The sqlget.py script allows export of entire features in SVMlib’s sparse format.
sqlget.py -T testset -f fv -o outfile
The keys should refer to a TestSet object (testset) and a FeatureVector object (fv).
This can easily be generalised to output CSV, and to accept ImageSet objects or FeatureSet objects as source.
The sqltset.py script gives a convenient way to do cover selection based on some feature. The syntax is
sqltset.py -s URL -f feature [-m min] [-M max] [-T dest|-c] -S source
Images from the source TestSet are selected if the given feature is at least equal to min and at most equal to max. If either bound is not given, it poses no constraint. If -c is given, the number of elligible images is printed on stdout. If -T is given a new TestSet dest is created by taking the elligible images.
For instance, highly textured images can be selected using
sqltset.py -s URL -f "TXB(0)" -m 0.6 -T imageset-textured -S imageset
Note that 0.6 was used with the BOSS set to give a sufficient number of images. If really highly textured images are sought, one may want to consider a higher threshold.