API tutorial for pysteg.sql

Where examples of python code are given, they assume that the sql module has been loaded and a connection to the database has been established. This is done with the following commands:

from pysteg.sql import *
sqlConnect()

Creating data sets

Defining images

To get started, the first step is to define the image objects from which we can extract features. The images themselves must be stored in the file system, but the file path and some image characteristics are stored in the database.

The images are organised in image sets, representing different image sources. Some image sets, such as a set of steganograms, would be derived from others, and this relationship is recorded. Note that the concept of image sets is independent of test and training sets to be introduced below.

For instance, two new image sets can be defined with the code below. The name parameter has to be a unique identifier of the image set.

C = ImageSet(
    path="BossBase-1.0-cover",
    fileformat="PBM",
    imgformat="pixmap",
    extension="pgm",
    description="Bossbase version 1.0 (clean images)",
    name="BOSSv1.0" )
S = ImageSet(
    path="BossBase-1.0-hugo-alpha=0.4",
    source=C,
    fileformat="PBM",
    imgformat="pixmap",
    extension="pgm",
    description="Bossbase version 1.0 with hugo embedding as published (alpha=0.4)",
    name="BOSSv1.0hugo0.4" )
submitImages(C)
submitImages(S)

The first assignment defines a set of cover images, recording a number of facts about the image. The second assignment defines a set of steganograms which are linked to the cover images via the source parameter. The steganogram should have the same filename as the corresponding cover image (in a different directory), to make sure that the cover image can always be retrieved.

The submitImages() function reads the directory listing from the file system and enters all the images found. The extension argument is used to filter the files, so that only files with the given extension are included. This permits additional files such as README or licence texts in the same directory as the images. The extension argument is not mandatory; without it every file in the directory is assumed to be a valid image.

Defining test (and training) sets

Defining test and training sets is easy. Take the following three lines as an example:

from pysteg.sql.imageset import makeTestSets

C = ImageSet.byPath( "/work/images/BossBase-1.0-cover" )
S = ImageSet.byPath( "/work/images/BossBase-1.0-hugo-alpha=0.4" )

(T0,T1) = makeTestSets( C, S, "testtest", testsize=10, trainsize=10 )

The image sets to be used as source can be identified in any way. The two first lines show how to recover them from the database based on a pathname. Note that the path name must be exact, Including trailing slashes.

The last line makes a test and training set of ten images each. Half of the images for each set is drawn from each of the source sets C and S. There is an optional parameter skew to change the skew between the classes. If the parameters testsize and trainsize are less than 1, they are interpreted as fractions of the total number of available images.

Each cover image will only be used once within the test and training set, assuming that stego images have the same base filename as the corresponding cover image.

Extracting features

The queue system

Feature extraction is most easily done using the queue system. The database includes a queue table where each record identifies an image and a number of feature extraction functions to run. Clients can connect to the DB, retrieve a queue item, and execute the job including adding newly calculated features to the DB.

The assigned and assignee fields give the date and client ID of assignment. Available jobs have null in those fields. If a client dies for whatever reason, it will leave a queue item with assigned non-null. One can easily identify old, assigned jobs and relaunching them by resetting assigned and assignee to null.

Adding jobs to the queue

Jobs are queued using the pysteg.svm.extract.queueSet() function; e.g.

queueSet(T,F)

where T is a TestSet object and F is a list of feature sets given by FeatureSet object or as keys. Then every image in the set will be queued with the specified features.

Running clients to process the queue

The queues can be processed using the sqlworker.py script, which loops as long as new jobs are received. It can be called as

sqlworker.py -s URL [-i idstring]

where URL is the URL for the database and the optional idstring identifies the client process. The script calls the pysteg.svm.extract.worker() function which can obviously be called separately.

Classification with SVM

Training

from pysteg.sql.svmodel import *


FeatureSet( key="SVMTEST", description="Just for testing", matrix=False )
Feature( cat=fs, key="SVMTEST01", description="Just for testing" )
TS  = TestSet.byName( "testtest" )
fv  = FeatureVector.byKey( "SPAM-848" )
mod = SVModel( testset=TS, feature=f, fvector=fv )
mod.train()
mod.saveModel()

The first two lines creates a feature set and the feature corresponding to the SVM model. The feature set is necessary because every feature must belong to a FeatureSet object.

Lines 3-4 retrieve the feature vector and training set to be used, and Line 5 instantiates an SVModel object to hold the model. Line 6 trains the model, including scaling of features and grid search for optimal parameters. Line 7 saves the model to file, using a default file name.

Classification and testing

Classification is straight forward with a single line to test the classifier on one test set:

predict( mod, "testtest (Test)" )

The second argument could be a TestSet object, a single Image object, or any iterator of Image objects.

The return value is a list of tuples (image, classification score, predicted label, true label); where the true label is None if it is unknown.

The model file

The SVM model, as created by libSVM is not stored in the database, but rather on file with the file name stored in the database. The reason for this is that the libSVM uses ctypes, with the type or class for the SVM model implemented in C. Therefore it cannot be pickled, and tailor made methods would be necessary to store the model in the database.

The model filename is configurable with a full path name as a parameter to the saveModel() method. The default filename is derived from the feature vector key and the name of the training set, and the file goes in the current directory. There is room for improvement here.