PERMM: Image Analysis

Project Goals

The primary goal of the image analysis with in the PERMM project is to make still images retrievable by queries specifying visual content. To full fill this goal the image analysis must perform the following sub tasks,

Image Segmentation, break an image into consistent regions with the region boundaries being broadly similar to where a person might believe a boundary to be. The region boundaries will necessarily reflect changes in image properties rather than corresponding to real world object boundaries.
Region description, the machine must be provided with a rich set of descriptors to cover the regions colour, texture, shading, shape and context.
Semantic Classification of Region Content, the region descriptors above are passed to a series of classifiers which are able to recognise categories of stuff, e.g. skin, sky, grass, etc.

Performing the semantic classification step allows queries to be specified in near natural language text format rather than directly in terms of regions' parametric descriptions (as is the case with query by example systems). So we can look for `grass' rather than specifying the RGB value and putting a numerical value to the type of texture desired. Interposing classifiers between the retrieval engine and the regions' descriptors also to some extent finesses the problems of defining a metric between region descriptors. Papers describing the image processing algorithms for image segmentation and region description are,

Voronoi seeded colour image segmentation: tr.1999.3. but the broad sweep of the algorithms is outlined below.

Image Segmentation

The criteria for region boundary selection for use in an image retrieval project must be intuitively obvious to a user, otherwise the machines perception of an image will be fundamentally at odds with what a user expects. The segmentation scheme developed here satisfies this criteria and functions in the following way, (the stages of an example image segmentation are given for clarity).

Example image

Colour edge detection. A full 3 colour Canny type edge detector is used here to avoid the case where there is no brightness change across a colour boundary.

Colour derivative

Resulting edge map
Distance transform of the edge image. The peaks in the distance transform are used as seed points for region growing. This transformation is also referred to as a Voronoi transform. This means of generating seed points for region growing insures the density of seed points reflects the local density of image edges.

Distance transform or Voronoi image, the darker a pixel the further it is from its closest image edge
Region growing. Regions are grown out from seed points. If a pixel touching the boundary of a region is close in colour to the boundary pixel and to the mean colour of the region it is accepted as belonging to that region.

An initial set of regions is grown from Voronoi seed points.
Initial regions are morphed out to image edges. This helps region boundaries to accurately follow the image edge map.

The initial set of regions are grown out to the image edges
Similar colour touching regions are merged.

Initial colour image segmentation
Texture model. The texture model is based on a local ridge operator. The texture features correspond to small, oriented, coloured, bar-like features. Each bar-like feature contains roughly 20 pixels from the ridge-map produced by the ridge operator.

Image containing textured regions (the cats).

Oriented bar like texture features.
Texture features are clustered according to colour and orientation.

Clusters of texture features.
Textured regions are found by unifying Voronoi regions which share the same dominant texture feature cluster.

Textured regions.

Region Content Description

It remains an open problem how to describe the visual content of a region. In this project we have adopted the stance that large banks of filters at various scales and orientations are not appropriate for our needs. Such filters perform well in modelling the power spectrum of large uniformly textured areas, they fail however in the case that their footprint lies largely outside the region they are meant to be describing. They also fail to give a straight forward mapping from intuitively salient region properties and their associated parametric description. The image region property descriptors adopted in this project are

Colour is represented by a 2600 element colour histogram encoding hue more finely then saturation or brightness values.
Texture is represented by two histograms, one of orientation of the local texture features and the other (for recognising very specific textures) as a pairwise geometric histogram of relative orientation of pairs of texture features against separation of the features.
Shading is encoded in a series of 4 histograms encoding the width of the isobrightness contours verse the brightness transition type at each side of the isobrightness bands with in a region. Four orientations of transition are considered; vertical, horizontal, top_left - button_right, bottom_left - top_right. This implicitly encodes shading gradient and consequently light source orientation as well as aspects of surface shape. It should be noted that if one wanted to explicitly solve the shape from shading problem the starting point would be the isobrightness constraint on surface normal direction relative to illumination direction.
Shape is encoded as a histogram of boundary straight line length verse orientation. This is not a particularly good shape encoding but qualitative description of shape is a problem.

Semantic Classification of Region Content

The region content descriptors outlined above are passed to a series of trained classifiers each of which is capable of recognising a single class of stuff. To date classifiers have been built and trained to recognise the following classes of stuff; skin, sky, grass, trees, water, sand, brick, cloth, interior wall, cloud, snow, tarmac and wood. The classifiers are all Multi-Layer Perceptrons and they were trained using the Stuttgart Neural Network Simulator package. The classifiers take as input varying subsets of the image description parameters outlined above. For example the classifiers for recognising interior wall and cloth have a reduced colour model which only encodes brightness and saturation and not hue. In general the classifiers have proven to be between 85 and 95 percent accurate on ordinary home photographs.

Active Research

The area of image analysis that is being actively focused on is classification of spatial patterns with a view to face spotting. Existing successful face spotting algorithms rely on multi-scale multi-orientation correlation template matching although the template may be implicitly encoded in a MLP type neural net or in the bowels of a Support Vector Machine. The alternate approach being pursued here relies on identifying small oriented sub-features and grouping these into larger features before reaching for the MLP. This 'pre-filtering' to provide candidate regions of interest greatly reduces the computational load. Work is also in progress to improve the accuracy of the stuff classifiers.