Active Appearance Models (AAM)

Active Appearance Models (AAM) is a statistical based template matching method, where the variability of shape and texture is captured from a representative training set. Principal Components Analysis (PCA) on shape and texture data allow building a parametrized face model that fully describe with photorealistic quality the trained faces as well as unseen.

Shape Model
The shape is defined as the quality of a configuration of points which is invariant under Euclidian Similarity transformations. This landmark points are selected to match borders, vertexes, profile points, corners or other features that describe the shape. The representation used for a single $n$-point shape is a $2n$ vector given by $ \b {x}=\left(x_1,y_1,x_2,y_2, \ldots, x_{n-1},y_{n-1},x_n,y_n \right)^T$. With $N$ shape annotations, follows a statistical analysis where the shapes are previously aligned to a common mean shape using a Generalised Procrustes Analysis (GPA) removing location, scale and rotation effects. Optionally, we could project the shape distribution into the tangent plane, but omitting this projection leads to very small changes. Applying a Principal Components Analysis (PCA), we can model the statistical variation with

$\displaystyle \b x = \overline{\b x} + \Phi_s \b b_s$
where new shapes $ \b x$, are synthesised by deforming the mean shape, $ \overline{\b x}$, using a weighted linear combination of eigenvectors of the covariance matrix, $\Phi_s$ $ \b b_s$ is a vector of shape parameters which represents the weights. $\Phi_s$ holds the $t_s$ most important eigenvectors that explain a user defined variance.

Texture Model
For $ m$ pixels sampled, the texture is represented by the vector $ \b g = [g_1,g_2, \ldots, g_{m-1}, g_m]^T$. Building a statistical texture model, requires warping each training image so that the control points match those of the mean shape. In order to prevent holes, the texture mapping is performed using the reverse map with bilinear interpolation correction. The texture mapping is performed, using a piece-wise affine warp, i.e. partitioning the convex hull of the mean shape by a set of triangles using the Delaunay triangulation. Each pixel inside a triangle is mapped into the correspondent triangle in the mean shape using barycentric coordinates, see figure.

Figure 1: Texture mapping example.
WarpingSource WarpingDestination
This procedure removes differences in texture due shape changes, establishing a common texture reference frame. The effects of differences in illumination are reduced performing a histogram equalization independently in each of the three color channels. A texture model can be obtained by applying a low-memory PCA on the normalized textures,
$\displaystyle \b g = \overline{\b g} + \Phi_g \b b_g$
where $ \b g$ is the synthesized texture, $ \overline{\b g}$ is the mean texture, $ \Phi_g$ contains the $ t_g$ highest covariance texture eigenvectors and $ \b b_g$ is a vector of texture parameters.

Combined Model
The shape and texture from any training example is described by the parameters $ \b b_s$ and $ \b b_g$. To remove correlations between shape and texture model parameters a third PCA is performed to the following data, $ \b b = \left( \begin{array}{c} \b W_s \b b_s\\ \b b_g \end{array} \right) = ... ... - \overline{\b x})\\ \Phi_g^T (\b g - \overline{\b g})\\ \end{array} \right)$, where $ \b W_s$ is a diagonal matrix of weights that measures the unit difference between shape and texture parameters. A simple estimate for $ \b W_s$ is to weight uniformly with ratio, $ r$, of the total variance in texture and shape, i.e. $ r = \sum_i \lambda_{gi} / \sum_i \lambda_{si}$, where $ \lambda_s$ and $ \lambda_g$ are shape and texture eigenvalues, respectively. Then $ \b W_s=r \b I$. As a result, using again a PCA, $ \Phi_c$ holds the $ t_c$ highest eigenvectors, and we obtain the combined model, $ \b b = \Phi_c \b c$. Due the linear nature for the model, it is possible to express shape, $ \b x$, and texture, $ \b g$, using the combined model by

$\displaystyle \b x = \overline{\b x} + \Phi_s \b W_s^{-1} \Phi_{c,s} \b c$
$\displaystyle \b g = \overline{\b g} + \Phi_g \Phi_{c,g} \b c$
where $ \Phi_c = \left(\begin{array}{c}\Phi_{cs}\\\Phi_{cg}\\\end{array} \right)$ and $ \b c$ is a vector of appearance controlling both shape and texture. An AAM instance is built by generating the texture in the normalized frame using eq. and warping-it to the control points given by eq..

Model Training
An AAM search seek to minimize the texture difference between a model instance and the beneath part of the target image that it covers. It can be treated as an optimization problem where $ arg_{min \vert_{\b c}} \vert\b I_{image}-\b I_{model}\vert^2$ updating the appearance parameters $ \b c$ and pose. This nonlinear problem can be solved by learning offline how the model behaves due parameters change and the correspondent relations between the texture residual. Additionally, similarity parameters are considered to represent the 2D pose, $ \b t = (s_x,s_y,t_x,t_y)^T$. To maintain linearity and for zero parameters value represent no change in pose, these parameters are redefined to $ s_x=(s\cos(\theta)-1)$, $ s_y=s\sin(\theta)$ which represents a combined scale, $ s$, and rotation, $ \theta$, while the remaining parameters $ t_x$ and $ t_y$ are translations. The complete model parameters include also pose, $ \b p = (\b c^T\vert \b t^T)^T$. The initial AAM formulation uses the Multivariate Linear Regression (MLR) approach over the set of training texture residuals, $ \delta \b g$, and the correspondent model perturbations, $ \delta \b p$. Assuming that the correlation of texture difference and model parameters update is locally linear, the goal is to get the optimal prediction matrix, in the least square sense, satisfying the linear relation, $ \delta \b p = \b R \delta \b g$. Solving it involves perform a set experiences, building huge residuals matrices and perform MLR on these. It was suggested that appearance parameters, $ \b c_i$, should be perturbed in about $ \pm0.25\sigma_i$ and $ \pm0.5\sigma_i$. Scale around 90%, 110%, rotation $ \pm 5^\circ$, $ \pm 10^\circ$ and translations $ \pm5\%$, $ \pm10\%$, all with respect to the reference mean frame. The MLR was later replaced by a simpler approach, computing the gradient matrix, $ \dfrac{\partial \b r}{\partial \b p}$, requiring much less memory and computational effort. The texture residual vector is defined as $ \b r(\b p) = \b g_{image}(\b p) - \b g_{model}(\b p)$, where the goal is to find the optimal update at model parameters to minimize $ \vert\b r(\b p)\vert^2=\b r^T \b r$. Expanding the texture residuals, $ \b r(\b p)$, in Taylor series around $ \b p$ and holding the first order terms, $ \b r(\b p + \partial \b p) \approx \b r(\b p) + \b J \partial \b p$ where $ \b J=\dfrac{\partial \b r (\b p)} {\partial \b p}$ is the Jacobian matrix. Differentiating w.r.t. $ \b p$ and equalling to zero leads to $ \partial \b p = - (\b J^T \b J)^{-1} \b J^T \b r$. Normally steepest descent approaches require the Jacobian evaluation for each iteration. Since the AAM framework works on a normalize reference frame, the Jacobian matrix can be considered fixed over the training set and can be estimated once on the training phase.

Model Fitting
The model parameters are updated over texture residuals by,

$\displaystyle \b p_k = \b p_{k-1} - \alpha (\b J^T \b J)^{-1} \b J^T \delta \b g$
which is a damped Gauss-Newtow modification on Steepest Descent methods where $ \b J$ is the Jacobian matrix and $ \alpha$ is the damping factor. Starting with a given estimate for the model, $ \b p_0$, and a rough estimate of the location of the face, an AAM model can be fitted following the algorithm 1. As better is the initial estimate minor the risk of being trap in a local minimum, in this work AdaBoost method its used. Figure 3 shows a successful AAM search.

Figure 3: Iterative model refinement.
Iteration01 Iteration02 Iteration03 Iteration05 Iteration08 Iteration15 Final Original