Vector quantization (VQ)#
Suppose you have recorded sounds at different locations and want to
categorize them into similar groups. In other words, you have a
stochastic vector
The above expression thus calculates the squared error between
A illustration of a simple vector codebook is shown on the right. The
input data is a Gaussian distribution shown with grey dots and the
codebook vectors
Example of a codebook for a 2D Gaussian with 16 code vectors.
Metric for codebook quality#
Suppose then that you have a large collection of
vectors
where
To find the best set of codebook vectors
or more specifically, for a dataset as
Unfortunately we do not have an analytic solution for this optimization problem, but have to use numerical, iterative methods.
Codebook optimization#
Expectation maximization (EM)#
Classical methods for finding the best codebook are derivatives of expectation maximization (EM), which is based on two alternating steps:
Expectation Maximation (EM) algorithm:
For every vector
in a large database, find the best codebook vector .For every codebook vector
;Find all vectors
assigned to that codevector.Calculate mean of those vectors.
Assign the mean as a new value for the codevector.
If converged then stop, otherwise go to 1.
This algorithm is guaranteed to give a codebook at every step which is
not worse than the previous codebook. That is, at each iteration will
improve until it finds a local minimum, where it stops changing. The
reason is that each step in the iteration finds a partial best-solution.
In the first step, we find the best matching codebook vectors for each
data vectors
As noted above, this algorithm is the basis to most vector quantization codebook optimization algorithms. There are a multiple reasons why this simple algorithm is usually not sufficient alone. Most importantly, the above algorithm is slow to converge to a stable solution and it often finds a local minimum instead of a global minimum.
To improve performance, we can apply several heuristic approaches. For
example, we can start with a small codebook
The advantage of this approach is that it focuses attention to the big
bulk of datapoints
Conversely, we can start with a large codebook, say treat the whole
input database
In any case, optimization of vector codebooks is a difficult task and we have no practical algorithms which would be guaranteed to find the global optimum. Like in many other machine learning problems, optimizing the codebook is very much about learning to know your data. You should first use one algorithm and then analyse the output to find out what can be improved, and keep repeating this optimization and analyse process until the output is sufficiently good.
Optimization with machine learning platforms#
A modern approach to modelling is machine learning, where complex phenomena are modelled with neural networks. Typically they are trained with gradient-descent type methods, where parameters are iteratively nudged towards the minimum, by following the steepest gradient. Since such gradients can be automatically derived on machine learning platforms (using the chain rule), they can be applied on very complex models. Consequently, they have become very popular and succesful.
The same type of training can be readily applied to vector quantizers as well. However, there is a practical problem with this approach. Estimation of the gradients of the parameters with the chain-rule requires that all intermediate gradients are non-zero. Quantizers are however piece-wise constant such that their gradients are uniformly zero, thus disabling the chain rule and gradient descent for all parameters which lie behind the quantizer in the computational graph. A practical solution is known as pass-through, where gradients are passed unchanged through the quantizer. This approximation is simple to implement and provides often adequate performance.
While this approach converges slower than the EM-algorithm, it is often beneficial since it allows optimization of an entire system to a single goal. That is, if we have several modules in a system, it is beneficial if their joint interaction is taken into account in the optimization. If modules are optimized independently, it is difficult to anticpate all interactions and the system performance can remain far from optimal.
Algorithmic complexity#
Vector quantization is manageable for relatively small codebooks of,
say,
A heuristic approach is to use successive codebooks, where at each
iteration, we quantize the error of the last iteration. That is, let’s
say that on the first iteration we have 8 bits, corresponding to a
codebook
Where ordinary vector quantization can find the optimal solution, split
vector quantization generally does not give a global optimum. It does
give good solutions, though, but with an algorithmic complexity which
very much lower than ordinary vector quantization. For example, in the
above example of 30 bits, we could assign three consecutive layers of
codebooks with 10 bits /
Applications#
Probably the most important application where vector quantization is used in speech processing, is speech coding with Code-excited linear prediction (CELP), where
linear predictive coefficients (LPC) are transformed to line spectral frequencies (LSFs), which are often encoded with multi-stage vector quantizers.
gains (signal energy) of the residual and long term prediction are jointly encoded with a single stage vector quantizer.
Other typical applications include
In optimization of Gaussian mixture models (GMMs), it is useful to use vector quantization to find a first-guess of the means of each mixture.
Discussion#
The benefit of vector quantization is that it is a simple algorithm which gives high accuracy. In fact, for quantizing complicated data, vector quantization is (in theory) optimal in fixed-rate coding applications. It is simple in the sense that an experienced engineer can implement it in a matter of hours. Downsides with vector quantization include
Complexity; for accurate quantization you need prohibitively large codebooks. The method therefore does not scale up nicely to big problems.
Difficult optimization;
Training data; The amount of data needed to optimize a vector codebook is large. Each codebook vector must be assigned to a large number of data vectors, such that calculation of the mean (in the EM algorithm) is meaningful.
Convergence; we have no assurance that optimization algorithms find the global optimum and we have no assurance that local minima are “good enough”.
Lack of flexibility; the codebook has a fixed size. If we would like to use codebooks of different sizes, for example, if we want to transmit data with a variable bit-rate, then we have to optimize and store a large codebook for every possible bitrate.
Blindness to inherent structures; this model describes data with a codebook, without any deeper understanding of what the data looks like within each category. For example, say we have two classes, speech and non-speech. Even if speech is very flexible, the non-speech class is much, much larger. Speech is a very small subset of all possible sounds. Therefore, the within-class variance will be much larger in the non-speech class. Consequently, the accuracy in the non-speech class would be much lower.
As a consequence, we would be tempted to increase the number of codevectors such that we get uniform accuracy in both classes. But then we loose the correspondence between codevectors and natural descriptions of the signal.