6.2. Objective quality evaluation#

6.2.1. Objective estimators for perceptual quality#

With “objective evaluation” we usually refer to estimators of perceptual quality, where the objective is to predict the mean output of a subjective listening test using an algorithm. That is, we want a computer to listen to a sound sample and try to “guess” what a human listener would say about its quality (on average).

It is then clear that subjective evaluation is always the “true” measure of performance and objective evaluation is an approximation thereof. In this sense, subjective evaluation is “better”. In fact, there are plenty of examples where objective quality estimators give the opposite result of the subjective preference [Manocha et al., 2022]. However, there are many good reasons to use objective instead of subjective evaluation:

  • Subjective evaluation is expensive; a test requires that a large number of persons listens to sound samples, which is both time-consuming and requires infrastructure. Objective evaluation is performed on a computer, such that you can generally test a large number of sound samples in a short time.

  • Subjective evaluation is noisy; even with a large number of expert listeners it is generally difficult to get exactly the same result in two consecutive tests. Objective evaluation always gives the same rating for the same input, such that testing is consistent and reliable. This is especially important for scientific reproducibility; an independent laboratory can verify and confirm your results, the objective measure always gives the same output. With subjective evaluation, independent researchers can get different results, and you can never be 100% certain where the difference in results comes from. Did one of the researchers do an error or is it just that subjective listeners give always slightly different results?

Some of the most frequently used objective measures include:

  • PESQ is probably the most frequently used objective evaluation method and it is defined in ITU-T Recommendation P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs (2001) [Rix et al., 2001]. It is thus an evaluation method designed explicitly for telecommunications applications. It estimates the mean score of an P.800 ACR test.
    PESQ accepts only narrow-band input and is not directly applicable on other bandwidths. The degradation types whose effect PESQ can reliably predict are

    • Speech input levels to a codec

    • Transmission channel errors

    • Packet loss and packet loss concealment with CELP codecs

    • Bit rates if a codec has more than one bit-rate mode

    • Transcodings

    • Environmental noise at the sending side

    • Effect of varying delay in listening only tests

    • Short-term time warping of audio signal

    • Long-term time warping of audio signal

    Observe that distortions other than those listed above can provide unreliable results. An important missing feature are distortions caused by spectral processing, such as musical noise. Specifically, for example, using PESQ to evaluate speech enhancement methods based on processing in the STFT domain, can give unreliable results.

  • Perceptual Objective Listening Quality Assessment (POLQA) is the successor of PESQ and defined in ITU-T Recommendation P.863: Perceptual objective listening quality assessment [Beerends et al., 2013]. It is important to notice that for most practical purposes, POLQA is better than PESQ. It has a wider range of applications and acceptable degradation types and the output is more reliable. However, from a scientific perspective it is extremely regrettable that implementations of POLQA are commercial and expensive products, rendering application of POLQA infeasible in normal scientific work. Even if an individual team could afford purchasing a POLQA licence, verification of POLQA results by independent research labs is possible only if they also purchase a POLQA licence. Despite of its limitations, PESQ has therefore remained the scientific standard in objective evaluation of speech.

  • Perceptual Evaluation of Audio Quality (PEAQ) evaluates, instead of only speech, also other types of audio samples [Thiede et al., 2000]. It is therefore less accurate with respect to distortions specific to speech signals, but it generalizes better to other audio such as music and background noises. The measure is defined in ITU-R Recommendation BS.1387: Method for objective measurements of perceived audio quality (PEAQ).

  • The short-term objective intelligibility (STOI) measure focuses on how intelligible a speech sample is [Taal et al., 2011]. It is thus clearly focused on lower-quality scenarios where speech is so badly corrupted that it is hard to understand what is said. Like all objective measures, it is not a completely reliable estimate of quality, but can be useful in combination with other measures. A good feature of STOI is that an implementation is available.

6.2.2. Other objective performance criteria#

There are many cases where other performance criteria are well-warranted than merely prediction of subjective listening test results. Most typically these criteria are applied when there is no user involved, such as speech recognition, or, when we want to have more detailed characterization of performance than given by predictors of subjective listening test results.

Some examples of such performance criteria include:

  • Word error rate (WER) is used in speech recognition to measure the proportion of words correctly recognized from a test signal.

  • Signal to noise ratio (SNR) is used to measure the proportion of the desirable speech signal and undesirable noise components (which includes for example background noises, distortions caused by processing algorithms and transmission, as well as undesirable competing speakers). With a clean input spectrum \(X_k\) and its distorted counterpart \(\hat X_k\), the SNR is defined as

    \[ D_{SNR} = \frac{ \sum_{k=0}^{N-1} |X_k|^2 }{ \sum_{k=0}^{N-1} |X_k - \hat X_k|^2 }. \]

    Typically, SNR is presented in units of decibel, obtained by \(10\log_{10} D_{SNR}\). The motivation of the SNR is that it reflects the proportion of energy which is distorted. By using a ratio, we thus normalize the error to reflect accuracy, rather than error energy.

  • Perceptual signal to noise ratio (pSNR) measures SNR in a perceptually motivated domain. Essentially distortions are weighted such that they approximately correspond to human perception. This is similar to the above predictors of subjective listening tests, but works also on small segments of speech. It can be used to for detailed analysis of distortions to, for example, which parts of the signal contain undesirable distortions. With perceptual weighting coefficients \(w_k\) the pSNR is defined as

    \[ D_{pSNR} = \frac{ \sum_{k=0}^{N-1} w_k |X_k|^2 }{ \sum_{k=0}^{N-1} w_k |X_k - \hat X_k|^2 }. \]
  • The speech distortion index (SDI) measures the amount by which a desirable speech signal is distorted. In speech enhancement, it is often used in combination with the noise attenuation factor (NAF), which measures the amount by which undesirable noises are removed. It is clear that by doing nothing, we obtain a perfect SDI and by setting the output to zero, we obtain a perfect NAF. Neither outcome is usually satisfactory. It is therefore usually not clear what the right balance between the two measures are.

  • Unweighted and weighted average recall (UAR, WAR) are often used to measure performance in speech classification tasks, such as classifying a speech segment into one of finite number of possible emotions. UAR is defined as the mean of class-specific recalls (the proportion of class samples recognized correctly) while WAR is the overall proportion of samples recognized correctly across all classes (sometimes also referred to as accuracy). UAR is often preferred over WAR in experiments where there is a notable class imbalance in the test data, and where it is important to have systems that are also sensitive to the less-frequent classes.

  • Receiver operating characteristic (ROC) curves and its derivatives such as area under the curve (AUC) or equal error rate (EER) are often used to report performance of systems that have some type of detection threshold that can be varied, and when performance for each threshold value is measured in terms of precision and recall. For instance, performance of speaker verification systems is often evaluated using such metrics.

  • The log-spectral distance or log-spectral distortion (LSD) measures spectral error of the log-magnitude spectrum \(10\log_{10} P(\omega)\), where \(P(\omega)=|X(\omega)|^2\) is the power (energy) of the clean signal spectrum \(X(\omega)\). The LSD is then defined using the corrupted spectrum \(\hat P(\omega)\) as

    \[ D_{LS} =\sqrt {{\frac {1}{2\pi }}\int _{-\pi }^{\pi }\left[10\log _{10}{\frac {P(\omega )}{{\hat {P}}(\omega )}}\right]^{2}\,d\omega } =\sqrt {{\frac {1}{2\pi }}\int _{-\pi }^{\pi }\left[10\log _{10} {P(\omega )}-10\log _{10}{\hat {P}}(\omega )\right]^{2}\,d\omega }. \]

    In practical applications the integral needs to be replaced with a summation such as

    \[ D_{LS} =\sqrt {{\frac {1}{N }}\sum _{k=0}^{N-1 }\left[10\log _{10}{\frac {P_k}{{\hat {P}}_k}}\right]^{2} }, \]

    where \(N\) is the number of spectral components. Observe that the LSD thus corresponds to the mean of the squared error in the log-domain. The LSD is motivated by the fact that human perception of distortion is approximately logarithmic [Gray and Markel, 1976].

6.2.3. References#

BSB+13

John G Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, and Michael Keyhl. Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part i—temporal alignment. Journal of the Audio Engineering Society, 61(6):366 – 384, 2013. URL: http://www.aes.org/e-lib/browse.cfm?elib=16829.

GM76

Augustine Gray and John Markel. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5):380 – 391, 1976. URL: https://doi.org/10.1109/TASSP.1976.1162849.

MJF22

Pranay Manocha, Zeyu Jin, and Adam Finkelstein. Audio similarity is unreliable as a proxy for audio quality. arXiv preprint arXiv:2206.13411, 2022. URL: https://doi.org/10.48550/arXiv.2206.13411.

RBHH01

Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, 749 – 752. IEEE, 2001. URL: https://doi.org/10.1109/ICASSP.2001.941023.

THHJ11

Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. An algorithm for intelligibility prediction of time – frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125 – 2136, 2011. URL: https://doi.org/10.1109/TASL.2011.2114881.

TTB+00

Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends, and Catherine Colomes. PEAQ - The ITU standard for objective measurement of perceived audio quality. Journal of the Audio Engineering Society, 48(1/2):3 – 29, 2000. URL: http://www.aes.org/e-lib/browse.cfm?elib=12078.