6.2. Objective quality evaluation#

6.2.1. Objective estimators for perceptual quality#

With “objective evaluation” we usually refer to estimators of perceptual quality, where the objective is to predict the mean output of a subjective listening test using an algorithm. That is, we want a computer to listen to a sound sample and try to “guess” what a human listener would say about its quality (on average).

It is then clear that subjective evaluation is always the “true” measure of performance, and objective evaluation is an approximation thereof. In this sense, subjective evaluation is “better”. There are plenty of examples where objective quality estimators give the opposite result of the subjective preference [Manocha et al., 2022]. However, there are many good reasons to use objective instead of subjective evaluation:

  • Subjective evaluation is expensive; a test requires that a large number of persons listening to sound samples, which is both time-consuming and requires infrastructure. Objective evaluation is performed on a computer, such that you can generally test a large number of sound samples in a short time.

  • Subjective evaluation is noisy; even with a large number of expert listeners, it is generally difficult to get exactly the same result in two consecutive tests. Objective evaluation always gives the same rating for the same input, such that testing is consistent and reliable. This is especially important for scientific reproducibility; an independent laboratory can verify and confirm your results, and the objective measure always gives the same output. With subjective evaluation, independent researchers can get different results, and you can never be 100% certain where the difference in results comes from. Did one of the researchers make an error, or is it just that subjective listeners always give slightly different results?

Some of the most frequently used objective measures include:

  • PESQ is probably the most frequently used objective evaluation method and it is defined in ITU-T Recommendation P.862: Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs (2001) [Rix et al., 2001]. It is thus an evaluation method designed explicitly for telecommunications applications. It estimates the mean score of a P.800 ACR test.
    PESQ accepts only narrow-band input and is not directly applicable on other bandwidths. The degradation types whose effect PESQ can reliably predict are

    • Speech input levels to a codec

    • Transmission channel errors

    • Packet loss and packet loss concealment with CELP codecs

    • Bit rates if a codec has more than one bit-rate mode

    • Transcodings

    • Environmental noise at the sending side

    • Effect of varying delay in listening-only tests

    • Short-term time warping of audio signal

    • Long-term time warping of audio signal

    Observe that distortions other than those listed above can provide unreliable results. An important missing feature is distortions caused by spectral processing, such as musical noise. Specifically, for example, using PESQ to evaluate speech enhancement methods based on processing in the STFT domain, can give unreliable results.

  • Perceptual Objective Listening Quality Assessment (POLQA) is the successor of PESQ and defined in ITU-T Recommendation P.863: Perceptual objective listening quality assessment [Beerends et al., 2013]. It is important to notice that for most practical purposes, POLQA is better than PESQ. It has a wider range of applications and acceptable degradation types, and the output is more reliable. However, from a scientific perspective, it is extremely regrettable that implementations of POLQA are commercial and expensive products, rendering the application of POLQA infeasible in normal scientific work. Even if an individual team could afford to purchase a POLQA license, verification of POLQA results by independent research labs is possible only if they also purchase a POLQA license. Despite of its limitations, PESQ has, therefore, remained the scientific standard in objective evaluation of speech.

  • Perceptual Evaluation of Audio Quality (PEAQ) evaluates, instead of only speech, also other types of audio samples [Thiede et al., 2000]. It is, therefore, less accurate with respect to distortions specific to speech signals, but it generalizes better to other audio, such as music and background noises. The measure is defined in ITU-R Recommendation BS.1387: Method for objective measurements of perceived audio quality (PEAQ).

  • The short-term objective intelligibility (STOI) measure focuses on how intelligible a speech sample is [Taal et al., 2011]. It is thus clearly focused on lower-quality scenarios where speech is so badly corrupted that it is hard to understand what is said. Like all objective measures, it is not a completely reliable estimate of quality but can be useful in combination with other measures. A good feature of STOI is that an implementation is available.

6.2.2. Other objective performance criteria#

There are many cases where other performance criteria are well-warranted than mere prediction of subjective listening test results. Most typically, these criteria are applied when there is no user involved, such as speech recognition or, when we want to have more detailed characterization of performance than given by predictors of subjective listening test results.

Some examples of such performance criteria include:

  • Word error rate (WER) is used in speech recognition to measure the proportion of words correctly recognized from a test signal.

  • Signal to noise ratio (SNR) is used to measure the proportion of the desirable speech signal and undesirable noise components (which include for example background noises, distortions caused by processing algorithms, and transmission, as well as undesirable competing speakers). With a clean input spectrum \(X_k\) and its distorted counterpart \(\hat X_k\), the SNR is defined as

    \[ D_{SNR} = \frac{ \sum_{k=0}^{N-1} |X_k|^2 }{ \sum_{k=0}^{N-1} |X_k - \hat X_k|^2 }. \]

    Typically, SNR is presented in units of decibel, obtained by \(10\log_{10} D_{SNR}\). The motivation of the SNR is that it reflects the distorted proportion of energy. By using a ratio, we thus normalize the error to reflect accuracy, rather than error energy.

  • Perceptual signal-to-noise ratio (pSNR) measures SNR in a perceptually motivated domain. Essentially, distortions are weighted such that they approximately correspond to human perception. This is similar to the above predictors of subjective listening tests but also works on small segments of speech. It can be used to for detailed analysis of distortions to, for example, which parts of the signal contain undesirable distortions. With perceptual weighting coefficients \(w_k\) the pSNR is defined as

    \[ D_{pSNR} = \frac{ \sum_{k=0}^{N-1} w_k |X_k|^2 }{ \sum_{k=0}^{N-1} w_k |X_k - \hat X_k|^2 }. \]
  • The speech distortion index (SDI) measures the amount by which a desirable speech signal is distorted. In speech enhancement, it is often used in combination with the noise attenuation factor (NAF), which measures the amount by which undesirable noises are removed. It is clear that by doing nothing, we obtain a perfect SDI, and by setting the output to zero, we obtain a perfect NAF. Neither outcome is usually satisfactory. It is, therefore, usually not clear what the right balance between the two measures is.

  • Unweighted and weighted average recall (UAR, WAR) are often used to measure performance in speech classification tasks, such as classifying a speech segment into one of a finite number of possible emotions. UAR is defined as the mean of class-specific recalls (the the proportion of class samples recognized correctly) while WAR is the the overall proportion of samples recognized correctly across all classes (sometimes also referred to as accuracy). UAR is often preferred over WAR in experiments where there is a notable class imbalance in the test data, and where it is important to have systems that are also sensitive to the less-frequent classes.

  • Receiver operating characteristic (ROC) curves and its derivatives, such as area under the curve (AUC) or equal error rate (EER) is often used to report the performance of systems that have some type of detection threshold that can be varied, and when performance for each threshold value is measured in terms of precision and recall. For instance, the performance of speaker verification systems is often evaluated using such metrics.

  • The log-spectral distance or log-spectral distortion (LSD) measures spectral error of the log-magnitude spectrum \(10\log_{10} P(\omega)\), where \(P(\omega)=|X(\omega)|^2\) is the power (energy) of the clean signal spectrum \(X(\omega)\). The LSD is then defined using the corrupted spectrum \(\hat P(\omega)\) as

    \[ D_{LS} =\sqrt {{\frac {1}{2\pi }}\int _{-\pi }^{\pi }\left[10\log _{10}{\frac {P(\omega )}{{\hat {P}}(\omega )}}\right]^{2}\,d\omega } =\sqrt {{\frac {1}{2\pi }}\int _{-\pi }^{\pi }\left[10\log _{10} {P(\omega )}-10\log _{10}{\hat {P}}(\omega )\right]^{2}\,d\omega }. \]

    In practical applications, the integral needs to be replaced with a summation, such as

    \[ D_{LS} =\sqrt {{\frac {1}{N }}\sum _{k=0}^{N-1 }\left[10\log _{10}{\frac {P_k}{{\hat {P}}_k}}\right]^{2} }, \]

    where \(N\) is the number of spectral components. Observe that the LSD thus corresponds to the mean of the squared error in the log-domain. The LSD is motivated by the fact that human perception of distortion is approximately logarithmic [Gray and Markel, 1976].

6.2.3. References#

[BSB+13]

John G Beerends, Christian Schmidmer, Jens Berger, Matthias Obermann, Raphael Ullmann, Joachim Pomy, and Michael Keyhl. Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part i—temporal alignment. Journal of the Audio Engineering Society, 61(6):366 – 384, 2013. URL: http://www.aes.org/e-lib/browse.cfm?elib=16829.

[GM76]

Augustine Gray and John Markel. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(5):380 – 391, 1976. URL: https://doi.org/10.1109/TASSP.1976.1162849.

[MJF22]

Pranay Manocha, Zeyu Jin, and Adam Finkelstein. Audio similarity is unreliable as a proxy for audio quality. arXiv preprint arXiv:2206.13411, 2022. URL: https://doi.org/10.48550/arXiv.2206.13411.

[RBHH01]

Antony W Rix, John G Beerends, Michael P Hollier, and Andries P Hekstra. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, 749 – 752. IEEE, 2001. URL: https://doi.org/10.1109/ICASSP.2001.941023.

[THHJ11]

Cees H Taal, Richard C Hendriks, Richard Heusdens, and Jesper Jensen. An algorithm for intelligibility prediction of time – frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125 – 2136, 2011. URL: https://doi.org/10.1109/TASL.2011.2114881.

[TTB+00]

Thilo Thiede, William C Treurniet, Roland Bitto, Christian Schmidmer, Thomas Sporer, John G Beerends, and Catherine Colomes. PEAQ - The ITU standard for objective measurement of perceived audio quality. Journal of the Audio Engineering Society, 48(1/2):3 – 29, 2000. URL: http://www.aes.org/e-lib/browse.cfm?elib=12078.