Automatic speech recognition (ASR) technology has developed and entered our lives so naturally that now we don’t even notice its presence. Hasn’t it always been there?

The number of ASR-based apps is growing, and the quality is gradually improving, but it is still far from satisfying business needs.
Let’s consider what’s missing.
Speech-to-text challenges and possible solutions
Often, deep neural networks don’t give expected results because of the background noise that makes it complicated to recognize the words correctly. Multiple speakers talking at the same time make the task even more complex.

And even in perfect conditions, the possibilities of neural nets are still limited as model training cannot embrace all possible accents and articulation peculiarities of speakers.
Speech-to-text problem is still not solved, so research and development departments in many companies are looking for new ways to overcome them.
If we speak about visual content, it’s worth mentioning that we, humans, get not only auditory but visual information too. Alongside recognizing the words spoken, we analyze the lip movements and get additional information from the speakers’ articulation.
This inspired Meta AI to develop a self-supervised framework called AV-HuBERT, which stands for Audio-Visual Hidden Unit BERT that analyzes both the audio and visual aspects of speech.
But it’s not always possible to see the speaker. So speech-to-text solutions that process the audio and provide accurate scripts of what was said are very much wanted.
A way to estimate accuracy
Accuracy is one of the most critical characteristics that influence the choice of the ASR solution.
And one of the most common metrics used to rate the accuracy of a speech-to-text solution is calculating the word error rate (WER).
WER is the ratio of mistakes to the total number of spoken words in the transcribed text.

The lower the WER is, the more accurate the transcription.
Now when we’ve figured out the theory, let’s compare how different solutions perform in practice.
Comparing popular solutions with Cognitive Mill
Many companies use Amazon Transcribe or Google Cloud Speech-to-text solutions.
We at AIHunters built our solution using NVIDIA’s speech-to-text model QuartzNet15x5Base-En with punctuation_en_bert for punctuation and capitalization.
We’ve conducted a small experiment to estimate the accuracy of the solutions mentioned above on the basis of three randomly selected video fragments of different genres: an action movie, an interview, and a news report.
As all three samples provided more or less similar results, let’s review one of these samples — a news report — in more detail.
Here is the input video.
Below is the correct script, which includes 250 words in total.
Heat waves sweeping across Europe this summer have brought not just record high temperatures and scorched fields but new discoveries as well. The drought-stricken waters of Italy's river Po are running so low that they have revealed a Second World War bomb, as Ikaba Koyi reports.
This is river Po, the longest river in Italy. But after weeks of scorching heat, its waters have been running so low they've revealed a 450-kilo World War II bomb. The bomb was found by fishermen on the bank of the Po river due to a decrease in water levels caused by drought. The army were called in, and put to work. About 3,000 people living nearby were ordered to evacuate before the operation.
At first, someone said they would not move. But in the last few days, we think we have convinced everyone. The most important thing is for everyone to evacuate the town because even if only one person is found, the operations would have to be suspended, with delays for everyone.
Traffic on a railway line and state road nearby were halted, and even the area's airspace was shut down. Bomb disposal engineers then removed the fuse from the bomb, which the army said contained 240 kilograms of explosive. The device was taken to a nearby quarry, where it was finally destroyed.
Ikaba Koyi, BBC News.
Though Google and Amazon have different settings along with the possibilities to train or adapt the model, we used the default model so that the solutions would compete on more or less equal terms.
We neither set the number of channels nor turn on speaker diarization or vocabulary filters.
The model we use can currently only analyze texts in US English, so we set US English for all solutions.
And we didn’t take punctuation and capital letters into account.
The results
Now, let’s see which solution won the accuracy competition!
To count the accuracy, we used the WER formula. And here are the results of our small test.
The parts of the script that contain errors of any type, such as insertions, deletions, or substitutions, are highlighted.
If it’s just highlighted, it’s an erroneous deletion, which means the text hasn’t been recognized and is missing in the script.
If it’s a highlighted strikethrough text, it means an erroneous insertion, the words that are not in the original text.
If the highlighted strikethrough text is followed by highlighted text, it’s an erroneous substitution with the correction.
Amazon Transcribe
Here is the result we’ve got with Amazon Transcribe.

Amazon's solution made only four mistakes.
4 / 250 = 0.016
The word error rate (WER) for Amazon Transcribe is 1.6%, which means that the transcribed text is 98.4% accurate.
One of the four mistakes is the proper name of the journalist — Ikaba Koyi, which the system didn’t know in advance.
Google Speech-to-text
Here’s the transcription we got with the Google Cloud Speech-to-text service.

The Google Speech-to-text solution made 50 mistakes.
50 / 250 = 0.2
The word error rate (WER) for Google Speech-to-text is 20%, so the transcribed text is 80% accurate.
In some cases, the system erased the parts of sentences that it couldn’t recognize.
Cognitive Mill
Below is the script we’ve got with our solution based on NVIDIA’s speech-to-text and punctuation models.

NVIDIA's model misinterpreted 20 words.
20 / 250 = 0.08
The word error rate (WER) for the Cognitive Mill solution is 8%, with an accuracy of 92%.
The system made erroneous insertions in the parts of the video where the English interpretation was made on top of the original Italian speech.
It didn’t affect the other two services, though.
According to the experiment results, the Amazon Transcribe solution obviously provided the most accurate script with an accuracy of almost 100%.
The models we use for Cognitive Mill pipelines showed 92% of accuracy.
And the score for Google Speech-to-text is 80%.
The results are rough and should not be considered ultimate.
Everything depends on use cases and business needs. And as I’ve already mentioned above, the reviewed services can be customized for specific tasks.
We conducted this experiment to get an overall comparison of popular services and our solution using default settings.
Why do we use NVIDIA’s models for the Cognitive Mill pipelines?
Looking at the results of our small speech-to-text solutions competition, you may wonder why we chose NVIDIA’s models instead of integrating the ready-made Amazon service that showed the most accurate result.
These are our reasons.
Depending on business needs and specific tasks, it’s not always necessary to strive for accuracy of 100%.
For example, the accuracy we get with NVIDIA’s models is high and more than enough to identify the topic and keywords or generate the summary to get the gist of what was discussed in the video as we achieve with our News summary solution.
In one of our previous articles, you can find more information about the summarization pipeline.
As you know, AIHunters’ mission is cognitive business automation. So for Cognitive Mill pipelines, the script we get with the help of NVIDIA’s neural network is just the input for our cognitive post-processing.
To guarantee that our customers get the value, performance, and flexibility they need at a reasonable price, we cannot rely on ready-made third-party services, such as the ones provided by Google or Amazon.
We need independent customizable modules to integrate into our pipelines.
Our cost-effective processes bring business value and generate stable, highly predictable results with the help of algorithms we use on top of the metadata collected with the help of Deep Learning models. DL is our ‘ears’, while our cognitive modules that perform post-processing are the ‘brains’ able to make human-like decisions and automate routine tasks.
Stay tuned for more stories about how our ‘robot’ works!