Video introduction: Stable real-time voice translation in Google Translate

The transcription feature in the Google Translate app can be used to create real-time translated transcripts for events like meetings and speeches, or just for stories told around the dinner table in a language you don’t understand. In this case, a timely display of the translated text is useful to help keep the reader engaged and engaged. However, with earlier versions of this feature, the translated text was subject to multiple real-time revisions, which could be distracting. This is because of the non-monotonic relationship between the original text and the target text, the word at the end of the original text will affect the word at the beginning of the target text. Today, we’re pleased to introduce some of the technology behind the recently released update to the transcription feature in the Google Translate app, which significantly reduces translation revisions and improves the user experience. Two papers present research to achieve this goal. The first formalized assessment framework targets field translation and development methods to reduce instability. The second shows that these methods are very good comparable alternatives while still retaining the simplicity of the original method. The resulting model is more stable and provides a significantly improved reading experience in Google Translate.

Evaluating real-time translation Before attempting any improvements, it is important to first understand and quantify the different aspects of the user experience with the goal of maximizing quality while minimizing latency and instability. In “Strategies for Retranslation of Long, Simultaneous, spoken Translation”, we developed a real-time translation evaluation framework that has guided our research and engineering work ever since. This work proposes performance measures using the following indicators:

  • Erasure: Measures the additional reading burden of users due to instability. It is the number of words erased and replaced per word in the final translation.
  • Lag: Measures the average time elapsed between the time the user speaks a word and the time the translation of the word displayed on the screen becomes stable. The need for stability avoids reward systems that can only be fast with frequent corrections.
  • BLEU score: Measures the quality of the final translation. The quality difference of intermediate translation is captured by a combination of all indicators.

It is important to recognize the inherent trade-offs between these different aspects of quality. Transcribe achieves real-time translation by superimposing machine translation on top of real-time automatic speech recognition. New translations are generated in real time for each update of the identified transcript; Updates can occur several times per second. This approach places Transcribe at one extreme of a three-dimensional quality framework: it exhibits minimal lag and the best quality, but also has a high erase rate. Knowing this allows us to try to find a better balance.

A straightforward solution to stable retranslation to reduce erasure is to reduce the frequency of translation updates. Along this line, “streaming translation” models (such as STACL and MILk) intelligently learn to recognize when enough source information has been received to safely extend the translation, so the translation never needs to be changed. In doing so, the stream translation model can achieve zero erasure. The disadvantage of such streaming translation models is that they again take an extreme position: zero erasure requires the sacrifice of BLEU and hysteresis. Rather than eliminate erasure completely, the occasional unstable small budget may allow for better BLEU and lag. More importantly, streaming translation requires retraining and maintenance of specialized models for real-time translation. This precluded the use of streaming translation in some cases, as maintaining a streamlined pipeline is an important consideration for a product like Google Translate that supports more than 100 languages. In our second paper, “Simultaneous Translation Retranslation and streaming,” we show that our original real-time translation “retranslation” approach can be fine-tuned to reduce erasure and achieve a more advantageous erasure/lag /BLEU trade-away. We apply a pair of inference time heuristics to the original machine translation model, masking and bias, without training any specialized models. The end of an ongoing translation tends to flicker because it is more likely to rely on the source word that has not yet arrived. We reduce this by truncating some words from the translation until we observe the end of the source sentence. Thus, this shielding process trades delay for stability without compromising quality. This is very similar to the delay-based strategies used in the flow approach, such as Wait-k, but is applied only during reasoning rather than training. Neuromachine translation often “teeters” between equally good translations, leading to unnecessary erasures. We improve stability by skewing the output towards what we already show the user. In addition to reducing erasure, bias also tends to reduce latency by stabilizing translations earlier. Bias interacts well with masks, because words whose masks may be unstable also prevent the model from favoring them. However, this process does require careful tuning, as high deviations and poor masking can negatively affect quality. The combination of masking and bias results in a high quality and low latency retranslation system while virtually eliminating erasure. The table below shows how the metrics respond to the heuristic approach we introduced and how they compare to the other systems discussed above. The figure shows that even with a very small erasing budget, retranslation exceeds the zero-flicker stream translation systems (MILk and Wait-K) trained for real-time translation. System blue team erasure retranslation (old) 20.4 4.1 2.1

  • Stable (new) 20.2 4.1 0.1

Compare retranslation with stable and dedicated flow models (WaIT-K and MILk) on WMT 14 English-German. The retranslated Bleu-lag tradeoff curve was obtained by different combinations of bias and mask, while maintaining an erasure budget of less than 2 words per 10 generated erasures. Retranslation provides better BLEU/ lag trade-offs than flow models that cannot be corrected and require specific training for each tradeoff point.

Conclusion The solution outlined above returns a good translation very quickly, while allowing it to be modified as more source sentences are spoken. The simple structure of the retranslation allows us to apply our best phonetic and translation models with minimal effort. However, reducing erasure is only part of the story — we’re also looking to improve the overall voice translation experience with new technologies that can reduce delays in translation or enable better transcription when multiple people speak.

Update note: first update wechat public number “rain night blog”, later update blog, after will be distributed to each platform, if the first to know more in advance, please pay attention to the wechat public number “rain night blog”. Blog Source: Blog of rainy Night