Android TTS voice broadcast practice

“This is the first day of my participation in the Gwen Challenge in November. See details of the event: The last Gwen Challenge in 2021”. In my work, I met the requirement of voice broadcast. After receiving push, I broadcast the broadcast content carried by push by voice broadcast. Similar to the micro channel pay treasure collection information. After the survey, the main voice broadcast schemes are as follows:

TTS SDK based on third parties, such as Baidu, Spitz, IFlytek, etc.
Self-developed Native TTS engine + model;
TTS solution based on cloud;
Use the TTS engine that comes with your phone.

Nothing more than research or procurement, local or cloud.

Let’s talk a little bit about the development of TTS.

Status and development of TTS

Speech synthesis, also known as Text to Speech (TTS) technology, is an important research direction in the field of Speech processing, aiming at making machines generate natural and pleasant human Speech.

TTS technologies fall into two main categories:

Universal TTS: suitable for navigation, voice broadcast, intelligent customer service and most voice interaction scenarios;
Personalized TTS: it is mainly applied to scenes with high sound quality such as education, long audio, live broadcast and dubbing of film and television games.

After a long time of development, the speech synthesis model has gradually reached the present stage of end-to-end synthesis based on emotion, from the initial split-based synthesis to parameter synthesis. The latest generation of end-to-end synthesis reduces the requirement of linguistic knowledge, and can realize multilingual synthesis system in batches, with high natural degree of speech.

Speech synthesis technology is internally divided into front end and back end.

The front-end is mainly responsible for the phonetic analysis and processing of the text, including language, word segmentation, part of speech prediction, polyphonic word processing, prosody prediction, emotion and so on. After predicting the pronunciation of the text, the information is sent to the TTS back-end system, and the background acoustic system merges the information and converts the content into speech.
From the first generation of speech splicing synthesis, to the second generation of speech parameter synthesis, to the third generation of end-to-end synthesis, the degree of intelligence of the back-end acoustic system has gradually increased, and the degree of detail and difficulty of marking training materials has also gradually decreased.

Application of TTS in voice interaction scenarios:

Splicing based synthesis

In order to better fit the human voice, the stitching synthesis technology needs a large-scale human voice library. The content of the library is marked according to the phonemes and different features. During the synthesis, the phonemes corresponding to the linguistic features are found and splined together to complete the synthesis.

Advantages: the effect is close to the real person, low calculation
Disadvantages: spliced, inconsistent voice, dependent on sound library, need manual intervention to make a lot of selection rules and parameters, high production cost.

Parameter-based synthesis

Through deep learning, the mapping relationship between text features and sound library is built, and the parameter synthesis model is built. When a linguistic feature is input, the audio feature is given based on neural network, and then the speech waveform is synthesized by vocoder.

Advantages: the number of sound library is not much, the synthetic voice connection is stable, high quality;
Disadvantages: high dependence on vocoder, at the same time, due to the information loss in the modeling of traditional parameter system, the further improvement of synthetic speech expressiveness is limited.

End-to-end (Tactron as an example)

To some extent, end-to-end speech synthesis solves the defects of splicing synthesis and parameter synthesis. The end-to-end synthesis system directly inputs text or phonetic characters and directly models through text or text features and speech, skipping the vocoder stage, reducing the dependence on vocoder and weakening the front-end concept.

Advantages: reduce the requirement of phonetics knowledge, can be easily copied in different languages, batch realization of dozens or more languages synthesis system, high degree of natural speech;
Disadvantages: large amount of calculation, can not be manually tuned, low real-time.

Scheme selection

At the time of making this feature, the company already had a team working on the TTS engine, and the cloud TTS service was already running, and some cloud TTS services were purchased from outside. However, due to the magnitude of our Push, the cost of both outsourcing and self-development cloud engine is very high. For cost consideration, we can only consider using the solution on the end.

There are three scenarios for client implementation:

3. Elimination for cost reasons;
Self-developed engine: Voice team based on the parameters of synthesis engine development has been achieved, but there is no manpower support subsequent debugging, and broadcast is fixed, and to the quality requirement of the synthesized voice is not particularly high, so choose a synthetic scheme based on stitching as alternative, after the former part and part of the statement use full voice, transformation between part way through word by word synthesis;
TTS engine: Android already comes with TTS engine, but not all phones come with Chinese engine.

Android TTS engine

Initialize the TTS engine

class TTSListener implements TextToSpeech.OnInitListener { @Override public void onInit(int status) { if (mSpeech ! = null) { int isSupportChinese = mSpeech.isLanguageAvailable(Locale.CHINESE); . / / whether to support Chinese mSpeech getMaxSpeechInputLength (); If (isSupportChinese == TextToSpeech.LANG_AVAILABLE) {int setLanRet = mSpeech. SetLanguage (locale.chinese); // Set the language int setSpeechRateRet = mSpeech. SetSpeechRate (1.0f); // int setPitchRet = mSpeech. SetPitch (1.0f); // Set volume String defaultEngine = mspeech.getDefaultengine (); If (status == TextToSpeech.SUCCESS) {// Initialize TextToSpeech engine successfully, }} else {// Failed to initialize TextToSpeech engine}}} TextToSpeech mSpeech = new TextToSpeech(ContextHolder.appContext(), new TTSListener());Copy the code

Set the broadcast state callback

mSpeech.setOnUtteranceProgressListener(new UtteranceProgressListener() { @Override public void onStart(String UtteranceId) {// Start utteranceId} @override public void onDone(String utteranceId) {// End of the utteranceId} @override public void onError(String UtteranceId) {// broadcast error}});Copy the code

Began to broadcast

long utteranceId = System.currentTimeMillis(); HashMap ttsOptions = new HashMap<String, String>(); ttsOptions.put(TextToSpeech.Engine.KEY_PARAM_UTTERANCE_ID, String.valueOf(utteranceId)); Utterance ttsoptions. put(textTospeech.engine. KEY_PARAM_VOLUME, string.valueof (1)); // volume ttsoptions. put(TextTospeech. engine. KEY_PARAM_STREAM, String.valueof (Audiomanager.stream_notification)); Int ret = mSpeech. Speak (TTS, TextToSpeech.QUEUE_FLUSH, ttsOptions); If (ret == TextToSpeech.SUCCESS) {// Broadcast SUCCESS}Copy the code

Convert text to voice files

long utteranceId = System.currentTimeMillis(); File file = new File("/sdcard/audio_" + utteranceId + ".wav"); int ret = synthesizeToFile("xxxxx", null, file, String.valueOf(utteranceId)); If (ret == TextToSpeech.SUCCESS) {// The file was synthesized successfully}Copy the code

Stop playing

Call mSpeech. Stop (); At this time will trigger UtteranceProgressListener onDone callback.

Destruction of the engine

mSpeech.shutdown();
Copy the code

Android11 adaptation

To be compatible with Android11 phones, we need to add the following declaration to the androidmanifest.xml file of the application:

  <queries>
    <intent>
      <action android:name="android.intent.action.TTS_SERVICE"/>
    </intent>
  </queries>
Copy the code

X Gradle needs to use versions higher than 3.5.4 to recognize the queries attribute.

TextToSpeech is incompatible with mobile phones

The system has its own TextToSpeech API, which is very convenient to use, but some mobile phones that do not support the Chinese engine are very troublesome, and we have no idea which mobile phones support and which mobile phones do not support, so after a wave of online running, we got the current market mobile phone TextToSpeech Chinese engine support situation. Ninety unsupported phones are as follows:

+--------------------------------------+ | Xiaomi MI+9+Transparent+Edition | | Xiaomi MI+8 | | OnePlus GM1900 | | OnePlus GM1910 | | Redmi M2006C3LC | | OnePlus IN2020 | | meizu 16T | | Xiaomi MI+MAX+3 | | OnePlus ONEPLUS+A6010 | | OnePlus ONEPLUS+A5010 | | meizu meizu+17 | | Redmi Redmi+K30 | | OnePlus KB2000 | | OnePlus HD1900 | | OnePlus ONEPLUS+A6000 | | meizu 16s | | Xiaomi M2102K1C | | Xiaomi M2002J9E | | Redmi M2007J17C | | OnePlus IN2010 | | meizu meizu+17+Pro | | nubia NX659J | | meizu 16s+Pro | | meizu meizu+16Xs | | Xiaomi Mi+10 | | Lenovo Lenovo+L78051 | | meizu  MEIZU+18 | | OnePlus HD1910 | | Hisense HLTE226T | | xiaomi Redmi+Note+8 | | Redmi Redmi+K30i+5G | | Redmi M2007J3SC | | Redmi M2004J19C | | Redmi Redmi+Note+8+Pro | | Redmi M2104K10AC | | xiaomi Redmi+Note+7 | | Redmi M2003J15SC | | Xiaomi MIX+2S | | Redmi Redmi+K30+Pro | | nubia NX627J | | Xiaomi MI+CC9+Pro | | Redmi Redmi+K30+5G | | meizu MEIZU+18+Pro | | Xiaomi MI+9 | | Xiaomi M2102K1AC | | Xiaomi MI+8+UD | | blackshark AWM-A0 | | Xiaomi M2011K2C | | Xiaomi MI+8+Lite | | Sony XQ-AT72 | | Xiaomi Mi+10+Pro | | Xiaomi M2102J2SC | | OnePlus ONEPLUS+A5000 | | Xiaomi M2101K9C | | Redmi M2103K19C | | xiaomi Redmi+Note+7+Pro | | nubia NX616J | | Redmi M2012K10C | | Xiaomi MIX+3 | | Redmi  M2004J7AC | | Xiaomi MI+CC9+Pro+Premium+Edition | | nubia NX619J | | Xiaomi M2007J1SC | | koobee X60+Pro | | Xiaomi Redmi+K20+Pro | | SMARTISAN DT1901A | | Redmi M2004J7BC | | asus ASUS_I001DA | | HONOR HLK-AL00a | | Redmi M2006J10C | |  Redmi M2012K11AC | | blackshark SHARK+PRS-A0 | | HONOR BKL-AL20 | | SMARTISAN DT1902A | | ZTE ZTE+A2322 | | Redmi M2007J22C | | blackshark SKW-A0 | | Nokia Nokia+X7 | | Redmi M2012K11C | | HUAWEI MAR-AL00 | | Redmi Redmi+K30+Pro+Zoom+Edition | | nubia NX669J | | Meizu 16+X | | Xiaomi Redmi+K20+Pro+Premium+Edition | | asus ASUS_I005DA  | | blackshark SHARK+KLE-A0 | | Xiaomi MI+6 | | motorola XT2125-4 | | GIONEE 20190619G | | HUAWEI ART-AL00m | +--------------------------------------+Copy the code

Some support better, such as Huawei, you can choose the pronunciation tone in the Settings. In system setting-auxiliary functions-barrier-free text-to-speech path, you can select the engine and adjust the speech speed and tone:

And click the Engine option to install the speaker voice package provided by the engine and select the speaker:

conclusion

This article summarizes the current situation and development of TTS, and introduces several ways to implement TTS on mobile terminals, as well as how to provide TextToSpeech API, limitations of system API, and how to do compatibility.