Not long ago, the number of monthly active users of Oppo’s artificial intelligence assistant “Xiaobu Assistant” exceeded 100 million, becoming the first mobile phone voice assistant in China with more than 100 million active users in a month.

After more than two years of growth, the ability of Xiaobu Assistant has been greatly upgraded, and convenient service functions have also been integrated into our side. The small cloth team has also overcome many technical difficulties and brought more intelligent service for users. To that end, the team has written a series of articles detailing the technology behind Clob’s assistant. This article is the first to reveal the technology behind the cloth, and focuses on the design and evolution of the system architecture.

1. Industry value

1.1 introduction

The conversational system is a technology that has been under research for nearly 30 years and represents the future of human-computer interaction. In the past decade, with the gradual breakthrough in voice and NLP fields and the increasingly mature industrial applications, user value and industry scale have risen rapidly.

From the point of view of the scene, the dialogue system can be roughly divided into three categories.

  • Task-based: Precise answers, limited areas, and the goal is to satisfy the user with minimal interaction, such as setting an alarm clock.

  • Question and answer type: the answer is broad, limited to the field, and the goal is to satisfy the user with the simplest interaction, such as encyclopedia.

  • The small talk type: answers are broad, open field, and aim for a conversation round.

Intelligent assistant is a task-oriented product form of dialogue system that integrates question-and-answer, chat and integration. It has great potential in the industry.

1.2 Intelligent Assistant

With the advent of AIOT, intelligent device groups increasingly rely on intelligent assistants for natural human-computer interaction under the background of the integration of all things. Smart assistants will cover millions of devices and have a lot of imagination.

UK market Research firm Juniper Research predicts that the number of devices equipped with smart assistants will increase from 2.5 billion at the end of 2018 to 8 billion by 2023.

From the user level, although intelligent assistant belongs to the minority function, with the popularization of intelligent devices and the gradual cultivation of early users, the familiarity and recognition are gradually increasing, which has a large room for improvement.

The user value of smart assistants is threefold

  1. The efficiency of
  2. personality
  3. emotional

With the further popularization of the industry, on the basis of small screen, no screen and large screen, more intelligent devices for vertical scenes and people are gradually extended, such as education smart screen, story machine, AI learning machine and so on.

Small cloth assistant is the intelligent assistant of OPPO company, covering all kinds of terminal equipment of the company, and constantly adding new entrance, covering many task type, question and answer type and chat type of the vape domain.

As the “brain” in the intelligent assistant, the dialogue system is one of the most core technical points. Only with a dialogue system can an intelligent assistant understand users’ demands and meet users’ demands on efficiency, personality and emotion with a conversational service.

2. Industry structure

2.1 review

The typical architecture of a dialog system is introduced first. In the academic world, the dialogue system mainly has two architectures: PIPELINE and E2E, among which PIPELINE is widely used in industry, and E2E is still in the exploratory stage.

Pipeline Modular Architecture

ASR (Automatic Speech Recognition)

Receive audio input and output a transcribed sentence text. Generally, it consists of four parts: Signal processing, acoustic model, decoder, post-processing, first collect the sound, process the signal, convert the speech signal to the frequency domain, propose the feature vector from the N millisecond speech, provide it to the acoustic model, the acoustic model is responsible for classifying the audio into different phonemes, and then the decoder obtains a string of words with the highest probability. The final post-processing is to combine the words into text that is easy to read.

NLU (Natural Language Understanding)

Is responsible for representing natural language into structured data that computers can process. Receives text input, outputs structured triples Domain +Intent +Slot. Semantic parsing is mainly carried out through word segmentation, part of speech tagging, named entity recognition, syntactic parsing, and anaphora resolution.

DM (Dialog Management)

Be responsible for controlling the whole conversation process. It takes the output from the NLU, maintains some context state and dialogue policies, and outputs what actions to perform, such as asking the user further to get the necessary information. DM is the main body of the Dialog system, which consists of the following two important modules: Dialog State Tracking (described by DST later) and Dialog Policy (described by DP later). DST records the state of T-1 or even T-N and the state of the current time T. Combined with the context, the current session state is determined. The DP decides what action to perform based on the session state and the specific task.

ASR and NLU determine the lower limit of speech interaction, while DM determines the upper limit of speech interaction.

NLG (Natural Language Generation)

According to the system action output by DM, the reply content is generated. There are rule template-based approaches and deep learning-based approaches in general.

TTS (Text To Speech)

You need to control the pronunciation and rhythm of polyphonic words, such as where to pause, and how light or stressed words are.

Summary: The advantage of the modular PIPELine architecture is that it is highly interpretable and easy to implement. Most of the task-based dialogue systems in the industry are based on this architecture. The disadvantage is that each module is relatively independent, it is difficult to tune jointly, and the error between modules is accumulated layer by layer.

E2e end-to-end architecture

In recent years, with the development of end-to-end neural generation models, an end-to-end trainable framework has been constructed for dialogue systems. This type of architecture aims to train an overall mapping relationship from client-side natural language input to machine-side natural language output (i.e., combining NLU, DM, and NLG as one module), which has the characteristics of strong generalization and migration, breaking the isolation between modules of the traditional Pipeline architecture. However, the end-to-end model has high requirements on the quantity and quality of data, and its effect is not controllable. Moreover, it is not clear enough about the modeling of the process such as slot filling and API call. The industrial application effect is effective and is still under exploration.

Next, we look at typical industry implementations of different types of dialog systems.

2.2 Microsoft Xiaoice: chat type dialogue system

Microsoft Xiaoice is an open domain chat social chat bot, featuring “EQ”. The effectiveness of chatbots is generally evaluated by CPS (number of dialogue rounds per session). The larger the CPS, the better the chatbot’s ability to participate in the conversation. Xiaoice’s average CPS is 23 rounds (April 2017 data).

The diagram below shows the overall structure of Xiaoice. It consists of three layers: the user experience layer, the dialogue engine layer and the data layer.

User Experience Layer

This layer connects Xiaoice to popular chat platforms (such as WeChat and QQ) and communicates with users in two modes, full duplex mode and alternate dialogue mode. This layer also includes a set of components for handling user input and Xiaoice responses, such as speech recognition and synthesis, image understanding, and text normalization.

Dialog engine layer

Composed of conversation manager, empathy computing module, core chat and conversation skills. The dialog manager consists of DST and DP. The empathy calculation is based on user data, Iceman set and other data input, and the calculation characteristics are used as DM and skill input. There are two different schemes of chat and skill fusion: generative and retrieval.

The data layer

Stores collected session data (text pairs or text image pairs), non-session data and knowledge graphs for core sessions and skills, and portraits of Xiaoice and all registered users.

Relevant information may refer to: https://arxiv.org/pdf/1812.08…

2.3 Mi robot: question-and-answer dialogue system

Xiaomi robot is a classic Pipeline architecture. As the application scenarios of customer service robots are all text interactions on the web page, ASR and TTS modules are not involved.

It has achieved the domain and platform, facing the Ali ecosystem, business ecosystem and enterprise ecosystem support by PaaS and SaaS output. Modularize the whole dialogue management and process, and build a parallel architecture system with pluggable algorithms and business modules.

Relevant information may refer to: https://zhuanlan.zhihu.com/p/…

2.4 Secret, Little Love, Alexa and other intelligent assistants

They are mainly task-based, including chat and Q&A. Du Mi and Xiao Ai are based on the classic Pipeline architecture. The following is a brief introduction of Xiao Ai as an example.

Little love:

  1. Multichannel dialog management recalls, with complete NLU and ACTION for each lobe
  2. Traffic is distributed in full vertical domain, and intention prediction model is used to reduce traffic
  3. Policy of the central control module DM makes the intention selection to return the result

2.5 Open source solution: RASA

RASA is based on the PIPELINE architecture

  1. Interpreter is the responsibility of the NLU and Tracker+Policy+Action is the responsibility of the DM
  2. Modular design, especially the Interpreter process, is customizable
  3. The biggest change in action isolation, which can be embedded in external servers
  4. A large number of configuration-driven design is adopted to complete the development of dialogue flow based on rule configuration
  5. Rasax provides a dialogue driven development solution, evaluation, annotation, and testing platform

3. Cloth assistant engineering practice

3.1 Dialog system architecture design and evolution

The overall system of Oppo Small Cloth Assistant is stratified as follows:

The dialogue system consists of the user domain, dialogue domain and semantic domain on the left, which is built by referring to the classical PIPELINE architecture.

In addition to the basic experience related to voice output, the evolution goal of the dialogue system can be roughly divided into two stages.

  1. Improve skill coverage and skill intent recognition
  2. Excavate and improve skill satisfaction, highlight skill building

Stage 1 is centered on the rapid iteration of the vertical domain, and Stage 2 is centered on the public capacity building and the semantic optimization of the vertical domain dialogue.

Phase 1: Rapid iteration of the vertical domain

Skill coverage and single-round intention identification are the main goals, and the dialogue system only needs to provide the basic ability of multiple rounds of strength and weakness to meet the demands of this stage. It pursues the vertical domain to set goals respectively and the rhythm to iterate quickly, with low coupling between the vertical domains.

The design principles are:

  1. Conway’s law: [vertical domain (algorithm + engineering)]. Divided the business according to the feature team, and each vertical domain server is divided into algorithms and engineering. Based on this, services are divided, responsible for complete dialogue management and semantic understanding
  2. Low coupling: the engineering between the vertical domains is not coupled. In addition to the global ranking decision of the algorithm, the NLU of each vertical domain is also not coupled
  3. High cohesion: framework abstracts common dialog management functions, central control is responsible for global scheduling, and vertical domain service focuses on logic

Stage 2: public capacity building and vertical domain optimization

When skill coverage and single-round intention identification call are optimized to a certain extent, skill satisfaction tends to dialogue product experience and highlight skill building.

At this stage, there are many demands for dialogue semantic public capability. Public construction helps to reduce the cost of repeated development and maintenance between adjacent domains, maintain consistent dialogue experience, and ensure quality and performance.

The dialog management component is currently under construction for gradual decoupling.

The design principles are:

  1. Inversion of control: The DM service of the vertical domain does not directly control the dialogue, but provides the necessary information through an abstract protocol. The framework and the common dialogue manage the control and decision-making dialogue. The same applies to other dialog management components.
  2. Single responsibility: the atomic capabilities of dialogue management are disassembled into dialogue components, which are choreographed by central control services to reduce complexity and improve reusability.
  3. Downward Compatibility: The DM service used to have full conversation management functionality, and the protocol extension guarantees downward compatibility, allowing DM to host both conversation management and conversation management.

In addition to the strong and weak rounds and intention identification that have been supported in stage 1, the following dialogue capabilities will be gradually built following the landing of product features, and the dialogue product experience and highlight skills will be created.



3.2 Dialogue framework

In the past, the business services that iterated most frequently were DM and NLU, which implemented dialogue logic and semantic understanding respectively. In order to solve the common problems of DM service development and NLU service development, two sets of frameworks, DM framework and DAG framework, are abstractly proposed.

DM framework

The DM service inputs domain, intent, slot, and dialog state, and outputs the action and new dialog state of the skill. There are two phases of DM service for Bu’s assistant:

  1. In the multichannel dialog management phase, DM services are responsible for complete dialog management capabilities
  2. In the central dialogue management stage, the DM service is responsible for the output of Action, and the dialogue management is entrusted to the upper central control service

In order to solve the common problems in the two phases of business DM service, the analysis is as follows:

  1. The similarity degree of business process is large, and the foundation of unified business process is established
  2. Dialogue management capability is repeatedly built
  3. The structure of the code is very different, which is not good for new readers
  4. Each DM service provides SDK for upper layer invocation, and the interface and protocol cannot be uniformly and centrally managed

The DM Service Development Framework addresses these issues and is designed according to the following principles:

  1. The idea of layered design is adopted to decouple the business logic and reduce the coupling and mutual influence of business
  2. Spring EL expressions + annotations are used to standardize the style and readability of the code
  3. Relying on Inversion + Richter Replacement Principle + Interface-Oriented Programming solves the implementation of differentiated business logic at the upper level of each business

DAG framework

NLU is divided into vertical domain construction. In the initial stage, Python is used to build the prototype and Java Side Car Proxy is used as a service.

Step by step expose some engineering problems:

  1. The operators of each group of the algorithm are similar, but the calling order is very different. The same operators are repeatedly constructed, the operator maintenance cost is high, and the operator capabilities of each group are not common
  2. The Agile Iteration team implemented the capabilities in Python, and the performance of the service was problematic
  3. In order to realize the capability reuse, monitoring improvement, performance efficiency improvement, support the rapid on-line of the skill NLU field, stratify precipitation operator, and use DAG framework to arrange.

Operator hierarchy design:

The basic class library layer is responsible for the capacity building at the lowest level. The operators at the upper level depend on the implementation of the basic class library layer at the bottom level. The business layer uses DAG framework to combine operators to construct the process topology diagram that needs to be executed (as shown below), so as to quickly build the domain NLU.

Benefits of Pilot Business:

  1. Flat noise decreased by 71.8%
  2. Single-instance concurrency increased by 50 times
  3. Single skill operator code reuse rate 95.7%

3.3 Performance optimization practices

Cloth Assistant pursues the ultimate user experience, and fluency is one of the most important dimensions.

We shot the video with a high-speed camera, the assistant initiated interaction with similar products at the same time, and finally returned the comparison of the display time of skill results. The winning rate was calculated according to the actual proportion of online query, which was taken as the core index of fluency.

The following will mainly describe the engineering practice of fluency optimization.

Problem analysis

  1. Third-party resource execution time on the server side accounts for the largest proportion. Among the server time, the execution time of third-party resources takes up the largest proportion, 80%+
  2. Server-side speech recognition takes the second place
  3. Client-side rendering interactions can be more concise. Part of the vertical skill client interaction can be cleaner and faster to execute

Total solution

  1. Parallelism: prediction, string change and union
  2. Pruning: Fast and slow layered, multi-level caching
  3. Speed up: tripartite self-construction, cloud VAD, interaction simplification, implementation optimization

General idea of forecast

Prediction is a feature of high architectural complexity. Expand to illustrate the practice of Boo’s assistant.

In the process of the user’s voice interaction, the intermediate results of the ASR stream are continuously uploaded to the screen, and the complete user audio input will not be obtained until the end of the tail point recognition VAD.

Using business features, prediction can achieve “listening while thinking”, parallelize the identification process and execution process, and shorten the serial waiting time.

There are two strategies

  1. VAD phase parallel execution, high accuracy and low profit.
  2. Recognition phase parallel execution, low accuracy and high profit.

The current principal uses the first strategy, which is a trade-off between the cost of scaling up and the time consuming optimization requested on the back end.

The prediction has a great impact on the architecture, and the implementation is difficult. 1 request is divided into n-1 informal prediction request and 1 formal request, and the downstream cannot know whether this request is a formal request, stateful service will introduce side effects and lead to incorrect results.

There are three ways to solve the problem:

  1. Each predicted request retracement status
  2. Submit status after formal request is completed
  3. Transform to stateless

Forecast scenario – Retreat status per forecast request

The implementation difficulty is that the sequence is difficult to guarantee, the need for distributed transactions, to ensure that the following steps in a transaction.

  1. Undo the dialog state
  2. Dialog business logic
  3. The dialog state is written to write

Forecast scenario – submission status after formal request is complete

Difficulties in implementation are:

  1. The business logic is intrusive, and each design business state maintenance needs to be modified to implement Try, Confirm, and Cancel
  2. Request amplification, back-end write requests increase by 1/N, usually predict request N is small

Predictive scenario — transformed to stateless 1. Write state persistence is unified upstream, and state reads and writes are carried over the request protocol. Conversation state size 1KB below 2. Some services that cannot be transformed to be stateless will be predicted to go to, Reject

The scheme is generally suitable for the data volume of the small cloth assistant, and the architecture is simpler and more elegant, and more friendly to performance and usability.

Forecast earnings

Some skills with higher hit rate have been increased to 70+%, and the time spent has been reduced by 60+%

Unactivated abilities have an overall hit rate of 42.3% and a 43% decrease in time

4. Challenges and prospects

As algorithm solutions and product scenarios of the dialogue system continue to expand and links become more and more complex, engineering architecture scalability and performance availability will face great challenges.

  • Algorithm scheme: NLU optimization from single round to multi-round, dialogue decision rules to model, standardization to personalized
  • Product scenario: multi-device, multi-entry, multi-mode

In the future, Bu will consider the following directions:

  • Dialog system component decoupling: cloud-side scalability, in-process microkernel, component response algorithm product change, component common library governance performance availability
  • End-cloud interaction mechanism optimization: end-to-side scalability, the dialogue system responds asynchronously to end-to-side change events, and ADAPTS to the changes of complex interactions with multiple devices, multiple entrances and multiple modes
  • Open protocol and SDK: internally provide business expansion points, and concentrate the company’s efforts to build Xiaobu Assistant technology brand; Externally combine skill platform and expand skill ecology