Writing | Vincent
Edit | Vincent
AI Front Line introduction:Did the humans lose again? At the Alibaba Computing Conference wuhan Summit on May 23, THE AI from Alibaba showed its talent again, handling more than 30 coffee orders in just 49 seconds. Since Duplex, which looks almost real, was unveiled at Google I/O earlier this month, AI companies around the world, especially tech giants, have been flexing their voice muscles.






Please pay attention to the wechat public account “AI Front”, (ID: AI-front)



AI takes orders quickly and accurately, humans lose again?

V.qq.com/x/page/s066…

“Five chocolates, two vanilla lattes, chocolate with cream.” “Two medium caramel lattes, one hot and one cold.” “Forget the chocolate.” “Six more small, lightly iced mochas, three with caramel and three with vanilla,” “And a large, cold latte, deiced, half sugar, skim milk, to go.”

This happened at the Wuhan Summit of the Cloud Computing Conference on May 23. Zhijie Yan, chief scientist of voice interaction in Alibaba’s Machine Intelligence Technology lab, places an order with a machine at the fastest speed of five words per second. The machine responded to each conversation with precision, and the senior barista standing by gave up, “too fast to write it down.”

The human barista completed the order in two minutes and 37 seconds after yan’s second recitals, compared with 49 seconds for the machine.

AI cashiers on duty, did the humans lose again? Will another job be replaced by AI?

If you ask this question, many experts in THE field of AI will answer: No, AI is a tool, and it exists in many scenarios for the purpose of assisting.

Ordering is a boring and repetitive job for many baristas. They would rather spend their time on making good coffee for customers and communicating with customers to understand their needs and feedback when making coffee than on ordering. As for customers, queuing during rush hours is also a headache for some customers, who waste a lot of time waiting in line for their coffee order.

The ordering machine on the market is generally touch-controlled, which is not convenient for customers to quickly find goods. Especially for some guests more customized needs, point up more trouble.

Common voice interaction products in the market are generally in the form of “wakeword + voice command” at present, which is difficult to understand for complex needs and cannot really interact as naturally as interpersonal communication.

Yan zhijie said that this interactive mode completely breaks the traditional imperative interaction mode of “voice wake up + voice command”. The streaming multi-intention spoken language understanding engine we pioneered greatly improves the understanding of human’s casual and natural spoken language expression, and can achieve natural human-machine interactive speech interaction without wake up.

In the above demonstration link, including modification, deletion, order and other rounds of dialogue, in the whole process of communication, customers do not need to say “hi, click single” such as rigid wake up words, can directly order, more in line with the natural dialogue between people.

Behind the “showmanship” : Multimodal phonetic interpretation

Voice ordering machine is a typical product based on multi-mode human-computer voice interaction scheme of Machine Intelligence Technology Laboratory of Alibaba Dharma Academy. On the one hand, the scheme makes human-computer interaction in public space possible through multi-mode fusion technologies such as voice, computer vision and touch control, and puts it into business scenes to promote commercialization. On the other hand, the streaming multi-intention spoken comprehension engine pioneered by Ali has greatly improved the understanding of casual and natural spoken expressions of human beings, realizing human-machine interactive speech interaction.

Architecture diagram of streaming Multi – wheel multi – intention spoken Comprehension algorithm

Streaming rounds intentions more spoken language understanding technology involves more child tasks, including: entity information extraction (such as product name), long sentence semantic segmentation (spoken about streaming input segmentation for semantic complete sentence), intention recognition, multiple relationship extraction (such as the relationship between the product and its properties), physical links, entities refer to digestion, etc.

Multimodal human-computer voice interaction scheme is an end-to-end model, directly modeling from the streaming oral input of the user to the final understanding of the user’s multiple intentions, no longer relying on the models of subtasks and their cascades, which greatly reduces the accumulation and transmission of errors among various subtasks.

In terms of architecture, the scheme separates algorithm from business, uses business knowledge map to effectively express business-related knowledge, uses serial-to-sequence deep learning model to automatically learn the mapping between user spoken input and structured expression of intention, and uses business knowledge map to express business logic. Reinforcement learning is applied to the automatic mapping learning model to achieve the purpose of weak supervision. On the one hand, the whole system only needs a small amount of end-to-end data annotation for training, which greatly reduces the annotation pressure. On the other hand, the loose coupling of knowledge graph makes it more convenient to expand to new business.

It is understood that aliyun’s solution can sell tickets on the subway in addition to working as a cashier. The technology has already been deployed in The Shanghai metro. Passengers say their destination and the ticket machine selects the right stop and route. This is especially helpful to the first to Shanghai passengers, who will be confused in the face of more than ten lines of more than 300 stations. According to test data, buying a ticket usually takes more than 30 seconds, while buying a ticket by voice only takes about 10 seconds.

Careful readers should have noticed that all of the above application scenarios are either crowded and noisy cafes or subway stations with huge background noise. How to solve the noise problem?

AI the front line to know, this time on the subway and cafes such strong noise environment, dharma hospital innovation based on machine learning for the first time the large microphone array technology, combined with the depth of optimizing the structure of the acoustic and multimodal voice extract, can automatically extract target from strong interference background voice speaker voice, realize the noisy speech recognition in interference environment. In addition, local and cloud dynamic full-link model matching is carried out at the same time for the coffee mill sound and human voice of the cafe, and end-to-end adaptive optimization is realized to ensure every voice interaction.

At present, Ali has carried out a pilot in the cafe in the campus, we don’t know how the system performs in the real application scenario, if there are readers to experience, don’t forget to come back to us and tell us what you think.

Voice field show operation, technology big factory want to do?

Someone once described natural language processing as the brightest jewel in the crown of artificial intelligence, and another said that as long as NLP is solved, 80% of the problems in the field of artificial intelligence will be solved. Perhaps because of the importance of speech in AI research, it is more difficult to break through. In simplified scenarios, the performance of intelligent voice applications can always bring various surprises; But when it comes to complex real-world situations, they don’t seem to work so well.

At the beginning of this article, we mentioned Google Duplex that looks like it’s real, and THE AI Front has taken a look at this creepy AI voice. Because Duplex was not shown live, it has been overshadowed by multiple claims that it is a fake, though Google has yet to respond.

At the same time as Google’s Build 2018 developer conference, Microsoft also unveiled a major voice product: An intelligent, enhanced meeting taking system that can eliminate stenography, simultaneous communication, and secretarial work at the same time. The 360-degree camera and microphone matrix not only accurately identifies all attendees, but also records and translates what everyone is saying in real time, helping you extract key points. Every time someone says “Follow up,” it’s automatically recorded in Microsoft’s meeting system.

At the Microsoft AI Conference in China on Jan 21, Microsoft showed off an upgraded Chinese version, showing off its capabilities in the field of voice.

Even show two operations, only the two demonstrations were carried out in a closed simulated office environment, although it could identify the speaker, but in the process of demonstration, there was no real meeting of many people talking at the same time and noise interference and other scenes.

In addition to AI giants, a number of startups have also found a gold mine in voice.

Luo Yonghao demonstrated the voice operation function of his own workstation TNT at smartisan’s press conference. During the process, there were “occasional” recognition problems. I don’t know whether it was because luo’s Mandarin was not good enough or the wind was strong that night in the Bird’s Nest.

We’ll see what the future holds.