background

COVID-19 is a double-edged sword. On the one hand, it hampers people’s travel and greatly affects the offline economy, but it also greatly accelerates the business model of livestream delivery. As the main product of the group’s live streaming, Taobao Live needs the real-time understanding ability of the live broadcasting room, whether it is to expand the outlets and experts on the production side or to recommend the best quality live content for users on the distribution side. This capability provides anchor rating and evaluation system for operation, and provides rich feature optimization recommendation effect for recommendation algorithm.

What is real-time content understanding

Content understanding algorithms generally refer to image and video classification and recognition, ASR, OCR, etc. These algorithms can handle batch offline tasks, such as analyzing a surveillance video, processing a portion of the video that is crawled, which do not need to return results in time, while some, such as face recognition access control, need to return processing results quickly. Real-time content understanding requires high timeliness and stability of background services. The time span of live broadcasting is long, which may go on for several hours. To understand the content of live broadcasting steadily and efficiently, the algorithm’s reasoning time is fast enough, the resource utilization rate is as high as possible, and the coordination with the server is good.

Use content understanding to optimize the efficiency of live broadcast recommendation

Whether the live broadcast quality is good or not is affected by several aspects: 1. Whether the picture is high definition and low delay; 2. 2. Whether the live broadcast content is attractive is what users are interested in; 3 live scene with goods to see the quality of the goods to sell, the price is not low. The first problem can be solved with the help of the video codec technology developed by Taobao itself. The third problem needs to be compared in the commodity library and understood according to the user feedback. Real-time content understanding algorithm mainly solves the second problem, which is to judge the attractiveness of anchors by identifying multi-dimensional information.

After understanding the content, we need to consider how to combine it with the recommendation algorithm. At present, the recommendation algorithm is mainly based on user behavior characteristics. In terms of fully exposed content, user behavior is still very convincing. For example, some popular content on Douyin is really of interest to most people. However, in the case of inadequately exposed content, user behaviors are relatively few, so recommendations should be made according to the characteristics of the content itself. For example, there are a large number of new videos uploaded by users on Youtube, and the exposure of new videos is relatively low. Google uses the video marking algorithm to calculate the video tags and recommend the videos with interested tags to users to do the cold start of new content. For live broadcast, each scene can be understood as a new video, and recommendations are made during the process of the video, which requires more understanding of the content.

The recommendation algorithm is divided into recall, sorting, rearrangement and other stages. If the features of content understanding are put into the recall and sorting model, the coverage of the features should be considered, as well as the fusion with other features, which will be more complicated. Sometimes the features of content understanding are accurate, but there may be no gain when they are put together with user behavior characteristics. In the rearrangement phase, it is more flexible to put weight on the content understanding results to adjust the sorting order, or do some filtering.

An algorithmic framework for real-time content understanding

There are tens of thousands of live broadcasts on Taobao Live every day, and the duration of live broadcasts may be several hours or even more than ten hours. The exposure of different live broadcasts varies greatly, so we start with real-time analysis of head anchors. The two most critical elements of live broadcasting are anchors and commodities. At present, we only focus on the understanding of anchors. First of all, it starts with live face detection, voice classification, and appearance level, several important features of business and operation side feedback.

First, we built a live streaming Metaq message listening service. Studio state change is very complicated, such as sometimes happens to flow, sometimes the host will temporarily leave, sometimes baby bag increase the goods, the same could be a different machine to receive the news of the studio, we spend too much time on the broadcast message processing, owing to a live stream address is fixed, finally in order to simplify the logic, We only deal with on-air and off-air news.

When counting the PVR (exposure coverage ratio) of these live broadcasts of the head, we found that thousands of live broadcasts of the head may account for a high exposure coverage, so we took the exposure log from the live recommendation team to calculate the head anchors of the previous day. There are public domain and private domain in the exposure log of the whole live broadcast. Our unified calculation only calculates the traffic data in the public domain and then falls into the ODPS table.

We first try after get a live url FLV format live streams, decoding machine constant downstream of decoding and then draw frame input model, the process of decoding machine utilization rate is not high, so we take the frame frequency is not very dense, 10 to 20 seconds to extract a frame, and the air may pause/stop, so we switched to HLS format, The downstream decoding service is input by TS. Compared with the previous stream pulling mode, the concurrent processing capability is more than one times higher.

Anchor attribute characteristics

There are a variety of features associated with anchors, including stable features such as face and voice, and real-time features such as expressions and actions. In order to maximize ROI and facilitate the use of recommendation algorithm, we summarized the following aspects of character attributes with the students of operation and recommendation algorithm, and gradually identified these attribute tags through the algorithm. These anchor tags are valuable as operational activity grabbers, analysis tools, and models that help recommend algorithms to train better.

Commonly used character attributes are divided into visual and sound dimensions. Together with the VIP team, we built a set of multi-mode attribute recognition framework, as shown in the figure below. The video frame and sound data are sampled from the live stream and then input to the sound recognition module and visual recognition module respectively. At present, the voice module mainly includes male/female voice, voice classification model and ASR recognition model, and the visual module includes human face detection and tracking, face attribute recognition, and image quality analysis module. Other features can be derived from the output of the single model. For example, asR results can be used to calculate the speech speed characteristics, and the combination of visual gender and voice gender judgment results can be used to screen transgender anchors.

Here are some of our achievements in attribute recognition:

The face of a property

Compared with face detection and recognition algorithms, attribute recognition algorithms such as gender, appearance level, age and expression recognition do not have very high recognition accuracy in the general scene. On the one hand, the data set is too small; on the other hand, attribute features are subjective; in addition, makeup and beauty tools will obviously interfere with these attributes. Among attribute features, appearance level attribute has the biggest influence on user experience, so face attributes focus on appearance level and make some attempts.

Due to the subjective appearance level, we tried crowdsourcing service to annotate the data, but the results of different taggers were still very different. Finally, the algorithm students screened again by themselves. Firstly, the binary model is used to judge the level of appearance, and then the high appearance level anchors selected by the algorithm are given to the operation students, and then they are divided into high, medium high, medium low and low appearance level types by the operation to do the experiment. On the algorithm side, the classification after softMax is used as the appearance level.

Considering that the task might be simpler by comparing the appearance level of the two images and selecting the one with higher appearance level, we tried the pation-wise sorting scheme. The results showed that the performance of distinguishing the data with obvious high and low appearance level was the same as that of the classification model. In terms of the data with middle appearance level, the sorting task was also not well solved. So the ranking model is not very different from the classification model.

Here are the high appearance levels:

Comparison of other companies’ appearance level API results:

In our test set, the comparison results of PR curve and ROC curve are as follows:

Face recognition

There are usually two ways to do face recognition. One is to use the classification model, where each person is a category. This scheme has a good effect, but if new characters are added, the model needs to be retrained, so the flexibility is not very high. Use the same model to calculate the face feature, and then retrieve the target person. Due to the large fluidity of anchors in the live broadcasting business, we use the retrieval scheme for face recognition. In face recognition, we use ArcFace model to calculate feature, and then use clustering and Rerank to recognize face identity.

Audio features

Many of the studios have no anchors and only commentaries, such as those selling jewelry. Therefore, audio features are an indispensable part of real-time understanding in the live broadcast room. At present, we have connected to EasyASR algorithm of PAI platform, and cooperated with PAI team to build features of male/female voice, whether there is a voice or not, and background music. Acc index is over 90%, which supports services such as empty lens recognition and transgender identification.

Business floor

The real-time content understanding of the live broadcast studio mainly serves the recommendation algorithm. Compared with the direct output of materials from the content production or the classification marking results from the audit business, it is more complicated for the recommendation business to add the real-time content understanding results to the recommendation process. Currently, there are two ways to use features. In the inference period, real-time features should also be added to the score, which requires the accumulation of features for a long time. At present, it takes more than two months to train the sorting algorithm, and the cost is relatively high. Another way to use it is to fuse features into weight points, and add or lower weights on the basis of ranking results to affect the results. The following describes the usage of content understanding algorithms in different application scenarios.

Sorting algorithm

From the home page of Taobao live card into the live channel page, this page mainly shows the cover of the studio, at the same time will show a live screen of a studio. In this scenario, the recommendation algorithm mainly focuses on CTR and live broadcast duration index. Users will judge whether to click based on the cover image on this page, and some live broadcast images will be played in the wifi environment. As mentioned above, the scheme of adding features into model training needs to accumulate logs for a long time, so we use the method of increasing and decreasing weights to carry out the experiment.

Due to channel mainly cover page display, we assume that have a high level of appearance face on the cover of the figure will be more attractive, so try to high level anchor do weighted in appearance, but the experiment using the high level of appearance of the host is the anchor id after operating screening, and live a lot of people and the people in the cover is not the same, and a studio there may be many anchor, The appearance level of anchors at different times is also different. The realization results show that the CTR of high appearance level anchors does not increase, and the whole barrel index does not change positively. We will continue the experiment by directly calculating the level of face appearance in the cover image.

The second experiment the channel page tried was empty lens suppression. The live broadcast usually takes a long time, and there will often be some empty shots (without the explanation by the anchor). The operation students have done the experience analysis, and the empty shots account for a large proportion. Moreover, the empty shots are characterized by a binary problem, which is very suitable for adjusting the right.

Empty lens can be judged from the two dimensions of picture and sound. Since there is no sound in the channel page, we judge whether the broadcast room is empty lens according to whether there is a face. In fact, some industries do not need to show their faces, so we only select women’s clothing, clothing and other industries where the anchor will show their faces. As many live broadcasts on the channel page only show the cover picture, the CTR and user duration index of these live broadcasts analyzed by us have increased by 1%-1.2% as a result of the experiment.

From the channel page point is to slide down the scene, the user sees the studio content, and there is sound. So more suitable for real-time content understanding algorithms landing, such as empty lens suppression. In combination with the results of face detection and ASR, we judge whether there is a face in the live broadcast room within a period of time, whether there is someone talking, and combine CV and audio features to output whether the current live broadcast room is an empty lens.

Experiment at the start of the base barrels and benchmark index flat, because we only cover the anchor head, so you need to live this part of the statistical analysis of service efficiency, found in short time is long, high rate on the live of the mirror, the experimental barrels on the average user time has significantly increased, and then further analysis found that is streaming service machine OOM hang up, The duration of live analysis is too short. After fixing some bugs in the project, the average user duration and PV increased significantly in some of the live broadcasts covered by the analysis. The whole bucket index also increased due to the optimization of the experience. The average user duration, PV and overall PV both increased by more than 2%.

Negative governance

Novel content governance

Transgender featured anchors

The content is rich and colorful, and hosts use a variety of means to attract attention. Soft porn is the most common type, and transgender anchors have been found in recent months. Transsexual anchors refer to some male anchors who engage in live broadcast activities for women through transsexual or makeup change. The live broadcast contents take transsexual anchors as highlights to attract clicks, which has great public opinion hidden dangers. In order to identify transgender anchors, we use the characteristics of transgender anchors that their screen gender is female and their voice gender is male to conduct screening on Taobao anchors. Accuracy of related models:

labe accuracy
Image gender dichotomies Male/Female 0.97
Voice sex dichotomies Male/Female 0.88

Because the air there may be multiple anchor, such as anchorwoman leak noodles male anchor outside the picture, or male and female anchor with pictures, so through sound and visual gender inconsistencies identified across gender characteristics of the host will recall of the flawed, algorithm identification results through operating students check will eventually transgender label on the play, Combined with the A2A (Account 2 Account) similar anchor extension, we have accumulated dozens of transgender anchor accounts. In the common domain scenario, reduce the frequency of disclosure to avoid continuous brushing to multiple transgender people to improve user experience.

The star assistant cast the live broadcast

Star broadcast and assistant generation broadcast

With the rise of livestreaming with goods, some stars have joined Taobao Livestreaming to start their careers. The operation student analyzed the usage of the star public domain traffic from January 1, 2021 and found that the overall uv transaction value of the star public domain exceeded the mean value of MCN institutions. However, the overall duration of user stay is lower than the MCN average. Further statistics found that in the live broadcast period of star assistant, all public domain transformation data were lower than MCN market and ranked the bottom. This shows that after the original star himself live broadcast changed to assistant live broadcast, the overall transformation was the most severely affected. Many users enter the studio by seeing the star’s cover picture. If they find that the star is not in the live broadcast, they will feel cheated. Therefore, in order to encourage celebrities to live broadcast and improve the conversion rate of public domain traffic, we have made real-time identification of stars in live broadcast rooms to assist traffic regulation.

In cooperation with taobao live recommendation team, we adjusted the right to the star live broadcast room in the first guess information flow, channel page, and down scene, so that the star account felt the difference between the star broadcast and assistant broadcast, and made changes. In the experiment, we also found that CTR and per capita time increased to a certain extent after pv exposure of stars in the broadcast room decreased.

summary

At present, our services use EAS, TPP, VIP, IGraph, MetaQ, ODPS and other platforms. It is difficult to debug the engineering link and cost a lot of machine resources. In order to increase the processing capacity of the model, new algorithms will be deployed to the RTP platform. The new platform is still under construction and needs to be used as it is built. How to combine the content understanding algorithm with the recommendation algorithm is also a point that needs to be paid attention to. Short videos or images can be processed offline and the algorithm can retain the results, but the live content is processed in real time and needs to be deeply bound with the recommendation algorithm. The last point is that it is important to understand the business scenarios thoroughly. Different scenarios have different effective characteristics.