❤️ [Column: Data collection] ❤️ [Effective Rejection of fake data]


👋 Follow physics AI 👋, learn more fun AI, rush 🚀 🚀

In the current stage of the blog, there are many mixed C site links. Due to time reasons, it cannot be effectively sorted out for the time being, please understand


🥇 Data set introduction


🔴 Basic Information

The Audio Speech and Language Processing Research Group (ASLP Lab) of Northwestern Polytechnical University, Qiaowen and Hillshell jointly released WenetSpeech, a multi-domain Chinese speech recognition dataset of 10,000 hours

  • Corresponding paper: https://arxiv.org/pdf/2110.03370.pdf
  • The official home page: https://wenet-e2e.github.io/WenetSpeech/
  • This section mainly refers to the article:https://mp.weixin.qq.com/s/lR22WmI5G2mPSuloZUcWVA
  • Students who pursue typesetting experience can copy the original [link above] for reference

🔵 WenetSpeech profile

In addition to 10,000 + hours of high-quality annotated data, WenetSpeech includes 2,400 + hours of weak annotated data and 22,400 + hours of total audio, covering a wide range of Internet audio and video, noise background conditions, speech modes, The source fields include audio books, commentary, documentaries, TV dramas, interviews, news, reading, speeches, variety shows and other scenes. The detailed statistics of these fields are shown in the figure below.

🟣 WenetSpeech collection process

Several typical examples of this OCR system in different scenarios are shown below. In the figure, the green boxes are all the detected text regions, the red boxes are the text regions judged as subtitles, and the text above the red boxes is the recognition result of OCR. It can be seen that the system correctly determines the subtitle area and accurately identifies the subtitle text. At the same time, through our test, it is found that the system can also accurately determine the start and end time of the subtitle.

🟡 Data verification

In WenetSpeech, data with confidence >=95% were selected as high-quality annotation data, and data with confidence between 0.6 and 0.95 were selected as weakly supervised data.

🔴 comparison of classical algorithms


📘 Download the correct opening method


The download method record time: [2021-10-22 record]

🔴 Download home page

  • Wenet – e2e. Making. IO/WenetSpeech…

🔴 Enter the email address

🔴 The following page is displayed

🔴 email soon received download instructions

Get ready. 500 gigabytes of disk space

🔴 Start download

Depending on the speed of the network, the download should take about half a day

  • du -shData set compression package size: 309 GB
  • tree -L 3View the data set structure below


📙 salutes the big guy


WenetSpeech is currently the largest open source Mandarin speech corpus suitable for industry-level speech recognition research


The cause of artificial intelligence of all mankind is probably promoted step by step like this

The voice dataset is summarized below in the blog post

  • 👋 Voice data set download address summary | | free Chinese speaker recognition corpus Common Voice data sets | download summary

Recent classic interesting blog recommendations

  • ❤️ Effective entry target detection YOLO Combat Series selection — [1024 special issue]
  • ❤️ First Knowledge of Super Cent Reconstruction — How to Make goddess clearer, My White Moon [ICCV, BSRGAN of 2021 Super Cent Reconstruction]
  • ❤ ️ multi-stage progressive image restoration | go rain, denoising, fuzzy (with source code) | | effective tutorial [❤ ️ CVPR 2021 ❤ ️ 】
  • ❤️ [Introduction to Deep Learning Project] Change the style for younger students [❤️CVPR 2020 Style transfer nice-Gan ❤️]
  • ❤ ️ [introduction to deep learning project] converts junior photos to pencil sketch | 【 ❤ ️ the Pattern Recognition of 2020 U Square Net ❤ ️ 】