Disclaimer:
The purpose of this paper is to record the learning process of technology, and there is no intention to destroy the experimental APP. The tool cannot be used for commercial or personal purposes, and any improper use shall be borne by the individual.
0x1, Lao under crack
Fat guy, remember that year-end review you just wrote? Have you started implementing some of the things on your list? By the way, 2021 is 7.6 percent over!
Last year, always wanted to write two tools: Markdown conversion tool and video caption extraction tool, the former liver came out in late December:
A Tool for Typesetting Public Articles
The latter delay to this year, before a period of time sudden whim, began to toss about, helpless things are too much, has been shelved, piecemeter gather together time, finally liver, of course still simple ~
Here’s why we made this tool
Main story
Always feel oneself can’t very good control own emotion, irritability, a lot of worldly wisdom greatly truth also don’t understand, so that the wife often say: you this person strange strange.
Realizing the problem, I must try to correct it, so I began to search for information about learning this aspect. By chance, I found the video of Zeng Lao in B station:
After watching one episode, I fell in love with this funny little old man. After that, I would wear headphones to close my eyes and listen to one episode in the morning rush hour of line 1.
Every time you hear it, it’s a blast, and it’s wonderful, but the half-life of that benefit is often so short that you forget it.
The philosophy of life is to chew over and over again, reviewing the old to learn the new truth who all understand, but each review to listen to 30 minutes, in this era of knowledge explosion, a little unrealistic.
In fact, just remember the truth and know the story, so I began to take notes while listening, typing code words into notes, and finishing at home after work.
But this poor recording takes the shine off an otherwise interesting story:
Listen to two sentences → pause the video → code words → cut back to continue listening, and so on and on and on. Sometimes those who did not hear clearly have to row back to listen several times. In the past, it takes 2-3 times to listen to an episode in a morning rush hour.
In fact, just get the subtitles, because that’s what he’s saying. Do a subtitles search online?
The reality is that the video is so old that it’s hard to find a subtitle file, or someone else is taking notes while listening, and the content is incomplete.
To find a way to do something different, a “hands-on” developer’s mind flashed this idea: I write my own subtitle extraction tool?
I immediately thought of two general directions: OCR text recognition and audio to text, briefly said the general idea:
OCR text recognition
- 1. Frame extraction by second;
- 2. Crop a specific area of the picture (subtitles are generally fixed to be displayed in a certain area);
- 3. Image processing before recognition (gray scale processing, binarization processing, etc.);
- 4. OCR text recognition using open source libraries or third-party apis;
- 5. Splicing the recognition results of multiple pictures to produce subtitle files;
Audio to text
- 1. Convert video to WAV format (most identification libraries and third-party apis support this format);
- 2. Divide the audio into several small segments according to a specific duration (some SDKS support audio recognition within 60s);
- 3. Use open source libraries or third-party apis for audio recognition;
- 4. Splicing the recognition results of multiple audio segments to produce subtitle files;
Logical thinking is very clear, of course, say so to say, it is certain to step on the pit.
* Hide the plot
The above two methods of making subtitles out of thin air should be called subtitle generation instead of subtitle extraction. Here, I think of another thing:
I have a friend, when watching the neon love action blockbuster, language barrier, can only through the intense physical collision to understand the director wants to express the intention, but also led to the situation that he did not understand the plot of the situation, carelessly.
With this tool, there is no problem of not knowing what to say. While watching the confrontation, you can also further speculate on the psychological activities of the actors. Wonderful!
This section takes B station video as an example to develop a subtitle extraction tool.
0x2. Video download
Before starting video extraction, first get the video to the local. There are four general methods as follows:
- 1. Use various downloaders directly, such as Flash bean video downloader, Bilibili video downloader, Haw Down, cruller script, etc.
- Python download tool you-get
- 3. Crack the video rules of Site B and obtain the video source URL for download;
- 4, take advantage of, simulate the request for a third party to parse the website, crawl parsing results to obtain the source address;
This is only for technical methods 2 and 3.
1, get you – library
You – Get support YouTube, Twitter, Tumblr, B site and other sites for video download, more support sites and other instructions to check:
Official Chinese Wiki
PIP command directly installed:
pip install you-get
Copy the code
If the download is slow, try adding an image source, such as:
pip install you-get -i https://pypi.tuna.tsinghua.edu.cn/simple
Copy the code
After the installation of the library can be slow operation, after the installation, randomly find a B station video link, open the command line and type:
You-get-i link addressCopy the code
You can view all available video quality and formats, with DEFAULT as the DEFAULT quality:
You can also download it from the command line by calling you-get.common.any_download() in Python: you can download it from the command line by calling you-get.common.any_download() :
import os
from you_get import common as you_get
url = "https://www.bilibili.com/video/BV1eh41127Ma"
you_get.any_download(url=url, info_only=False, output_dir=os.getcwd(), merge=True)
Copy the code
After running, the console will output the video download progress:
Then some videos (such as Xinpan), you need a large member to watch and download, directly download will report an error:
Get cookies. TXT is a Chrome extension that allows you to export cookies. After logging in to your site B account, use it to export Cookies:
Then put the exported file into the same directory as the code file, here I rename the file to bilibili. TXT, and then set the Settings before downloading:
import os
from you_get import common as you_get
url = "https://www.bilibili.com/bangumi/play/ss5978"
you_get.load_cookies("bilibili.txt")
you_get.any_download_playlist(url=url, cookies="bilibili.txt", info_only=False, output_dir=os.getcwd(), merge=True)
Copy the code
After running, big members can also download the drama normally:
There are a lot of tutorials on the Internet that copy each other and send cookies without checking.
2, obtain the video source address + IDM download
You-get is good enough, but I don’t know what happened a few days ago, the next video is only tens of kilogramme speed, so write down the backup plan.
IDM(Internet Download Manager) Windows Download god, do not need how to introduce it, never heard of their own 100, Mac players can use Aria2 instead.
The thing to do here is to get the source address of B station video, and then use IDM to download.
① Obtain the video download source address
F12 Open developer tools, switch to the Network TAB, filter for Doc type requests, refresh:
Using the Requests library to simulate a request wave, the request header only sets the user-agent:
import requests as r url = "https://www.bilibili.com/video/BV1mE411R71f" headers = { 'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/83.0.4103.97 Safari/537.36'} if __name__ == '__main__': resp = r.get(url=url, headers=headers) print(resp.text)Copy the code
The following information is displayed on the console:
Ok, then locate the video, hit Play video, and the Network TAB outputs a bunch of this:
View the URL for one of the requests:
https://xy113x100x172x188xy.mcdn.bilivideo.cn:4483/upgcxcode/93/92/120309293/120309293-1-30216.m4s?expires=1610969512&pl atform=pc&ssig=DPtz9VyD7TVRwOtle4ARXw&oi=2005480736&trid=d8bdd82a47d34a1e914b7de07cc20946u&nfc=1&nfb=maPYqpoel5MI3qOUX6Y PRA = = & McDnid = 1001203 & mid = 67402819 & orderid = 0, 3 & & logo = A0000001 agrr = 1Copy the code
See the.m4s here, which stores the MP4 video clips that make up the VIDEO streams of the HTML5 video player.
Click on one of them to see the request details:
Em? The range field in the request header specifies the range of the video segment. The Content-range field in the response header specifies the total length of the video segment.
Can we simulate the request to get the total length of the video first, and then range=0- the total length of the video to get the full video?
To verify this guess later, let’s see where we can get the URL of the video source, and request another address here (covering as much clarity as possible) :
https://www.bilibili.com/video/BV1554y1s7qG
Copy the code
To simulate a request wave, copy the response HTML into PyCharm, format it, and search for.m4s.
Copy the Json, put it into the formatting tool, search for the video URL, and find the following two fields:
There are two named fields with the same value, which is estimated to be compatible with old and new interfaces. There are several such data under video, which correspond to videos with different sharpness.
The quality definition in the outer support_formats corresponds to the video ID. In addition, another field audio caught my attention:
B station audio and video are separate, I have heard, is it really so? Using IDM simple verification:
Oh, there are two videos:
Open if so, small problem, directly with FFMPEG synthesis of a wave. In addition, if you don’t want to compose your own video, you can download the Flv format. To switch the player to Flash Player:
Using IDM simple verification, download to the local open, there is sound:
Then search for.flv at the following link:
https://api.bilibili.com/x/player/playurl?cid=143182635&fnver=0&otype=json&bvid=BV1RJ41177XR&player=1&fnval=0&qn=0&avid= 83697364Copy the code
The url of the resource was found:
Bvid corresponds to BV number, avid corresponds to AV number, and then go to this CID and search for this value in HTML:
Let’s look at qn, guess it’s clarity, let’s change it to 80.
Change it back to 0 and see:
0 is the default, the rest of the parameters are default, then try to open the browser to the URL:
After the request, Baidu made a wave. It found that the request may be for anti-theft broadcast, and the referrer request header should be set for the request, pointing to the address of the playing page.
Try this code with requests:
import requests as r
if __name__ == '__main__':
download_url = "https://upos-sz-mirrorkodo.bilivideo.com/upgcxcode/35/26/143182635/143182635_da2-1-64.flv?e" \
"=ig8euxZM2rNcNbKBhwdVhoMM7WdVhwdEto8g5X10ugNcXBlqNxHxNEVE5XREto8KqJZHUa6m5J0SqE85tZvEuENv" \
"No8g2ENvNo8i8o859r1qXg8xNEVE5XREto8GuFGv2U7SuxI72X6fTr859r1qXg8gNEVE5XREto8z5JZC2X2gkX5L5F" \
"1eTX1jkXlsTXHeux_f2o859IB_&uipk=5&nbs=1&deadline=1611047620&gen=playurl&os=kodobv&oi=20054" \
"80449&trid=95df7eeb4c384372bf7a106860d1aa3eu&platform=pc&upsig=4e2a26137f1fe3a58cc3255cf3e" \
"C25cc & uparams = e, uipk, NBS, deadline, gen, OS, oi, trid, platform&mid = 67402819 & orderid = 0, 3 & agrr =" \
"1&logo=80000000"
referer_url = "https://www.bilibili.com/video/BV1RJ41177XR"
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) '
'the Chrome / 83.0.4103.97 Safari / 537.36'.'Referer': referer_url,
'Origin': 'https://www.bilibili.com'
}
resp = r.get(url=download_url, headers=headers)
with open("video.flv"."wb+") as f:
f.write(resp.content)
f.close()
print("Video saved successfully")
Copy the code
After running, wait for a moment to see that the video has been downloaded to the local:
Also try two **.m4s** can this operation:
Good guy, do not have to set the scope of the download video segment, directly is the complete video, and then is some URL extract details of things, and then the audio and video with ffmPEG command to merge a wave, relatively simple fragmentol, do not explain step by step, directly give the complete code:
# -*- coding: utf-8 -*-
# !/usr/bin/env python
""" ------------------------------------------------- File : bilibli_video_download.py Author : CoderPig date : The 2021-01-18 23:58 Desc: B standing video download -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "" "
import requests as r
import http.cookiejar
import cp_utils
import re
import json
import time
import subprocess
# Extract the re of video information
play_info_pattern = re.compile(r'window\.__playinfo__=(\{.*? \})', re.MULTILINE | re.DOTALL)
initial_state_pattern = re.compile(r'window\.__INITIAL_STATE__=(\{.*? The \}); ', re.MULTILINE | re.DOTALL)
# request header
headers = {
'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) '
'the Chrome / 83.0.4103.97 Safari / 537.36'.'Origin': 'https://www.bilibili.com'
}
# Cookies file
cookies_file = 'bilibili.txt'
flv_player_playurl = 'https://api.bilibili.com/x/player/playurl'
# B Station video class
class BVideo:
def __init__(self, title=None, cid=None, bvid=None, avid=None, quality=0, flv_url=None, mp4_url=None, wav_url=None,
merge_video=None) :
self.title = title
self.cid = cid
self.bvid = bvid
self.avid = avid
self.quality = quality
self.flv_url = flv_url
self.mp4_url = mp4_url
self.wav_url = wav_url
self.merge_video = merge_video
# Get mp4 resource data
def fetch_mp4_data(url) :
if headers.get('Referer') is not None:
headers.pop('Referer')
b_video = BVideo()
resp = r.get(url=url, headers=headers, cookies=cookies)
print("Request:", resp.url)
play_info_result = play_info_pattern.search(resp.text)
if play_info_result is not None:
data_json = json.loads(play_info_result.group(1))
b_video.mp4_url = data_json['data'] ['dash'] ['video'] [0] ['baseUrl']
b_video.wav_url = data_json['data'] ['dash'] ['audio'] [0] ['baseUrl']
initial_result = initial_state_pattern.search(resp.text)
if play_info_result is not None:
data_json = json.loads(initial_result.group(1))
video_data = data_json['videoData']
b_video.title = video_data['title']
b_video.avid = video_data['aid']
b_video.bvid = video_data['bvid']
b_video.cid = video_data['cid']
return b_video
# Get FLV resource data
def fetch_flv_data(b_video) :
params = {
'qn': 0.'fnval': 0.'player': 1.'fnver': 0.'otype': 'json'.'avid': b_video.avid,
'bvid': b_video.bvid,
'cid': b_video.cid
}
resp = r.get(flv_player_playurl, params=params, headers=headers, cookies=cookies)
print("Request:", resp.url)
if resp is not None:
resp_json = resp.json()
if 'data' in resp_json:
b_video.flv_url = resp_json['data'] ['durl'] [0] ['url']
return b_video
Download resources in common mode
def download_normal(url, referer_url, file_type) :
headers['Referer'] = referer_url
print("Download:", url)
resp = r.get(url=url, headers=headers)
file_name = '{}. {}'.format(str(int(round(time.time() * 1000))), file_type)
with open(file_name, "wb+") as f:
f.write(resp.content)
print("Download completed:", resp.url)
return file_name
Merge audio and video
def merge_mp4_wav(video_path, audio_path, output_path) :
print("Audio and Video merging ~")
cmd = f'ffmpeg -i {video_path} -i {audio_path} -acodec copy -vcodec copy {output_path}'
subprocess.call(cmd, shell=True)
print("Merge completed ~")
if __name__ == '__main__':
cookies = http.cookiejar.MozillaCookieJar(cookies_file) \
if cp_utils.is_dir_existed(cookies_file, mkdir=False) else None
video_url = input("Please enter the link to the video you want to download: \n")
print("Extract video information...")
video = fetch_flv_data(fetch_mp4_data(video_url))
user_choose = input("\n Please enter the serial number of the downloaded resources: \n1, mp4 \n2, FLV \n")
if user_choose == '1':
mp4_path = download_normal(video.mp4_url, video_url, 'mp4')
wav_path = download_normal(video.wav_url, video_url, 'wav')
print("Audio-video downloads complete, start merging resources.")
merge_mp4_wav(mp4_path, wav_path, "after_{}.mp4".format(str(int(round(time.time() * 1000)))))
elif user_choose == '2':
flv_path = download_normal(video.mp4_url, video_url, 'flv')
else:
print("Error input")
exit(0)
Copy the code
After running, just paste a link of B site video URL, and then download the video on demand.
② Call IDM to download the video
Downloading with requests is a bit slow and the progress is not perceived. Some long hd videos are a bit slow, such as:
If you have an IDM installed like this writer, you can PIP a wave of IDM modules
pip install idm
Copy the code
Then the code calls a wave, see the document: IDM
Call the idM download function as follows:
# IDM download
def download_idm(url, referer_url, file_type) :
print("Download:", url)
file_name = '{}. {}'.format(str(int(round(time.time() * 1000))), file_type)
downloader = IDMan()
downloader.download(url, output=file_name, referrer=referer_url)
print("Download completed:", url)
return file_name
Copy the code
After running input video download, will directly call IDM download, simply do not too cool ~
Tips: This code is only applicable to ordinary video download, not for large members of the drama download, interested in the relevant rules can be cracked ~
Video download part to say this, and then began to execute captioning extraction scheme OCR text recognition ~
0x3. OCR character recognition
The OCR text recognition library is the first of its kind to be used in the OCR text recognition library. The OCR text recognition library is the first to be used in the OCR text recognition library.
1. Extract video frames
Here use OpencV, PIP directly install Cv2, the speed is generally slow, it is recommended to use the mirror source mode download:
pip install opencv-python
Copy the code
Rather simple, directly on the code:
# -*- coding: utf-8 -*- # ! /usr/bin/env python """ ------------------------------------------------- File : video_frame_extract.py Author : CoderPig date : 2021-01-24 10:21 Desc : Video frame extraction -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "" "import cv2 import OS import cp_utils video_dir = os.path.join(os.getcwd(), "video") frame_extract_dir = os.path.join(os.getcwd(), "frame_extract") def save_image(save_dir, image, num): save_path = os.path.join(save_dir, "{}.jpg".format(num)) cv2.imwrite(save_path, image) def extract_frame(file): save_dir = os.path.join(frame_extract_dir, file.split(os.path.sep)[-1][:-4]) cp_utils.is_dir_existed(save_dir) video_capture = cv2.VideoCapture(file) success, I = 0 j = 0 while success: If I % video_capture. Get (cv2.cap_prop_fps) == 0: J = j + 1 save_image(save_dir, frame, j) print(" {}".format(j)) success, frame = video_capture.read() if __name__ == '__main__': Cp_utils.is_dir_existed (video_dir) cp_utils.is_dir_existed(frame_extract_dir cp_utils.filter_file_type(video_dir, ".mp4") flv_list = cp_utils.filter_file_type(video_dir, ".flv") video_list = mp4_list + flv_list print(" Please select the sequence number of the processing video: \n{}\n".format("=" * 64)) for index, value in enumerate(video_list): Print (" {}. {} ". The format (value) index,) file_choose = input (" \ n {} \ n please enter the serial number: \n".format("=" * 64)) file_choose_int = int(file_choose) if 0 <= file_choose_int < len(video_list): extract_frame(video_list[file_choose_int])Copy the code
Run:
You can also open the file manager to view:
2. Processing before cropping pictures & OCR
After the video frame is extracted, then the picture is cropped. There is no need to take a whole picture to OCR, just crop the subtitle area, here the PIL module is used for processing. To get the clipping area, directly use the Windows drawing tool to compare:
Get the starting coordinate (12,244), the end coordinate (12+247,244+28) is (259,272), randomly find a picture code crop wave try:
import os
from PIL import Image
frame_extract_dir = os.path.join(os.getcwd(), "frame_extract")
if __name__ == '__main__':
im = Image.open(frame_extract_dir + "/102.jpg")
box = (12, 244, 259, 272)
region = im.crop(box)
region.save("test.jpg")
Copy the code
Open the image to see what it looks like:
Yes, nice, and then add gray and binarization, the code is as follows:
import os from PIL import Image import tesserocr frame_extract_dir = os.path.join(os.getcwd(), "frame_extract") if __name__ == '__main__': im = Image.open(frame_extract_dir + "/102.jpg") box = (12, 244, 259, Table = [] for I in range(256): if I < 150: table.append(0) else: table.append(1) im = im.point(table, "1") region = im.crop(box) region.save("test.jpg")Copy the code
Processed image:
3. Tesseract OCR identification
Tesseract-ocr is a free OCR text recognition engine provided by Google:
Github repository address
This is a terrifyingly low recognition rate if you don’t train your own character bank, so try using your own Chinese character bank here.
1) install the tesseract
After the installation is complete, add the installation PATH in the PATH environment variable, for example:
Then open the terminal and try to identify the processed image by typing the following command:
tesseract test.jpg result -l chi_sim
Copy the code
Open the generated result.txt:
What the hell? Is there a problem with the font? I tried another image:
The identification result is normal:
Tesserocr are not installed, give up directly, try the third party OCR SDK, here with Baidu OCR.
4. Baidu OCR identification
Config = {'appId': 'XXX', 'apiKey': 'YYY', 'secretKey': 'secretKey'; 'ZZZ' } client = AipOcr(**config) def bd_ocr(file): with open(file, 'rb') as f: image = f.read() result = client.basicGeneral(image) if 'words_result' in result: Return '\n'. Join ([w['words'] for w in result['words_result']])Copy the code
Identification results:
It suddenly occurred to me that the ancient image quality of the subtitle picture might be better without binarization and grayscale processing. Remove that part of the code and try again:
If so, change basicGeneral to basicAccurate.
The identification results are about the same, but the high precision identification can only be white prostitute 500 times a day:
0x4. Audio to text
The above Baidu OCR high precision identification results are basically enough, is not white prostitute, there are number of restrictions. In addition, one of the biggest disadvantages of OCR file recognition is that it can’t be done once and for all:
Each video may need to calculate the area of captioning interception, and then carry out text recognition.
But speech recognition can get rid of this dilemma, just speak, complete subtitles.
1, video to WAV audio clips
To use the Python audio processing library Pydub directly, download and install ffmpeg and configure environment variables:
Making the warehouse
Get the next video duration, and then each 60 seconds for a segment, easy, directly on the code:
# -*- coding: utf-8 -*- # ! /usr/bin/env python """ ------------------------------------------------- File : audio_text_extract.py Author : CoderPig date : 2021-01-26 10:11 Desc : Turn video audio extract subtitles -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "" "the from pydub import AudioSegment import OS import time import cp_utils import math audio_after_dir = os.path.join(os.getcwd(), "audio_after_dir") def video_to_wav(file_path): If file_path.endswith(".flv"): video = AudioSegment. From_flv (file_path) else: video = AudioSegment.from_file(file_path, format=file_path[-3:]) wav_save_dir = os.path.join(audio_after_dir, str(int(round(time.time() * 1000)))) cp_utils.is_dir_existed(wav_save_dir) video_duration = int(video.duration_seconds * Part_duration = math.ceil(video_duration/part_duration) Last_start = video_duration -video_duration % part_duration print(" The duration of the video to be processed is {}, clipped to: Format (video_duration, part_count)) for part in range(0, part_count - 1): start = part * part_duration end = (part + 1) * part_duration - 1 wav_part = video[start: End] print(" {} - {}".format(start, end)) wav_part.export(os.path.join(wav_save_dir, "{}.wav".format(part)), Wav_part = video[last_start: Video_duration] print(" Export time: {} - {}".format(last_start, video_duration)) wav_part.export(os.path.join(wav_save_dir, "{}.wav".format(part_count)), format="wav") return wav_save_dir if __name__ == '__main__': Cp_utils.is_dir_existed (audio_after_dir) wav_dir = video_to_wav(os.path.join(os.getcwd(), "1.flv")) print(" FLV to wav complete, The output file directory is: ", wav_dir)Copy the code
After the operation:
You can also see the corresponding WAV fragment in the output directory:
So much for audio fragment extraction, and then speech-to-text, with speech recognition library Speech_Recognition
2. Speech_Recognition library to recognize audio
For more information about the library, please go to:
- Github official repository
- Pypi
Direct PIP command installation:
pip install SpeechRcognition
Copy the code
Recognize_sphinx () speech recognizer is used here for off-line recognition, but the PocketSphinx library must be installed:
pip install pocketsphinx
Copy the code
Error during installation:
error: command 'swig.exe' failed: No such file or directory...
Copy the code
Solution: Download SWIG from the official website, decompress it, and configure environment variables
Official swiG download address
New environment variables:
Add this variable to the Path environment variable:
Pocketsphinx (pocketsphinx)
Visual C++ 14.0: Visual C++ 14.0: Visual C++ 14.0
Look for the WHL. A WHL file is essentially a compressed package that contains both the py file and the compiled PyD file.
In the absence of a compilation environment, you can choose your own Python environment for installation, eliminating the current system environment must meet the compilation environment trouble.
Then look for CP38-Win32 WHL, available at Pypi:
pocketsphinx Download files
No, let alone 3.8, not even 3.7, I erase, can only be downgraded back to 3.6? Then I googled:
Pocketsphinx – 0.1.15 – cp38 – win32
And boy, found: in the following link to pypi. Bartbroe. Re/pocketsphin…
It was installed as desired
Then we identify Chinese (Mandarin), we need to install additional Chinese language, acoustic model:
Download address
After the decompression:
Go to the Python installation directory: Lib\site-packages\speech_recognition
Create a new zh-cn folder, go to the folder, put everything in it, and rename it:
- Zh_cn. Cd_cont_5000 – & – model
- Zh_cn. Lm. Bin – language – model. Lm. Bin
- Zh_cn. Dic – pronounciation – dictionary. Dict
When you’re done, you can play. The identification code is simple:
import speech_recognition as sr r = sr.Recognizer() test = sr.AudioFile("1.wav") start_timestamp = time.time() with test as source: audio = r.record(source) try: text = r.recognize_sphinx(audio, language='zh-CN') print(text) except sr.UnknownValueError as e: Print (e) print(" timestamp: ", time.time() -start_timestamp)Copy the code
Output results:
That’s a far cry, and it takes three minutes faster… No, try a third party without training the library yourself.
3. Baidu speech recognition
Non-enterprise certification can be white prostitute 5W times, valid for 180 days, here do not know why I have 15W times:
Change the official Demo directly:
# -*- coding: utf-8 -*- # ! /usr/bin/env python """ ------------------------------------------------- File : bd_asr_test.py Author : CoderPig date : 2021-01-26 15:04 Desc : Baidu speech recognition test -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- "" import" requests as r API_KEY = 'kVcnfD9iW2XVZSMaLMrtLYIz' = 'O9o1O213UgG5LFn0bDGNtoRN3VWl2du6' SECRET_KEY CUID = '123456 python RATE fixed value DEV_PID = = 16000 # 1537 # mandarin ASR_URL = 'http://vop.baidu.com/server_api' TOKEN_URL = 'http://openapi.baidu.com/oauth/2.0/token' def fetch_token () : data = { 'grant_type': 'client_credentials', 'client_id': API_KEY, 'client_secret': SECRET_KEY } resp_json = r.post(TOKEN_URL, data=data).json() return resp_json['access_token'] def wav_conversion(token, file_path): with open(file_path, 'rb') as speech_file: speech_data = speech_file.read() length = len(speech_data) params = {'cuid': CUID, 'token': token, 'dev_pid': DEV_PID} headers = { 'Content-Type': 'audio/' + 'wav' + '; rate=' + str(RATE), 'Content-Length': str(length) } resp = r.post(ASR_URL, headers=headers, params=params, data=speech_data) print(resp.text) if __name__ == '__main__': bd_token = fetch_token() wav_conversion(bd_token, '13.wav')Copy the code
After running and waiting for a moment, the interface returns the following message:
I understand, is not the duration of the recording file more than 60 seconds, re-cut the audio, changed to 30s is not good, finally change to 10s a section:
This is… I still have to upload the audio and text, train the model, or I can train myself, and then when I want to open the service:
Then the price of the bag, dissuaded
4. Cracked the conversion interface of paid APP
There is no such thing as a high recognition rate without having to train yourself and spend a lot of money:
Really have, as long as 99 /2 years, just use, Baidu search: audio transfer text, and then found a ranking of the conversion APP, did not see a trial function, so manual @ customer service to help transfer files to try kang:
Soon I received an email:
It looks good, 99 dollars 2 years can be used freely can also be ok, into a wave of VIP, then throw a 60s recording file to try:
Recognition accuracy and speed is ok, support large files:
The most important thing is to support neon:
Tut tut, perfect, then grab the bag to see the request, lose 30 minutes of recording try kang, the request process is as follows:
Then we will study the laws of these five interfaces.
Combined with the request parameters and response results, it is not difficult to see that interface ① is the audio verification interface, and the file suffix is not.wav but.MP4. Why?
There must have been a conversion, and the motivation for the conversion was simple:
The converted MP3 volume is 10 times less than the volume of waV!
Don’t believe it? Local FFMPEG conversion under:
ffmpeg -i 0.wav -f mp3 -acodec libmp3lame -y test.mp3
Copy the code
The result after conversion:
Still, the result is larger than the uploaded audio file, so you might ask: You know? Add all the content-lenght in interface ② :
Add the result:
Here can actually not reason how to turn, but like to get to the bottom of my, is sure to find out how to turn, then decompile the source kangkang cough ~
Look at the app, 360 non-enterprise version of the reinforcement, no brain export dex, how to guide can see I wrote the tutorial before.
Android audio and video conversions mostly use the FFmpeg command line, while mp3 conversions have a mandatory libmp3lame argument, which is directly global search:
Convert to Java and locate the following code:
Could it be here? The command line mimics a patchwork of arguments:
ffmpeg -i 0.wav -vn -ar 16000 -acodec libmp3lame test2.mp3
Copy the code
Converted MP3 file:
Right-click to see details:
And so it was:
To understand how audio is converted, take a look at interface ② :
Interface name memprofile → Member Profile → Member Profile
Generally, only the top number of single sign-on will appear the token to update, directly skip it, interface ③ is file block upload:
Some response parameters in the right interface (①) are used.
- Tasktag: upload tasktag
- Tasktoken: upload tasktoken
- Timestamp: indicates the timestamp when the upload task is established
- Fileindex: indicates the file offset, fixed to 0
- Chunks: indicates the number of file blocks
- Chunk: Indicates the current chunk
Then interface ④ :
It is not difficult to see that this interface is used to query the translation status, every 1s request, polling to confirm whether the translation is complete, until:
Then request the interface ⑤ to get the translation result:
The value of the downURL field is the URL of the translated result file, which can be downloaded and parsed.
A complete audio upload is like this, and then you have to crack the construction of a parameter → datasign, basically every interface can’t do without it, and it’s dynamic. Direct global search datasign:
With:
Convert md5 to string using a hashMap
If you add hUuPd20171206LuOnD, you can add datasign.
Rules know, it is very good to write code, limited to space is not a explanation, directly out of the complete code
# -*- coding: utf-8 -*-
# !/usr/bin/env python
"""
-------------------------------------------------
File : extract_text_by_api.py
Author : CoderPig
date : 2021-01-28 15:42
Desc : 利用APP的API生成字幕
-------------------------------------------------
"""
import hashlib
import math
import os
import time
import requests as r
from pydub import AudioSegment
import cp_utils
host = 'app.xunjiepdf.com'
origin_video_dir = os.path.join(os.getcwd(), "origin_video") # 原始视频的存放目录
video_to_wav_dir = os.path.join(os.getcwd(), "video_to_wav") # 视频转音频的存放目录
wav_to_mp3_dir = os.path.join(os.getcwd(), "wav_to_mp3_dir") # wav转mp3的存放目录
# API接口
base_url = 'https://{}/api/v4/'.format(host)
member_profile_url = base_url + "memprofile"
upload_par_url = base_url + "uploadpar"
upload_file_url = base_url + "uploadfile"
task_state_url = base_url + "taskstate"
task_down_url = base_url + "taskdown"
# 常量字段
device_id = '设备id'
product_info = 'F5030BB972D508DCC0CA18BDF7AE48E26717591F38906C09587358DAAC0092F0'
account = '账户'
user_token = '用户Token'
machine_id = '设备id'
software_name = '软件名'
# 普通请求头
okhttp_headers = {
'Host': host,
'User-Agent': 'okhttp/3.10.0'
}
# 上传文件请求头
upload_headers = {
'Host': host,
'User-Agent': 'Dalvik/2.1.0 (Linux; U; Android 8; Mi 20 Build/QQ3A.200805.001)',
'Content-Type': 'application/octet-stream'
}
# 视频转换成多个wav文件
def video_to_wav(file_path, seconds):
part_duration = seconds * 1000
print(file_path)
if file_path.endswith(".flv"):
video = AudioSegment.from_flv(file_path)
else:
print(file_path[-3:])
video = AudioSegment.from_file(file_path, format=file_path[-3:])
wav_save_dir = os.path.join(video_to_wav_dir, str(int(round(time.time() * 1000))))
cp_utils.is_dir_existed(wav_save_dir)
video_duration = int(video.duration_seconds * 1000) # 获取视频时长
part_count = math.ceil(video_duration / part_duration) # 裁剪录音段数
last_start = video_duration - video_duration % part_duration
print("待处理视频时长为:{},裁剪为:{} 段".format(video_duration, part_count))
for part in range(0, part_count - 1):
start = part * part_duration
end = (part + 1) * part_duration - 1
wav_part = video[start: end]
print("导出时间段:{} - {}".format(start, end))
wav_part.export(os.path.join(wav_save_dir, "{}.wav".format(part)), format="wav")
# 剩下一段
wav_part = video[last_start: video_duration]
print("导出时间段:{} - {}".format(last_start, video_duration))
wav_part.export(os.path.join(wav_save_dir, "{}.wav".format(part_count)), format="wav")
return wav_save_dir
# 获取用户信息
def member_profile():
data = {
"deviceid": device_id,
"timestamp": int(time.time()),
"productinfo": product_info,
"account": account,
"usertoken": user_token
}
data_sign = md5(dict_to_str(data))
data['datasign'] = data_sign
resp = r.post(url=member_profile_url, headers=okhttp_headers, data=data)
print(resp.json())
# 文件上传校验
def upload_par(file_path):
file_name = file_path.split(os.path.sep)[-1]
data = {
"outputfileextension": "srt",
"tasktype": "voice2text",
"productid": "34",
"isshare": 0,
"softname": software_name,
"usertoken": user_token,
"filecount": 1,
"filename": file_name,
"machineid": machine_id,
"fileversion": "defaultengine",
"softversion": "4.3.2",
"fanyi_from": "zh",
"limitsize": "204800",
"account": account,
"timestamp": int(time.time())
}
data_sign = md5(dict_to_str(data))
data['datasign'] = data_sign
resp = r.post(url=upload_par_url, headers=okhttp_headers, data=data)
print("请求:", resp.url)
if resp is not None:
resp_json = resp.json()
print(resp_json)
if resp_json['code'] == 10000:
return TaskInfo(resp_json['tasktag'], resp_json['tasktoken'], resp_json['timestamp'])
else:
return resp_json
# 文件分块上传
def upload_file(upload_task, file_path):
# 获得文件字节数
file_size = os.path.getsize(file_path)
# 计算文件块数
chunks_count = math.ceil(file_size / 1048576)
upload_params = {
'tasktag': upload_task.task_tag,
'timestamp': int(time.time()),
'tasktoken': upload_task.task_token,
'fileindex': 0,
'chunks': chunks_count,
}
# 分段请求
for count in range(chunks_count):
upload_params['chunk'] = count
start_index = count * 1048576
with open(file_path, 'rb') as f:
f.seek(start_index)
content = f.read(1048576)
resp = r.post(url=upload_file_url, headers=upload_headers, params=upload_params, data=content)
print("请求:", resp.url)
if resp is not None:
print(resp.json())
count += 1
# 查询翻译状态
def task_state(upload_task):
data = {
"ifshowtxt": "1",
"productid": "34",
"deviceos": "android10",
"softversion": "4.3.2",
"tasktag": upload_task.task_tag,
"softname": software_name,
"usertoken": user_token,
"deviceid": device_id,
"devicetype": "android",
"account": account,
"timestamp": int(time.time())
}
data_sign = md5(dict_to_str(data))
data['datasign'] = data_sign
while True:
resp = r.post(url=task_state_url, headers=okhttp_headers, data=data)
print("请求:", resp.url)
if resp is not None:
resp_json = resp.json()
if resp_json['code'] == 10000:
print(resp_json['message'])
return resp_json['code']
elif resp_json['code'] == 20000:
print(resp_json['message'])
time.sleep(1)
continue
else:
return resp_json['code']
# 获取翻译结果
def task_down(upload_task):
data = {
"downtype": 2,
"tasktag": upload_task.task_tag,
"productinfo": product_info,
"usertoken": user_token,
"deviceid": device_id,
"account": account,
"timestamp": int(time.time())
}
data_sign = md5(dict_to_str(data))
data['datasign'] = data_sign
resp = r.post(url=task_down_url, headers=okhttp_headers, data=data)
resp_json = resp.json()
download_url = resp_json.get('downurl')
print(download_url)
if download_url is not None:
download_resp = r.get(download_url)
if download_resp is not None:
file_name = download_url.split('/')[-1]
with open(file_name, 'wb') as f:
f.write(download_resp.content)
return file_name
# 解析srt文件提取时间及内容列表
def analyse_srt(srt_file_path):
time_list = []
text_list = []
time_start_pos = 1
text_start_pos = 2
with open(srt_file_path, 'rb') as f:
for index, value in enumerate(f.readlines()):
if index == time_start_pos:
time_list.append(value.decode().strip()[0:8])
time_start_pos += 4
elif index == text_start_pos:
text_list.append(value.decode().strip())
text_start_pos += 4
return time_list, text_list
# md5加密
def md5(content):
md = hashlib.md5()
md.update(content.encode('utf-8'))
return md.hexdigest()
# 字典转字符串
def dict_to_str(data_dict):
# 按键升序排列
sorted_tuple = sorted(data_dict.items(), key=lambda d: d[0], reverse=False)
content = ''
for t in sorted_tuple:
content += '&{}={}'.format(t[0], t[1])
content += 'hUuPd20171206LuOnD'
if content.startswith("&"):
content = content.replace("&", "", 1)
return content
class TaskInfo:
def __init__(self, task_tag, task_token, timestamp):
self.task_tag = task_tag
self.task_token = task_token
self.timestamp = timestamp
if __name__ == '__main__':
cp_utils.is_dir_existed(origin_video_dir)
cp_utils.is_dir_existed(video_to_wav_dir)
cp_utils.is_dir_existed(wav_to_mp3_dir)
flv_file_list = cp_utils.filter_file_type(origin_video_dir, '.flv')
mp4_file_list = cp_utils.filter_file_type(origin_video_dir, '.mp4')
flv_file_list += mp4_file_list
if len(flv_file_list) == 0:
print("待处理视频为空")
exit(0)
print("\n请选择要提取字幕的视频序号:")
for pos, video_path in enumerate(flv_file_list):
print("{} → {}".format(pos, video_path))
file_choose_index = int(input())
file_choose_path = flv_file_list[file_choose_index]
input_duration = int(input("\n请输入分割长度,单位s,如输入60,代表音频切割为每60s一段: \n"))
print("开始切割,请稍后...")
wav_output_dir = video_to_wav(file_choose_path, input_duration)
print("\n请选择要处理音频片段序号:")
wav_file_list = cp_utils.filter_file_type(wav_output_dir, '.wav')
for pos, wav_path in enumerate(wav_file_list):
print("{} → {}".format(pos, wav_path))
wav_choose_index = int(input())
wav_choose_path = wav_file_list[wav_choose_index]
output_mp3_path = os.path.join(wav_to_mp3_dir, '{}.mp3'.format(int(time.time())))
# 文件转换
os.system('ffmpeg -i {} -vn -ar 16000 -acodec libmp3lame {}'.format(wav_choose_path, output_mp3_path))
# 文件校验
task = upload_par(output_mp3_path)
# 文件上传
upload_file(task, output_mp3_path)
# 查询翻译状态
task_state(task)
# 下载翻译结果文件
srt_file_name = task_down(task)
if srt_file_name is not None:
result_txt_file = '{}.txt'.format(int(time.time()))
with open(result_txt_file, 'w+', encoding='utf-8') as f:
for text in analyse_srt(srt_file_name)[1]:
f.writelines(text + '\n')
print("文件写入完成:", result_txt_file)
Copy the code
Next, drop a video file to origin_video directory and run the script to complete the subtitle extraction:
Open the generated subtitle file:
Yes, by the way, the Japanese translation, the verification interface set the value of fanyi_from to JP.
0 x 5, summary
All 2W more words, not so much, the code is a little messy, and I found some bugs, the follow-up is free to organize optimization under Github, interested in can first star, can also write a practice according to this article, hope you old drivers moderation, on the paste, thank you ~
VideoSubtitleExtractTool