Recently I met a demand: extract the video source according to the link shared by youku ios client and play it in ios system player. After stepping on some pits, the author also has some experience in the extraction of video websites, and make a summary here to provide some experience and direction for the students who encounter this problem later.

1. Introduction

Extracting website video links can be divided into three situations:

  1. When viewing the source code of such web pages, it will be found that the video source is a Flash file and the valid video address cannot be obtained. You need to use the third-party library You-get
  2. Unfortunately, You-get does not support all video websites. For video websites that cannot be supported, user-Agent should be set to simulate iPhone users for extraction or crack their encryption mode
  3. Video source in iPhone Safair, which is the focus of this article

2. Use you-get to capture the video address

You-get is actually a video downloader, written in Python3 and running in a command line environment rather than a GUI. It supports many video sites, see the Github project for details, and is being updated.

The installation You Get

You-get runs in Python3 and does not support python2. x. So there is at least one python3. x version on the system before you start.

Although the project description states that the installation can be compiled directly from the source code, using PIP is the easiest way to install without special requirements. PIP installation can be seen in the article PIP Installation.

Install you-get using PIP

$ [sudo] pip3 install you-get	
Copy the code

Check whether the installation is successful

$ you-get -V
Copy the code

Use you-get to download the video

$ you-get http://youtu.be/sGwy8DsUJ4M
Copy the code

Display Video Information

$ you-get -u http://youtu.be/sGwy8DsUJ4M
Copy the code

Use the you-get -u $link to return the video information and extract the video link with the regular expression.

Some of the problems

  1. The ios player only supports a few video formats, including MP4 and MOV. However, many of the downloaded videos are in FLV format and cannot be played in the system player.
  2. The video url shared in the mobile phone cannot be used to extract the video using You-get.
  3. You-get doesn’t support all video sites.

3. Extract the video link from Safari on your phone

While ios doesn’t support video formats like FLV, sites like Youku can still play in Safari. Therefore, the author guesses that the video is not played in the falsh file mode on the mobile end. Set user-agent Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit Like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4 Analog Mobile browser access to video sites, you can get video links in the source code.

Video link v.youku.com/v_show/id_X…

And the web source code

SRC is what we need in the figure.

It is much easier to know the source code contains video links, as long as we get the source code, and then use regular expression to extract video links. However, after trying, it is found that the video label part is dynamically acquired, which is more troublesome to crack JS and not universal. Therefore, a simple and crude solution is adopted here:

Run a virtual window on the server side, load the web page by calling the browser, and extract the video address by doing regular to the final code.Copy the code

Runtime environment

CentOs, python3.4, 1 MB bandwidth, FireFox browser

Use Splinter to call the browser

Splinter is a Python automated testing tool that simulates browser behavior. Can run JS, support mouse operation and so on.

Install stable version

$ [sudo] pip install splinter
Copy the code

Or source code installation

$ git clone git://github.com/cobrateam/splinter.git
$ cd splinter
$ [sudo] python setup.py install
Copy the code

Code sample

browser = Browser()
browser.visit(url)
html = browser.html

browser.quit()
Copy the code

Running virtual Desktops

Centos servers do not have desktops. In order to call the browser to render in the server, I run a virtual desktop on the centos command line interface. Xvfb creates a new virtual X window that works with Python’s PyVirtualDisplay.

The installation

Pyvirtualdisplay yum install xorg-x11-server-xvfb PIP install pyvirtualdisplayCopy the code

Install Firefox and Selenium

yum install firefox
pip install selenium
Copy the code

code

from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(800, 600))
display.start()

browser = webdriver.Firefox()
browser.get('http://www.baidu.com')
print browser.title
browser.quit()

display.stop()
Copy the code

Talk is cheap, Show me the fucking code


from splinter import Browser
from selenium.webdriver import PhantomJS, DesiredCapabilities
from splinter.driver.webdriver import (BaseWebDriver, WebDriverElement as BaseWebDriverElement)

from pyvirtualdisplay import Display
from selenium import webdriver

import re
import json


def fetch_info(url) :
	html = download_html(url)
	videoUrl = parse_html(url, html)

	resultDic = {'ret' : 0.'srcUrl' : url, 'title' : url, 'abstract' : url, 'qsvideo' : videoUrl}
	print(json.dumps(resultDic))

	pass

def download_html(url) :
	display = Display(visible=0, size=(800.600))
	display.start()

	browser = Browser(user_agent="Mozilla / 5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B410 Safari/600.1.4")
	browser.visit(url)
	html = browser.html

	browser.quit()
	display.stop()
	return html
	pass

def parse_html(url, html) :
	matchObj = re.search(r'<div class="tvp_video"><video(.*)src="(.*)"></video>', html)
	if matchObj:
		videoUrl = matchObj.group(2)
		return videoUrl

	return ""
	pass

if __name__ == '__main__':
	url = "http://v.youku.com/v_show/id_XMTMyOTU4NTc4OA==.html"
	fetch_info(url)
Copy the code

4. The second case

In the second case, for video websites that you-get cannot support, user-Agent should be set to extract them disguised as iPhone users, or its encryption mode should be cracked.

5. Summary

In general, this is a feasible grasping scheme. However, because the browser is required to start and load the web page, the efficiency is relatively low in a harsh environment. As far as I’m concerned, the tens of seconds it takes to grab a video link over 1M bandwidth is almost unbearable. Therefore, how to improve the speed in low bandwidth still needs to be explored. At present, it is speculated that the time consuming may be caused by downloading pictures when loading the web page. The specific reason needs to be investigated and verified in detail.

You -get Github PIP Splinter Linux Headless use Selenium to grab data