Tencent Video crawler

The truth is, last week I was interviewing for a job as a reptile. Then came the following conversation. Interviewer: You are self-taught, and I have some questions for you, what are the shallow and deep copies of Python? Me: Hehe Interviewer: Uh… Can you talk a little bit about how you understand object-oriented programming in Python? Me: Hehe Interviewer: Uh… Okay, I won’t ask you the basis. Have you ever climbed a dynamic web page? Me: Yes, I have climbed douban ah. Interviewer: What would you do if I asked you to climb the video website? F12 viewer can’t see the video address! Me: Can’t you see? Interviewer: HMM! Me: I don’t know… Interviewer: Uh… You so let me very difficult, or you try to climb Tencent video. Me: good good! Interviewer: You don’t want to render JS with those emulated browsers! Grab the bag!

Then I came home and saw!! What the hell ！！！！

QQ screenshots 20161114234015. PNG

QQ screenshots 20161114233935. PNG

Really what address all have no!! How to do? Caught? What is catch bag!! I only use Selenium and Firefox to simulate crawling dynamic web pages… Is this job a dead end? I’m not a quitter!! Then I went to Baidu: how to use Python to catch packages how to use Python to climb JS later I understand that to catch packages is to go to F12 network to find packages… Then I searched again: how to use Python to climb AJAX is still not a problem! However, I learned to log in to the video website on the mobile end by using the simulated mobile browser of Chrome when I was searching Tencent Video in Jianshu. I could see the address link of the video directly. But… How to use code to climb ah?

Okay, that’s enough bullshit. We don’t want to hear it anymore. Let me get straight to the point. After three days of study, I first figured out the crawling method for this dynamic website. First, use network to find a real address, or backdoor address, for receiving requests.

QQ screenshots 20161114234944. PNG

There it is, as long as you put parameters in the url, it will return data. If I scroll down, we can look at the parameters

QQ screenshots 20161114235117. PNG

There are many parameters here that are not required. I would like to say in advance here, and finally we need to find out the rules to reconstruct this website according to observation, because we crawlers can not open F12 on the website to find information. Let’s see what happens when the url opens

This XML file does not appear to have any style information associated with it. The document tree is shown below. <root>  <exem>0</exem> <hs>0</hs> <ls>0</ls> <tm>1479138997</tm> <dltype>1</dltype> <preview>135</preview> <sfl> <cnt>0</cnt> < / SFL > < fl > < CNT > 3 < / CNT > < fi > < sl > 0 < / sl > 64 < / br > < br > < id > 10703 < / id > < name > sd < / name > < ending > 0 < / ending > < sb > 1 < / sb > < cname > sd; (270P)</cname> <fs>4795182</fs> </fi> <fi> <fs>8435750</fs> <sl>1</sl> <br>235</br> <id>10712</id> <name>hd</name> < LMT >0</ LMT > <sb>1</sb> <cname> hd; < / cname (480 p) > < / fi > < fi > < ending > 0 < / ending > < sb > 1 < / sb > < cname > super clear; (720P)</cname> <fs>17133689</fs> <sl>0</sl> <br>650</br> <id>10701</id> <name>shd</name> </fi> </fl> <s>o</s> <vl> <vi> <br>60</br> <fst>5</fst> <vh>480</vh> <dm>0</dm> <targetid>1619695007</targetid> <fclip>1</fclip> <ul> <ui> <vt>200</vt> < dt > 2 < / dt > < DTC > 10 < / DTC > < url > http://222.84.158.26/vhot2.qqvideo.tc.qq.com/ < / url > < / UI > < UI > < DTC > 10 < / DTC > < url > http://222.84.158.27/vhot2.qqvideo.tc.qq.com/ < / url > < n > 200 < / n > < dt > < / dt > 2 < / UI > < UI > < url > http://222.84.158.28/vhot2.qqvideo.tc.qq.com/ < / url > < n > 200 < / n > 2 < / dt > < dt > < DTC > 10 < / DTC > < / UI > < UI > < n > 0 < / n > <dt>2</dt> <dtc>10</dtc> <url>http://video.dispatch.tc.qq.com/96405217/</url> </ui> </ul> <st>2</st> <videotype>23</videotype> <hevc>0</hevc> <dsb>0</dsb> <ch>0</ch> <drm>0</drm> <enc>0</enc> <vid>q0345aui4cn</vid> <lnk>q0345aui4cn</lnk> <fn>q0345aui4cn.p712.mp4</fn> <fs>8435750</fs> <fvkey> 79487AB1BD24866699D881C26FC8E62D2131609CE05CC1F18B76EC6481957C2416C28CAE6B21FEE56DCB157706205BC52F3928F51CC08EE79AF4E315 7C4940870340CFB9B9407A55FF02A8EA1B560B00A3ADC310709553ED9AC0E8E67705AE5D4097A5370B4BF1AB75B444A3A2EB7F51 </fvkey> < level > 0 < level > 135.722 < / td > < td > < fmd5 > 9 cac5ea26dbf008bce792704b6d67c97 < / fmd5 > < ct > 21600 < / ct > < sp > 0 < / sp > < cl > The < fc > 1 < / fc > < ci > < independence idx > 1 < / independence idx > < cs > 8435748 < / cs > < CD > 135.722 < / CD > < cmd5 > 40 a954202d9f663ac2443372afde8d4b < / cmd5 > The < keyid > q0345aui4cn. 10712.1 < / keyid > < / ci > < / cl > < share > 1 < / share > < type > 9 < / type > < VST > 2 < / VST > < iflag > 0 < / iflag > < TI > Amnesiac old man asked the same question 17 times! The patient doctor fire < / ti > < vw > 848 < / vw > < logo > 1 < / logo > < pl > < CNT > 2 < / CNT > < pd > < url > http://video.qpic.cn/video_caps/0/ < / url > <cd>2</cd> <c>10</c> <h>45</h> <w>80</w> <r>10</r> <fn>q1</fn> <fmt>40001</fmt> </pd> <pd> <fmt>40002</fmt> <url>http://video.qpic.cn/video_caps/0/</url> <cd>2</cd> <c>5</c> <h>90</h> <w>160</w> <r>5</r> <fn>q2</fn> </pd> </pl> </vi> <cnt>1</cnt> </vl> </root>Copy the code

Anyway, let’s go to another url called getKey and see what’s inside

<root> <ct>21600</ct> < keyID > q0345aui4CN.10712.1 </keyid> <key> 50C099826174CDA1036AC3F8BB82BEC4A6E91005384153F81E3EF477C719CBF0B442C8D533F36402878D2B7C639318CD3078DEFC435C8C865813DBEB 689F66D789F410D5115A745C3019F6ABAEBFF53258F4F524A9FE978D26595B6B1732F25992A33CB6F3F2C792FA22CAC171C7087C </key> 62154.61 < / br > < br > < sp > 0 < / sp > < sr > 0 < / sr > < level > 0 < level > < s > o < / s > < filename > q0345aui4cn. P712. Mp4 < filename > <levelvalid>1</levelvalid> </root>Copy the code

So how do we know what a reptile needs? You can find video directly in network. We look at the video of the real address http://183.60.171.154/vhot2.qqvideo.tc.qq.com/q0345aui4cn.p712.1.mp4?sdtfrom=v1090&type=mp4& vkey=E1B062FA0354A56228FFE25100D1B41F83FEBC012471446999AA386A8FC8A8273AE07BFF6092B4F4AABCE8D98E7F622D71AF44C88B5D3F63A6E D4BFE45457ED0D95E67CA930861F396967521A97ED4154449A61DD95F24A7945262EEB481C3BAC8FE53A2B186F1FC111CE8D32FFBB8C1 &level=0&platform=11&br=60&fmt=hd&sp=0&guid=ED20EE2DB93337F04708204D247CB34A&locid=f9aa455a-6a68-4fd9-a4dd-0dcf179bd6b3& Size = 8435748&oCID =2535331244 Let’s simplify and see which parameters we can get rid of. I’m going to go straight to the address of the lite version that my crawler crawled to http://222.84.158.29/vhot2.qqvideo.tc.qq.com/q0345aui4cn.p701.1.mp4?vkey=46994EBB71F530AC11AB11EFA265508534036712DA0187B 3939F9FDE9062FBA7B5C37F0C80D72626AEEB3CD87A15B19CB064817D98C4DD0D43D113804FE8F50F67DE15DD2FC48CC06FA9AE67678FD2CE9363358 A1AC9BBCDB831B1AF79E7B105 first we see two IP addresses, VKEY is not the same. IP is easy to explain, different video source, different address. And then Vkey which is always changing, every time you open getKey this page is changing. I analysis the next video address down to need what Ip=http://222.84.158.29/vhot2.qqvideo.tc.qq.com/ (getinfo page) of filenname = VID + p701.mp4 (P701 = 1701) 1.MP4 belongs to the suffix.2. As long as there is this video source) vkey=46994EBB71F530AC11AB11EFA265508534036712DA0187B3939F9FDE9062FBA7B5C37F0C80D72626AEEB3CD87A15B19CB064817D98C4DD0D43D 113804 fe8f50f67de15dd2fc48cc06fa9ae67678fd2ce9363358a1ac9bbcdb831b1af79e7b105 (this) in getkey page

All right, so let’s regroup! I’m going to get the VID on the original video page (I don’t want to explain myself opening the F12 viewer) and then I’m going to get the ID on the getInfo page, Splice filename to get the Vkey on the getKey page and finally splice it into our video URL so here I’m putting out my simplified getInfo and getKey urls, just replace the VID, Filename is the forehead can open the getinfo = ‘{vid} http://vv.video.qq.com/getinfo?vids= & otype = xlm&defaultfmt = FHD’ getkey=’http://vv.video.qq.com/getkey?format={id}&otype=xml&vt=150&vid={vid}&ran=0%2E9477521511726081\\&charge=0&filenam E ={filename}&platform=11′ That’s it, let’s start writing the crawler!!

from urllib.request import Request, urlopen from urllib.error import URLError,HTTPError from bs4 import BeautifulSoup import re print('https://v.qq.com/x/page/h03425k44l2.html\\\\n \\\\ https://v.qq.com/x/cover/dn7fdvf2q62wfka/m0345brcwdk.html\\\\n \ \ \ \ 'http://v.qq.com/cover/2/2iqrhqekbtgwp1s.html?vid=c01350046ds web = input (" please enter the url:) if re. The search (r' vid = ', web) : patten =re.compile(r'vid=(.*)') vid=patten.findall(web) vid=vid[0] else: newurl = (web.split("/")[-1]) vid =newurl.replace('.html', # ') from the video page, vid getinfo = '{vid} http://vv.video.qq.com/getinfo?vids= & otype = = FHD' xlm&defaultfmt. The format (vid = vid. Strip ()) def getpage(url): Req = Request(URL) user_agent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit' req.add_header(' user-agent ', user_agent) try: response = urlopen(url) except HTTPError as e: print('The server couldn\\\\'t fulfill the request.') print('Error code:', e.code) except URLError as e: print('We failed to reach a server.') print('Reason:', E.simple) HTML = response.read().decode(' utF-8 ') return(HTML) # BeautifulSoup(a, "html.parser") for e1 in soup.find_all('url'): ippattent = re.compile(r"((?:(2[0-4]\\\\d)|(25[0-5])|([01]?\\\\d\\\\d?))\\\\.) {3} (? :(2[0-4]\\\\d)|(255[0-5])|([01]? \\\\d\\\\d?) )") if re.search(ippattent,e1.get_text()): ip=(e1.get_text()) for e2 in soup.find_all('id'): idpattent = re.compile(r"\\\\d{5}") if re.search(idpattent,e2.get_text()): Id = (e2 get_text ()) filename = vid. Strip () + 'p' + id [2:] + '. 1. Mp4 '# find id and splicing filename getkey='http://vv.video.qq.com/getkey?format={id}&otype=xml&vt=150&vid={vid}&ran=0%2E9477521511726081\\\\ Format (id=id,vid=vid.strip(),filename=filename) # charge=0&filename={filename}&platform=11'. Format (id=id,vid=vid.strip(),filename=filename Getpages (getkey) key = (re. The.findall (r '< key > (. *) < \ \ \ \ / key >', b)) videourl = IP + filename + '? '+' vkey = '+ key [0] print (' video broadcast address '+videourl) # doneCopy the code

IZ)%W(BYA9I6~H%@_GE(5.png

Then I uploaded the crawler to Github and sent it to the interviewer, happily waiting for work to finish begging for food! But the prime!!!!! Interviewer: HMM, good! It actually worked! What are your salary expectations? Me: Hey, this I have no experience. You can just send me six or seven thousand! Interviewer: Well, you have no experience and we can only pay you according to the standard of a fresh graduate. I: have no matter have no matter, salary is not important. The most important first let me into this industry to accumulate experience! After a while Interviewer: Actually… Our boss says you’re poor at basics, even though you’re good at self-learning. But do you understand data structures and algorithms? Me: No… Interviewer: Actually, we have already developed the crawler. The next step is to develop the log analysis and recommendation system. Me:…

So, I didn’t get the offer and continued begging… I don’t look at Tencent video, Tencent is useless to climb, put the source code up we learn next bar. Who love to climb their own climb, but also free from advertising. O (╥﹏╥)o When can I find a job?? Forget it, tomorrow I go to school front end, strive for before the end of the year when a cut diagram son!!

Related Posts

Go Speed Learning -Go Fiber Frame Scaffolding Notes – (1) : – Basic speed introduction to Fiber frame

Yarn resource scheduling, a must for an interview!

Interviewer: Tell me about CountDownLatch, CyclicBarrier, Semaphore?