Python crawler case collection

This is the 7th day of my participation in the August More Text Challenge

urllib2

Urllib2 is a Python library that is used to crawl web pages. Urllib2 is a Python module that comes with Python2.7. In python3.x, urllib and urllib2 are combined into one urllib; Urllib3 is a new third-party extension in Python3. x.

Urllib2 official documentation: https://docs.python.org/2/library/urllib2.html

Urllib2 source: https://hg.python.org/cpython/file/2.7/Lib/urllib2.py

Urllib2 is changed to urllib.request in python3.x

Let’s start with a simple Demo that requests access to Baidu

Import urllib.request # send a request to the specified URL. Response = urllib.request.urlopen("http://www.baidu.com/") # The class object returned by the server supports Python file object operations HTML = response.read() # print the response (HTML)Copy the code

We have got the baidu homepage, but the first problem is that when you use urllib2 to access it, its user-agent is python-urllib /3.6 (user-agent determines the User’s browser).

We need a little camouflage, or the first step will be detected by the anti-crawler

#! /usr/bin/env python # -* -coding :utf-8 -* -import urllib.request # urllib2 user-agent: Ua_headers = {" user-agent ": "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, } # create a request object from urllib.request.request () Urllib.request. request ("http://www.baidu.com/", headers = ua_headers) Urllib.request.urlopen () can be either a string or an object response = urllib.request.urlopen(request) # HTML = response.read() # response.read() # response.read() # response.read Print (response.getCode ()) print(response.getCode ()) Print (response.geturl()) print(response.info()) #print(HTML)Copy the code

Crawl to baidu post bar

Concatenate parameters directly after the URL. This request is called a GET request

#! /usr/bin/python #coding:utf-8 from urllib import request,parse def loadPage(fullUrl,filename): Url: url to be crawled filename: filename to be processed """ print(' downloading '+ filename) headers = {"User_Agent" : "Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_11_0) AppleWebKit/537.36 (KHTML, Request1 = request. request (fullUrl,headers = headers); return request.urlopen(request1).read() def wirtePage(html,filename): Print (' saving '+ filename) # print with open(filename,'w') as f: Print ('-' * 30) def f. Print ('-' * 30) def f. Print ('-' * 30) def tiebaSpider(url,beginPage,endPage): for page in range(beginPage,endPage+1): Pn = (page-1) * 50 filename = "" + STR (page) +' page.html 'fullUrl = URL +'&pn='+ STR (pn) # LoadPage (fullUrl,filename) print(HTML) # wirtePage(HTML,filename) if __name__ == '__main__': Kw = input(' beginPage = int(input(' please input start page :')) endPage = int(input(' please input endPage ')) url = 'http://tieba.baidu.com/f?' key = parse.urlencode({"kw":kw}) fullUrl = url + key tiebaSpider(fullUrl,beginPage,endPage)Copy the code

POST request to Youdao Translation

Some web sites do not concatenate their query parameters directly on the URL but use post form data, which simulates a POST request

From urllib import request,parse # Is not shown on the browser's url url = "http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule" # complete headers headers = { "Accept" : "application/json, text/javascript, */*; Q =0.01", "x-requested-with" : "XMLHttpRequest", "user-agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36", "Content-type" : "application/x-www-form-urlencoded; Formdata = {"from" :" AUTO", "to" : formData = {"from" :" AUTO", "to" : "AUTO", "smartresult" : "dict", "client" : "fanyideskweb", "type" : "AUTO", "i" : key, "doctype" : "json", "keyfrom" : "Fanyi. Web", "ue" : "utf-8", "version" : "2.1", "action" : "FY_BY_CLICKBUTTON", "typoResult" : Parse.urlencode (formdata).encode('utf-8') print(data) So the request is POST # and if not, Get request1 = request. request (url, data = data, headers = headers) print(request.urlopen(request1).read().decode('utf-8'))Copy the code

Grab Ajax douban movies

Sometimes the top of the page is empty and the content is loaded through Ajax, so we need to focus on the data source, the page loaded with Ajax, the data source must be JSON, get JSON will get the data

from urllib import request, parse url = "https://movie.douban.com/j/chart/top_list?type=11&interval_id=100%3A90&action" headers = {"User-Agent" : "Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"} formData = {"start":"0", "limit":"20" } data = parse.urlencode(formdata).encode('utf-8') request1 = request.Request(url, data = data, headers = headers) print(request.urlopen(request1).read().decode('utf-8'))Copy the code

urllib2

Crawl to baidu post bar

POST request to Youdao Translation

Grab Ajax douban movies

Related Posts

Redis Learning Notes – simple dynamic strings

In-depth understanding of the concurrency mechanism of the GO language

Five star red flag translucent gradient avatar how to do? Today we’ll do it in a few lines of Python!!