Writing: Prof Liu Sai, Pythonistia && Otaku, on behalf of a surveying and mapping staff on the transition to nature

Homepage: zhihu.com/people/ban-zai-liu-shang

There are three main steps to a crawler: make a request, parse the data, and store the data, which is enough to write a basic crawler. Frameworks such as Scrapy incorporate everything a crawler can do, but they can be difficult for beginners to navigate, and can be a bit Scrapy. Therefore, I decided to write a lightweight crawler framework – Looter, which integrates debugging and crawler template these two core functions, using looter, you can quickly write an efficient crawler. In addition, the function document of this project is also quite complete, if you do not understand the place can read the source code.

The installation

Copy the code
  1. $ pip install looter

Only Python3.6 or later is supported.

Quick start

Let’s start with a very simple image crawler: First, get the web site with the shell

Copy the code
  1. $ looter shell konachan.com/post

You can then grab the image locally with just two lines of code

Copy the code
  1. >>> imgs = tree.cssselect('a.directlink')
  2. >>> save_imgs(imgs)

Or just one line: D

Copy the code
  1. >>> save_imgs(links(res, search='jpg'))

workflow

If you want to whip up a crawler quickly, you can use looter’s templates to generate one automatically

Copy the code
  1. $ looter genspider <name> <tmpl> [--async]

In this line of code, TMPL is a template, divided into data and image templates.

Async is an alternate option that causes the generated crawler core to use Asyncio instead of thread pools.

In the generated template, you can customize the domain and taskList variables.

What is a TaskList? It’s actually all the links to the page you want to crawl.

Using http://konachan.com as an example, you can create your own taskList using list comprehensions:

Copy the code
  1. domain = 'https://konachan.com'
  2. tasklist = [f'{domain}/post?page={i}' for i in range(1, 9777)]

Then you have to customize your crawl function, which is the core of the crawler.

Copy the code
  1. def crawl(url):
  2.    tree = lt.fetch(url)
  3.    items = tree.cssselect('ul li')
  4.    for item in items:
  5.        data = dict()
  6. # data[...] = item.cssselect(...)
  7.        pprint(data)

In most cases, what you want to grab is a list (that is, ul or OL tags in HTML) that can be saved as items variables using CSS selectors.

You then simply iterate over them using a for loop, extracting the data you want and storing it in dict.

However, before you finish writing the crawler, it’s a good idea to debug your CSSSelect code using the shell provided by looter.

Copy the code
  1. >>> items = tree.cssselect('ul li')
  2. >>> item = items[0]
  3. >>> item.cssselect(anything you want to crawl)
  4. Note that the output of the code is correct!

After debugging, your crawler is done. How about, isn’t it easy 🙂

Of course, I also prepared several crawler examples for reference.

function

Looter provides many useful functions for users.

view

Before you crawl, make sure your page is rendered the way you want it to be

Copy the code
  1. >>> view(url)

save_imgs

When you get a bunch of image links, you can use it to save them locally

Copy the code
  1. >>> img_urls = [...]
  2. >>> save_imgs(img_urls)

alexa_rank

Get the site’s Reach and Popularity index. This function returns a tuple (URL, reachRank, PopularityRank)

Copy the code
  1. >>> alexa_rank(url)

links

Get all the links to the web page

Copy the code
  1. >>> Links (res
  2. >>> links(res, absolute=True) #
  3. >>> links(res, search='text') # Find the specified link

Similarly, you can use regular expressions to retrieve matching links

Copy the code
  1. >>> re_links(res, r'regex_pattern')

saveasjson

Save the result as JSON file, support key value sorting

Copy the code
  1. >>> total = [...]
  2. >>> save_as_json(total, name='text', sort_by='key')

parse_robots

Used to crawl all links on the site robots.txt. This is useful when doing full site crawlers or recursive URL crawlers

Copy the code
  1. >>> parse_robots(url)

login

Some websites have to be logged in before they can be accessed, so the login function, which essentially establishes a session and sends a POST request with data to the server. However, login rules vary from site to site, and finding the right postData can be a bit of a struggle, even if you have to construct param or header parameters. Fortunately, someone on Github has already put together a mock-up of fuck-login, which I admire. In short, it is a test of your ability to capture packets. The following is the simulated login of netease 126 mailbox (required parameters: PostData and Param).

Copy the code
  1. >>> params = {'df': 'mail126_letter', 'from': 'web', 'funcid': 'loginone', 'iframe': '1', 'language': '-1', 'passtype': '1', 'product': 'mail126',
  2. 'verifycookie': '-1', 'net': 'failed', 'style': '-1', 'race': '-2_-2_-2_db', 'uid': '[email protected]', 'hid': '10010102'}
  3. > > > postdata = {' username: your username, 'savelogin' : '1', 'url2:' http://mail.126.com/errorpage/error126.htm 'and' password ': Your password}
  4. >>> url = "https://mail.126.com/entry/cgi/ntesdoor?"
  5. >>> res, ses = login(URL, postdata, params=params) # res is the page after the post request, ses is the request session
  6. >>> index_url = re.findall(r'href = "(.*?) "', res.text)[0] # Get the link in res to redirect the home page
  7. >>> index = ses.get(index_url

\

The Python Web crawler learning series features courseware and source code for all sections of the course. The course will be delivered by Pan Luo, author of “Learn Python Web Crawler from Scratch”, a well-known blogger and expert on Python web crawlers.

Lecture 1: An introduction to Python’s zero-basic syntax

  1. Environmental installation
  2. Variables and Strings
  3. Process control
  4. The data structure
  5. File operations

Lecture 2: Regular expression crawlers

  1. The network connection
  2. The crawler principle
  3. Chrome browser installation and use
  4. Request libraries use
  5. Regular expression
  6. CSV file storage

Lecture 3: Lxml libraries and xpath syntax

  1. Excel storage
  2. LXML library
  3. Xpath syntax

Lecture 4: API crawlers

  1. API concepts
  2. Baidu map API call
  3. JSON data parsing
  4. Picture the crawler

Lecture 5: Asynchronous loading

  1. MySQL Database Installation
  2. MySQL database is simple to use
  3. Python operation database
  4. Asynchronous loading
  5. Reverse engineering
  6. Integrated case

Lecture 6: Form interaction and simulated login

  1. A post request
  2. Reverse engineering
  3. Submit a cookie
  4. Integrated case

Lecture 7: Selenium Emulates the browser

  1. Selenium
  2. PhantomJS
  3. Asynchronous loading processing
  4. Web page handling
  5. Integrated case

Lecture 8: Introduction to Scrapy

  1. Scrapy installed
  2. Create a project
  3. Component Introduction
  4. Integrated case

Lecture 9: Scrapy

  1. Cross-page crawler
  2. Store database

\

Click here or read the original article to enter the course