Python Have you ever seen a three-line crawler

Python uses the Lassie library to crawl images and videos on static pages in just three lines of code. The Python tutorial starts every crawler with “send a request,” and most readers will get stuck when it comes to page parsing, which does require a little prior knowledge of XPATH or CSS selectors. So is there a way to read the information on the page that doesn’t require such a complicated operation?

The answer is: yes.

Lassie is a super simple page information retrieval tool, it can get static information on a page in a few lines of code, such as: page description, video links, page titles, page keywords, image links and so on.

Why is it super easy? Feel it:

import lassie
data = lassie.fetch('https://www.zhihu.com')
printF:push20191112>python test.py {(data) you can fetch the page and get the following result (dictionary output) : (base)'images': [{'src': 'https://static.zhihu.com/static/favicon.ico'.'type': 'favicon'}].'videos': [].'description': 'If you have a problem, go on. Zhihu, a reliable q&A community, is committed to providing reliable answers to everyone efficiently. zhihu With serious, professional and friendly community atmosphere, structured, easy to get high quality content, based on the content of the question and answer mode of production and the unique community mechanism, attract, brought together from all walks of life, the domain experts, a large number of participants, the professional field of amateur, quality content through the nodes of scale production and sharing. Users use q&A and other communication methods to build trust and connection, build and enhance personal influence, and discover and gain new opportunities. '.'locale': 'zh_CN'.'url': 'https://www.zhihu.com'.'title': "Zhihu" - Have a problem, on Zhihu".'status_code': 200} 
Copy the code

1. Install

If you haven’t already installed Python, we recommend reading this article: Python Installation.

When the installation is complete, open your CMD/Terminal and enter the following command:

pip install lassie

Lassie can be successfully installed.

2. Use

Now, use this tool to crawl the image link from our previous post!

import lassie
data = lassie.fetch('https://pythondict.com/ai/python-suicide-detect-svm/')
print(data['images'])
Copy the code

Results:

[{'src': 'https://pythondict.com/wp-content/uploads/2019/11/2019111013222864.png'.'secure_src': 'https://pythondict.com/wp-content/uploads/2019/11/2019111013222864.png'.'type': 'og:image'}, {'src':  
'https://pythondict.com/wp-content/uploads/2019/11/2019111013222864.png'.'type': 'twitter:image'},  
{'src': 'https://pythondict.com/wp-content/uploads/2019/07/2019073115192114.jpg'.'type': 'favicon'}]
Copy the code

Of course, we can also use list parsing to put all the links in an array:

print([i['src'] for i in data['images']])
Copy the code

Results:

[‘ pythondict.com/wp-content/… ‘, ‘pythondict.com/wp-content/…’, ‘pythondict.com/wp-content/…]

How about this tool for crawling static pages is too convenient! The only drawback is that it can’t crawl the detailed text content of the page, but can only be used to extract images, videos and page-related information. If your crawler only needs to crawl images and videos from static pages, then this library is a magic trick.

If you enjoyed today’s Python tutorial, stay tuned to the Python Utility Guide, and give it a thumbs up/check it out below if it was helpful

Python Dict.com Is more than a dictatorial model

Have you ever seen a three-line crawler

Python Have you ever seen a three-line crawler

1. Install

2. Use

Related Posts

The Git Rebase

Install and use multiple versions of Go? On one machine.

UML Gantt chart and pie chart details and Markdown syntax