First, prepare knowledge

This case to achieve the function: the use of network crawler, climb a place of weather, and print and voice broadcast. The requests library, the LXML library, and the PyTTSX3 library can be installed using PIP:

pip install requests
pip install lxml
pip install pyttsx3
Copy the code

The Requests library is a powerful web request library that allows you to send HTTP Requests just like a browser to retrieve data from a web site.

Lxml library is the most versatile and easy to use library for processing XML and HTML. Etree in Lxml library is usually used to convert HTML into documents.

The Pyttsx3 library is a very simple library to play speech, what you give it, it read, don’t mind the stiff tone of course. The basic usage is as follows:

import pyttsx3

word = pyttsx3.init()

word.say('hello')
# Key sentence, without this line of code, will not play the voice
word.runAndWait()
Copy the code

Code word is not easy nonsense two sentences: there is a need to learn information or technical problems exchange “click”

Crawler is to crawl the relevant content of the web page, understanding HTML can help you better understand the structure of the web page, content and so on. TCP/IP protocol, HTTP protocol these knowledge can be understood, can let you understand the basic principle of network request and network transmission, this small case is not used.

Two, elaborate

2.1. get request target URL

We first imported the Requests library and then used it to get the target web page. We asked for Beijing weather from the Weather site.

import requests
Send a request to the destination URL and return a response object
req = requests.get('https://www.tianqi.com/beijing/')
#.text is the HTML of the web page for the Response object
print(req.text)
Copy the code

The printed result is the content displayed on the site, which is “parsed” by the browser. We see the structure as follows:

The data we obtained after our request

Notice, there’s a pretty good chance you’re not going to get the code, you’re going to get 403, what does that mean?

403 Error 403 is a common error message during website access, indicating that resources are unavailable. The server understands the client’s request, but refuses to process it.

The default crawler will tell the server to send a Python crawler request, and many websites will set up anti-crawler mechanism, do not allow crawlers to access.

So, if we want the target server to respond, let’s disguise our crawler. This little example is disguised with the usual change user-Agent field.

Change the previous code to disguise the crawler as a browser request so that it can be accessed normally.

import requests

headers = {'content-type':'application/json'.'User-Agent':'the Mozilla / 5.0 (Xll. Ubuntu; Linux x86_64; The rv: 22.0) Gecko / 20100101 Firefox 22.0 / '}

Send a request to the destination URL and return a response object
req = requests.get('https://www.tianqi.com/beijing/',headers=headers)
#.text is the HTML of the web page for the Response object
print(req.text)
Copy the code

Where does the user-agent field come from? Take Chrome as an example. Open a web page at random, press F12 on the keyboard or right-click in the blank area and choose “Check”. Click “Network”, click “Doc”, click “Headers”, check the user-agent field of Request Headers in the information bar, and copy it directly.

2.2. LXML etree appearance

The data we get from the webpage request is complicated, and only part of it is the data we really want. For example, we check the weather of Beijing from the website of weather, but only what we want in the picture below, how can we extract it? This is where lxml.etree comes in.

There is only a small part of the code that we want, and we find that the weather and temperature we want are all under the class=’weather_info’ level, so that’s fine. After the requested code we add:

html_obj = etree.HTML(html)
html_data = html_obj.xpath("//d1[@class='weather_info']//text()")
Copy the code

Let’s print(html_data) to see if the extraction is what we want.

Even the newline character of the web page is extracted, and don’t forget, extracted is the list. We’re going to do a little bit more processing.

word = "Welcome to the weather assistant."

for data in html_data:
    word += data
Copy the code

Let’s print it out and see, well, we have everything we want. However, there is one more [switching city], we will refine, finally cut this last.

2.3. Speak up

All the data we want is in the word variable, so let him read it now, using pyTTsx3,

ptt = pyttsx3.init()
ptt.say(word)
ptt.runAndWait()
Copy the code

Ok, that’s all done now. We groped step by step, now integrated together, the final broadcast effect is still good, this is a very beautiful crawler trip, looking forward to the next climb!

As a Python developer, I spent three days to compile a set of Python learning tutorials, from the most basic Python scripts to Web development, crawlers, data analysis, data visualization, machine learning, etc. These materials can be “clicked” by the friends who want them