The Python language can be parsed through LXML, so if you want to parse HTML and XML in a web page to collect data, you can parse website data through LXML, and the data collected by crawlers is much easier. LXML is very fast.

The process of extracting web page data using LXML

Using LXML, you can parse out your website’s data in just two steps:

1, use LXML to parse out the web page. Through this process, we usually choose lxml.html

2. Use xpath parsing, and then collect the data you need.

In order to extract the data needed for the entire site, the whole scraping key here is network performance, not application performance. So can not use asynchronous to improve the performance of the program, if the use of asynchronous extraction, so grab frequency increased, but more easily restricted by the site.

We can parse the HTML code with an example:

#! -*- encoding:utf-8 -*- import aiohttp, Asyncio targetUrl = "http://httpbin.org/ip" # proxy server (product website www.16yun.cn) proxyHost = "t.16yun.cn" proxyPort = "31111" # ProxyPass = "password" proxyServer = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {proxyPass = "password" proxyServer = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, "user" : proxyUser, "pass" : ProxyPass,} userAgent = "Chrome/83.0.4103.61" async def Entry (): conn = aiohttp.TCPConnector(verify_ssl=False) async with aiohttp.ClientSession(headers={"User-Agent": userAgent}, connector=conn) as session: async with session.get(targetUrl, proxy=proxyServer) as resp: body = await resp.read() print(resp.status) print(body) loop = asyncio.get_event_loop() loop.run_until_complete(entry())  loop.run_forever()Copy the code