Use Python's LXML for data parsing - Moment For Technology

The Python language can be parsed through LXML, so if you want to parse HTML and XML in a web page to collect data, you can parse website data through LXML, and the data collected by crawlers is much easier. LXML is very fast.

The process of extracting web page data using LXML

Using LXML, you can parse out your website’s data in just two steps:

1, use LXML to parse out the web page. Through this process, we usually choose lxml.html

2. Use xpath parsing, and then collect the data you need.

In order to extract the data needed for the entire site, the whole scraping key here is network performance, not application performance. So can not use asynchronous to improve the performance of the program, if the use of asynchronous extraction, so grab frequency increased, but more easily restricted by the site.

We can parse the HTML code with an example:

#! -*- encoding:utf-8 -*- import aiohttp, Asyncio targetUrl = "http://httpbin.org/ip" # proxy server (product website www.16yun.cn) proxyHost = "t.16yun.cn" proxyPort = "31111" # ProxyPass = "password" proxyServer = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {proxyPass = "password" proxyServer = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, "user" : proxyUser, "pass" : ProxyPass,} userAgent = "Chrome/83.0.4103.61" async def Entry (): conn = aiohttp.TCPConnector(verify_ssl=False) async with aiohttp.ClientSession(headers={"User-Agent": userAgent}, connector=conn) as session: async with session.get(targetUrl, proxy=proxyServer) as resp: body = await resp.read() print(resp.status) print(body) loop = asyncio.get_event_loop() loop.run_until_complete(entry())  loop.run_forever()Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Use Python’s LXML for data parsing

Use Python’s LXML for data parsing

Related Posts

Wechat applet wx.request encapsulation (build – request interception, response interception)

Build your own private NPM library

Remember the experience of being caught in the first frame of a video