Python crawler combat: Application treasure APP data information collection

Application treasure APP data collection

- Tools to prepare
- Project idea analysis
- Simple source code analysis

Tools to prepare

Data Source:Application of treasure

Development environment: Win10, PYTHon3.7

Development tools: PyCharm, Chrome

Project idea analysis

Define the data to be collected:

Download address of app
Number of app downloads
The name of the app
The company that developed the app

Extract the category tag to the page

Get the href attribute of the A tag

Used to concatenate dynamic addresses later

Find dynamically loaded APP data loading address

The value of the URL is the value of each category tag

Sj.qq.com/myapp/cate/…Concatenate the new URL value to send the request

Simple source code analysis

Import Requests # Toolkit sends network requests from LXML import etree # Convert to objects import CSV # Process table data URL = "https://sj.qq.com/myapp/category.htm?orgame=1" response = requests.get(url) html_data = etree.HTML(response.text) li_list = html_data.xpath('//ul[@data-modname="cates"][position()>1]/a/@href') del(li_list[-1]) for url1 in li_list: for i in range(10): new_url = "https://sj.qq.com/myapp/cate/appList.htm" + url1 + "&pageSize=20&pageContext={}".format(i*20) res = Request.get(new_url).json() if res["count"] == 0: break with open(" app.csv ", "a", newline="", encoding=" UTF-8 ")as f: csv_data = csv.DictWriter(f, fieldnames=["appName", 'authorName', "apkUrl"]) for info in res["obj"]: appName = info['appName'] authorName = info['authorName'] apkUrl = info['apkUrl'] print({"appName": appName, "authorName": authorName, "apkUrl": apkUrl}) csv_data.writerow({"appName": appName, "authorName": authorName, "apkUrl": apkUrl})Copy the code

Python crawler combat: Application treasure APP data information collection

Application treasure APP data collection

Tools to prepare

Project idea analysis

Simple source code analysis

Related Posts

15. The sum of three numbers | brush the question and punch the clock

Do you really understand ES6’s Set, WeakSet, Map, and WeakMap?

What is rearrangement and redrawing