Actual combat | using multi-threaded Python creeper crawled LOL hd wallpaper

Source: official account [Jie Ge’s IT Journey]

Author: Alaska

ID: Jake_Internet

I. Background introduction

With the popularization of mobile terminal, many mobile apps appear, and application software is also popular. Recently I saw league of Legends mobile game online, I feel ok, THE PC side of League of Legends is a popular game, do not know how the future of the mobile side of League of Legends, then today we use a multi-threaded way to climb LOL official website hero hd wallpaper.

Second, page analysis

Target website: lol.qq.com/data/info-h…

As shown in the picture on the official website, it is obvious that a small image represents a hero, and our goal is to crawl all the skin images of each hero, download them all and save them locally.

The secondary pages

The above page is called the main page, and the secondary page is the page corresponding to each hero. Take The Daughter of Darkness as an example, its secondary page is as follows:

We can see that there are many small graphs, each of which corresponds to a skin. Check the skin data interface through network, as shown below:

We know that the skin information is transmitted in a JSON format string, so we only need to find the id of each hero, find the corresponding JSON file, and extract the required data to get high-definition skin wallpaper.

And here’s the location of the dark Girl JSON file:

hero_one = 'https://game.gtimg.cn/images/lol/act/img/js/hero/1.js'
Copy the code

The rule here is also very simple, each hero skin data address is like this:

url = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(id)
Copy the code

So the question is what is the pattern of id? Here the hero ID needs to be viewed on the home page, as follows:

We can see two lists [0,99], [100,156],156 heroes, but heroId goes all the way up to 240… . Therefore, it can be seen that it has a certain rule of change, instead of adding one in turn. Therefore, to climb all the hero skin pictures, all the Heroids need to be obtained first.

Three, grab ideas

Why do you use multithreading? To explain, when we’re crawling images, videos, things like that, because we need to save them locally, we’re doing a lot of reading and writing of files, which is IO, so imagine if we were doing synchronous requests;

Then the second request will not be made until the first request is completed and the file is saved locally, which is very inefficient. If multiple threads are used for asynchronous operations, the efficiency will be greatly improved.

So you have to use multiple threads or multiple processes, and then you throw so many data queues out to the thread pool or the process pool;

In Python, multiprocessing Pool is a process Pool, multiprocessing. Dummy is very useful.

multiprocessing.dummyModule:dummyModules are multithreaded;
multiprocessingModule:multiprocessingIs multi-process;

Dummy module and multiprocessing module are both common API, code switch use is more flexible;

We first grab the hero IDS in a test demo.py file. Here I have written the code, and get a list of hero ids stored in the main file.

demo.py

url = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js' res = requests.get(url,headers=headers) res = Res.content.decode (' utF-8 ') res_dict = json.loads(res) heros = res_dict["hero"] # 158hero info idList = [] for hero in heros: hero_id = hero["heroId"] idList.append(hero_id) print(idList)Copy the code

The idList is as follows:

Idlist = [1,2,3,…,875,876,877

Build url: page = ‘www.bizhi88.com/s/470/{}.ht…

Here, I represents ID for dynamic url construction.

So we customize two functions, one for crawling and parsing the page and one for downloading the data, turn on the thread pool, and use the for loop to build a URL that stores the hero skin JSON data in a list as a URL queue. Use pool.map() to execute the spider function;

def map(self, fn, *iterables, timeout=None, chunksize=1): / / pool. Map (spider,page) / / pool. Map (spider,page) / / pool. Page: url queueCopy the code

Function: Extract each element in the list as the parameter of the function, create a process, put into the process pool;

Parameter 1: the function to execute;

Parameter 2: iterator, passing the numbers in the iterator as arguments to the function;

Json data parsing

Skin_name, 3. MainImg. Because we find that heroName is the same, we take the heroName as the name of the hero’s skin folder, so that it is easy to view and save.

item = {}
item['name'] = hero["heroName"]
item['skin_name'] = hero["name"]
if hero["mainImg"] == '':
   continue
item['imgLink'] = hero["mainImg"]
Copy the code

One caveat:

Some mainImg tags are empty, so we need to skip them, otherwise we will get an error if there is an empty link;

Iv. Data collection

Import related third-party libraries

Dummy import Pool as ThreadPool # concurrency import time # efficiency import OS # file operation import Json # parseCopy the code

Page data parsing

def spider(url): res = requests.get(url, headers=headers) result = res.content.decode('utf-8') res_dict = json.loads(result) skins = res_dict["skins"] # Print (len(skins)) for index,hero in enumerate(skins) print(len(skins)) for index,hero in enumerate(skins): Item ['name'] = hero["heroName"] item['skin_name'] = hero["name"] if hero["mainImg"] == ": continue item['imgLink'] = hero["mainImg"] print(item) download(index+1,item)Copy the code

Download photo

Def download(index,contdict): name = contdict['name'] path = "skin /" + name if not os.path.exists(path): os.makedirs(path) content = requests.get(contdict['imgLink'], Content with open('./ skin /' + name + '/' + contdict['skin_name'] + STR (index) + '.jpg', 'wb') as f: f.write(content)Copy the code

Here we use OS module to create folder, as we mentioned earlier, each hero’s heroName value is the same, to create folder and name, easy to save skin (classification), and here is the image file path need to be careful, missing a slash will report an error.

The main () function

Def main(): pool = ThreadPool(6) page = [] for I in range(1,21): newpage = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js'.format(i) print(newpage) page.append(newpage) result = pool.map(spider, page) pool.close() pool.join() end = time.time()Copy the code

Description:

In the main function we preferred to create six thread pools;
Through the for loop to dynamically build 20 urls, let’s try a small tool, 20 hero skin, if you climb all of the previous idList traversal, then dynamically build URL;
The map() function is used to parse and store the url in the thread pool.
When the thread pool is closed, it does not close the thread pool, but changes its state to non-pluggable.

5. Program operation

if __name__ == '__main__':
    main()
Copy the code

The results are as follows:

Of course, this is only a partial image capture, and I’ve crawled 200+ images, but it’s still ok.

Six, summarized

This time we use multithreading to crawl the hero skin hd wallpaper of the official website of the League of Legends. Because the picture involves IO operation, we use concurrent mode to carry out, which greatly improves the execution efficiency of the program.

Of course, the crawler is shallow, this small test, climb 20 heroes skin pictures, interested partners can climb all the skin down, just need to change the traversal elements to the previous IDList.

In this paper, to the end.

Original is not easy, if you think this article is useful to you, please kindly like, comment or forward this article, because this will be my power to output more high-quality articles, thank you!

By the way, please give me some free attention! In case you get lost and don’t find me next time.

See you next time!

Actual combat | using multi-threaded Python creeper crawled LOL hd wallpaper

I. Background introduction

Second, page analysis

Three, grab ideas

Iv. Data collection

5. Program operation

Six, summarized

Related Posts

30 Common Mistakes Python programmers make

Garbage collector and memory allocation

Compile OpenJDK12 on MacOS and debug using CLion