After learning more than a week of crawler class, I finally couldn’t stand it, so I decided to write my own crawler program manually. Just as LJ was encouraging students to share their achievements, excellent works will be rewarded, so I sorted out the process of crawling the hd wallpapers of major games with Python programming to contribute and share with everyone.
Climb took the current fire game wallpaper, MOBA “hero alliance” game, mobile game “king glory”, “Onmyoji FPS game to survive the jedi, one of the” hero alliance “wallpaper crawl, crawl here show” hero alliance “process of all hero wallpaper, learned this, oneself go to climb in other games, also won’t be a problem.
Take a look at the final crawling effect. The wallpaper of each hero has been crawling off:
12 wallpapers of “Annie of Darkness” :
Large hd picture:
The formal teaching begins below! Version: Python 3.5 Tool: Jupyter Notebook Implements all steps, and ultimately consolidates lol_SSL.py as a file
1. Understand the crawl object and design the crawl process
Before using crawler, it is very necessary to spend some time to understand the crawler object, which can help us to design the crawler process scientifically and reasonably, so as to avoid the difficulty of crawler and save time.
1.1 Basic Information about heroes Open the official website of league of Legends and see the information about all heroes:
To crawl all heroes, we first need to get the hero’s information. Right-click — check — Elements on the web page to see the hero’s information, as shown below, including the hero’s nickname, hero name, English name, and so on. Since this information is dynamically loaded using JavaScript and cannot be retrieved by normal crawl methods, we considered using the virtual browser PhantomJS to retrieve this information.
We click to enter the “Dark Daughter Annie” page, the page address is
“Annie” in the address is the English name of the hero. To access other heroes, you only need to change the English name.
On the Hero page, click on the thumbnail to switch to a different large image of the skin. On the large image, “right click – Open image in New TAB” will open the large image. This is the HD wallpaper we want:
Look at the address information in the image above and open up several other Annie’s skin wallpapers to see that the only difference between different wallpapers is the number of the picture:
Hero “blind Monk Li Qing” wallpaper address:
Reobserve hero “card master Treaster” wallpaper address:
Can sum up such a rule: wallpaper address consists of three parts, fixed address + hero ID + wallpaper number. Fixed Address:
> > < span style = “font-size: 14px! Important; – There are no more than 20 wallpapers for each hero, depending on the number of skins
1.2 Hero ID In the above process, we have a basic understanding of the information to crawl the object, but the number of each hero ID is not known, in the web source code and JavaScript loading can not find the hero and ID corresponding information, guess this information may be placed in a JS file, let’s find. From the all heroes info screen, “right-click — check — Network” and refresh the screen to find a champion. Js file:
Open the champion.js file and find the information we need in it. The hero’s English name corresponds to the hero’s ID:
1.3 Crawler Flow Chart So far, we have a certain understanding of the object we want to crawl and have some ideas about the specific crawling method. We can design the following crawler flow chart:
2. Design the overall code framework
According to the crawler flow chart, we can design the following code framework:
class LOL_scrawl:
def __init__(self): # constructor
pass
def get_info(self): Enter the crawl hero name, nickname, or All
pass
def create_lolfile(self): # create folder LOL in current directory
pass
def get_heroframe(self): Get all hero information on the official website
pass
def get_image(self,heroid,heroframe): # Crawl hero info
pass
def run(self):
self.create_lolfile() # create LOL folder
inputcontent = self.get_info() Get keyboard input information
heroframe = self.get_heroframe() Get all hero info
print('Got hero info saved in herofame.csv, now start crawling wallpaper... \n')
if inputcontent.lower() == 'all': # When typing all, crawl all hero wallpapers
pass
else: # otherwise crawl single hero wallpaper
passif __name__ == '__main__':
lolscr = LOL_scrawl() # create object
lolscr.run() Run crawler
Copy the code
The run() function does the following: creates an LOL folder — gets input from the keyboard — climbs All hero wallpapers if it’s “All”, or a single hero wallpapers if it’s not. If you want to crawl all or a single hero wallpapers, try-except will be used to crawl all or a single hero wallpapers, because it may fail due to network instability, etc.
if inputcontent.lower() == 'all': # When typing all, crawl all hero wallpapers
try: allline=len(heroframe) for i in range(1,allline):
heroid=heroframe['heroid'][[i]].values[:][0]
self.get_image(heroid,heroframe)
print('Complete all climb missions! \n') except:
print('Crawl failed or partially failed, please check for errors')
else: # otherwise crawl single hero wallpaper
try: hero=inputcontent.strip()
line = heroframe[(heroframe.heronickname==hero) | (heroframe.heroname==hero)].index.tolist()Find the row in the dataframe where the hero is located
heroid=heroframe['heroid'][line].values[:][0] Get the hero ID
self.get_image(heroid,heroframe)
print('Complete all climb missions! \n') except: print('wrong! Please input correctly as prompted! \n')
Copy the code
Now that the crawler framework is in place, the following two core codes are explained: get_heroframe() and get_image(Heroid,heroframe).
3. Crawl all hero information
3.1 Parsing the JS File First, we need to parse the champion. Js file to get a one-to-one correspondence between the hero’s English name and ID. Urllib. request opens the file address, reads the contents and treats them as strings, parses the contents and converts them into dictionaries {key:value}, key is the English name, value is the hero ID:
import urllib.request as urlrequestGet hero English name and id, generate dictionary
herodict{Englishname:id}content=urlrequest.urlopen('http://lol.qq.com/biz/hero/champion.js').read()str1=r'champion={"keys":'str2=r',"data":{"Aatrox":'champion=str(content).split(str1)[1].split(str2)[0]herodict0=eval(champion) herodict = dict((k, v) for v, k in herodict0.items())Print (herodict)Herodict {Englishname: id} looks like this: {'Soraka': '16'.'Akali': '84'.'Skarner': '72'.'Tristana': '18'.'Zilean': '26'.'JarvanIV': '59'.'Varus': '110'.'Talon': '91'.'Ashe': '22'.'Malphite':'54'.'Nocturne':'56'.'Khazix':'121'}
Copy the code
All the hero information pages on the official website are loaded with JavaScript, so it is not easy to crawl. We use Selenium+PhantomJS to dynamically load the hero information. Selenium is an automated testing tool that supports Chrome, Safari, Firefox and other browser drivers. You need to install the Selenium module before using it. PhantomJS is a virtual browser. It has no interface, but its dom rendering, JS running, network access, canvas/ SVG drawing and other functions are very complete, and it has a wide range of applications in page fetching, page output, automatic testing and other aspects. PhantomJS can be downloaded from the official website. We used Selenium+PhantomJS to dynamically load the hero information and BeautifulSoup to get the url page content:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
# the url of all heroes on the official league of Legends website
url_Allhero='http://lol.qq.com/web201310/info-heros.shtml#Navi'
# Open url with headless browser PhantomJS to solve JavaScript dynamic loading problem
driver=webdriver.PhantomJS(executable_path=r'D: \ phantomjs 2.1.1 - Windows \ bin \ phantomjs')
Executable_path is the installation location of PhantomJS
driver.get(url_Allhero)time.sleep(1)
Pause execution for 1 second to make sure the page loads
# BeautifulSoup to get web page content
pageSource=driver.page_sourcedriver.close()bsObj=BeautifulSoup(pageSource,"lxml")
Copy the code
After you have the page content, BeautifulSoup is used to parse the page content and store the hero nickname, name, ID and other information into heroframe:
# BeautifulSoup parses the page content to get a list of heroes
heroframeherolist=bsObj.findAll('ul', {'class':'imgtextlist'})
for hero in herolist:
n=len(hero)
m=0
heroframe=pd.DataFrame(index=range(0,n), columns=['herolink'.'heronickname'.'heroname'.'Englishname'.'heroid']) heroinflist=hero.findAll('a')
Extract the hyperlink part of the hero information
for heroinf in heroinflist:
# Hyperlink section for hero info
herolink=heroinf['href']
eronickname=heroinf['title'].split(' ')[0].strip() heroname=heroinf['title'].split(' ')[1].strip() heroframe['herolink'][m]=herolink heroframe['heronickname'][m]=heronickname heroframe['heroname'][m]=heroname heroframe['Englishname'][m]=heroframe['herolink'][m][21:] heroframe['heroid'][m]=herodict[heroframe['Englishname'][m]]
m=m+1
heroframe.to_csv('./LOL/heroframe.csv',encoding='gbk',index=False)
Copy the code
So far, the get_heroframe() function crawles all heroes and stores them in heroframe.csv, as shown below:
4. Crawl hero wallpapers
Define the get_image(heroid,heroframe) function to crawl all the wallpapers of a single hero. First create a subfolder for the hero in the LOL folder:
# create a folder to store the hero wallpaper
line = heroframe[heroframe.heroid == heroid].index.tolist()
Find the row in the dataframe where the hero is located
nickname=heroframe['heronickname'][line].valuesname=heroframe['heroname'][line].valuesnickname_name=str((nickname +' '+ name)[0][:])filehero='.\LOL'+'\ \'+nickname_name ifNot os.path.exists(fileHero): os.makedirs(fileHero) then you can crawl the hero's wallpaper. Since there are no more than 20 wallpapers per hero, we can use a loop of less than 20 to get all the wallpapers:For k in range(21):
# generate a wallpaper address
url='http://ossweb-img.qq.com/images/lol/web201310/skin/big'+str(heroid)+'0'*(3-len(str(k)))+str(k)+'.jpg'
# Grab a wallpaper
try: image=urlrequest.urlopen(url).read()
imagename=filehero+'\ \'+'0'*(3-len(str(k)))+str(k)+'.jpg'
with open(imagename, 'wb') as f:
f.write(image) except HTTPError as e: continue
Copy the code
Output a message indicating success after the climb is complete:
# Complete the hero's climb
print('heroes'+ nickname_name +'Wallpaper has been picked successfully \n')
Copy the code
And here we are! As long as run this small program, all the hero’s skin wallpaper is collected in the bag, of course, you can also climb all the skin of a single hero, as long as the prompts to enter the hero’s nickname or name. Crawl single Hero Skin wallpaper:
Crawl all hero Skin wallpapers:
Keep the network open while running the code. If the network speed is too slow, the crawl may fail. It takes about 3-4 minutes to crawl all 139 heroes hd wallpapers (about 1000 images) on a 3 Megabyte cable network. The same is true for other games like Honor of Kings, Onmyoji, PUBG, and so on. League of Legends is the most difficult game to climb, so it’s easy to write your own code to climb other games. Finally, a “Till Death do Us Part” wallpaper and congratulations to LPL!
The last
If you still can’t write this script, you can pay attention to xiaobian + forward this article, and then the private letter xiaobian “skin”, you can get the completed code, or find my guidance to achieve capture hero skin, the original is not easy!
Click here to learn more about Python
To learn more