This is the 14th day of my participation in the August More Text Challenge. For details, see: August More Text Challenge

Sometimes we have some needs to get pictures, such as writing blogs or public articles, the need for illustrations and cover, of course, these pictures must be no copyright free pictures, more commonly used sites pexels, Pixabay, today to introduce a new website: Alana.io /.

Since it is a foreign website, the access speed is relatively slow, and it takes a lot of time to search page by page, so I thought of using Python to crawl and download to the local, and then use keywords to classify, and then directly preview the later use.

First, we need to understand the basic process of crawling data:

Send a request: Send a request to the server through the URL.

Get the response content: The response content page contains HTML, Json strings or binary data (videos, images), etc.

Parse content: The response data can be parsed through regex, BeautifulSoup, xpath, and so on.

Save data: can save to local file, also can save to database (MySQL, Redis, Mongodb, etc.)

The preparatory work

Access alana.io/ through a browser and search for keywords (e.g. : computer) to see the requested URL in F12.

Turn to page 2 to see the URL rules for page numbers:

Click Response to see the rules for image urls, most of which start with *****.

Below, is the bulk crawl and download images of the code composition.

1. Create a download directory

Create a keyword directory for later search.

If not os.path.exists('./{}'.format(search_words)): os.mkdir('./{}'.format(search_words))Copy the code

2. Request and parse data

Headers = {' user-agent ': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6 Like Gecko)/Chrome/76.0.3809.100 Safari/537.36'} HTML = request.get (URL,headers=headers).text Re.compile (r'<img width="548" height="365" SRC ="(.*?)"') # re.compile(r'<img width="548" height="365" SRC ="(.*?)"') # re.compile(r'<img width="548" height="365" SRC ="(Copy the code

3. Save the image data to the local PC

Def save_pics(search_words,urls): # download headers = {' user-agent ': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'} for I in urls: filename = './{}/'.format(search_words)+i.split('/')[-1] try: with open(filename,'wb+') as f: Write (requests. Get (I,headers=headers).content)# Write (requests. Get (I,headers).content)# write(requests.Copy the code

4. Main function (download by page loop)

if __name__ == '__main__': # for page in range(1,search_page+1) # for page in range(1,search_page+1) url = 'http://alana.io/page/{}/? s={}&post_type=download'.format(page,search_words) create_dirs(search_words) urls = save_urls(url) save_pics(search_words,urls) time.sleep(2)Copy the code

Done:

Note: it is only for study use. It is recommended not to collect large numbers intensively to avoid excessive server pressure.