Live up to the time, the creation of non-stop, this article is participating in 2021 year-end summary essay contest

preface

Use Python multithreading to crawl more than 5000 latest movie download links

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests module;

Re module;

The CSV module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

bs = BeautifulSoup(html, "html.parser")
b = bs.findAll(class_="co_content8")
b = b[0].findAll(class_="ulink")
Copy the code

Once you’ve got the links, the next step is to go back to those links and get the download link for the movie

bs1 = BeautifulSoup(html1, "html.parser")
b1 = bs1.find("tbody").find_next("td").find_next("a")
download_url = b1.get("href")
Copy the code

But there are still a lot of small details, such as we need to get the total number of pages of the movie, secondly, there are so many pages, a thread does not know when to run, so we first get the total number of pages, and then use multiple threads to assign tasks

We first get the total page number, and then use multiple threads to assign tasks

The total number of pages is actually obtained using the re re

def get_total_page(url) :
    r = requests.get(url=url,headers=headers)
    r.encoding = 'gb2312'
    pattern = re.compile(r'(? < = page /) \ d + ')
    t = pattern.findall(r.text)
    return int(t[0])
Copy the code

The crawl content is stored in CSV, or you can write a function to access it

def wirte_into_csv(name,down_url) :
    f = open('Latest Movie.csv'.'a+', encoding='utf-8')
    csv_writer = csv.writer(f)
    csv_writer.writerow([name,down_url])
    f.close()
Copy the code

Start four processes to download the link

total_page = get_total_page("https://www.ygdy8.com/html/gndy/oumei/list_7_1.html")
    total_page = int(total_page/25+1)
    end = int(total_page/4)
    try:
        _thread.start_new_thread(run, (1, end))
        _thread.start_new_thread(run, (end+1, end*2))
        _thread.start_new_thread(run, (end*2 + 1, end * 3))
        _thread.start_new_thread(run, (end*3 + 1, end * 4))
    except:
        print("Error: unable to start thread")

    while(1) :pass
Copy the code