Live up to the time, the creation of non-stop, this article is participating in 2021 year-end summary essay contest

preface

Use Python multithreading to crawl more than 5000 latest movie download links

Let’s have a good time

The development tools

Python version: 3.6.4

Related modules:

Requests module;

Re module;

The CSV module;

And some modules that come with Python.

Environment set up

Install Python and add it to the environment variables. PIP installs the required related modules.

bs = BeautifulSoup(html, "html.parser")
b = bs.findAll(class_="co_content8")
b = b[0].findAll(class_="ulink")
Copy the code

Once you’ve got the links, the next step is to go back to those links and get the download link for the movie

bs1 = BeautifulSoup(html1, "html.parser")
b1 = bs1.find("tbody").find_next("td").find_next("a")
download_url = b1.get("href")
Copy the code

But there are still a lot of small details, such as we need to get the total number of pages of the movie, secondly, there are so many pages, a thread does not know when to run, so we first get the total number of pages, and then use multiple threads to assign tasks

We first get the total page number, and then use multiple threads to assign tasks

The total number of pages is actually obtained using the re re

def get_total_page(url) :
    r = requests.get(url=url,headers=headers)
    r.encoding = 'gb2312'
    pattern = re.compile(r'(? < = page /) \ d + ')
    t = pattern.findall(r.text)
    return int(t[0])
Copy the code

The crawl content is stored in CSV, or you can write a function to access it

def wirte_into_csv(name,down_url) :
    f = open('Latest Movie.csv'.'a+', encoding='utf-8')
    csv_writer = csv.writer(f)
    csv_writer.writerow([name,down_url])
    f.close()
Copy the code

Start four processes to download the link

total_page = get_total_page("https://www.ygdy8.com/html/gndy/oumei/list_7_1.html")
    total_page = int(total_page/25+1)
    end = int(total_page/4)
    try:
        _thread.start_new_thread(run, (1, end))
        _thread.start_new_thread(run, (end+1, end*2))
        _thread.start_new_thread(run, (end*2 + 1, end * 3))
        _thread.start_new_thread(run, (end*3 + 1, end * 4))
    except:
        print("Error: unable to start thread")

    while(1) :pass
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python multithreaded crawler for more than 5,000 new movie download links

preface

The development tools

Environment set up

Python multithreaded crawler for more than 5,000 new movie download links

preface

The development tools

Environment set up

Related Posts

How to use Spring Cloud correctly? 【 the 】

OpenLDAP configures centralized user authentication

Secret of PHP classes (2) Class constants