Basic idea:
First use developer tools to find the list of tags to extract data from:
! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/bd2bcc5f762547ce8b954deae1193f9b)
Use xpath to locate the list of data to extract
! [](https://p26-tt.byteimg.com/large/pgc-image/0025bc4a2eeb4349a7dbd3098dea1c12)
And then extract the corresponding data one by one:
! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/ee47b055daf141dc9e971c420d66c084)
Save data to CSV:
! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/5fe2f65117d946909db951fe6a538436)
Use developer Tools to find the next button TAB:
! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/85e34c5409104fc599652944a5a1d633)
Extract this tag object using xpath and return:
! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/b8c2e5240e2d4a198f0f459cb6cc04ce)
Call the click event and loop through the above process:
! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/ed409f438f104b0ead1c8f4903cb9b14)
Final effect:
! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/24d7a011458d40eb857577df9a13aef3)
Code:
from selenium import webdriver import time import re class Douyu(object): def __init__(self): # at the beginning of the url of the self. Start_url = "https://www.douyu.com/directory/all" # instantiate a Chrome object self. The driver = webdriver. Chrome (#) Self.start_csv = True def __del__(self): self.driver. Quit () def get_content(self): Sleep (2) item = {} # Get the next TAB next_page = Self. Driver. Find_element_by_xpath (" / / span [text () = 'next'] /.." Get_attribute ("aria-disabled") is_next_URL = next_page.get_attribute("aria-disabled" Self.driver. find_elements_by_xpath("//ul[@class=' layout-cover-list ']//li") item["user-id"] = li.find_element_by_xpath(".//div[@class='DyListCover-userName']").text item["img"] = li.find_element_by_xpath(".//div[@class='DyListCover-imgWrap']//img").get_attribute("src") item['class-name'] = li.find_element_by_xpath(".//span[@class='DyListCover-zone']").text item["click-hot"] = li.find_element_by_xpath(".//span[@class='DyListCover-hot']").text item["click-hot"] = Re.sub (r'\n', ",item['click-hot']) # Save data self.save_csv(item) # Return the tag if there are next page and next page click events, return next_page,is_next_url def save_csv(self,item): STR =','. Join ([I for I in item.values()]) with open('./douyu.csv','a',encoding=' utF-8 ') as f: if self.start_csv: F.write (" userid,image, class ") self.start_csv = False # print("save success") Self.driver. get(self.start_url) while True: def run(self): Is_next = self.get_content() if is_next! Next_page.click () if __name__=='__main__': douyu_spider = Douyu() douyu_spider. Run ()Copy the code
This article reprinted text, copyright belongs to the author, such as infringement contact xiaobian to delete
Original address: blog.csdn.net/qq_46456049…
The source code is available here