In order to prevent crawlers, many websites use JS to load loaded data. If you use Python’s Request library to crawl, you will not be able to crawl the data. Selenium library can simulate opening a browser. Selenium does more than that. It can locate a tag (based on classname, ID, HTML tag, etc.), click, scroll up, and manipulate JS statements.
Download chrome driver: npm.taobao.org/mirrors/chr…
If the chrome version and the driver version are significantly different, opening urls and retrieving HTML tags will not generate an error, but manipulating JS statements will generate an error
Version problem: open the installation location of Chrome browser, where the first folder name is her version, such as mine is 84.0.4147.105, and then in the above url is 84.0.4147.30 this version of Windows 32-bit driver (even if my computer is 64-bit).
Then unzip the downloaded driver into the python interpreter installation directory (the same directory as python.exe), and add the Python installation directory Path to the system variable Path so That Python can find the driver
PIP install Selenium
The link I use below is the ranking page of B station, and may change the link address at any time
From Selenium import webdriver browser = webdriver.chrome () 'https://www.bilibili.com/v/popular/rank/all?spm_id_from=333.851.b_7072696d61727950616765546162.3') # open a website Print (browser.page_source) # print the HTML browser.quit() # close the browser page, From Selenium import webdriver import time browser = webdriver.chrome () # open browser browser.get() 'https://www.bilibili.com/v/popular/rank/all?spm_id_from=333.851.b_7072696d61727950616765546162.3') # open a website Browser.execute_script (' window.scrollto (300,1000)') # move 300 pixels to the right, Slide down 1000 pixels time. Sleep (2) the execute_script (' window. ScrollTo (0, document. Body. ScrollHeight) ') # to the bottom of the page is the most time. Sleep (2) Element = browser.find_element_by_link_text(' ') # browser.execute_script("arguments[0].scrollintoView ();" Sleep (2) print(browser.page_source) # print the HTML browser.quit() # close the browser page and exit the processCopy the code
The browser.find_element_by_xxx(XXX) function inside can get a lookup to a location in a variety of ways, such as ID, class, etc. I used a link text here to find it.
The Selenium library also provides functions to find an HTML tag, but IN my opinion, the same HTML tag can appear more than once on a page, so I recommend using the re library to find a specific text
Import re from Selenium import webdriver browser = webdriver.chrome () 'https://www.bilibili.com/v/popular/rank/all?spm_id_from=333.851.b_7072696d61727950616765546162.3') # number = open a url re.findall(r'<div class="num">(.*?)</div>', browser.page_source, Title = re.findall(r'target="_blank" class="title">(.*?)</a>', browser. Page_source, Re. # S) title view = re. The.findall (r ' ' '< I class = "b - the icon play" > < / I > (. *?) < / span >' ' ', the page_source, Re. # S) watched danmu = view = re. The.findall (r ' ' '< I class = "b - the icon view" > < / I > (. *?) < / span >' ' ', the page_source, Re. # S) barrage user number = re. The.findall (r ' ' '< I class = "b - the icon author" > < / I > (. *?) < / span >' ' ', the page_source, Pingfen = re.findall(r'<div class=" PTS "><div>(.*?)</div>', browser.page_source, Re.s) # for I in zip(number, title, view, danmu, user, pingfen): print(I) browser.quit() # Close the browser page and exit the processCopy the code
We’re ready to crawl. Log him into the MySQL database. If you use my code to input, you will first create a bilibili database, and then create a table for 20201025day, the structure of the table is as follows:
from selenium import webdriver import re import pymysql class Write_sql(): def __init__(self, database): Self. database = database print(self.database, self.database, 'database ') def writeValue(self, insert_sql): The db = pymysql. Connect (' 127.0.0.1 ', 'root', 'XXXX', Self.database) cursor1 = db.cursor() # Retrieve cursor cursor1. Execute (insert_SQL) # execute SQL statement db.mit () # commit to database browser = webdriver.Chrome() browser.get( Number = 'https://www.bilibili.com/v/popular/rank/all?spm_id_from=333.851.b_7072696d61727950616765546162.3') re.findall(r'<div class="num">(.*?)</div>', browser.page_source, Title = re.findall(r'target="_blank" class="title">(.*?)</a>', browser. Page_source, Re. # S) title view = re. The.findall (r ' ' '< I class = "b - the icon play" > < / I > (. *?) < / span >' ' ', the page_source, Re. # S) watched danmu = view = re. The.findall (r ' ' '< I class = "b - the icon view" > < / I > (. *?) < / span >' ' ', the page_source, Re. # S) barrage user number = re. The.findall (r ' ' '< I class = "b - the icon author" > < / I > (. *?) < / span >' ' ', the page_source, Pingfen = re.findall(r'<div class=" PTS "><div>(.*?)</div>', browser.page_source, For I in zip(number, title, view, danmu, user, pingfen): Insertsql = 'insert into 20201025DAY values' + STR (I) To convert to a string type write_sql.writevalue (insertsQL) # Call the object's insertion method browser.quit()Copy the code
Please correct me if there are any mistakes
PS: If you can’t solve the problem, you can click the link below to get it by yourself
Free Python learning materials and group communication solutions click to join