Using Python + Selenium to crawl Table Data (ii)

Introduction to the

It’s been a month since I last wrote a crawler blog, so today we’re going to move on to how to use crawlers to grab table data and save it to Excel. This time I’m going to switch to an internal Sample and look at the implementation details.

Implementation details

Again, the pandas tool set for analyzing structured data is included in the original login.py file

import pandas as pd
from openpyxl import load_workbook
Copy the code

After introduction, we can use Python + Selenium implementation crawler simulation login (1) as described in the previous article

Self.browser.find_element_by_name ('commit').click() # log in time.sleep(1) #Copy the code

After the successful login really successful page for node resolution, simulate opening the left sidebar hierarchy

Span_tags = self.browser.find_elements_by_xpath('//span[text()=" user "]') span_tags[0].click() # open wechat user page a_tags = self.browser.find_elements_by_xpath('//a[@href="/admin/wxusers"]') a_tags[0].click()Copy the code

Through the code above, we fully expand the content of the sidebar and open the page. Next, the most important code comes. Since the internal Sample written this time is not separated from the front and back ends, we need to obtain the number of pages in the page.

b_tags = self.browser.find_element_by_class_name('pagination.page.width-auto').find_elements_by_tag_name('b') pageSize =  int(b_tags[1].text)Copy the code

After getting the number of pages, we need to do a for loop for our page:

For I in range(pageSize):Copy the code

Inside the loop to locate the table and retrieve the contents of the table

LST = [] # store table contents as list element = self.browser.find_element_by_tag_name(' tBody ') # Locate table contents td tr_tags = Element.find_elements_by_tag_name ("tr") # Td_tags = tr.find_elementS_by_tag_name (' TD ') for TD in td_tags[:4]: # extract the first 4 columns lst.append(td.textCopy the code

After extracting the content of the first page, the content is segmbered and continuously saved into Excel

Col = 4 col = 4 col = 4 col = 4 col = 4 col = 4 LST = [LST [I: I + col] for I in range(0, len(LST), Col)] # list to dataframe df = pd. dataframe (LST) # list to dataframe df.to_excel('demo.xlsx', sheet_name='sheet_1', Book = load_workbook('demo.xlsx') writer = pd.excelwriter ('demo.xlsx'), engine='openpyxl') writer.book = book writer.sheets = dict((ws.title, ws) for ws in book.worksheets) df.to_excel(writer, sheet_name='sheet_1', Index =False,startrow=row,header=False) writer.save() time.sleep(1) # pause for a second to prevent the local Sample from being concurrency too high row=row + 10 # record the number of rows stored in ExcelCopy the code

After the content is saved, click on the next page, and so on until the loop stops and our data is captured

Self.browser.find_element_by_class_name ('next').click()Copy the code

Verify and test

The above is the content saved by the capture. At present, I have not recorded the video of the second article, and I hope you can verify it by yourself, but I can guarantee that these are the codes that I have tested personally.

conclusion

Click here for the first articleImplementation of Simulated Crawler Login using Python + Selenium
The Excel save tool used this time is PANDAS
This is a very simple crawler and I think we’ll do more complex machine learning next time, so stay tuned
If you need complete code, please contact me

Using Python + Selenium to crawl Table Data (ii)

Introduction to the

Implementation details

Verify and test

conclusion

Related Posts

Since the surveillance image paper intensive | BYOL (SOTA) | 2020

TensorRT detailed introduction refers to north, if you do not know TensorRT, come and have a look!

FOSHU algorithm