This is the 29th day of my participation in the First Challenge 2022
One, foreword
Today we share is the practice of crawler: crawl medical information website pictures and classification store to local and store to MySql database.
After reading this article, which may take 10 to 20 minutes, you can learn: Xpath syntax, practice, page-turning, multi-page crawling ideas, three methods of data storage: download to local, save to Mysql database, save to local CSV file, the last batch of good before school, full.
Ii. Review of basic knowledge
1. Basic use of Xpath
Installation method
Direct recommended method: Douban source installation (other installation methods can be baidu)
pip install -i https://pypi.douban.com/simple lxml
Copy the code
Pymysql for database operations
1) Installation method
Direct recommended method: Douban source installation (other installation methods can be baidu)
pip install -i https://pypi.douban.com/simple/ pymysql
Copy the code
2) Basic usage introduction
import pymysql
# database connection
conn = pymysql.connect(host = "localhost",port = 3306,user = "Your database login name",
password = "Your database login name",charset="utf8",database = "Your database name")
Get the cursor using the cursor() method
cur = conn.cursor()
Execute SQL statement (add, delete, check, change)
cur.execute(sql)
# submit session
conn.commit()
Close the database connection
Copy the code
Third, look at the code, learn while typing and memorize Xpath system actual combat training
1. Diagram the site we are going to climb
The main page that we crawled ishttp://www.med361.com
, there are many categories of medical products below it (we use the form 1: n in the figure), and each of themcategory
There are more under (category)product
(Commodities)(we use the form of 1: n in the picture), of course, after entering the detailed commodities home page, there will be moreurl
“, said in detail when climbing back.
2. Visit the main page and use Xpath to get all the product category urls
(1) Basic code
Author: Objective: To climb the medical website image"
import requests
from lxml import etree
def get_respones_data(branch_url) :
# requests send requests
get_response = requests.get(branch_url)
Convert the returned response code to text (the entire web page)
get_data = get_response.text
# parse the page
a = etree.HTML(get_data)
return a
# main page
mian_url = "http://www.med361.com"
Send a request to get the xpath-formatted page
response_01 = get_respones_data(mian_url)
Url of different medical product categories
# back
branch_url_1 = response_01.xpath("/html/body/div[2]/div/div/div/div[1]/ul/li/a/@href")
# on the next page
branch_url_2 = response_01.xpath("/html/body/div[2]/div/div/div/div[2]/ul/li/a/@href")
print(branch_url_1)
print(branch_url_2)
Copy the code
(2)Xpath path selection analysis
- Picture analysis:
- Text explanation:
# back "/ HTML/body/div [2] / div/div/div/div [1] / ul/li [1] / a", "/ HTML/body/div [2] / div/div/div/div [1] / ul/li [4] / a" "/ HTML/body/div [2] / div/div/div/div [1] / ul/li [10] / a" # page "/ HTML/body/div [2] / div/div/div/div [2] / ul/li [1] / a"Copy the code
The above is the Xpath path of several medical commodities of different categories that I selected, and the rule can be found. Only the last Li tag is changing, but what we need to obtain is the href attribute of tag A (the page URL is inside), and the difference between the previous page and the next page is the sequence number of the last div, so the Xpath path is:
"/html/body/div[2]/div/div/div/div[1]/ul/li/a/@href"
and
"/html/body/div[2]/div/div/div/div[2]/ul/li/a/@href"
Copy the code
(3) The running result is
(4) Data revision
Through the results, we can easily see that the URL we get is different from what we imagined, such as there is no WWW or HTTP, heh heh, but through the page jump analysis, we know that we now get the back part of the product category URL, and the front part is: http://www.med361.com, deal with:
# merge all categories
branch_url = branch_url_1 + branch_url_2
Treat the URL to a type we can access directly
for i in range(len(branch_url)):
branch_url[i] = mian_url + branch_url[i].replace('/.. '.' ')
print(branch_url)
Copy the code
Running results:So far, we’re done with this part.
3. Access the paging surface and use Xpath to get all the item details urls
(1) Basic code
Url of a medical product category
branch_url_01 = "http://www.med361.com/category/c-456-b0.html"
# send request
response_02 = get_respones_data(branch_url_01)
# Xpath retrieves all individual commodity urls
url_list = response_02.xpath('//*[@id="ddbd"]/form/dl/dd[2]/a/@href')
Treat the URL to a type we can access directly
for i in range(len(url_list)):
url_list[i] = mian_url + url_list[i]
print(url_list)
Copy the code
(2)Xpath path selection analysis
- Picture analysis:
- Text explanation:
"//*[@id="ddbd"]/form/dl[1]/dd[2]/a"
"//*[@id="ddbd"]/form/dl[3]/dd[2]/a"
"//*[@id="ddbd"]/form/dl[9]/dd[2]/a"
Copy the code
The above is the Xpath path of several different medical commodities selected by me, and the rule can be found. Only the last DL tag is changing, while what we need to obtain is the href attribute of tag A (page details URL is inside), so the Xpath path is:
"//*[@id="ddbd"]/form/dl[9]/dd[2]/a/@href"
Copy the code
(3) The running result is
(4) Supplement: Turn the page
- Photo show
- Code implementation
Url of a medical product category
branch_url_01 = "http://www.med361.com/category/c-456-b0.html"
# send request
response_02 = get_respones_data(branch_url_01)
# Xpath retrieves all individual commodity urls
url_list = response_02.xpath('//*[@id="ddbd"]/form/dl/dd[2]/a/@href')
# Xpath gets all page-turning urls
url_paging = response_02.xpath('//*[@id="pager"]/a/@href')
Treat the page-turning URL to a type we can access directly
for i in range(len(url_paging)):
url_paging[i] = mian_url + url_paging[i].replace('/.. '.' ')
Treat the item URL to a type we can access directly
for i in range(len(url_list)):
url_list[i] = mian_url + url_list[i]
print(url_list)
for i in range(len(url_paging)):
time.sleep(1)
# send request
response_03 = get_respones_data(branch_url_01)
# Xpath retrieves all individual commodity urls
url_list = response_03.xpath('//*[@id="ddbd"]/form/dl/dd[2]/a/@href')
Treat the item URL to a type we can access directly
for i in range(len(url_list)):
url_list[i] = mian_url + url_list[i]
print(url_list)
Copy the code
So far, we’ve done that as well.
4. Visit a single product page and use Xpath to get the product name and presentation image URL
(1) Basic code
url_one = "http://www.med361.com/product/p-4357.html"
response_04 = get_respones_data(url_one)
# 1. Medical Equipment Name
m_e_name = response_04.xpath('//*[@id="product-intro"]/ul/li[1]/h1/text()') [0].strip()
print("Name of medical device :" + m_e_name)
# 2. Picture Introduction
picture_i = response_04.xpath('//*[@id="content"]/div/div[3]/div[2]/div/div/div[1]/div[2]/div/img/@src')
# Solve the problem of obtaining incomplete image links
for i in range(len(picture_i)):
if mian_url not in picture_i[i]:
picture_i[i] = mian_url + picture_i[i]
print(picture_i)
Copy the code
(2)Xpath path selection analysis
- Picture analysis:
- Text explanation:
'//*[@id="content"]/div/div[3]/div[2]/div/div/div[1]/div[2]/div[4]/img'
'//*[@id="content"]/div/div[3]/div[2]/div/div/div[1]/div[2]/div[5]/img'
'//*[@id="content"]/div/div[3]/div[2]/div/div/div[1]/div[2]/div[6]/img'
Copy the code
The above is the Xpath path of the pictures of several different medical products that I selected. It can be found that the rule is that only the last DIV tag is changing, while what we need to obtain is the SRC attribute of the IMG tag (the page details URL is inside), so the Xpath path is:
"//*[@id="content"]/div/div[3]/div[2]/div/div/div[1]/div[2]/div/img/@src"
Copy the code
(3) The running result is
So far, we’ve done that as well.
5. By integrating 2, 3 and 4 above, the system crawls all the name and picture information of all products in all categories
(1) Basic code
import requests
from lxml import etree
import time,random
Get the crawlable, checked proxy IP
with open("new_http.txt",encoding="utf-8") as file :
t0 = file.read()
s0 = t0.split(",")
def get_respones_data(branch_url) :
Obtain the proxy IP address
i = random.randint(0.len(s0)-2)
proxies = {
"http": s0[i]
}
# requests send requests
get_response = requests.get(branch_url)
Convert the returned response code to text (the entire web page)
get_data = get_response.text
# parse the page
a = etree.HTML(get_data)
return a
# main page
mian_url = "http://www.med361.com"
# send request
response_01 = get_respones_data(mian_url)
Url of different medical product categories
branch_url_1 = response_01.xpath("/html/body/div[2]/div/div/div/div[1]/ul/li/a/@href")
branch_url_2 = response_01.xpath("/html/body/div[2]/div/div/div/div[2]/ul/li/a/@href")
# merge all categories
branch_url = branch_url_1 + branch_url_2
Treat the URL to a type we can access directly
for i in range(len(branch_url)):
branch_url[i] = mian_url + branch_url[i].replace('/.. '.' ')
print(branch_url) # All categories
# Set of commodity names
commodity_name = []
# Product introduction photo gallery
commodity_intr = []
for i in range(len(branch_url)): # Different categories
time.sleep(random.randint(1.3))
response_02 = get_respones_data(branch_url[i])
url_list_all = []
# Xpath retrieves all individual commodity urls
url_list = response_02.xpath('//*[@id="ddbd"]/form/dl/dd[2]/a/@href')
# Xpath gets all page-turning urls
url_paging = response_02.xpath('//*[@id="pager"]/a/@href')
Treat the page-turning URL to a type we can access directly
for j in range(len(url_paging)):
url_paging[j] = mian_url + url_paging[j].replace('/.. '.' ')
Treat the item URL to a type we can access directly
for j in range(len(url_list)):
url_list[j] = mian_url + url_list[j]
url_list_all = url_list
# Single category page flipping
for n in range(len(url_paging)):
time.sleep(1)
# send request
response_03 = get_respones_data(url_paging[n])
# Xpath retrieves all individual commodity urls
url_list = response_03.xpath('//*[@id="ddbd"]/form/dl/dd[2]/a/@href')
Treat the item URL to a type we can access directly
for j in range(len(url_list)):
url_list[j] = mian_url + url_list[j]
url_list_all = url_list_all + url_list Get all item urls for a single category
for m in range(len(url_list_all)):
time.sleep(1)
response_03 = get_respones_data(url_list_all[m])
# 1. Medical Equipment Name
m_e_name = response_03.xpath('//*[@id="product-intro"]/ul/li[1]/h1/text()') [0].strip()
commodity_name.append(m_e_name) Get the product name
# print(" medical device name :" + m_e_name)
# 2. Picture Introduction
picture_i = response_03.xpath('//*[@id="content"]/div/div[3]/div[2]/div/div/div[1]/div[2]/div/img/@src')
# Solve the problem of obtaining incomplete image links
for i in range(len(picture_i)):
if mian_url not in picture_i[i]:
picture_i[i] = mian_url + picture_i[i]
commodity_intr.append(picture_i)
# print(picture_i
Copy the code
(2) Download the file and store it locally
# Download image function
"Folder_name: folder name, by picture description picture_address: link to a group of pictures"
def download_pictures(folder_name, picture_address) :
You must have the Medical folder on disk G
file_path = r'G:\Medical\{0}'.format(folder_name)
if not os.path.exists(file_path):
Create a new folder
os.mkdir(os.path.join(r'G:\Medical', folder_name))
Download the image and save it in a new folder
for i in range(len(picture_address)):
# Download file (WB, write in binary format)
with open(r'G:\Medical\{0}\0{1}.jpg'.format(folder_name,i+1), 'wb') as f:
time.sleep(1)
Send a request to download the image according to the download link
response = requests.get(picture_address[i])
f.write(response.content)
Copy the code
Add the code to the correct position and call the function.
(3) Store it in MySql database
- Mysql > alter table medical select * from medical;
CREATE TABLE `medical`.`data_med` (
`id` INT NOT NULL AUTO_INCREMENT,
`med_name` VARCHAR(200) NULL,
`url_01` VARCHAR(200) NULL,
`url_02` VARCHAR(200) NULL,
`url_03` VARCHAR(200) NULL,
`url_04` VARCHAR(200) NULL,
`url_05` VARCHAR(200) NULL,
`url_06` VARCHAR(200) NULL,
`url_07` VARCHAR(200) NULL,
`url_08` VARCHAR(200) NULL,
`url_09` VARCHAR(200) NULL,
`url_10` VARCHAR(200) NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `id_UNIQUE` (`id` ASC))
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;
Copy the code
- Save data to database:
# database connection
conn = pymysql.connect(host = "localhost",port = 3306,user = "Your database login name",
password = "Your database login name",charset="utf8",database = "Your database name")
def sql_insert(sql) :
cur = conn.cursor()
cur.execute(sql)
conn.commit()
Copy the code
Add the code to the correct position and call the function.
(3) Field content is stored tocsv
file
Save it to a CSV file
List_info: List of stored contents
def file_do(list_info) :
Create a new CSV file.
file_size = os.path.getsize(r'G:\medical.csv')
if file_size == 0: Print the header only once
# header
name = ['Name Introduction'.'url_01'.'url_02'.'url_03'.'url_04'.'url_05'.'url_06'.'url_07'.'url_08'.'url_09'.'url_10']
Create a DataFrame object
file_test = pd.DataFrame(columns=name, data=list_info)
# data write
file_test.to_csv(r'G:\medical.csv', encoding='utf-8',index=False)
else:
with open(r'G:\medical.csv'.'a+',newline=' ') as file_test :
# append to file
writer = csv.writer(file_test)
Write file
writer.writerows(list_info)
Copy the code
Add the code to the correct position and call the function.
(4) Simple display of operation effect
Four, after the speech
The above code may need debugging, but the general idea is like this, share this article, mainly want to let you learn me to crawl my web page a way of thinking, we do not have to follow my knock.
Persistence and hard work: results.
Like to see the message forwarding, four support, the original is not easy. Ok, see you next time, I love the cat love technology, more love si si’s old cousin Da Mian ଘ(˙꒳˙)ଓ Di Di