The target
- Gets all article names, links, number of views, and comments
- Store it in a format suitable for pandas to read
Analysis of the
Page jump
Home page: http://blog.csdn.net/fontthrone?viewmode=list page 2: http://blog.csdn.net/FontThrone/article/list/2 three or four pages and so on According to the format of page DiErSanSi successful jump: try http://blog.csdn.net/FontThrone/article/list/1 = http://blog.csdn.net/fontthrone?viewmode=list http://blog.csdn.net/FontThrone/article/list/1
So get different pages we just need to jump to the link at the end of the number to control it, really simple =- =
Overall page composition
The page composition (class) is shown in figure 2
– article_list
– – list_item article_item
— article_title Title
– – – – h1
– – – – – link_title
– – – – – – a
— article_description
– – – article_manage
– – – – link_postdate Date
– – – – link_view number of readers
– – – – link_comments Number of comments
– – – – link_edit Edit
– – – clear
We first get each -article_list, and then loop through the information in each list_item article_item
details
- Parsing web pages with BS4 reduces the workload
- The href in the a tag is obtained using [‘ href ‘]
- For the span containing a tag +text, the text is directly obtained by using the re
- Pay attention to the encoding :1. the encoding of the content obtained from the web page 1. the default encoding of the file py
Code into
Methods of obtaining specific information
- 1. Obtain article_list
html = BeautifulSoup(response.text, 'html.parser')
blog_list = html.select('.list_item_new > #article_list > .article_item')
Copy the code
- 2. Get the article_title and link to the article
blog_title = house.select('.link_title > a')[0].string.encode('utf-8')
blog_title = str(blog_title.replace(' ', '').replace('\n', ''))
blog_url = urljoin(ADDR, house.select('.link_title > a')[0]['href'])
Copy the code
- 3. Obtain the link_view number and link_comments number
link_view = str(house.select('.article_manage > .link_view')[0]) blog_people = re.search(r'\d+', re.search(r'\(\d+\)', Link_comment = STR (house. Select ('.article_manage > '). Group ()). .link_comments')[0]) blog_comment = re.search(r'\d+', re.search(r'\(\d+\)', Link_comment).group()).group(Copy the code
Write to a CSV file
Import CSV # with open('info.csv', 'wb') as f: Csv_writer = csv.writer(f, delimiter=',') # create csv_writerow (['blog_title', 'blog_URL ',) # create csv_writerow (['blog_title',' blog_URL ', Csv_writerow ([blog_title, blog_url,blog_people, Blog_comment]) # write the information we crawled in lineCopy the code
CODE
True, the CODE
The original machine reference: blog.csdn.net/fontthrone/…
# - * - coding: Utf-8 -*- # # Created by FontTian # # http://blog.csdn.net/fontthrone # from bs4 import BeautifulSoup from urlparse import urljoin import requests import csv Import re import sys reload(sys) sys. setDefaultencoding ('utf8') # account = STR (raw_input(' enter CSDN login account :')) account = 'fontthrone' URL = 'http://blog.csdn.net/' + account ADDR = 'http://blog.csdn.net/' start_page = 0 with open('info.csv', 'wb') as f: csv_writer = csv.writer(f, delimiter=',') csv_writer.writerow(['blog_title', 'blog_url', 'blog_people', 'blog_comment']) print 'starting' while True: start_page += 1 URL2 = URL + '/article/list/' + str(start_page) print URL2 response = requests.get(URL2) html = BeautifulSoup(response.text, 'html.parser') # print html blog_list = html.select('.list_item_new > #article_list > .article_item') # check blog_list if not blog_list: print 'No blog_list' break for house in blog_list: blog_title = house.select('.link_title > a')[0].string.encode('utf-8') blog_title = str(blog_title.replace(' ', '').replace('\n', '')) link_view = str(house.select('.article_manage > .link_view')[0]) blog_people = re.search(r'\d+', re.search(r'\(\d+\)', link_view).group()).group() link_comment = str(house.select('.article_manage > .link_comments')[0]) blog_comment = re.search(r'\d+', re.search(r'\(\d+\)', link_comment).group()).group() blog_url = urljoin(ADDR, house.select('.link_title > a')[0]['href']) csv_writer.writerow([blog_title, blog_url,blog_people, blog_comment]) print 'ending'Copy the code
Running effect
In combination with PANDAS, update the original script machine
instructions
This code is for reference only. You are not advised to use this script to refresh the access volume
CODE
# blog_url =[
# 'http://blog.csdn.net/fontthrone/article/details/76675684',
# 'http://blog.csdn.net/FontThrone/article/details/76652772',
# 'http://blog.csdn.net/FontThrone/article/details/76652762',
# 'http://blog.csdn.net/FontThrone/article/details/76652753',
# 'http://blog.csdn.net/FontThrone/article/details/76652257',
# 'http://blog.csdn.net/fontthrone/article/details/76735591',
# 'http://blog.csdn.net/FontThrone/article/details/76728083',
# 'http://blog.csdn.net/FontThrone/article/details/76727466',
# 'http://blog.csdn.net/FontThrone/article/details/76727412',
# 'http://blog.csdn.net/FontThrone/article/details/76695555',
# 'http://blog.csdn.net/fontthrone/article/details/75805923',
# ]
import pandas as pd
df1 = pd.DataFrame(pd.read_csv('info.csv'))
blog_url = list(df1['blog_url'])
Copy the code
supplement
- With the first and a half generation, can the second generation be far behind?
- Hey, guys, a 666 is on the way.