The target

  • Gets all article names, links, number of views, and comments
  • Store it in a format suitable for pandas to read

Analysis of the

Page jump

Home page: http://blog.csdn.net/fontthrone?viewmode=list page 2: http://blog.csdn.net/FontThrone/article/list/2 three or four pages and so on According to the format of page DiErSanSi successful jump: try http://blog.csdn.net/FontThrone/article/list/1 = http://blog.csdn.net/fontthrone?viewmode=list http://blog.csdn.net/FontThrone/article/list/1

So get different pages we just need to jump to the link at the end of the number to control it, really simple =- =

Overall page composition



The page composition (class) is shown in figure 2

– article_list

– – list_item article_item

— article_title Title

– – – – h1

– – – – – link_title

– – – – – – a

— article_description

– – – article_manage

– – – – link_postdate Date

– – – – link_view number of readers

– – – – link_comments Number of comments

– – – – link_edit Edit

– – – clear

We first get each -article_list, and then loop through the information in each list_item article_item

details

  • Parsing web pages with BS4 reduces the workload
  • The href in the a tag is obtained using [‘ href ‘]
  • For the span containing a tag +text, the text is directly obtained by using the re
  • Pay attention to the encoding :1. the encoding of the content obtained from the web page 1. the default encoding of the file py

Code into

Methods of obtaining specific information

  • 1. Obtain article_list
        html = BeautifulSoup(response.text, 'html.parser')
        blog_list = html.select('.list_item_new > #article_list > .article_item')
Copy the code
  • 2. Get the article_title and link to the article
            blog_title = house.select('.link_title > a')[0].string.encode('utf-8')
            blog_title = str(blog_title.replace(' ', '').replace('\n', ''))
            blog_url = urljoin(ADDR, house.select('.link_title > a')[0]['href'])
Copy the code
  • 3. Obtain the link_view number and link_comments number
link_view = str(house.select('.article_manage > .link_view')[0]) blog_people = re.search(r'\d+', re.search(r'\(\d+\)', Link_comment = STR (house. Select ('.article_manage > '). Group ()). .link_comments')[0]) blog_comment = re.search(r'\d+', re.search(r'\(\d+\)', Link_comment).group()).group(Copy the code

Write to a CSV file

Import CSV # with open('info.csv', 'wb') as f: Csv_writer = csv.writer(f, delimiter=',') # create csv_writerow (['blog_title', 'blog_URL ',) # create csv_writerow (['blog_title',' blog_URL ', Csv_writerow ([blog_title, blog_url,blog_people, Blog_comment]) # write the information we crawled in lineCopy the code

CODE

True, the CODE

The original machine reference: blog.csdn.net/fontthrone/…

# - * - coding: Utf-8 -*- # # Created by FontTian # # http://blog.csdn.net/fontthrone # from bs4 import BeautifulSoup from urlparse import urljoin import requests import csv Import re import sys reload(sys) sys. setDefaultencoding ('utf8') # account = STR (raw_input(' enter CSDN login account :')) account = 'fontthrone' URL = 'http://blog.csdn.net/' + account ADDR = 'http://blog.csdn.net/' start_page = 0 with open('info.csv',  'wb') as f: csv_writer = csv.writer(f, delimiter=',') csv_writer.writerow(['blog_title', 'blog_url', 'blog_people', 'blog_comment']) print 'starting' while True: start_page += 1 URL2 = URL + '/article/list/' + str(start_page) print URL2 response = requests.get(URL2) html = BeautifulSoup(response.text, 'html.parser') # print html blog_list = html.select('.list_item_new > #article_list > .article_item') # check blog_list if not blog_list: print 'No blog_list' break for house in blog_list: blog_title = house.select('.link_title > a')[0].string.encode('utf-8') blog_title = str(blog_title.replace(' ', '').replace('\n', '')) link_view = str(house.select('.article_manage > .link_view')[0]) blog_people = re.search(r'\d+', re.search(r'\(\d+\)', link_view).group()).group() link_comment = str(house.select('.article_manage > .link_comments')[0]) blog_comment = re.search(r'\d+', re.search(r'\(\d+\)', link_comment).group()).group() blog_url = urljoin(ADDR, house.select('.link_title > a')[0]['href']) csv_writer.writerow([blog_title, blog_url,blog_people, blog_comment]) print 'ending'Copy the code

Running effect

In combination with PANDAS, update the original script machine

instructions

This code is for reference only. You are not advised to use this script to refresh the access volume

CODE

# blog_url =[
#     'http://blog.csdn.net/fontthrone/article/details/76675684',
#     'http://blog.csdn.net/FontThrone/article/details/76652772',
#     'http://blog.csdn.net/FontThrone/article/details/76652762',
#     'http://blog.csdn.net/FontThrone/article/details/76652753',
#     'http://blog.csdn.net/FontThrone/article/details/76652257',
#     'http://blog.csdn.net/fontthrone/article/details/76735591',
#     'http://blog.csdn.net/FontThrone/article/details/76728083',
#     'http://blog.csdn.net/FontThrone/article/details/76727466',
#     'http://blog.csdn.net/FontThrone/article/details/76727412',
#     'http://blog.csdn.net/FontThrone/article/details/76695555',
#     'http://blog.csdn.net/fontthrone/article/details/75805923',
# ]

import pandas as pd
df1 = pd.DataFrame(pd.read_csv('info.csv'))

blog_url = list(df1['blog_url'])
Copy the code

supplement

  • With the first and a half generation, can the second generation be far behind?
  • Hey, guys, a 666 is on the way.