Script: Get the number of visits to CSDN articles

The target

Gets all article names, links, number of views, and comments
Store it in a format suitable for pandas to read

Analysis of the

Page jump

Home page: http://blog.csdn.net/fontthrone?viewmode=list page 2: http://blog.csdn.net/FontThrone/article/list/2 three or four pages and so on According to the format of page DiErSanSi successful jump: try http://blog.csdn.net/FontThrone/article/list/1 = http://blog.csdn.net/fontthrone?viewmode=list http://blog.csdn.net/FontThrone/article/list/1

So get different pages we just need to jump to the link at the end of the number to control it, really simple =- =

Overall page composition

The page composition (class) is shown in figure 2

– article_list

– – list_item article_item

— article_title Title

– – – – h1

– – – – – link_title

– – – – – – a

— article_description

– – – article_manage

– – – – link_postdate Date

– – – – link_view number of readers

– – – – link_comments Number of comments

– – – – link_edit Edit

– – – clear

We first get each -article_list, and then loop through the information in each list_item article_item

details

Parsing web pages with BS4 reduces the workload
The href in the a tag is obtained using [‘ href ‘]
For the span containing a tag +text, the text is directly obtained by using the re
Pay attention to the encoding :1. the encoding of the content obtained from the web page 1. the default encoding of the file py

Code into

Methods of obtaining specific information

1. Obtain article_list

        html = BeautifulSoup(response.text, 'html.parser')
        blog_list = html.select('.list_item_new > #article_list > .article_item')
Copy the code

2. Get the article_title and link to the article

            blog_title = house.select('.link_title > a')[0].string.encode('utf-8')
            blog_title = str(blog_title.replace(' ', '').replace('\n', ''))
            blog_url = urljoin(ADDR, house.select('.link_title > a')[0]['href'])
Copy the code

3. Obtain the link_view number and link_comments number

link_view = str(house.select('.article_manage > .link_view')[0]) blog_people = re.search(r'\d+', re.search(r'\(\d+\)', Link_comment = STR (house. Select ('.article_manage > '). Group ()). .link_comments')[0]) blog_comment = re.search(r'\d+', re.search(r'\(\d+\)', Link_comment).group()).group(Copy the code

Write to a CSV file

Import CSV # with open('info.csv', 'wb') as f: Csv_writer = csv.writer(f, delimiter=',') # create csv_writerow (['blog_title', 'blog_URL ',) # create csv_writerow (['blog_title',' blog_URL ', Csv_writerow ([blog_title, blog_url,blog_people, Blog_comment]) # write the information we crawled in lineCopy the code

CODE

True, the CODE

The original machine reference: blog.csdn.net/fontthrone/…

# - * - coding: Utf-8 -*- # # Created by FontTian # # http://blog.csdn.net/fontthrone # from bs4 import BeautifulSoup from urlparse import urljoin import requests import csv Import re import sys reload(sys) sys. setDefaultencoding ('utf8') # account = STR (raw_input(' enter CSDN login account :')) account = 'fontthrone' URL = 'http://blog.csdn.net/' + account ADDR = 'http://blog.csdn.net/' start_page = 0 with open('info.csv',  'wb') as f: csv_writer = csv.writer(f, delimiter=',') csv_writer.writerow(['blog_title', 'blog_url', 'blog_people', 'blog_comment']) print 'starting' while True: start_page += 1 URL2 = URL + '/article/list/' + str(start_page) print URL2 response = requests.get(URL2) html = BeautifulSoup(response.text, 'html.parser') # print html blog_list = html.select('.list_item_new > #article_list > .article_item') # check blog_list if not blog_list: print 'No blog_list' break for house in blog_list: blog_title = house.select('.link_title > a')[0].string.encode('utf-8') blog_title = str(blog_title.replace(' ', '').replace('\n', '')) link_view = str(house.select('.article_manage > .link_view')[0]) blog_people = re.search(r'\d+', re.search(r'\(\d+\)', link_view).group()).group() link_comment = str(house.select('.article_manage > .link_comments')[0]) blog_comment = re.search(r'\d+', re.search(r'\(\d+\)', link_comment).group()).group() blog_url = urljoin(ADDR, house.select('.link_title > a')[0]['href']) csv_writer.writerow([blog_title, blog_url,blog_people, blog_comment]) print 'ending'Copy the code

Running effect

In combination with PANDAS, update the original script machine

instructions

This code is for reference only. You are not advised to use this script to refresh the access volume

CODE

# blog_url =[
#     'http://blog.csdn.net/fontthrone/article/details/76675684',
#     'http://blog.csdn.net/FontThrone/article/details/76652772',
#     'http://blog.csdn.net/FontThrone/article/details/76652762',
#     'http://blog.csdn.net/FontThrone/article/details/76652753',
#     'http://blog.csdn.net/FontThrone/article/details/76652257',
#     'http://blog.csdn.net/fontthrone/article/details/76735591',
#     'http://blog.csdn.net/FontThrone/article/details/76728083',
#     'http://blog.csdn.net/FontThrone/article/details/76727466',
#     'http://blog.csdn.net/FontThrone/article/details/76727412',
#     'http://blog.csdn.net/FontThrone/article/details/76695555',
#     'http://blog.csdn.net/fontthrone/article/details/75805923',
# ]

import pandas as pd
df1 = pd.DataFrame(pd.read_csv('info.csv'))

blog_url = list(df1['blog_url'])
Copy the code

supplement

With the first and a half generation, can the second generation be far behind?
Hey, guys, a 666 is on the way.

Script: Get the number of visits to CSDN articles

The target

Analysis of the

Page jump

Overall page composition

details

Code into

Methods of obtaining specific information

Write to a CSV file

CODE

True, the CODE

Running effect

In combination with PANDAS, update the original script machine

instructions

CODE

supplement

Related Posts

Nacos configures the server – side long polling processing mechanism in the center

Traffic limiting: Leaky bucket algorithm and Redis-cell traffic limiting module

Dubbo stage 2 sharing content outline