The introduction

As a result of the work needs, to the front end of the company to do a small tool, the use of Python language, crawling sogou wechat wechat article

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/68e3ecd73875408ebebb5df1533d0cbe)

From hot to fashion circles, and includes the option to load more content under each column

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/2308593f405340ff89f597be44246b07)

That adds up to 500+ articles

demand

Crawl these articles to get the title of each article and the picture on the right, climb the picture to the specified named way output to the specified folder, and the article title and picture name corresponding output to Excel and TXT

The effect

! [](https://p6-tt-ipv6.byteimg.com/large/pgc-image/a336caa6606548c4be1e0f97a0872489)
! [](https://p3-tt-ipv6.byteimg.com/large/pgc-image/196520d22d0543299a3a01fb88b2340e)
! [](https://p9-tt-ipv6.byteimg.com/large/pgc-image/3adde97478634df5b2bfb197704d3b59)
! [](https://p1-tt-ipv6.byteimg.com/large/pgc-image/00f3810593fe41368bbafabf480328de)

The complete code is as follows

Package Version -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 0.17 certifi altgraph 2020.6.20 chardet 3.0.4 future 0.18.2 idna 2.10 LXML 4.5.2 pefile 2019.4.18 PIP 19.0.3 PyInstaller 4.0 pyinstaller-rex-contrib 2020.8 pyWin32-ctypes 0.2.0 Requests 2.24.0 setuptools 40.8.0 urllib3 1.25.10 XlsxWriter 1.3.3 XLWT 1.3.0Copy the code

! /usr/bin/python

– coding: UTF-8 –

import os

import requests

import xlsxwriter

from lxml import etree

Request header information for wechat article

headers = {

‘Accept’: ‘text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, /; Q = 0.8, application/signed – exchange; v=b3; Q = 0.9 ‘,

‘Accept-Encoding’: ‘gzip, deflate, br’,

‘Accept-Language’: ‘zh-CN,zh; Q = 0.9 ‘,

‘Host’: ‘weixin.sogou.com’,

‘user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari

}

Download the header information of the image

headers_images = {

‘Accept’: ‘text/html,application/xhtml+xml,application/xml; Q = 0.9, image/avif, image/webp image/apng, /; Q = 0.8, application/signed – exchange; v=b3; Q = 0.9 ‘,

‘Accept-Encoding’: ‘gzip, deflate’,

‘Accept-Language’: ‘zh-CN,zh; Q = 0.9 ‘,

‘Host’: ‘img01.sogoucdn.com’,

‘user-agent ‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari

}

a = 0

all = []

Creating a root directory

Save_path = ‘./ wechat article ‘

folder = os.path.exists(save_path)

if not folder:

os.makedirs(save_path)

Create images folder

Images_path = ‘%s/ image ‘% save_path

folder = os.path.exists(images_path)

if not folder:

os.makedirs(images_path)

for i in range(1, 9):

for j in range(1, 5):

Url = “weixin.sogou.com/pcindex/pc/…” % (i, j)

Request the URL of sogou article

response = requests.get(url=url, headers=headers).text.encode(‘iso-8859-1’).decode(‘utf-8’)

An XPath parsing object is constructed and automatically corrects HTML text

html = etree.HTML(response)

XPath uses path expressions to select the user name

xpath = html.xpath(‘/html/body/li’)

for content in xpath:

count

a = a + 1

The article title

title = content.xpath(‘./div[@class=”txt-box”]/h3//text()’)[0]

article = {}

article[‘title’] = title

article[‘id’] = ‘%d.jpg’ % a

all.append(article)

Image path

path = ‘http:’ + content.xpath(‘./div[@class=”img-box”]//img/@src’)[0]

Download article image

images = requests.get(url=path, headers=headers_images).content

try:

with open(‘%s/%d.jpg’ % (images_path, a), “wb”) as f:

Print (‘ downloading article %d ‘% a)

f.write(images)

except Exception as e:

Print (‘ failed to download article image %s’ % e ‘)

The information is stored in Excel

Create a workbookx

Workbook = xlsxwriter. workbook (‘%s/Excel format. XLSX ‘% save_path)

Create a Worksheet

worksheet = workbook.add_worksheet()

Print (‘ Generating Excel… ‘)

try:

for i in range(0, len(all) + 1):

The first line is used to write the header

if i == 0:

worksheet.write(i, 0, ‘title’)

worksheet.write(i, 1, ‘id’)

continue

worksheet.write(i, 0, all[i – 1][‘title’])

worksheet.write(i, 1, all[i – 1][‘id’])

workbook.close()

except Exception as e:

Print (‘ Failed to generate Excel %s’ % e ‘)

Print (” Excel generated successfully “)

Print (‘ creating TXT… ‘)

try:

With open(‘%s/ array.txt ‘% save_path, “w”) as f:

f.write(str(all))

except Exception as e:

Print (‘ TXT failed %s’ % e ‘)

Print (‘ TXT generated successfully ‘)

Print (‘ select %d from % a ‘)

Finally, package the program into an EXE file, you can run the program directly under the Windows system

! [](https://p26-tt.byteimg.com/large/pgc-image/c005adcdeb944248ac2f438812be0b10)

Did you learn?Complete project code acquisitionJust click here

This article reprinted text, copyright belongs to the author, such as infringement contact xiaobian delete!

The original address: blog.csdn.net/y1534414425…