First, the idea of implementation

This time climb sohu news political class

! [](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/7df50830cad94be5aee009e5c547974a)

Get the URL – crawl the news name and its hyperlink – judge the fit with the topic – get the final result

Second, obtain url change rules

It is observed that sohu news page belongs to the dynamic page, but there is no file under F12 — network — XHR, so we cannot find the content we want to find in the file from ALL

! [](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/8af86ec85e1f4f75b2034e062846995e)

The file is found to be a JS file

! [](https://p6-tt-ipv6.byteimg.com/origin/pgc-image/2cf9f83075834cbaab2b1027ab0a8782)

Observe the URL patterns of the files at the beginning of the four feeds

! [](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/974da01dca4a49dcb5e873da0efcd4fd)

Page changes callback changes irregularly the last number per page +8 Remove callback found no impact on the content of the web page, so the final page acquisition code adopts the form of string splicing

For p in range (1, 10) : p2=1603263206992+p*8 url='https://v2.sohu.com/public-api/feed?scene=CATEGORY&sceneId=1460&page='+str(p)+'&size=20&_='+str(p2)Copy the code

3. Crawl the news name and its hyperlink

This time use the regular expression to obtain

Implementation code:

Headers ={' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36', 'cookie':'itssohu=true; BAIDU_SSP_lcr=https://news.hao123.com/wangzhi; IPLOC=CN3300; SUV=201021142102FD7T; reqtype=pc; gidinf=x099980109ee124d51195e802000a3aab2e8ca7bf7da; t=1603261548713; jv=78160d8250d5ed3e3248758eeacbc62e-kuzhE2gk1603261903982; ppinf=2|1603261904|1604471504|bG9naW5pZDowOnx1c2VyaWQ6Mjg6MTMxODgwMjEyODc2ODQzODI3MkBzb2h1LmNvbXxzZXJ2aWNldXNlOjMwOjAwMD AwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMHxjcnQ6MTA6MjAyMC0xMC0yMXxlbXQ6MTowfGFwcGlkOjY6MTE2MDA1fHRydXN0OjE6MXxwYXJ0bmVyaWQ6MT owfHJlbGF0aW9uOjA6fHV1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1bmlxbmFtZTowOnw; pprdig=L2Psu-NwDR2a1BZITLwhlxdvI2OrHzl6jqQlF3zP4z70gqsyYxXmf5dCZGuhPFZ-XWWE5mflwnCHURGUQaB5cxxf8HKpzVIbqTJJ3_TNhPgpDMMQd Fo64Cqoay43UxanOZJc4-9dcAE6GU3PIufRjmHw_LApBXLN7sOMUodmfYE; ppmdig=1603261913000000cfdc2813caf37424544d67b1ffee4770' } res=requests.get(url,headers=headers) soup=BeautifulSoup(res.text,'lxml') news=re.findall('"mobileTitle":"(.*?) ",',str(soup)) herf=re.findall('"originalSource":"(.*?) "',str(soup)) #news=soup.find_all("div",attrs={'class':'news-wrapper'}) #html=etree.HTML(res.text) #news=html.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/div/div[3]/div[3]/h4/a/text()') News_dic =dict(zip(news,herf))# Store titles and links to dictionaries for k,v in news_di.items (): news_dictall[k]=v # Dictionary merge for each pageCopy the code

Fourth, judge the fit with the theme

def ifsim(topicwords): News_dicfin ={} news_dic=getdata() ana.set_stop_words('D: homework/python/text mining/dataset/news dataset/data/stopwords.txt') # input stopwords for k,v in news_dic.items(): Word_list =ana.extract_tags(k,topK=50,withWeight=False) # word_li.append (word_list) word_lil=[] for I in word_list: Word_lil. Append ([I])# convert to list in list form for passing into dictionary word_dic= dictionary (word_lil)# convert to dictionary dictionary form for analysis D =dict(word_did.items ()) docwords=set(d.values()) # intersection =topicwords. Intersection (docwords)# News_dicfin [k]=v print(news_dicfin)Copy the code

If word_dic is directly output, the result is:

! [](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/14dbb5fbb39f4cc8a3c3109922cc494d)

Docwords output:

! [](https://p26-tt.byteimg.com/origin/pgc-image/8ce9e59f8e3143db9281bc1a2b632a07)

Word_list output:

! [](https://p1.pstatp.com/origin/pgc-image/a16155315e0e456ca747bede710d11b2)

Word_lil output is:

! [](https://p1.pstatp.com/origin/pgc-image/165d438aa3244e9c9e2773fe6b808a6a)

The output result of D is:

! [](https://p1.pstatp.com/origin/pgc-image/009e2464b9da46f6bfc451dce40f19c5)

4. Output results

This time, by judging that the title is the same as the number of my given subject words, i.e. intersection >0, the word is judged to belong to the subject model and is saved into the final dictionary. The output result of news_sicfin is as follows:

! [](https://p9-tt-ipv6.byteimg.com/origin/pgc-image/84ccd599c33d4497aeb83fbf59520aae)

Five, the general code

import requests from bs4 import BeautifulSoup import jieba from gensim.corpora.dictionary import Dictionary import re Import jieba.analyse as ana def getdata(): #news_all=[] news_dictall={} for p in range(1,10): p2=1603263206992+p*8 url='https://v2.sohu.com/public-api/feed?scene=CATEGORY&sceneId=1460&page='+str(p)+'&size=20&_='+str(p2) headers={ 'user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36', 'cookie':'itssohu=true; BAIDU_SSP_lcr=https://news.hao123.com/wangzhi; IPLOC=CN3300; SUV=201021142102FD7T; reqtype=pc; gidinf=x099980109ee124d51195e802000a3aab2e8ca7bf7da; t=1603261548713; jv=78160d8250d5ed3e3248758eeacbc62e-kuzhE2gk1603261903982; ppinf=2|1603261904|1604471504|bG9naW5pZDowOnx1c2VyaWQ6Mjg6MTMxODgwMjEyODc2ODQzODI3MkBzb2h1LmNvbXxzZXJ2aWNldXNlOjMwOjAwMD AwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMHxjcnQ6MTA6MjAyMC0xMC0yMXxlbXQ6MTowfGFwcGlkOjY6MTE2MDA1fHRydXN0OjE6MXxwYXJ0bmVyaWQ6MT owfHJlbGF0aW9uOjA6fHV1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1bmlxbmFtZTowOnw; pprdig=L2Psu-NwDR2a1BZITLwhlxdvI2OrHzl6jqQlF3zP4z70gqsyYxXmf5dCZGuhPFZ-XWWE5mflwnCHURGUQaB5cxxf8HKpzVIbqTJJ3_TNhPgpDMMQd Fo64Cqoay43UxanOZJc4-9dcAE6GU3PIufRjmHw_LApBXLN7sOMUodmfYE; ppmdig=1603261913000000cfdc2813caf37424544d67b1ffee4770' } res=requests.get(url,headers=headers) soup=BeautifulSoup(res.text,'lxml') news=re.findall('"mobileTitle":"(.*?) ",',str(soup)) herf=re.findall('"originalSource":"(.*?) "',str(soup)) #news=soup.find_all("div",attrs={'class':'news-wrapper'}) #html=etree.HTML(res.text) #news=html.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/div/div[3]/div[3]/h4/a/text()') News_dic =dict(zip(news,herf))# store titles and links to dictionaries for k,v in news_di.items (): Return (news_dictall)# return total dictionary def ifsim(topicwords): News_dicfin ={} news_dic=getdata() ana.set_stop_words('D: homework/python/text mining/dataset/news dataset/data/stopwords.txt') # input stopwords for k,v in news_dic.items(): Word_list =ana.extract_tags(k,topK=50,withWeight=False) # word_li.append (word_list) word_lil=[] for I in word_list: Word_lil. Append ([I])# convert to list in list form for passing into dictionary word_dic= dictionary (word_lil)# convert to dictionary dictionary form for analysis D =dict(word_did.items ()) docwords=set(d.values()) # intersection =topicwords. Intersection (docwords)# News_dicfin [k]=v print(news_dicfin) if __name__=='__main__': len(commwords)>0: Topicwords = {" epidemic ", "new champions", "pneumonia", "confirmed", "case"} ifsim (topicwords)Copy the code

Complete code to share! If you have any questions, please click below!

PS: If you need Python learning materials, please click on the link below to obtain them

Free Python learning materials and group communication solutions click to join

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Python crawler – Theme crawl sohu news (steps and code implementation)

First, the idea of implementation

Second, obtain url change rules

3. Crawl the news name and its hyperlink

Fourth, judge the fit with the theme

4. Output results

Five, the general code

Python crawler – Theme crawl sohu news (steps and code implementation)

First, the idea of implementation

Second, obtain url change rules

3. Crawl the news name and its hyperlink

Fourth, judge the fit with the theme

4. Output results

Five, the general code

Related Posts

Script for Lazy Nezumi Pro

fatal: unable to access https:// Failed to connect to: Connection refused|git cl

Time Management, Brainstorming, Reading, Meeting taking – Avi