First, the idea of implementation

This time climb sohu news political class

! [](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/7df50830cad94be5aee009e5c547974a)

Get the URL – crawl the news name and its hyperlink – judge the fit with the topic – get the final result

Second, obtain url change rules

It is observed that sohu news page belongs to the dynamic page, but there is no file under F12 — network — XHR, so we cannot find the content we want to find in the file from ALL

! [](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/8af86ec85e1f4f75b2034e062846995e)

The file is found to be a JS file

! [](https://p6-tt-ipv6.byteimg.com/origin/pgc-image/2cf9f83075834cbaab2b1027ab0a8782)

Observe the URL patterns of the files at the beginning of the four feeds

! [](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/974da01dca4a49dcb5e873da0efcd4fd)

Page changes callback changes irregularly the last number per page +8 Remove callback found no impact on the content of the web page, so the final page acquisition code adopts the form of string splicing

For p in range (1, 10) : p2=1603263206992+p*8 url='https://v2.sohu.com/public-api/feed?scene=CATEGORY&sceneId=1460&page='+str(p)+'&size=20&_='+str(p2)Copy the code

3. Crawl the news name and its hyperlink

This time use the regular expression to obtain

Implementation code:

Headers ={' user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36', 'cookie':'itssohu=true; BAIDU_SSP_lcr=https://news.hao123.com/wangzhi; IPLOC=CN3300; SUV=201021142102FD7T; reqtype=pc; gidinf=x099980109ee124d51195e802000a3aab2e8ca7bf7da; t=1603261548713; jv=78160d8250d5ed3e3248758eeacbc62e-kuzhE2gk1603261903982; ppinf=2|1603261904|1604471504|bG9naW5pZDowOnx1c2VyaWQ6Mjg6MTMxODgwMjEyODc2ODQzODI3MkBzb2h1LmNvbXxzZXJ2aWNldXNlOjMwOjAwMD AwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMHxjcnQ6MTA6MjAyMC0xMC0yMXxlbXQ6MTowfGFwcGlkOjY6MTE2MDA1fHRydXN0OjE6MXxwYXJ0bmVyaWQ6MT owfHJlbGF0aW9uOjA6fHV1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1bmlxbmFtZTowOnw; pprdig=L2Psu-NwDR2a1BZITLwhlxdvI2OrHzl6jqQlF3zP4z70gqsyYxXmf5dCZGuhPFZ-XWWE5mflwnCHURGUQaB5cxxf8HKpzVIbqTJJ3_TNhPgpDMMQd Fo64Cqoay43UxanOZJc4-9dcAE6GU3PIufRjmHw_LApBXLN7sOMUodmfYE; ppmdig=1603261913000000cfdc2813caf37424544d67b1ffee4770' } res=requests.get(url,headers=headers) soup=BeautifulSoup(res.text,'lxml') news=re.findall('"mobileTitle":"(.*?) ",',str(soup)) herf=re.findall('"originalSource":"(.*?) "',str(soup)) #news=soup.find_all("div",attrs={'class':'news-wrapper'}) #html=etree.HTML(res.text) #news=html.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/div/div[3]/div[3]/h4/a/text()') News_dic =dict(zip(news,herf))# Store titles and links to dictionaries for k,v in news_di.items (): news_dictall[k]=v # Dictionary merge for each pageCopy the code

Fourth, judge the fit with the theme

def ifsim(topicwords): News_dicfin ={} news_dic=getdata() ana.set_stop_words('D: homework/python/text mining/dataset/news dataset/data/stopwords.txt') # input stopwords for k,v in news_dic.items(): Word_list =ana.extract_tags(k,topK=50,withWeight=False) # word_li.append (word_list) word_lil=[] for I in word_list: Word_lil. Append ([I])# convert to list in list form for passing into dictionary word_dic= dictionary (word_lil)# convert to dictionary dictionary form for analysis D =dict(word_did.items ()) docwords=set(d.values()) # intersection =topicwords. Intersection (docwords)# News_dicfin [k]=v print(news_dicfin)Copy the code

If word_dic is directly output, the result is:

! [](https://p1-tt-ipv6.byteimg.com/origin/pgc-image/14dbb5fbb39f4cc8a3c3109922cc494d)

Docwords output:

! [](https://p26-tt.byteimg.com/origin/pgc-image/8ce9e59f8e3143db9281bc1a2b632a07)

Word_list output:

! [](https://p1.pstatp.com/origin/pgc-image/a16155315e0e456ca747bede710d11b2)

Word_lil output is:

! [](https://p1.pstatp.com/origin/pgc-image/165d438aa3244e9c9e2773fe6b808a6a)

The output result of D is:

! [](https://p1.pstatp.com/origin/pgc-image/009e2464b9da46f6bfc451dce40f19c5)

4. Output results

This time, by judging that the title is the same as the number of my given subject words, i.e. intersection >0, the word is judged to belong to the subject model and is saved into the final dictionary. The output result of news_sicfin is as follows:

! [](https://p9-tt-ipv6.byteimg.com/origin/pgc-image/84ccd599c33d4497aeb83fbf59520aae)

Five, the general code

import requests from bs4 import BeautifulSoup import jieba from gensim.corpora.dictionary import Dictionary import re Import jieba.analyse as ana def getdata(): #news_all=[] news_dictall={} for p in range(1,10): p2=1603263206992+p*8 url='https://v2.sohu.com/public-api/feed?scene=CATEGORY&sceneId=1460&page='+str(p)+'&size=20&_='+str(p2) headers={ 'user-agent ': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36', 'cookie':'itssohu=true; BAIDU_SSP_lcr=https://news.hao123.com/wangzhi; IPLOC=CN3300; SUV=201021142102FD7T; reqtype=pc; gidinf=x099980109ee124d51195e802000a3aab2e8ca7bf7da; t=1603261548713; jv=78160d8250d5ed3e3248758eeacbc62e-kuzhE2gk1603261903982; ppinf=2|1603261904|1604471504|bG9naW5pZDowOnx1c2VyaWQ6Mjg6MTMxODgwMjEyODc2ODQzODI3MkBzb2h1LmNvbXxzZXJ2aWNldXNlOjMwOjAwMD AwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMHxjcnQ6MTA6MjAyMC0xMC0yMXxlbXQ6MTowfGFwcGlkOjY6MTE2MDA1fHRydXN0OjE6MXxwYXJ0bmVyaWQ6MT owfHJlbGF0aW9uOjA6fHV1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1aWQ6MTY6czExZjVhZTI2NTJiNmM3Nnx1bmlxbmFtZTowOnw; pprdig=L2Psu-NwDR2a1BZITLwhlxdvI2OrHzl6jqQlF3zP4z70gqsyYxXmf5dCZGuhPFZ-XWWE5mflwnCHURGUQaB5cxxf8HKpzVIbqTJJ3_TNhPgpDMMQd Fo64Cqoay43UxanOZJc4-9dcAE6GU3PIufRjmHw_LApBXLN7sOMUodmfYE; ppmdig=1603261913000000cfdc2813caf37424544d67b1ffee4770' } res=requests.get(url,headers=headers) soup=BeautifulSoup(res.text,'lxml') news=re.findall('"mobileTitle":"(.*?) ",',str(soup)) herf=re.findall('"originalSource":"(.*?) "',str(soup)) #news=soup.find_all("div",attrs={'class':'news-wrapper'}) #html=etree.HTML(res.text) #news=html.xpath('/html/body/div[2]/div[1]/div[2]/div[2]/div/div[3]/div[3]/h4/a/text()') News_dic =dict(zip(news,herf))# store titles and links to dictionaries for k,v in news_di.items (): Return (news_dictall)# return total dictionary def ifsim(topicwords): News_dicfin ={} news_dic=getdata() ana.set_stop_words('D: homework/python/text mining/dataset/news dataset/data/stopwords.txt') # input stopwords for k,v in news_dic.items(): Word_list =ana.extract_tags(k,topK=50,withWeight=False) # word_li.append (word_list) word_lil=[] for I in word_list: Word_lil. Append ([I])# convert to list in list form for passing into dictionary word_dic= dictionary (word_lil)# convert to dictionary dictionary form for analysis D =dict(word_did.items ()) docwords=set(d.values()) # intersection =topicwords. Intersection (docwords)# News_dicfin [k]=v print(news_dicfin) if __name__=='__main__': len(commwords)>0: Topicwords = {" epidemic ", "new champions", "pneumonia", "confirmed", "case"} ifsim (topicwords)Copy the code

Complete code to share! If you have any questions, please click below!

PS: If you need Python learning materials, please click on the link below to obtain them

Free Python learning materials and group communication solutions click to join