preface
Today we will use scrapy to crawl through the Chinese patent data of CNKI and do a simple data visualization analysis. Let’s have a good time
**PS: ** This project is only for learning and exchange. Please set reasonable download delay and patent data volume to avoid unnecessary pressure on THE CNKI server.
The development tools
Python version: 3.6.4
Related modules:
Scrapy module;
Fake_useragent module;
Pyecharts module;
Wordcloud module;
Jieba module;
And some modules that come with Python.
Environment set up
Install Python and add it to the environment variables. PIP installs the required related modules.
Data crawl
The data we need to crawl is shown below:
Including the following:
Crawl train of thought:
We can easily find that the details page URL for each patent looks something like this:
http://dbpub.cnki.net/grid2008/dbpub/Detail.aspx?DBName=SCPD & year FileName = patent publicly Numbers & QueryID = 4 & CurRec = 1Copy the code
Therefore, as long as you change the patent publication number, you can obtain the url of the detail page of the corresponding patent (after testing, it doesn’t matter if the year is not the same), so as to obtain the information of the corresponding patent. Specifically, the code is as follows:
All done~ Complete source code see personal profile related files.
PS: The code runs as main.py.
Data visualization
In order to avoid unnecessary pressure on THE SERVER of CNKI, here we only crawled part of THE DATA of CNKI China patent in 2014 (just running for more than an hour). The results of visual analysis of these data are as follows.
Let’s take a look at the distribution of patent applications in provinces:
And then what about the patent agencies?
Finally, take a look at the word cloud of all the patent abstracts:
And a word cloud of all the patent titles:
All done~ Complete source code see personal profile related files.