introduce
This article will introduce how I step by step in python crawler on hole, and then slowly walked out, during all the problems I will detail, let us later when encounter these problems, can quickly determine the source of the problem, the back of the code simply stick out of the core code, more detailed code temporarily not posted.
The process view
First of all I wanted to climb a website all of the above article content, but didn’t do it before because of the crawler (also don’t know what that language is the most convenient), so the thought of here is use python for a crawler (after all, the somebody else’s name with the meaning of the crawler 😄), my side is going to put all the data from the site to climb down onElasticSearch
Inside, chooseElasticSearch
The reason is that the speed is fast, inside the word segmentation plug-in, inverted index, when the need for data query efficiency will be very good (after all, there are more things to climb 😄), and then I will put all the data inElasticSearch
The wife ofkibana
To visualize the data and analyze the content of these articles, you can first take a look at the expected visualization effect (above). This effect picture iskibana6.4
The system gives the help effect chart (that is, you can do this, I also want to do this 😁). I will post a dockerfile later (not yet 😳).
Environmental requirements
- Jdk (Elasticsearch required)
- ElasticSearch (used to store data)
- Kinaba (for ElasticSearch and data visualization)
- Python (writing crawlers)
- Redis (Data weighting)
You can go to the appropriate tutorial to install these things, I only have the installation of ElasticSearch 😢 I get the installation tutorial
Step 1: Use Python’s PIP to install the required plug-in (the first pit is here)
- Tomd: Convert HTML to Markdown
pip3 install tomd
Copy the code
- Redis: The Redis plugin for Python is required
pip3 install redis
Copy the code
- scrapy: Frame mounting (pit)
- So first OF all I’m going to do it as above
pip3 install scrapy Copy the code
- And found that there was a lack of
gcc
componenterror: command 'gcc' failed with exit status 1
3. Then I searched, searched, searched, and finally found the right solution (during which I tried many wrong answers 😭). The ultimate solution is to use
yum
To install thepython34-devel
thepython34-devel
Depending on your version of Python, which may be Python-devel, change the middle 34 to your version, mine is 3.4.6yum install python34-devel Copy the code
- Try using scrapy once the installation is complete.
Second, use scrapy to create your project
- Type the command
scrapy startproject scrapyDemo
To create a crawler project
liaochengdeMacBook-Pro:scrapy liaocheng$ scrapy startproject scrapyDemo New Scrapy project 'scrapyDemo', Using the template directory '/ usr/local/lib/python3.7 / site - packages/scrapy/templates/project', created in: /Users/liaocheng/script/scrapy/scrapyDemo You can start your first spider with: cd scrapyDemo scrapy genspider example example.com liaochengdeMacBook-Pro:scrapy liaocheng$Copy the code
- Use genSpider to generate a basic spider, using commands
scrapy genspider demo juejin.im
, the back of this website is the website you want to climb, we climb their own home 😂 first
liaochengdeMacBook-Pro:scrapy liaocheng$ scrapy genspider demo juejin.im
Created spider 'demo' using template 'basic'
liaochengdeMacBook-Pro:scrapy liaocheng$
Copy the code
- View the generated directory structure
Step 3, open the project and start coding
- View the contents of the generated demo.py
# -*- coding: utf-8 -*-
import scrapy
class DemoSpider(scrapy.Spider) :
name = 'demo' The name of the crawler
allowed_domains = ['juejin.im'] ## Domain name to filter, that is, only crawl the content below this url
start_urls = ['https://juejin.cn/post/6844903785584672776'] ## Initial URL link
def parse(self, response) : If a new spider must implement this method
pass
Copy the code
- You can use the second method, which is to raise start_urls
# -*- coding: utf-8 -*-
import scrapy
class DemoSpider(scrapy.Spider) :
name = 'demo' The name of the crawler
allowed_domains = ['juejin.im'] ## Domain name to filter, that is, only crawl the content below this url
def start_requests(self) :
start_urls = ['https://juejin.cn'] ## Initial URL link
for url in start_urls:
# call parse
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response) : If a new spider must implement this method
pass
Copy the code
- write
articleItem.py
File (the item file is just like the entity class in Java)
import scrapy
class ArticleItem(scrapy.Item) : Need to implement scrapy.Item file
Article # id
id = scrapy.Field()
# Article title
title = scrapy.Field()
# Article content
content = scrapy.Field()
# the author
author = scrapy.Field()
# Release time
createTime = scrapy.Field()
# reading
readNum = scrapy.Field()
# thumb up
praise = scrapy.Field()
# head
photo = scrapy.Field()
# comments
commentNum = scrapy.Field()
# post link
link = scrapy.Field()
Copy the code
- write
parse
Method code
def parse(self, response) :
Get all the urls on the page
nextPage = response.css("a::attr(href)").extract()
# traversal all url links on the page, time complexity is O(n)
for i in nextPage:
if nextPage is not None:
# Put the link together
url = response.urljoin(i)
# must be a nuggets link to enter
if "juejin.im" in str(url):
# save to redis, if you can save to redis, it is an uncrawled link
if self.insertRedis(url) == True:
# dont_filter # dont_filter # dont_filter # dont_filter # dont_filter # dont_filter # dont_filter
yield scrapy.Request(url=url, callback=self.parse,headers=self.headers,dont_filter=False)
# We only analyze the article, nothing else
if "/post/" in response.url and "#comment" not in response.url:
# create the ArticleItem we just created
article = ArticleItem()
# article id as id
article['id'] = str(response.url).split("/")[-1]
# titles
article['title'] = response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > h1::text").extract_first()
# content
parameter = response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > div.article-content").extract_first()
article['content'] = self.parseToMarkdown(parameter)
# the author
article['author'] = response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > div:nth-child(6) > meta:nth-child(1)::attr(content)").extract_first()
# create time
createTime = response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > div.author-info-block > div > div > time::text").extract_first()
createTime = str(createTime).replace("Year"."-").replace("Month"."-").replace("Day"."")
article['createTime'] = createTime
# reading
article['readNum'] = int(str(response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > div.author-info-block > div > div > span::text").extract_first()).split("") [1])
# thumb up
article['badge'] = response.css("#juejin > div.view-container > main > div > div.article-suspended-panel.article-suspended-panel > div.like-btn.panel-btn.like-adjust.with-badge::attr(badge)").extract_first()
# comments
article['commentNum'] = response.css("#juejin > div.view-container > main > div > div.article-suspended-panel.article-suspended-panel > div.comment-btn.panel-btn.comment-adjust.with-badge::attr(badge)").extract_first()
# post link
article['link'] = response.url
# This method is very important because the pipeline has been unable to fetch data due to executing yield article
yield article
# Convert content to markdown
def parseToMarkdown(self, param) :
return tomd.Tomd(str(param)).markdown
The url is stored in redis, if it can be saved then the link does not exist, if it cannot be saved then the link exists
def insertRedis(self, url) :
ifself.redis ! =None:
return self.redis.sadd("articleUrlList", url) == 1
else:
self.redis = self.redisConnection.getClient()
self.insertRedis(url)
Copy the code
- Write the pipeline class, which is a pipe that can yield all the data returned by the yield keyword to the pipe, but needs to configure the pipeline in Settings
from elasticsearch import Elasticsearch
class ArticlePipelines(object) :
# initialization
def __init__(self) :
# elasticsearch的index
self.index = "article"
# elasticsearch的type
self.type = "type"
# elasticSearch IP + port
self.es = Elasticsearch(hosts="localhost:9200")
The method that must be implemented to process the data returned by yield
def process_item(self, item, spider) :
If it is the data of the crawler, it will be processed
ifspider.name ! ="demo":
return item
result = self.checkDocumentExists(item)
if result == False:
self.createDocument(item)
else:
self.updateDocument(item)
# add document
def createDocument(self, item) :
body = {
"title": item['title']."content": item['content']."author": item['author']."createTime": item['createTime']."readNum": item['readNum']."praise": item['praise']."link": item['link']."commentNum": item['commentNum']}try:
self.es.create(index=self.index, doc_type=self.type.id=item["id"], body=body)
except:
pass
# Update document
def updateDocument(self, item) :
parm = {
"doc" : {
"readNum" : item['readNum']."praise" : item['praise']}}try:
self.es.update(index=self.index, doc_type=self.type.id=item["id"], body=parm)
except:
pass
Check whether the document exists
def checkDocumentExists(self, item) :
try:
self.es.get(self.index, self.type, item["id"])
return True
except:
return False
Copy the code
Step 4, run the code to see the effect
- use
scrapy list
View all crawlers locally
liaochengdeMacBook-Pro:scrapyDemo liaocheng$ scrapy list
demo
liaochengdeMacBook-Pro:scrapyDemo liaocheng$
Copy the code
- use
scrapy crawl demo
To run the crawler
scrapy crawl demo
Copy the code
- Look inside kibana to climb to the data, execute the following command to see the data
GET /article/_search
{
"query": {
"match_all": {}
}
}
Copy the code
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "article2",
"_type": "type",
"_id": "5c790b4b51882545194f84f0",
"_score": 1,
"_source": {}
}
]
}
}
Copy the code