introduce

This article will introduce how I step by step in python crawler on hole, and then slowly walked out, during all the problems I will detail, let us later when encounter these problems, can quickly determine the source of the problem, the back of the code simply stick out of the core code, more detailed code temporarily not posted.

The process view

First of all I wanted to climb a website all of the above article content, but didn’t do it before because of the crawler (also don’t know what that language is the most convenient), so the thought of here is use python for a crawler (after all, the somebody else’s name with the meaning of the crawler 😄), my side is going to put all the data from the site to climb down onElasticSearchInside, chooseElasticSearchThe reason is that the speed is fast, inside the word segmentation plug-in, inverted index, when the need for data query efficiency will be very good (after all, there are more things to climb 😄), and then I will put all the data inElasticSearchThe wife ofkibanaTo visualize the data and analyze the content of these articles, you can first take a look at the expected visualization effect (above). This effect picture iskibana6.4The system gives the help effect chart (that is, you can do this, I also want to do this 😁). I will post a dockerfile later (not yet 😳).

Environmental requirements

  1. Jdk (Elasticsearch required)
  2. ElasticSearch (used to store data)
  3. Kinaba (for ElasticSearch and data visualization)
  4. Python (writing crawlers)
  5. Redis (Data weighting)

You can go to the appropriate tutorial to install these things, I only have the installation of ElasticSearch 😢 I get the installation tutorial

Step 1: Use Python’s PIP to install the required plug-in (the first pit is here)

  1. Tomd: Convert HTML to Markdown
pip3 install tomd
Copy the code
  1. Redis: The Redis plugin for Python is required
pip3 install redis
Copy the code
  1. scrapy: Frame mounting (pit)
    1. So first OF all I’m going to do it as above
    pip3 install scrapy
    Copy the code
    1. And found that there was a lack ofgcccomponenterror: command 'gcc' failed with exit status 1

    3. Then I searched, searched, searched, and finally found the right solution (during which I tried many wrong answers 😭). The ultimate solution is to useyumTo install thepython34-develthepython34-develDepending on your version of Python, which may be Python-devel, change the middle 34 to your version, mine is 3.4.6

    yum install python34-devel
    Copy the code
    1. Try using scrapy once the installation is complete.

Second, use scrapy to create your project

  1. Type the commandscrapy startproject scrapyDemo To create a crawler project
liaochengdeMacBook-Pro:scrapy liaocheng$ scrapy startproject scrapyDemo New Scrapy project 'scrapyDemo', Using the template directory '/ usr/local/lib/python3.7 / site - packages/scrapy/templates/project', created in: /Users/liaocheng/script/scrapy/scrapyDemo You can start your first spider with: cd scrapyDemo scrapy genspider example example.com liaochengdeMacBook-Pro:scrapy liaocheng$Copy the code
  1. Use genSpider to generate a basic spider, using commandsscrapy genspider demo juejin.im, the back of this website is the website you want to climb, we climb their own home 😂 first
liaochengdeMacBook-Pro:scrapy liaocheng$ scrapy genspider demo juejin.im
Created spider 'demo' using template 'basic'
liaochengdeMacBook-Pro:scrapy liaocheng$ 
Copy the code
  1. View the generated directory structure

Step 3, open the project and start coding

  1. View the contents of the generated demo.py
# -*- coding: utf-8 -*-
import scrapy


class DemoSpider(scrapy.Spider) :
    name = 'demo' The name of the crawler
    allowed_domains = ['juejin.im'] ## Domain name to filter, that is, only crawl the content below this url
    start_urls = ['https://juejin.cn/post/6844903785584672776'] ## Initial URL link

    def parse(self, response) : If a new spider must implement this method
        pass

Copy the code
  1. You can use the second method, which is to raise start_urls
# -*- coding: utf-8 -*-
import scrapy


class DemoSpider(scrapy.Spider) :
    name = 'demo' The name of the crawler
    allowed_domains = ['juejin.im'] ## Domain name to filter, that is, only crawl the content below this url

    def start_requests(self) :
        start_urls = ['https://juejin.cn']  ## Initial URL link
        for url in start_urls:
            # call parse
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response) : If a new spider must implement this method
        pass

Copy the code
  1. writearticleItem.pyFile (the item file is just like the entity class in Java)
import scrapy

class ArticleItem(scrapy.Item) : Need to implement scrapy.Item file
    Article # id
    id = scrapy.Field()

    # Article title
    title = scrapy.Field()

    # Article content
    content = scrapy.Field()

    # the author
    author = scrapy.Field()

    # Release time
    createTime = scrapy.Field()

    # reading
    readNum = scrapy.Field()

    # thumb up
    praise = scrapy.Field()

    # head
    photo = scrapy.Field()

    # comments
    commentNum = scrapy.Field()

    # post link
    link = scrapy.Field()
Copy the code
  1. writeparseMethod code
 def parse(self, response) :
        Get all the urls on the page
        nextPage = response.css("a::attr(href)").extract()
        # traversal all url links on the page, time complexity is O(n)
        for i in nextPage:
            if nextPage is not None:
                # Put the link together
                url = response.urljoin(i)
                # must be a nuggets link to enter
                if "juejin.im" in str(url):
                    # save to redis, if you can save to redis, it is an uncrawled link
                    if self.insertRedis(url) == True:
                        # dont_filter # dont_filter # dont_filter # dont_filter # dont_filter # dont_filter # dont_filter
                        yield scrapy.Request(url=url, callback=self.parse,headers=self.headers,dont_filter=False)

        # We only analyze the article, nothing else
        if "/post/" in response.url and "#comment" not in response.url:
            # create the ArticleItem we just created
            article = ArticleItem()

            # article id as id
            article['id'] = str(response.url).split("/")[-1]

            # titles
            article['title'] = response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > h1::text").extract_first()

            # content
            parameter = response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > div.article-content").extract_first()
            article['content'] = self.parseToMarkdown(parameter)

            # the author
            article['author'] = response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > div:nth-child(6) > meta:nth-child(1)::attr(content)").extract_first()

            # create time
            createTime = response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > div.author-info-block > div >  div > time::text").extract_first()
            createTime = str(createTime).replace("Year"."-").replace("Month"."-").replace("Day"."")
            article['createTime'] = createTime

            # reading
            article['readNum'] = int(str(response.css("#juejin > div.view-container > main > div > div.main-area.article-area.shadow > article > div.author-info-block > div >  div > span::text").extract_first()).split("") [1])

            # thumb up
            article['badge'] = response.css("#juejin > div.view-container > main > div > div.article-suspended-panel.article-suspended-panel > div.like-btn.panel-btn.like-adjust.with-badge::attr(badge)").extract_first()

            # comments
            article['commentNum'] = response.css("#juejin > div.view-container > main > div > div.article-suspended-panel.article-suspended-panel > div.comment-btn.panel-btn.comment-adjust.with-badge::attr(badge)").extract_first()

            # post link
            article['link'] = response.url

            # This method is very important because the pipeline has been unable to fetch data due to executing yield article
            yield article

# Convert content to markdown
def parseToMarkdown(self, param) :
    return tomd.Tomd(str(param)).markdown

The url is stored in redis, if it can be saved then the link does not exist, if it cannot be saved then the link exists
def insertRedis(self, url) :
    ifself.redis ! =None:
        return self.redis.sadd("articleUrlList", url) == 1
    else:
        self.redis = self.redisConnection.getClient()
        self.insertRedis(url)
Copy the code
  1. Write the pipeline class, which is a pipe that can yield all the data returned by the yield keyword to the pipe, but needs to configure the pipeline in Settings
from elasticsearch import Elasticsearch

class ArticlePipelines(object) :
    # initialization
    def __init__(self) :
        # elasticsearch的index
        self.index = "article"
        # elasticsearch的type
        self.type = "type"
        # elasticSearch IP + port
        self.es = Elasticsearch(hosts="localhost:9200")

    The method that must be implemented to process the data returned by yield
    def process_item(self, item, spider) :
        
        If it is the data of the crawler, it will be processed
        ifspider.name ! ="demo":
            return item

        result = self.checkDocumentExists(item)
        if result == False:
            self.createDocument(item)
        else:
            self.updateDocument(item)

    # add document
    def createDocument(self, item) :
        body = {
            "title": item['title']."content": item['content']."author": item['author']."createTime": item['createTime']."readNum": item['readNum']."praise": item['praise']."link": item['link']."commentNum": item['commentNum']}try:
            self.es.create(index=self.index, doc_type=self.type.id=item["id"], body=body)
        except:
            pass

    # Update document
    def updateDocument(self, item) :
        parm = {
            "doc" : {
                "readNum" : item['readNum']."praise" : item['praise']}}try:
            self.es.update(index=self.index, doc_type=self.type.id=item["id"], body=parm)
        except:
            pass

    Check whether the document exists
    def checkDocumentExists(self, item) :
        try:
            self.es.get(self.index, self.type, item["id"])
            return True
        except:
            return False
Copy the code

Step 4, run the code to see the effect

  1. usescrapy listView all crawlers locally
liaochengdeMacBook-Pro:scrapyDemo liaocheng$ scrapy list
demo
liaochengdeMacBook-Pro:scrapyDemo liaocheng$ 
Copy the code
  1. usescrapy crawl demoTo run the crawler
 scrapy crawl demo
Copy the code
  1. Look inside kibana to climb to the data, execute the following command to see the data
GET /article/_search
{
  "query": {
    "match_all": {}
  }
}
Copy the code
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "article2",
        "_type": "type",
        "_id": "5c790b4b51882545194f84f0",
        "_score": 1,
        "_source": {}
      }
    ]
 }
}
Copy the code