preface

This text and picture filter network, can learn, exchange use, does not have any commercial purposes, if you have any questions please contact us to deal with.

This article uses the Python crawler framework scrapy to collect some data from the site.

Basic development environment

  • Python 3.6
  • pycharm

How to install scrapy

PIP install scrapy can be installed in the CMD command line. However, network timeout usually occurs.

You are advised to switch to the domestic common source PIP install -i Domestic common address **** package name

Such as:

pip install -i  https://mirrors.aliyun.com/pypi/simple/ scrapy
Copy the code

Domestic common source alias address:

Tsinghua university: https://pypi.tuna.tsinghua.edu.cn/simple ali cloud: http://mirrors.aliyun.com/pypi/simple/, https://pypi.mirrors.ustc.edu.cn/simple/, China university of science and technology, huazhong university of science and Shandong university of science and technology: http://pypi.hustunique.com/ http://pypi.sdutlinux.org/ douban: http://pypi.douban.com/simple/Copy the code

Error you may get:

Install Scrapy. You may encounter a VC ++ error when installing Scrapy. You can install an offline package that removes modules

How does Scrapy crawl site data

This article uses Top250 data as an example to explain the basic process of scrapy framework to crawl data.

Static website, webpage structure is very suitable for writing and crawling, so many basic crawler cases are based on Douban movie data and cat’s eye movie data for example.

Create a crawler project for Scrapy

1. Create a crawler project

Select Terminal in Pycharm and type in Local

Scrapy StartProject scrapy StartProject

2. CD Switch to the crawler project directory

3. Create a crawler file

Scrapy genspider (+ name of crawler < unique >)

This completes project creation for scrapy and crawler file creation.

Scrapy crawler code

1. Disable the robots protocol in settings.py. Default is True

2. Modify the starting URL in the crawler file

start_urls = ['https://movie.douban.com/top250?filter=']
Copy the code

Change start_urls to the link to the Douban navigation url, which is the URL of the first page of data you crawl

3. Write and parse the business logic of data

The crawl content is as follows:

douban_info.py

import scrapy from .. items import DoubanItem class DoubanInfoSpider(scrapy.Spider): name = 'douban_info' allowed_domains = ['douban.com'] start_urls = ['https://movie.douban.com/top250?start=0&filter='] def parse(self, response): lis = response.css('.grid_view li') print(lis) for li in lis: title = li.css('.hd span:nth-child(1)::text').get() movie_info = li.css('.bd p::text').getall() info = ''.join(movie_info).strip() score = li.css('.rating_num::text').get() number = li.css('.star span:nth-child(4)::text').get() summary = li.css('.inq::text').get() print(title) yield DoubanItem(title=title, info=info, score=score, number=number, summary=summary) href = response.css('#content .next a::attr(href)').get() if href: next_url = 'https://movie.douban.com/top250' + href yield scrapy.Request(url=next_url, callback=self.parse)Copy the code

itmes.py

import scrapy




class DoubanItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    info = scrapy.Field()
    score = scrapy.Field()
    number = scrapy.Field()
    summary = scrapy.Field()
Copy the code

middlewares.py

Import faker def get_cookies(): """ "headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; X64) AppleWebKit / 537.36 (KHTML, Like Gecko) Chrome/86.0.4240.111 Safari/537.36'} Response = requests.get(url='https://movie.douban.com/top250?start=0&filter=', headers=headers) return response.cookies.get_dict() def get_proxies(): "" "proxy requests function "" "proxy_data = requests. Get [url = "http://127.0.0.1:5000/get/"). The json () return proxy_data [' proxy '] class HeadersDownloaderMiddleware: "headers middleware" "" "" "def process_request (self, request, spiders) : Faker() # request. Headers ({'user-agent': fake.user_agent(), } ) return None class CookieDownloaderMiddleware: """ def process_request(self, request, spider): # request. Cookies Set the cookies of the request, Is a dictionary of # get_cookies () calls for cookies method request. Cookies, update (get_cookies ()) return None class ProxyDownloaderMiddleware: Def process_request(self, request, spider): Request. Meta ['proxy'] = get_proxies() return NoneCopy the code

pipelines.py

import csv




class DoubanPipeline:
    def __init__(self):
        self.file = open('douban.csv', mode='a', encoding='utf-8', newline='')
        self.csv_file = csv.DictWriter(self.file, fieldnames=['title', 'info', 'score', 'number', 'summary'])
        self.csv_file.writeheader()


    def process_item(self, item, spider):
        dit = dict(item)
        dit['info'] = dit['info'].replace('\n', "").strip()
        self.csv_file.writerow(dit)
        return item




    def spider_closed(self, spider) -> None:
        self.file.close()
Copy the code

setting.py

# Scrapy settings for douban project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'douban' SPIDER_MODULES = ['douban.spiders'] NEWSPIDER_MODULE = 'douban.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'douban (+http://www.yourdomain.com)' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs  #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8, '#' Accept - Language ': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html # SPIDER_MIDDLEWARES = { # 'douban.middlewares.DoubanSpiderMiddleware': In 543, # } # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html  DOWNLOADER_MIDDLEWARES = { 'douban.middlewares.HeadersDownloaderMiddleware': In 543, } # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'douban.pipelines.DoubanPipeline': In 300, } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY =  60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'Copy the code

4. Run the crawler

Enter the command scrapy crawl + crawler file name