“This is the 18th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”
Scrapy crawls the thorn proxy
1. Create projects
- scrapy startproject XcSpider
2. Create a crawler instance
- scrapy genspider xcdl xicidaili.com
Start with the project folder Sources Root to prevent errors when importing your own files
3. Create a startup file main.py
from scrapy import cmdline
cmdline.execute('scrapy crawl xcdl'.split())
Copy the code
4. Overall tree structure of the project
Tree /F (/F can display a complete file)
│ ├ ── garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage, garbage │ ├ ─ ─ ─ mysqlpipelines │ │ pipelines. Py │ │ SQL. Py │ │ set py │ │ │ └ ─ ─ ─ __pycache__ │ pipelines, retaining - 36. Pyc │ SQL. Retaining - 36. Pyc │ set retaining - 36. Pyc │ ├ ─ ─ ─ spiders │ │ XCDL. Py │ │ set py │ │ │ └ ─ ─ ─ __pycache__ │ XCDL. Retaining - 36. Pyc │ set retaining - 36. Pyc │ └ ─ ─ ─ __pycache__ items. Retaining - 36. Pyc pipelines, retaining - 36. Pyc settings.cpython-36.pyc __init__.cpython-36.pycCopy the code
5. Configure the settings.py file
- This section describes how to configure MySQL and MongoDB data
- < span style = “box-sizing: border-box! Important; word-break: inherit! Important;
- Set DEFAULT_REQUEST_HEADERS to prevent reverse crawling, and we add headers
# -*- coding: utf-8 -*-
# Scrapy settings for XcSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'XcSpider'
SPIDER_MODULES = ['XcSpider.spiders']
NEWSPIDER_MODULE = 'XcSpider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'XcSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml; Q = 0.9 * / *; Q = 0.8 '.'Accept-Language': 'zh-CN,zh; Q = 0.9 '.'User-Agent': 'the Mozilla / 5.0 (Windows NT 10.0; Win64; X64) AppleWebKit/537.36 (KHTML, like Gecko)'
'the Chrome / 80.0.3987.149 Safari / 537.36',}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'XcSpider.middlewares.XcspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'XcSpider.middlewares.XcspiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 'XcSpider.pipelines.XcspiderPipeline': 300,
'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300.'XcSpider.pipelines.XcPipeline': 200,}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# enable log
LOG_FILE = 'xcdl.log'
LOG_LEVEL = 'ERROR'
LOG_ENABLED = True
Mysql > configure Mysql
MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_PORT = 3306
MYSQL_DB = 'db_xici'
# MongoDB configuration
# MONGODB hostname
MONGODB_HOST = '127.0.0.1'
# MONGODB port number
MONGODB_PORT = 27017
# database name
MONGODB_DBNAME = 'XCDL'
Table name where data is stored
MONGODB_SHEETNAME = 'xicidaili'
Copy the code
6. Items. Py files
- Write the data you need to crawl
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class XcspiderItem(scrapy.Item) :
# define the fields for your item here like:
# name = scrapy.Field()
pass
class XiciDailiItem(scrapy.Item) :
country = scrapy.Field()
ipaddress = scrapy.Field()
port = scrapy.Field()
serveraddr = scrapy.Field()
isanonymous = scrapy.Field()
type = scrapy.Field()
alivetime = scrapy.Field()
verificationtime = scrapy.Field()
Copy the code
7. xcdl.py
- For page processing, extract the required data
# -*- coding: utf-8 -*-
import scrapy
from XcSpider.items import XiciDailiItem
class XcdlSpider(scrapy.Spider) :
name = 'xcdl'
allowed_domains = ['xicidaili.com']
start_urls = ['https://www.xicidaili.com/']
def parse(self, response) :
# print(response.body.decode('utf-8'))
items_1 = response.xpath('//tr[@class="odd"]')
items_2 = response.xpath('//tr[@class=""]')
items = items_1 + items_2
infos = XiciDailiItem()
for item in items:
# Get country picture link
counties = item.xpath('./td[@class="country"]/img/@src').extract()
try:
country = counties[0]
except:
country = 'None'
# get ipaddress
ipaddress = item.xpath('./td[2]/text()').extract()
try:
ipaddress = ipaddress[0]
except:
ipaddress = 'None'
# for the port
port = item.xpath('./td[3]/text()').extract()
try:
port = port[0]
except:
port = 'None'
# get serveraddr
serveraddr = item.xpath('./td[4]/text()').extract()
try:
serveraddr = serveraddr[0]
except:
serveraddr = 'None'
# get isanonymous
isanonymous = item.xpath('./td[5]/text()').extract()
try:
isanonymous = isanonymous[0]
except:
isanonymous = 'None'
# for the type
type = item.xpath('./td[6]/text()').extract()
try:
type = type[0]
except:
type = 'None'
Get the survival time
alivetime = item.xpath('./td[7]/text()').extract()
try:
alivetime = alivetime[0]
except:
alivetime = 'None'
Get the validation time
verficationtime = item.xpath('./td[8]/text()').extract()
try:
verificationtime = verficationtime[0]
except:
verificationtime = 'None'
print(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
infos['country'] = country
infos['ipaddress'] = ipaddress
infos['port'] = port
infos['serveraddr'] = serveraddr
infos['isanonymous'] = isanonymous
infos['type'] = type
infos['alivetime'] = alivetime
infos['verificationtime'] = verificationtime
yield infos
Copy the code
8. pipelines.py
I. Save the database to the MongoDB database
- > < span style = “box-sizing: border-box; word-wrap: break-word! Important; word-wrap: break-word! Important;
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from XcSpider import settings
class XcspiderPipeline(object) :
def process_item(self, item, spider) :
return item
class XcPipeline(object) :
def __init__(self) :
host = settings.MONGODB_HOST
port = settings.MONGODB_PORT
dbname = settings.MONGODB_DBNAME
sheetname = settings.MONGODB_SHEETNAME
Create MONGODB database connection
client = pymongo.MongoClient(host=host, port=port)
Select * from database;
mydb = client[dbname]
The name of the database where the data is stored
self.post = mydb[sheetname]
def process_item(self, item, spider) :
data = dict(item)
self.post.insert(data)
return item
Copy the code
Ii. Save the database to the MySQL database
- < span style = “box-sizing: border-box! Important; word-break: inherit! Important
- Mysql > create mysqlPipelines folder/Package
- First, let’s write an SQL template -> sql.py
# -*- coding: UTF-8 -*-
"' = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = @ Project - > File: Project - > SQL @ IDE: PyCharm @ the Author: Ruochen @ Date: 2020/4/3 12:53 @ Desc = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ' ' '
import pymysql
from XcSpider import settings
MYSQL_HOST = settings.MYSQL_HOST
MYSQL_USER = settings.MYSQL_USER
MYSQL_PASSWORD = settings.MYSQL_PASSWORD
MYSQL_PORT = settings.MYSQL_PORT
MYSQL_DB = settings.MYSQL_DB
db = pymysql.connect(user=MYSQL_USER, password=MYSQL_PASSWORD, host=MYSQL_HOST, port=MYSQL_PORT, database=MYSQL_DB, charset="utf8")
cursor = db.cursor()
class Sql(object) :
@classmethod
def insert_db_xici(cls, country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime) :
sql = 'insert into xicidaili(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)' \
' values (%(country)s, %(ipaddress)s, %(port)s, %(serveraddr)s, %(isanonymous)s, %(type)s, %(alivetime)s, %(verificationtime)s) '
value = {
'country': country,
'ipaddress': ipaddress,
'port': port,
'serveraddr': serveraddr,
'isanonymous': isanonymous,
'type': type.'alivetime': alivetime,
'verificationtime': verificationtime,
}
try:
cursor.execute(sql, value)
db.commit()
except Exception as e:
print('Failed to insert ----', e)
db.rollback()
# to heavy
@classmethod
def select_name(cls, ipaddress) :
sql = "select exists(select 1 from xicidaili where ipaddress=%(ipaddress)s)"
value = {
'ipaddress': ipaddress
}
cursor.execute(sql, value)
return cursor.fetchall()[0]
Copy the code
- > < span style = “box-sizing: border-box! Important; word-break: inherit! Important
# -*- coding: UTF-8 -*-
"' = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = @ Project - > File: Project - > pipelines @ IDE: PyCharm @ the Author: Ruochen @ Date: 2020/4/3 12:53 @ Desc: = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = ' ' '
from XcSpider.items import XiciDailiItem
from .sql import Sql
class XicidailiPipeline(object) :
def process_item(self, item, spider) :
if isinstance(item, XiciDailiItem):
ipaddress = item['ipaddress']
ret = Sql.select_name(ipaddress)
if ret[0] = =1:
print(IP: {} already exists ----.format(ipaddress))
else:
country = item['country']
ipaddress = item['ipaddress']
port = item['port']
serveraddr = item['serveraddr']
isanonymous = item['isanonymous']
type = item['type']
alivetime = item['alivetime']
verificationtime = item['verificationtime']
Sql.insert_db_xici(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
Copy the code
9. Settings. py set up pipelines
- The settings.py file has already been added, so let’s say it again
- One is the middleware of MySQL and the other is the middleware of MongoDB
- The priority can be set arbitrarily
- They can be opened at the same time, but they can also be opened separately
“> < span style =” max-width: 100%; box-sizing: border-box! Important; word-wrap: break-word! Important
# from XcSpider.mysqlpipelines.pipelines import XicidailiPipeline
ITEM_PIPELINES = {
# 'XcSpider.pipelines.XcspiderPipeline': 300,
'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300.'XcSpider.pipelines.XcPipeline': 200,}Copy the code
10. Run the program
- Now we can start our crawler by running the main.py file
- You can then see the crawled data in the database
MySQL > alter table xicidaili; MySQL > alter table xicidaili; MySQL > create table xicidaili; MySQL > create table xicidaili; MySQL > create table xicidaili
Create Table: CREATE TABLE `xicidaili` (
`id` int(255) unsigned NOT NULL AUTO_INCREMENT,
`country` varchar(1000) NOT NULL,
`ipaddress` varchar(1000) NOT NULL,
`port` int(255) NOT NULL,
`serveraddr` varchar(50) NOT NULL,
`isanonymous` varchar(30) NOT NULL,
`type` varchar(30) NOT NULL,
`alivetime` varchar(30) NOT NULL,
`verificationtime` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=64 DEFAULT CHARSET=utf8;
Copy the code
End. Run result
The MySQL database
Mongo database
Finally, welcome to pay attention to my personal wechat public account “Little Ape Ruochen”, get more IT technology, dry goods knowledge, hot news