Graduation Design - Distributed Crawler System (Dry goods)

preface

Many students will ask, “Why did my graduation project always fail? Why is my gpa so low?” Either your project is too crude, or its features are so simple that it looks easy to implement and you didn’t take the time to do it. Is that what you think?

This case share, senior students to give you some dry goods, hand in hand with you to develop a distributed crawler system. Through this project, you will learn the following:

Architecture design. How to design a universal crawler? One system supports crawling all websites.
Distributed development experience. Distributed system development is more of a concern, how to ensure that the code will run correctly in a multi-node deployment?
Experience in multi-threaded development. Makes extensive use of the concurrent package’s multithreaded classes, threads, thread pools, locks. Teach you how to play multithreading with real business scenarios, and you usually write multithreading demo is completely different.

Solemnly declare: the starting point of this project is to learn and share technology, in the project of the crawler cases are also on the Internet can publicly accessible website, crawl when strictly control the speed and crawl crawl frequency (each single thread to crawl, crawl a page dormancy 1 second, crawl at most 100 pages), will not affect the normal operation of the target site.

If you want source code, documents and video tutorials, please scan the code and add me to wechat. Here is an advertisement. The senior has many years of bat working experience and has been responsible for school and social recruitment interviews. If you are interested, you can write to me personally and I can provide resume guidance and internal promotion.

The project architecture

The crawler component

The above is the classical architecture diagram of crawler system, and the responsibilities of each component are briefly described:

Spiders: Each spider processes a particular website and is responsible for extracting data from the target website
Scheduler: The Scheduler accepts requests from the engine and dispatches them to the engine later when the engine requests them
Downloader: Performs network downloads based on the Request object
Pipeline: Store item data extracted by spider. Here we store it in MySQL

Finally, we explain the crawler process together:

Start by creating a spider crawler task, which has an entry URL from which the spider begins to crawl
Call the Downloader component to perform the HTTP request to download the entire page
The spider parses the contents of the page, putting the required contents into items and placing the child urls of the page into the Scheduler component
Pipeline is responsible for persisting the data in item
The URL is placed into the Scheduler component, which deduplicates the URL to avoid repeated crawls
The spider continues to fetch the URL from Scheduler after it has climbed the current page. If it has a URL, it continues to crawl. If it has no URL, all pages have climbed and the spider task is finished

Distributed crawler

What is distributed crawler? In plain English: I have deployed multiple crawler modules that can crawl together. According to the analysis of the architecture diagram above, the Scheduler module only needs to be implemented based on Redis, so all the spiders of the module only need to obtain the URL from Redis and then put it into Redis when they climb to the new sub-URL. At this point, our architecture already supports distributed crawler. (There are many details in the code, and the article is limited. I will not elaborate on them.)

Custom crawler

For some static pages of the website, do SEO can be climbed directly by the search engine of the website, did not do anti-crawling sites, these sites we can customize a general crawler strategy to climb, direct HTTP request, and then analyze the content and images and other resources. And for some anti-crawling strategy, such as paging data, dynamic rendering of the web page, request header interception, IP high-frequency interception and so on. For this type of website crawler need to do some customized logic and so on architectural design, senior custom template provides a creeper entrance, through the development in the code for specific sites customized crawler strategy, so that you can avoid most of the climbing rules, so as to realize a crawler system can support the vast majority of web crawl.

The domain model

DO (DataObject) : Transfers data source objects up through the DAO layer in one-to-one correspondence with the database table structure
BO (BusinessObject) : BusinessObject. An object that encapsulates the business logic output by the Service layer
VO (View Object) : Display layer objects, usually transmitted by the Web to the template rendering engine layer

BO and VO domain models are divided into BoRequest (input model), BoResponse (output model), VoRequest (input model) and VoResponse (output model).

Technology stack

Front end: vue + Element

Backend: JDk1.8 + Springboot + Redis + mysql + jsoup + httpClient

Jurisdiction: the security + spring – the session

Interface design

The interface of the whole project adopts the restful style design that is popular on the Internet at present, and each interface and each parameter has detailed documentation. Because development in an enterprise is always a team effort, always a separate front and back end, you have to define the interfaces so that the front end and the back end can be developed simultaneously. Another way is to provide external interfaces. For example, your next-door team also wants to call the interface of your service, but your two schedules are in the same week. At this time, you have to define the interface to others first, and then develop it synchronously.

Running effect

System login

dashboard

Real-time statistics of system data

Task management

Page menu, “query”, “create”, “Edit” and “Delete” buttons all support separate permission allocation. Here are examples of crawler cases, such as “Crawling baidu news”, “crawling Bing wallpaper”, “crawling Dangdang book information” and “crawling Sina News”.

Creating a crawler Task

Crawl bing wallpapers

A lot of people use Bing search because they like bing hd wallpapers (like seniors). Here’s a crawler for Bing wallpapers. Because Bing wallpaper involves pagination, this is where our custom template feature comes in. By writing a BingTemplate template, we can easily do the pagination data crawler.

Civilization crawler, we only climbed 100 pages of data, the pictures of each page are in the resource details, very beautiful, you can click to enlarge the picture and download

Climb dangdang

In order to make the content more authentic, I extracted some book information from Dangdang, such as title, author, publisher, price, introduction and other information. (Interested in the column can review the design and implementation of the library management system)

Order a DangDangTemplate crawler, just a URL, start our custom crawler journey.

Crawl baidu and Sina news

Baidu news and Sina news are relatively easy to climb, do not need a template, directly create a task, only need to fill in a URL, immediately start crawler.

Resource management

All the retrieved data can be queried in the resource management interface. Click “Resource Details” to see the specific text and picture content, which can be enlarged, downloaded and slide show

Template management

As mentioned above, both websites with anti-crawling strategy and customized crawler can be realized by developing a crawler template. Such a design is very good for system expansibility, which is equivalent to saying that a crawler system can crawl all the content.

Log management

Log management is enabled for the administrator by default. All operations in the system are recorded, facilitating troubleshooting when the system is abnormal.

User management

By default, only the administrator has the permission to manage the user menu, and can create/edit users, assign user roles, and disable/enable users

Edit user information

Users with account editing permission can edit accounts

Role management

The default option is only administrators have role management, the authority here is fine-grained permissions to button, each button can manage the permissions, if only to users assigned the task of “query” permissions, but the user is a programmer, he want to ask for direct access to the task through the interface to modify interface, the backend is permission to check at this moment, Returns the “unauthorized” error code, then the front according to the “unauthorized” error code will be redirected to a 403 page (this is why only front-end validation is unsafe, the back-end must also have to check, this in the actual enterprise development also is such, have not junior drove to actual development experience with a small notebook to remember a, ha, ha, ha)

Permission to design

Permissions are implemented based on security and Spring-Session. Permissions can be divided into authentication and authorization. Authentication is actually login. When a user logs in, the account and password will be verified. Authorization refers to whether a user has the permission to access back-end resources. Each new user is assigned a role after creation. A role is actually a set of permissions, which can be understood as the permission to access back-end interfaces (resources).

Permissions design here is very flexible, fine-grained to the level of buttons, such as the course menu to add, delete, modify, query, students may access only queries of the course, unable to add and modify the course, even if it is modified through interface with direct access to the backend or new interface, the backend will also return authorization failed error, Because each back-end interface that requires permission is marked with permission tags, only users with resource permissions can access the interface.

Logging solution

Log using Lombok annotation + SLF4J + Log4j2 implementation scheme, based on the profile to achieve log configuration in multiple environments, because different environments have different log printing policies, for example, in the development environment, I may need to print logs to the console, and debug level logs are required for local development and debugging. Test environment may need to print to the log file, the online environment may need to print to a file at the same time send logs to the kafka and collected es, such as online deployed more machine after we check the log to a a machine to check the log, because all the es collection, we just need to log in to kibana to search, It’s very convenient. Kafka + ES + Kibana is one of the most popular logging solutions for Internet companies. If you can get your hands on it, you can build kafka, ES, kibana, and just add a few lines of configuration to your configuration file to create an enterprise-class logging solution (output to a log file by default).

The following are some key configurations. If you want to configure Kafka, you only need to configure the configuration in the tag

    
      
<Configuration status="WARN"  xmlns:xi="http://www.w3.org/2001/XInclude">
    <Properties>
        <Property name="LOG_FILE">system.log</Property>
        <Property name="LOG_PATH">./logs</Property>
        <Property name="PID">????</Property>
        <Property name="LOG_EXCEPTION_CONVERSION_WORD">%xwEx</Property>
        <Property name="LOG_LEVEL_PATTERN">%5p</Property>
        <Property name="LOG_DATE_FORMAT_PATTERN">yyyy-MM-dd HH:mm:ss.SSS</Property>
        <Property name="CONSOLE_LOG_PATTERN">%clr{%d{${LOG_DATE_FORMAT_PATTERN}}}{faint} %clr{${LOG_LEVEL_PATTERN}} %clr{${sys:PID}}{magenta} %clr{---}{faint} % the CLR {[% 15.15 t]} {abbreviation} % CLR {% - 40.40 - c {1}} {cyan} % CLR {that} {abbreviation} % m % n ${sys: LOG_EXCEPTION_CONVERSION_WORD}</Property>
        <Property name="FILE_LOG_PATTERN">% d {${LOG_DATE_FORMAT_PATTERN}} ${LOG_LEVEL_PATTERN} ${sys: PID} [t] % % - 40.40 c {1} : % L: %m%n${sys:LOG_EXCEPTION_CONVERSION_WORD}</Property>
    </Properties>
    <Appenders>
        <xi:include href="log4j2/file-appender.xml"/>
    </Appenders>
    <Loggers>
        <logger name="com.senior.book" level="info"/>
        <Root level="info">
            <AppenderRef ref="FileAppender"/>
        </Root>
    </Loggers>
</Configuration>
Copy the code

Service monitoring

The monitoring of services is implemented based on the Actuator + Prometheus + Grafana, and the code incursion is very small. Only the dependencies need to be added in THE POM. Data Dashboard can be set up by yourself, you can also go to the Dashboard market to download the template you want, in short, this is completely depends on hands-on ability, we play by ourselves ~~~

		<! -- Service Monitoring -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
        </dependency>
Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Graduation Design – Distributed Crawler System (Dry goods)

preface