First, the server background
One 4-core server
Jieba participle problem exploration
In this case, it is mainly aimed at tracking the hot spot of the daily chat record of the players. The technical level is not difficult, which is nothing more than cleaning the statements of the players, segmentation words, and theme extraction and other conventional processes. However, the difficulty is how to store the huge data of player chat records and perform conditional filtering while ensuring the query efficiency.
Celery asynchronous tasks
Celery is a simple, flexible and reliable distributed system that handles large numbers of messages, focuses on real-time processing of asynchronous task queues and also supports task scheduling, with a worker being a daemon. The Celery architecture consists of three parts, the message middleware, task execution unit and task execution result store. The structure I use is Celery+Redis. Celery uses workflow mode (group+ Chord) Redis to store message task parameters as well as workflow data store.
Celery tasks build
1. Due to the large chat record, jieba.lcut takes a long time in a single process task, so the task is considered to be divided. There are a variety of segmentation methods. Here only the one used in the case is introduced. First, the day is the segmentation point, and then the second segmentation is based on the player’s level.
For example, the general task is to check the hot ranking of players’ chat records in 20200406 and 20200407. In this cut, the days will be cut with players’ levels and divided into 4 workers, worker_1: Worker_3:20200407 worker_4: 20200407 Chat logs of level 451-900 players
Celery. Group all tasks in parallel
Group ([WORKER_1, WORKER_2, WORKer_3, WORKer_4])
Celery. Chord for workflow tasks
Chord (Group ([WORKer_1, WORKer_2, WORKer_3, WORKER_4])) The main purpose of this step is to summarize the cutting data of all levels and segments for each day and finally calculate the hotspot ranking.
Note: When using workflow mode, it should be noted that workflow mode will transfer the data of the previous step to the next step, and the data returned by the previous step should be controlled well, because too much data will affect the execution efficiency of the overall task
5. Mission execution
1. Start the worker
Since the test machine has only 4 cores, jieba participle mainly occupies CPU consumption, so the number of workers is directly pulled to the full. Start 4 workers, queue name celery_test celery -A main worker -n celery_test -c 4 -l info -q celery_test
2. Task execution results
During the task execution, 4 workers were busy with CPU usage
At this point, the task seems to have been completed, but this seems to have little to do with the title?
The one above is just the background of the discovery, the one below is the real story. In the case of worker=4, all the CPU is occupied to execute the task, and the average time of a single task is 60s. However, 60s is barely acceptable, which cannot efficiently execute workers. In the process of testing tasks, only one day’s tasks were performed for process analysis, that is, 2 worker: worker_1:20200406 chat records of 1-450 level players 20200406,451-900 level player chat logs start worker, still 4 processes
Since 2, 3, and 4 workers have all tried, what is the efficiency of 1 worker? The results are as follows
conclusion
In the case of limited resources, how should the number of Celery workers be selected optimally? In the case of pursuing the optimal number of single worker, the number of cores should not exceed half of the server. Of course, if there are tasks that do not consume much CPU, the more workers there are, the higher the number of concurrent tasks will be. Specific workers should be adjusted according to the business situation to save resources and time.