preface

AI technology is in its heyday at present, some people have high expectations for it, while others sneer at it. I am not a professional AI algorithm engineer, nor will I make judgments on AI technology. As a system engineer or a code farmer, my task in 2020 is to combine dynamic scheduling in cloud native. Real-time expansion, resource concentration and other advantages are combined with AI technology to solve the current AI training, reasoning process encountered in the environment, poor automation operation and other problems.

This article is a sideband and will not reveal any product information

The status quo

Status of a

At present, in addition to the use of AI in various big factories, most of them are concentrated in universities or research institutions, and a large number of masters and doctors are flooding into the AI market. But I don’t have a complete set, operation specification, everybody blindly use cluster resources, lead to operational difficulties, the scramble for resources, the most deadly is mostly stay in the research phase, results is hard to fall to the ground, are graduating students, and the hordes of students to repeat the same work, do things are the same, every achievement also can’t keep, It leads to a waste of resources.

Among the students I contacted, most of them operate in this way. I wonder if they represent the majority of AI students:

# docker run-itd -p 2222:22 AIImage bash # Docker run-itd -p 2222:22 AIImage bashCopy the code

This way, many disadvantages, I simply list a few:

  • The resources are unlimited and can be used freely. However, the resources of CPU, memory and GPU card are limited. It often happens that a student occupies too much resources, and the training container of other students will be killed, even affecting the normal use of the host.
  • Students can log on to the background cluster, obtain a relatively high level of authority, easy to happen strange problems (not every student has the ability to use Linux).
  • Insecure, unable to guarantee the security of their own data and training model.
  • Only in the training stage, the trained model + reasoning code is not packaged and published as a service, so that the model is constantly learning and recycling, so that the service is landing.

Status quo of the two

At present, there are many products in the market that can meet the basic requirements of AI from training to release, such as Huawei’s ModelArts, Ali’s PAI, Baidu’s AIStudio, etc. These products all face the same problem, in the cloud. Data, code and model need to be placed on the cloud server. For some sensitive data, it cannot be placed on the remote server, which is not allowed and insecure. Therefore, a product running in a private cluster is needed to meet the various needs of the AI.

Status quo of the three

This is based on the actual situation of our team. Without the strong technical accumulation and implementation of Dafa and the self-developed underlying framework (Alibaba has Crewhead Paddle, Baidu has PaddlePaddle and Huawei has MindSpore), there is not enough time and money to re-accumulate. Therefore, we mainly focus on the open source platform and software on the market for research and preliminary trial.

AI framework is more, also have Tensorflow, Pytorch, Caffe, MXNet, etc. For practical reasons, we can’t support all AI frameworks, like AWS HyperDL, which claims to support most of the current AI frameworks (although we haven’t used it, but I believe it, they are big guys). We only focus on the two frameworks commonly used by research institutions: TF and Pytorch (students can learn quickly).

In addition, we also see that TF and Pytorch can be distributed training, which can give play to the performance of multiple GPU cards and improve the training speed, which can also be the highlight of our platform.

implementation

I won’t go into the details of the implementation, but I’ll just briefly talk about the technology stack and technology components used.

  • Kubernetes: Underlying container scheduling platform.
  • Ceph: storage, can satisfy file storage, object storage (store small file data, mostly image data), block storage (meet K8S PVC).
  • Kubeflow: AI training platform, support notebook,TF, PyTorch, Spark, etc.
  • Ambassador: Service gateway.
  • Keycloak: Single login, user authentication and authorization.

As the underlying scheduling platform, K8S is also the most dealt with. K8S RBAC is used for authority management, ResourceQuota for training resource limitation, HPA for inference service Auto Scaling, and customized CRD. At the same time, the application of Shm,HugePage in K8S has a new understanding.

conclusion

Say more nonsense

2020 is coming to an end. This is an extraordinary year, and indeed another year of my ordinary struggle.

From the first contact with programming to now, exactly ten years.

In six years, I also face the 35-year-old barrier (actually, I personally don’t buy the argument that 35 is a barrier to being a programmer).

I have always been passionate about programming and computers.

Can in order to solve a problem do not eat do not drink a whole day, can in order to verify a problem, open the computer on the subway (others may think this is to install X), before falling asleep easily dare not think about programming problems, because it is easy to think more and more spirit, have to personally open the computer. .Copy the code

I do not hope that programming can make me rich and rich and reach the top of my life. I just hope that I can continue to love it like this and enjoy the sense of achievement after solving problems.

Finally, I hope that more AI products can be implemented to make our life more convenient.

Denver annual essay | 2020 technical way with me The campaign is under way…