Content source: April 22, 2018, Pinlan, founder and CEO yi-fan li in the “global start | Kubeflow Meetup 4.22 hangzhou, open up a new field of vision” AI to the project practice and outlook on Kubeflow AI speech to share. As the exclusive video partner, IT mogul Said (wechat ID: Itdakashuo) is authorized to release the video through the review and approval of the host and the speaker.

Read, read the word count: 2250 | 6 minutes

Access to guest speech videos and PPT:


Abstract

Kubeflow is an exciting project combining machine learning and distributed systems. Not only programmers in the system world are interested in it, but also all data scientists have high expectations for it. This sharing focuses on the experience of data scientists as the target users of Kubeflow and more prospects for the future.

Introduction: Industrial AI

Scikit learn

Since 2003, statistical machine learning algorithms have been widely used in Internet advertising, so machine learning formally emerged. The next scitit Learn extends machine learning to different scenarios, making it easy for Python users to pick up and run statistical machine learning algorithms in just a few weeks.

Around 2010, various papers on deep learning were published successively, and deep learning began to be applied to image, voice and text data, making theoretical breakthroughs one after another. For structured data, the advantages of machine learning may not be obvious, and simple logic can still be used to replace it, but in the fields of images, speech, and text, it is completely different, and this is where the advantages of deep learning lie.

After 2010, deep learning continued to develop. In 2014, some international companies launched algorithm frameworks such as TensorFlow and Caffe, bringing the ability of deep learning out of the academic circle.

Startups that apply deep learning to business problems are mushrooming, large Internet companies are joining the AI arms race, and All in AI has become a daily strategy. On the other hand, traditional enterprises have also entered into intensive digital transformation. Internet + and AI+ have been introduced into traditional enterprises, such as retail, energy, finance, manufacturing and other industries. After most companies started to dabbled in AI, the scarcity of AI talent became apparent.

In general, the emergence of tools has enabled more people to join the AI cause; Career development has spawned a new generation of AI tools. At present, we already have a powerful algorithm basic tool: TensorFlow, but the development of the business brought ten thousand GPU parallel AI business, resulting in the emergence of hundreds to thousands of AI algorithm team.

A Solo DS


As a Solo DS with limited personal funds, the computing platform is recommended to choose the game host, which is barely adequate in GPU performance. There is definitely no customer data on the data source side, so you can only choose some open source data, such as ImageNet. IDE usually depends on the language used; Pycharm can be used for Python. You also need a teacher, and StackoverFlow is a great way to fill that role. You also need a book for getting started. Finally, it’s time to prove yourself. Get ready to participate in all kinds of competitions.

A DS workflow


The entire workflow is shown above, starting with finding data, studying data, tuning the model, delivering the model after tuning it, and finally generating value from the model and, in some cases, closing revenue.

In Solo, if you were doing statistical machine learning, such as recommendations and advertising, 90% of your time would be spent on feature engineering. In the era of deep learning, 90% of the time is spent waiting for the model training to complete.

From a user perspective, there are several pain points in Solo. The first is slow training, which can be applied to multi-GPU training. Second, there is no data. The same open source data is used, which is limited in innovation. Therefore, tools for data collection annotation can be provided to customize data sets. Third, the environment configuration is difficult to do, they are faced with a wide range of different frameworks and libraries in the community.

A DS team


Compared with the Solo DS team, the DS team firstly upgraded the computing platform, which may be the resources in the machine room or cloud. Meanwhile, it had its own data set and used the code base. Git was introduced. In addition, there is not much improvement in other areas, such as how model libraries are selected, how publishing streams are constructed, how resources are shared and coordinated, etc.

The most common problem that DS teams encounter is difficulty managing GPU resources, such as queuing for limited resources, or shutting down other people’s tasks without their knowledge. In order to solve this problem, we adopt a relatively primitive method, assuming that there are now three Gpus, we will record the usage situation in Excel when needed, and write the estimated training completion time, as shown in the figure above.

Another problem is how to do AI DevOps, such as model file and code mismatch, prediction results problems, the last version of the model lost, these situations I believe most people have encountered. At the same time, distributed training is also very important for model training. If it can be done, it will save a lot of time. Compared with multiple machines and multiple cards, one machine and multiple cards will be simpler.

Overall, one of the biggest challenges for a DS team is that they come together with a clear goal and need to complete an AI task with quality, speed and no missteps. In fact, all the development challenges we face are actually technical debt caused by the rapid development of AI.

Kubeflow outlook

Kubeflow provides solutions to all of the aforementioned problems, both in Solo and in teams. For example, Configuration version Management, Process Management Tools for distributed training, Feature Extraction for complex data exploration, Serving Infrastucture for mass shipment, etc.