Every year, InfoQ editors discuss the latest state of AI, ML, and data engineering to identify trends that should be on your radar as a software engineer, architect, or data scientist. We’ve compiled it into the Technology Adoption Curve, with supportive comments, to help you understand how these things are evolving. We also explored what we think you should consider as part of your roadmap and skills development.
We documented these discussions for The first time as a special episode of The InfoQ Podcast. Bitcraze robotics engineer Kimberly McGuire, who uses autonomous drones every day, shares her experiences and opinions with the editors.
Deep learning turns to Early Adopters
While deep learning only started to get our attention in 2016, we’re now moving it from the Innovators category to the Early Adopters category. We know that there are two main frameworks for deep learning: TensorFlow and Pytorch. Both are widely used throughout the industry. We see PyTorch as the dominant player in academic research and TensorFlow as the leader in business/enterprise. The two frameworks are often not very different in terms of functionality, so which one you choose depends on your production performance requirements.
We’ve noticed that more and more developers and organizations are collecting and storing their data in such a way that deep learning algorithms can easily process it and “learn” about things related to business goals. Many people have built their machine learning projects specifically for deep learning. TensorFlow and PyTorch are building abstraction layers for various types of data and also include a large number of public data sets in their software.
We’ve also seen a huge increase in the number of data sets for deep learning. The next challenge we see is distributed Training, distributed Data and parallel training. Some extension libraries for these frameworks include FairScale, DeepSpeed, and Horovod. That’s why we’ve included “large-scale Distributed deep learning” in our list of topics for the Innovators category.
Another challenge we’re seeing in the industry right now has to do with training data itself. Some companies do not have large data sets, which means they benefit greatly from using pre-training models for specific domains. Because creating data sets can be an expensive endeavor, choosing the right data for the model is a new challenge that the engineering team must overcome.
Edge deployment of deep learning applications is a challenge
Currently, running AI on edge devices such as mobile/phones, Raspberry Pi and even smaller microprocessors remains a challenge. The challenge is to get your model trained on a large cluster and deployed on a small piece of hardware. The techniques used to achieve this are quantification of network weights (using fewer network weight bits), network pruning (removing weights that do not contribute much), and network distillation (training a smaller neural network to predict the same thing). This can be done with things like Google’s TensorFlow Light and NVIDIA’s TensorRT. We do sometimes see performance degradation when we zoom in on the model, but how much of a performance degradation, and whether this is a big deal, depends on the application.
Interestingly, we are seeing companies adapting their hardware to better support neural networks. We’ve seen this on Apple devices and on NVIDIA graphics cards with the tensor core. Google’s new Pixel phone also has a Tensor chip that runs neural networks locally. We see this as a positive trend that will make machine learning feasible in many more cases than it currently is.
Commercial robotics platforms for limited applications are growing in popularity
In the home, sweeping robots have become common. One new robot platform that is gaining popularity is Spot: Boston Dynamics’ walking robot. It is used by police and the army for surveillance purposes. Despite the success of such robotic platforms, their use is still limited and the use cases are very limited. However, we expect to see more use cases in the future as AI capabilities increase.
One robot that is succeeding is the self-driving car. Waymo and other companies are testing cars without internal safety drivers, which means the companies are confident in the vehicles’ capabilities. We think the challenge for large-scale deployment is to expand the area where these vehicles can travel. And proving they’re safe before they hit the road is a challenge.
GPU and CUDA programming allows parallelisation of Your Problems
GPU programming allows programs to perform large-scale parallel tasks. If a programmer has a goal: to accomplish a task by breaking it up into small tasks that don’t depend on each other, then the program is suitable for GPU programming. Unfortunately, programming in CUDA (NVIDIA’s GPU programming language) is still difficult for many developers. There are frameworks that can help you, such as PyTorch, Numba, and PyCUDA, that should make it more accessible to the general market. Currently, most developers use Gpus for deep learning applications, but we expect to see more applications in the future.
Semi-supervised Natural Language Processing performs well in benchmarks
Gpt-3 and other similar language models do an excellent job of “common natural language apis”. They can handle a wide variety of inputs and break many existing benchmarks. The more data we see used in a semi-supervised manner, the better the end result. Not only are they good at normal benchmarks, but they can be generalized to many benchmarks at once.
With regard to the architecture of these neural networks, we are seeing a shift from circular neural networks like LSTM to Transformer architectures. The trained models are huge, use a lot of data, and are expensive to train. This has led to some criticism of the huge cost of capital and energy used to produce these models. Another problem with large models is the speed of reasoning. When you work with real-time applications for these algorithms, they may not be fast enough.
MLOps and Data OPS allow easy training and retraining of algorithms
We see all major cloud vendors supporting common container choreography frameworks, such as Kubernetes, which increasingly integrate good support for ML-based use cases. This means that one can easily deploy a database as a container on a cloud platform and scale it up and down. One benefit is that it comes with built-in monitoring. One tool to watch out for is KubeFlow, which orchestrates complex workflows on Kubernetes.
We have seen improvements in tools for deploying algorithms at the edge. There is K3s, which is Kubernetes for the edge. There is KubeEdge, which is different from K3s. While both products are still in their infancy, they are expected to improve container-based AI deployment at the edge.
We’ve also seen products that support the full ML Ops lifecycle. One such tool is AWS Sage Maker, which helps you easily train models. We believe that eventually ML will be integrated into the full DevOps lifecycle. This creates a feedback loop in which you can deploy the application, monitor the application, and redeploy it as it happens.
AutoML allows automation of a portion of the ML lifecycle
We have seen a slight increase in the use of so-called AutoML, a technology that automates part of the machine learning lifecycle. The programmer can focus on getting the right data and rough ideas of the model, while the computer can figure out what the best hyperparameters are. At present, AutoML is mainly used to find the architecture of neural networks and the optimal hyperparameters of training models.
We believe this is a big step forward because it means machine learning engineers and data scientists will play a bigger role in translating business logic into formats that machine learning can solve. We do believe that this effort makes it even more important to track ongoing experiments. Techniques such as MLflow can also help track experiments.
Overall, we think the main problem is shifting from “finding the best model to capture your data” to “finding the best data to train your model.” Your data must be of high quality, your data set must be balanced, and it must encompass all possible edge cases for your application. Doing this is now mostly manual and requires a good understanding of the problem domain.
What do I need to learn to be a machine learning engineer
We believe that the education of machine learning has also changed a lot in the past few years. Starting with classical literature is probably no longer the best way to start, because we have made so many advances in the past few years. We recommend going straight to a deep learning framework such as TensorFlow or PyTorch.
It’s a good idea to choose a subject you want to specialize in. At InfoQ, we distinguish between the following disciplines: data scientist, data engineer, data analyst, or data operations. Depending on your chosen major, you can learn more about programming, statistics, or neural networks and other algorithms.
One tip we as InfoQ editors would like to share is to enter a Kaggle contest. You can choose a problem in the field you want to choose, such as image recognition or semantic segmentation. By building a good algorithm and submitting results on Kaggle, you’ll see the difference between your solution and that of other Kaggle users in the same contest. You’ll be motivated to rank higher on the Kaggle leaderboard, and tournament winners will often write down the steps they used to win at the end of the tournament. In this way, you can keep learning more techniques that can be directly applied to the problem domain.
Last but not least, InfoQ also has a lot of resources. We regularly publish the latest and greatest news, articles, demos, and podcasts on machine learning. You can also check out our article “How to Get Hired as a Machine Learning Engineer.” Last but not least, make sure you attend the QCon Plus conference in November and attend ML Everywhere.