You read that right. 300 million predictions. Every second
Machine learning is changing many areas. One big area is advertising. While companies like Google and Facebook are notorious for using big data to target personalized ads, there are many other players in the space. This shouldn’t come as a surprise, since online advertising is said to be a $100 billion industry.
We see some big numbers here
From a technology perspective, the industry is an interesting convergence of networks and machine learning. This presents an interesting set of challenges that must be addressed. We need high precision, constantly updated models, and very low latency. This makes traditional methods/models difficult to implement. The authors of the paper, “Extending TensorFlow to 300 million predicted values per second,” detail their challenges and approaches to solving these problems. They did this by sharing their experiences working with Zemanta.
A good background knowledge will help you understand the context of the paper/work
The above is a passage from the author. It explains what Zemanta is, how the service works, and how advertising space is sold. The last point details the use of machine learning to maximize their KPIs, which is quite interesting. Perhaps readers of this article will continue to work in the field (do remember me, laugh).
Some background on the design choices made.
In this article, I will share my experience in enabling Zemanta’s authors/team to achieve 300 million predictions per second using the TensorFlow framework. As usual, an annotated version of the paper will be at the end (the arxiv link has been shared). Be sure to read it for yourself. Which of these studies do you find most interesting? Let me know in the comments. If you would like to discuss this paper in more detail, please feel free to contact me via social media.
Lesson 1: Simple models are (still) king
This is something that people in machine learning know very well. Flipping through the AI news, you could be forgiven for equating machine learning with large models with complex steps. It’s not surprising that most beginners confuse machine learning with deep learning. They see the news about GPT-3 or Google ResNet. They argue that to build great models, you need to know how to build these huge networks that take days to train.
Deep learning has gained a lot of attention in recent days.
This article introduces the reality. As anyone who has engaged in machine learning can attest, the following memo is accurate.
Simple models are easier to train, can be tested faster, don’t require as many resources, and generally don’t lag far behind. Applying large models at scale can add a lot to server/run costs. The author reflects a similar view in the following paragraph of his paper.
In addition, we do not use gpus for reasoning in production. At our scale, it would be very expensive to have one or more top gpus per machine, on the other hand, having only a small cluster of GPU machines would force us to transition to a service-based architecture. Given that neither option is particularly desirable and that our model is relatively small compared to the most advanced models in other areas of deep learning, such as computer vision or natural language processing, we believe our approach is more economical. Since our model uses sparse weights, our use cases are also not well suited to GPU workloads.
Many companies don’t have large GPU systems lying around that they can use directly to train and predict data. And most of them are unnecessary. To quote the authors: relatively small models are much more economical.
Lesson 2: Don’t ignore lazy optimizers
A sparse matrix is a matrix with a value of almost zero. They are used to represent systems with limited interaction between two pairs of components. For example, imagine a human matrix where our rows and columns correspond to people on Earth. If two people know each other, the value of a particular index is 1; if they don’t, it’s 0. It’s a sparse matrix, because most people don’t know most of the rest of the world.
The matrix Zemanta uses is sparse. The reason for this, they found, is that most features are classified. Using the Adam optimizer adds a lot of running costs (50% more than Adagrad). On the other hand, Adagrad’s performance is also terrible. Fortunately, there is an alternative that performs well but costs a lot. LazyAdam.
Lazy evaluation is a mature practice in software engineering. Lazy loading is often used on GUI/ interactive platforms such as websites and games. It is only a matter of time before lazy optimizers become entrenched in machine learning. When it does happen, keep an eye out. If you’re looking for ways to do research in machine learning, this could be an interesting choice.
Lesson 3: Larger batches → Lower computational costs
I didn’t expect that at all.” By digging into TF, we realized that the calculation is much more efficient (per example) if we increase the number of examples in a calculation batch. This low linear growth is due to the fact that TF code is highly vectorized. TF also has some overhead for each calculation call, which is then amortized over larger batches. Given this, we believe that in order to reduce the number of calculation calls, we need to add many requests to a single calculation.”
This is a new problem. Does mass training lead to lower computing costs? To be honest, I still don’t understand why. If any of you know why, be sure to share with me. I’d like to know why. The scale is also surprising. Their computing costs were cut in half. The whole result of optimization like this is.
This implementation is highly optimized, reducing the number of calculation calls by a factor of five and halving the CPU usage of TF calculations. In rare cases where the batch thread is not given CPU time, these requests will time out. However, this occurred in less than 0.01% of requests. We observed a slight increase in average latency – about 5 milliseconds on average, and possibly higher during peak traffic. We developed service level agreements and appropriate monitoring to ensure stable delays. Since we didn’t significantly increase the percentage of timeouts, this was very beneficial and remains core to our TF service mechanics.
The slightly increased delay makes sense. To see exactly what they did, check out section 3.2. This is the whole web stuff, so I’m not an expert. But the results speak for themselves.
conclusion
This paper is an interesting read. It combines engineering, networking and machine learning. In addition, it provides insights into the use of machine learning in small companies, where huge models and 0.001% performance improvements are not the key.
You can read my fully annotated paper here (available for free download).
Medium.com/media/2cd46…
Contact with me
If this article interests you in getting in touch with me, then this section is for you. You can contact me on any platform, or check out my other content. If you want to discuss parenting, please text me on LinkedIn, IG or Twitter. If you want to support my work, use my free Robinhood referral link. We all get a free stock, and there’s no risk to you. Therefore, not using it just loses free money.
Check out my other articles on Medium. : rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Connect me on LinkedIn. Let’s contact: https://rb.gy/m5ok2y
I sets: rb. Gy/gmvuy9
My Twitter:twitter.com/Machine0177…
My Substack:codinginterviewsmadesimple.substack.com/
Get free stock on Robinhood: https://join.robinhood.com/fnud75
Lessons from Expanding TensorFlow to 300 million predictions per second was originally published in Geek Culture magazine, and people continued the conversation by highlighting and responding to the story.