Metzenkai, operation and maintenance manager of eBay, is mainly responsible for the maintenance of eBay’s cloud platform, application platform and network traffic
This article is to share with you about the practice of automatic patch deployment of the whole station cross-platform system.
background
EBay started with an all-Windows server front-end and static content was IIS.
The most expensive item ever sold on eBay was a $168 million yacht. EBay did $85 billion in business in 2016.
Our internal requirement of ATB is 99.947%, we will do database upgrade and migration once a week, and we will take into account all the things that have an impact on users, no matter our reasons or external reasons.
As an e-commerce platform, what we do is related to payment. If Paypal fails to collect money, its impact on me is that my users cannot make transactions when they buy things on my platform. If the platform fails, it will be my fault, and my ATB will be affected.
This is true for users, but it seems a little unfair for us. There are problems with your downstream products, so you can’t make a deal. We have to bear the impact of this ATB.
But we also do the Internet, such as 168 million to clinch a deal amount, don’t you think your server because the holes are dark and the money transfer to the account of others, or one day, your company was attacked by hackers, because your system didn’t patch, what you got, probably for most people, including our own, too, We care more about our own safety. Here’s some inside information.
The external situation has been quite severe since last year, including the ransomware last year. The whole security situation is very severe, and every engineer of us is under great pressure. Every time there is a leak, we have to fix it.
We require our patch cycle to be 45 days from release to full deployment.
EBay operates 24 hours a day. After 20 years of development, we have Windows and Linux, hundreds of thousands of systems and thousands of applications. There are only 12 front-end application server operation and maintenance personnel worldwide.
Twelve people need to maintain Windows plus Linux system. Our Windows system has 20,000 units and Linux 100,000 units, which are divided into thousands of application clusters. Our PD does not need to care about system-level vulnerabilities, because it runs on the OS with vulnerabilities, so we will make this patch.
We also manage other things, and patching is just part of our daily routine.
Problem analysis
At that time, it was 20,000 Windows systems. At that time, the leader told me that you should start to patch next week. I began to consider what to do at that time.
We went through a couple of different journeys. In short, there are three processes:
Support a single OS, scripting operation, repetitive labor
Supports a single OS to automate processes and reduce repetitive work
Supports multiple operating systems, visual operations, and platform-based management
Probably in many cases, like a lot of Internet companies, it probably started with Windows, when the OS was a single, using Windows Active Directory, and didn’t have the money to pay for its better services, Windows had a free product called WSUS, which provides unstructured management of Microsoft products in an enterprise environment.
However, using WSUS and Active directory as an implementation scheme has some limitations and cannot fully meet the needs of Internet e-commerce.
It can choose policies, but there are only three: either download-only, scheduled installation, or automatic installation.
If you download the auto-install, it needs to be rebooted, and there’s a problem, 20,000 machines at that point are patched and rebooted, and at that point the people in the monitoring center panic, because no one sends out a work order.
In our place, any machine restart must be issued work order, but found that all 20,000 machines in the station restarted, but no one issued work order.
At this time we will start all the associated with operations is called up and asked him why, we said in the patch, they say next time don’t do it, because you do it, first of all, I don’t know when we played, in addition to restart is not controllable, some service is not free to restart, it has a certain maintenance time.
So what to do? We chose to use a script and let the machine write down the script every month, and then we measured it.
First of all, we have to do the first one is a compatibility test, most of the time we don’t want to say that after a patch Windows hung up, blue screen reshipment, although at that time, we have some black technology, reshipment a Windows and only 15 minutes, but the amount is too large, it’s hard to recover, so we can download machine all the patches, Choose a script.
I pick a list of machines each time, and I send out a work order for the SEC to restart it. At the time of the machine will still be some change, you will have a new machine to join online constantly, constantly updated machine down, whatever you do, because the light have the script, is there a list of the machine, the script is dead, and the machine list is changed, each month have to put the machine maintenance list again, this also is very troublesome one thing.
I’m going to talk about how to solve this problem later, but let’s leave this problem here, just to get you thinking about it.
We went from script running on a single OS to automated process running, and then we implemented automated grayscale distribution of patches.
In the end, we want to be a visual, platform-based operation, so I let the Hadoop o&M students take charge of Hadoop, and I open up an interface for other o&M colleagues to put your logic in. You need to test, I will test for you, but your business logic is on your side, I leave this business logic to you to maintain.
There are several problems with patching, first, how do you know where the bug is, and second, how do you find it and what impact it has on the application.
When I have a bug found, how do I look at the impact from an application perspective. Then how to deploy the patch so that we can achieve cross-platform, there is also a question of how to patch to ensure security.
Most of the students may find the vulnerabilities by reading the official announcement or Microsoft’s announcement, and then patch the system. In fact, there are other possibilities.
First one is the active scan, each on a regular basis to do health check for you, just like we do physical examination, here you can select a commercial product or open source products, we think is relatively good, one is the Qualys, it will provide active scanning based on vulnerability database, help you find from the perspective of application and host the perspective what is some holes, And it also has the benefit of helping you grade bugs.
Can borrow some existing vulnerabilities, there are times when you had holes, you don’t know the degree of danger of the holes is high or low, as there is a leak in the Intel CPU front says, I heard this thing a face of meng force, there may be a lot of people also is such, what is it means we don’t know, it has the potential to tell you is what reason, We borrow some of the knowledge of others to complete ourselves.
Then there are vendor announcements, including official channels with Windows and so on.
In addition, there is another, is the industry notice.
Has one of the more popular in the United States called the CVE, when we find out any vulnerability scanning, it will attach two information, have a is the official patches, the second is the CVE, CVE will tell you how much is your vulnerability possible rating points, it has already made the rating, also the corresponding possible dangers and patch to tell you.
Users upload the bugs they find, record them in it, and post them. In fact, the industry is also looking at CVE’s findings. Since CVE and commercial scanning software have a lot in common, why would you use a commercial product when you have open source and not use public safety information?
Well, one of the things is that in terms of efficiency, and in terms of expertise, sometimes given to CVE, there’s so much information that you can’t compare them all.
If 200000 machines across several applications of the cluster, there are different versions, more trouble is a different version of the software package, we do in normal operations have found this is normal phenomenon, this kind of situation we call package version drift, we also try to avoid this kind of package in the process of patch version of drift.
What we did inside eBay was, we used scanning software to scan all the IP segments to make sure they were all covered, and then we added CMS, which is a configuration management system, which is kind of a CMDB, and based on those two parts to see what kind of applications this vulnerability mapped to have problems, It is convenient for us to do an evaluation and grading next.
Not all holes need to be handled through the patch, some bugs you might put a put, or do you use other ways to stop it directly, rating after you will find that if is zero day vulnerabilities must be immediately dozen, risk is not high to security patches has a chance to catch my breath, if it is zero day I need to find colleagues in the United States, Get it over with as soon as possible.
This will determine your strategy for fixing bugs, whether you want to fix the bugs, or disable the service of the buggy package, or whatever.
Strategy set, if it is a system level, operational team in charge, but if a lot of closely related with application, you will have to inform the corresponding development team, told him that your application needs to patch, do you the application of the patch, repair it out entirely, or put it in the application’s deployment inside to dispose of them, this is our for vulnerability assessment of a process.
There are actually two points to make cross-platform patching work under one system: first, there are Redhat, ubuntu and Windows on the system online.
There are several versions of Windows, but the good news is that they can use the same update source, which is updated roughly once a month. So one difference you’ll notice is that Linux has two update packages and Windows has one, so it’s relatively stable, Linux is constantly updated.
We’re on Ubuntu. Not all machines have access to the Internet. We do a reverse proxy. So it’s a reverse proxy pattern. Ubuntu doesn’t have access to the Internet, which also makes us less vulnerable to attacks.
So in this we need to set up two types of Repo, including online machines that can get the latest packages. The daily update package is constantly in sync with the outside, and then automatically triggers the internal CICD process for deployment and testing.
Now I have these deployments at the beginning of the package, and with these deployments I can pull out the packages that are already in production and test them, test them and release them.
Also, you patch your Repo online, your Repo is not updated, and it still has a bug, so it updates the Repo together to make sure that the new machine has the bug, which is what we do in terms of technical architecture.
With this preparation, which will be discussed further on how to ensure security during patch deployment, the technical architecture allows me to patch it, I can do my testing, and our development team can know when patches are deployed.
On the patch deployment, our ops team is directly deployed, first is to do the patch test, and exception shielding, actually a lot of times for the development of operations and the biggest difference is that, said operations to the responsibility is to prevent those haven’t foreseen or have to be foreseen, but still don’t know when the problem, so we will do abnormal protection, We check the list of packages before we patch.
We will have a blacklist based on certain applications, the application list is not typed, or we will look at PHP for example, which may affect the application, we add it to the blacklist, system patches, security patches are not typed.
Followed by important sampling test, and talks to the canary test in gray level test, is the sample test, we put it in a production environment, each putted forward, look for one to run and find the same OS, and then do gray, gray level distribution can be divided into three stages, five stages, or you can literally define. If you’re really unlucky, you still have a chance to roll back without too much impact on your business.
Now let’s look at some of the problems we might encounter in the results test.
Like false positives and false negatives.
What is a false positive?
The external scanning software comes out and says you have a Windows vulnerability here, and then you look again, this is a Linux device, and you have zheyangde data in your configuration database, and the world is not so nice for us, we need to understand the world, the world is not nice, I need to know what’s not nice about it, That could be a problem. That’s a false positive.
False negative means that the version number you see seems to be made with you, but sometimes you make a mistake. When scanning, you will be told that the negative you said still has loopholes, which is a problem we will encounter in the current testing stage. That is, how do you verify that you’re really done with a patch, and then test whether the system crashes, package dependencies, performance changes, and dependencies take effect?
And then there’s the deployment phase of how we validate.
In fact, there are more pits in the deployment phase. One is the coverage rate of patch deployment. We often say that after patch completion, the Window system requires you to place the package there and inform it to install, which depends on agent, but it may fail.
The other thing is that when your package comes down, it’s not actually ready, and there’s a lot of machines that are going to be built at that time, which we don’t know at all, so you have to see how many machines are being built in that time.
And increment of the discovery of the new system, and finally to see if there is how many patches to the results, how much is your workload, that want to speak with your boss, this is my work, I did many things this month, although I only have 12 people, but I did so many things, want to let him know what is the result. And your boss has to explain to your boss’s boss.
In addition to the above two problems, there will be other problems in the result test. For example, the patch will actually involve restart and the OS will break down. We need to have a locking mechanism, and when I deploy this patch, at the same time you don’t want to reduce its capacity, code deployment might affect it.
The other is monitoring, if the machine goes offline, restarts, the monitoring person should know why, he even wants to say, you patch is normal change, you don’t let me see such a mess. I can transfer the pressure to other departments, such as OUR AD-based management or LDAP management. You have your permissions, and I should provide tools for you to use. If anything goes wrong, it’s your problem.
System architecture
Then we can see the overall system architecture, probably is such a figure, from discovery to show, how do you show your figure, data, the vulnerability of the verification of your package, then test, planning, deployment, may be part of this, you cannot be incorporated into automation, it is a special application, it cannot be incorporated into automated deployment. For example, database, I dare not include in, I can not bear this responsibility.
Since there are only two DBS, why should I deploy them? At the end of the process, you’ll see a data presentation here, and you’ll tell your boss what you’re doing, and usually the last thing you care about is not the data that operations cares about, but the last thing your boss might care about is that he’s going to have to tell an external public media about his health.
Inside eBay, the front end operations manage all the WEB servers, and then there are other teams that do other things, and we let them manage their own parts based on LDAP or AD.
Then how do you deploy, how do you connect, what is your work order, need not to need change, you want to send orders or send script, or a protective script, where is your script, to tell him, as you will send messages pushed to who, when you finished the work order, after you clear know how much you have completed, And I’ll tell you how many things are left undone.
And then finally there’s a graph that shows that my job is these parts.
And some of them are colleagues in other departments, and this is their job to do.
future
After the introduction of the current situation, looking to the future, there are still a lot of space can be imagined, one is that we will do the kernel hot patch.
We are talking about the system level, how to patch the kernel? The other is that we want to borrow technology from later, so we don’t need patches for fusion or reinstallation.
For this kernel patch we cover several technologies, we chose Kpatch, and there are several other technologies available.
All machines, like a ship, the operating system is just a carrier, just provide your computing resources, if all of your code, all the dependency relations inside the container, I just need to replace a container, or your container is above, the following ship to fix you, take off, the container to another can also be on the ship.
You’re going to be a lot faster, and you’re going to be a lot easier, and you’re going to fix all the bugs in the application, and we talked about bugs in the application before and we’re going to let the developers fix them, put them in deployment, and it’s going to be a lot faster.
In the future, we hope to make patching easy and easy, without any pain or blindness. That’s all for my sharing. Thank you.
For more content on automated operations, please pay attention to Beijing DOIS Conference
Four experts from Alibaba, JD, Huawei and CMB will unlock more new skills in automatic operation and maintenance for you
Conference preview video fresh out ⬇️
For more, please read the original article