Twitter
What is the conversation
DevOps (a distinct “development” and “operations” complex) is a software engineering culture and practice that aims to unify software development (Dev) and software operations (Ops). The main feature of the DevOps movement is the strong promotion of automation and monitoring in all steps of software construction, from integration, testing, and release to deployment and infrastructure management. DevOps is designed to shorten development cycles, increase deployment frequency and more reliable releases, aligned closely with business goals.
DevOps is the practice of developers being responsible for 24/7 operations in production. This includes development, testing, standby, reliability engineering, disaster recovery, SLO definition, monitoring setup and alarms, debugging and performance analysis, event root cause analysis, preparation and deployment using shared infrastructure primitives.
A brief history of operating and maintaining Internet applications
- In the first phase, Internet applications are built and deployed in a manner similar to the shipping of “scalable packaging” software. Three different job roles (development, QA, and operations) work together to move applications from development to production over long engineering cycles. In this phase, each application is deployed in a dedicated data center or hosting facility, which further requires operators familiar with the site-specific network, hardware, and system administration.
- In the second phase, primarily spearheaded by Amazon and Google in the late 1990s and early 2000s, Internet applications in hyper-growing companies began to adopt practices similar to the modern DevOps movement (loosely coupled services, agile development and deployment, automation, etc.). These companies still run private (large) data centers, but because of the scale involved, they are starting to develop centralized infrastructure teams to solve common problems (networking, monitoring, deployment, configuration, data storage, caching, physical infrastructure, etc.) that are required for all services. Yet Amazon and Google never fully developed the job roles (amazon calls them systems engineers, Google calls them website reliability engineers) and recognized the different skills and interests that each involved.
- In phase 3, or cloud native, Internet applications now rely on hosted elastic architectures to be built from scratch, typically provided by one of the “Big Three” public cloud providers. The main reason to get a product to market as quickly as possible has always been the high probability of failure, but in the cloud native era, the “out of the box” infrastructure allows for a cumbersome pace of iteration. Another defining characteristic of companies that began in this era was to avoid hiring non-software engineer roles. The available infrastructure base is so strong that they think venture capital would be better spent on software developers who can do both engineering and DevOps.
Why is DevOps good for modern Internet startups?
- In my experience, a successful early-stage startup engineer is a special kind of engineer. They are risk tolerant, learn extremely fast, can get things done as quickly as possible regardless of technical debt, can usually work in multiple systems and languages, and often have prior experience in operations and systems administration or are willing to learn what they need to know. In short, the typical startup engineer is a great fit to be a DevOps engineer, whether they want to call themselves that or not.
- As mentioned above, modern public clouds provide an incredible infrastructure for building. Most of the basic operational tasks of the past have been automated, and the remaining layers are sufficient to release a minimum viable product to verify that there is a suitable product market.
- I firmly believe that the quality of systems improves when developers are forced to on-call and take responsibility for the code they write. No one likes to be paged regularly. This feedback loop builds a better product, and as I described in (1), engineers attracted to work on early start-up products are very willing to learn and do operational work. This is especially true given that startups with poor reliability often have little response. Reliability needs to be good enough for the product to find its niche and move into hypergrowth.
What happens when a modern Internet startup experiences excessive growth?
- The rapid increase in personnel growth has placed severe pressure on communication and engineering efficiency. I highly recommend reading The Mythical Man-Month (still relevant nearly 50 years after its initial publication) for more information on this topic.
- The above almost always leads to a shift from monolithic to microservice architectures, which is a way to decouple development teams and improve communication and engineering efficiency.
- The shift from singleton to microservice architecture increases the complexity of the system infrastructure by several orders of magnitude. Networking, observability, deployment, library management, security, and hundreds of other issues that were previously easy to solve are now major issues that need to be addressed.
- At the same time, hyper-growth means increased traffic and the resulting technology scaling problems, as well as greater impact on total failures and minor user experience issues.
Core infrastructure team
The fallacy of substitutability
- As I described above, modern cloud native technologies and numerous abstractions allow for building very rich applications with increasingly complex infrastructures. Naturally, most companies will no longer need specialized skills such as data center design and operations.
- For the past 15 years, the industry has focused on software engineering as the foundation of all disciplines. For example, Microsoft eliminated traditional QA engineers and replaced them with software test engineers, on the idea that manual QA was inefficient and that all testing should be automated. Similarly, traditional operations roles have been replaced by site reliability engineers or the like, with the idea that manual operations are inefficient and that the only way to scale is through software automation. First of all, I agree with these trends. Automation is a more efficient way to scale.
- General purpose and expert. More complex applications and architectures require more expertise to succeed, be it front-end, infrastructure, client, operations, testing, data science, etc. This doesn’t mean that generalists are no longer useful, or that generalists can’t learn and become experts, it just means that larger engineering organizations need different types of engineers to succeed.
- All engineers don’t like to do everything. Some engineers like to do full stack. Some people like to specialize. Some people like to write code. Some people like to debug. Some like the UI. Some like systems. An engineering organization that supports the continued development of specialists of all types must address the fact that employee happiness sometimes involves some specific types of problems and not others.
- Not all engineers are good at everything. I’ve met a lot of great people throughout my career. Someone can start with an empty file in an IDE and create an incredible system from scratch. At the same time, these people have little intuitive experience with how to run reliable systems, how to debug them, how to monitor them, and so on. Instead, I have been a lot of exciting interview cycle, which is a truly incredible operations engineers can purely through debugging aspects of professional knowledge and how to run a reliable system of natural instincts, produce huge benefits to the entire organization, but they were refused just because didn’t show enough coding skills.
failure
- Migrate to microservices. When an engineering organization reaches about 75 people, it is almost certain that a core infrastructure team will start developing the basic common functionality needed to build the product team to build microservices.
- Pure DevOps. At the same time, the product team was told to do DevOps.
- Reliability consultant. At this organizational scale, engineers who tend to work in infrastructure are likely to be those with prior operations experience or good intuition in the field. These engineers inevitably become de facto SRE/ production engineers and serve as consultants to help other members of the organization while continuing to develop the infrastructure needed to run the business.
- Lack of instruction. As an industry, we now want to hire people who can be involved in developing and operating Internet services. However, we generally do a poor job of recruiting new employees and the continuing education needed to carry out the task. How can we expect engineers to have operational intuition when we never teach skills?
- Support failed. With the increasing hiring rates in engineering organizations, the core infrastructure team can no longer continue to build and operate the infrastructure critical to business success while maintaining support to help the product team accomplish operations. Infrastructure engineers have dual responsibilities as organization-wide SRE consultants based on their existing workload. Everyone understands that training and documentation are vital, but few people prioritize the time they have to get them done.
- Job burnout. Worse, the previously described situation caused attrition and lowered morale throughout the organization. Product engineers feel they are being asked to do things they don’t want to do or haven’t been taught. Infrastructure engineers tire under the weight of providing support, knowing that training and documentation are urgently needed but failing to prioritize it while keeping existing systems running with high reliability across the company.
Is there a middle ground between the “old way” and the DevOps way?
- Is it just the SRE that needs to be on call or does the software engineer share the responsibility of being on call?
- Do SRE do actual engineering and automation, or do they only need to perform manual and repetitive tasks, such as deployment, repetitive page parsing, and so on?
- Is the SRE part of the core consulting organization or embedded in the product team?
What is the correct SRE model?
- Recognize operations and reliability engineering as a separate and highly valuable skill set. Our rush to automate everything and the idea that software engineers are interchangeable has marginalized a subset of engineers who are as valuable as software engineers. Operations engineers and software engineers are partners, not interchangeable cogs.
- The SRE is not an on-call, surveillance dashboard or deployed monkey. They are software engineers who focus on reliability tasks rather than product tasks. The ideal structure requires all engineers to perform basic operations tasks, including on-call, deployment, and monitoring. I think this is very important because it helps avoid class/job layering between reliability and software engineers, and enables software engineers to be more direct about product quality.
- SRE should be embedded in the product team, not reported to the product team engineering manager. This allows SRE to blend with their team, gain mutual trust, and still have checks and balances in place so that a real dialogue can take place as they try to weigh reliability against functionality.
- The goal of embedded SRE is to improve product reliability by implementing reliability-oriented features and automation, guiding and educating the rest of the team on operational best practices, and acting as a liaison between the product team and the infrastructure team (document feedback, pain points, required features, etc.).
conclusion
- Any new technology company that wants to compete needs DevOPs-style agile development and automation.
- Public cloud native primitives and a small core infrastructure team allow engineering organizations to scale to hundreds of engineers before they start to suffer operational losses due to a lack of coaching and role characteristics.
- Success on manageable staffing issues requires a real investment in new staff and continuing education, documentation, and the development of embedded SRE teams that bridge the gap between the product and infrastructure teams.
Original link:The Human Scalability of DevOps(Translation: Fengxsong)