Writing in the front

Recently, I spent some time reading the book SRE Goolge Operation, Maintenance and Decryption. For the content of the book, you can see the introduction on Douban. In general, this book is the first systematic disclosure of some guiding ideas, practices and related issues of SRE operation within Google, which has certain reference significance for our operations and even developers.

I was also impressed by some ideas in the book, such as ensuring that SRE engineers spend 50% of their time on projects, wrong budgets, wheel of Fortune, accident summaries, etc., which are very inspiring for practitioners. There are many ideas and tools mentioned in the book. I think different units have different cultural and institutional backgrounds, so this guiding principle may not be implemented, but the tools mentioned in the book may be used by others. Therefore, I compiled the following list of tools mentioned in the book and searched for corresponding open source projects for your reference.

If you find something incomplete, or if you want to discuss a tool in depth, feel free to leave a comment.

Google technology stack

Function is introduced product Open source products for the target note
Distributed consensus system, distributed lock service Chubby describes it as a strongly consistent storage system ZooKeeper, Consul
Monitoring service Borgmon Prometheus, Riemann, Heka, Bosun
Photon
Distributed periodic task system Cron
Task distribution system, cluster management system Borg
Distributed file system GFS
Mesos
Manage alarm response and upgrade rules Escalator
Fault tracking tool (passively collects all alarm information issued by the monitoring system while providing marking, grouping and data analysis functions) Outalator
Data pipelining Graphs, the Flume
Large-scale data processing Workflow Spanner ?
Incident Command System
Build system Bazel
Distributed file system GFS

Borg Scheduling Services (2003), Kubernetes Borg Name Service BNS Bigtable Blaze/Bazel Build Rapid release Midas Pacakge Management MPM package Sisyphus Release automation framework Chubby Strong consistency Storage system Prober End-to-end Monitoring Protocol Buffer (Protobuf) Alert Manager Alarm management service Dapper Distributed component tracking tool Incident Command System Emergency management IRC Robot Dagger Dependency Injection Tool Protocol Buffer Data exchange format Auxon Automated capacity planning GRPC Google RPC Framework Doorman Collaborative distributed client throttling system Zipking business flow tracking Stackdriver

Poke fun at two o ‘clock

P158: A test system can detect a Bug where MTTR is 0. P253: This type of design is sharded in the workload of the service leader. P327:Google has little experience dealing with large scale consumer products running client code that cannot be directly controlled.

Second, a powerful client

Chapters and reviews

Chapter and Title feeling
1 introduction
2 Google production environment: SRE perspective
3 Embrace risk
4 service quality objectives
5. Cut down on chores
6 Monitoring distributed systems
Google automation system evolution The value of automation, the level of automation
8 Release Project
9 simplification
10 Effective alarm based on time series data
11 on – call rotation
12 Effective troubleshooting methods
13 Emergency response
14 Emergency management
Hindsight: Learn from failure
16 Tracing Faults
17 Test Reliability
Software engineering practice in SRE department
19 Load balancing for front-end servers Best practices for load balancing policies between different data centers. Basic solutions include DNS, VIP (Network load balancer F5)
20 Load balancing system in the data center From the application layer to discuss how to carry out load balancing, how to make the utilization rate of each server more balanced, to avoid the situation of unevenly busy. How to identify the real state of the back end more accurately: lame-duck state.
21 Coping with Overload
22 Handle the chain fault
23 Managing critical status: Use distributed consensus to improve reliability
24 Distributed periodic task system
25 data processing pipeline
26 Data integrity: Consistent read and write
Reliable mass product launches
28 Quickly train SRE to join on-call
29 Handle interrupted tasks
30 Help the team recover from operation overload by embedding SRE
Communication and collaboration between SRE and other teams
32 Evolution history of the SRE participation mode
Practical experience in other industries
34 epilogue

References: 3. Database Monitoring based on Prometheus 4. Prometheus 5. Use and principle of Google Protocol Buffer 6, the Doorman: Global Distributed Client Side Rate Limiting 7, SRE Book Notes 8, Zipkin 9, Morgue Accident summary tool 10, Incident Management at Google 11. TerraForm 12.