Writing in the front
Recently, I spent some time reading the book SRE Goolge Operation, Maintenance and Decryption. For the content of the book, you can see the introduction on Douban. In general, this book is the first systematic disclosure of some guiding ideas, practices and related issues of SRE operation within Google, which has certain reference significance for our operations and even developers.
I was also impressed by some ideas in the book, such as ensuring that SRE engineers spend 50% of their time on projects, wrong budgets, wheel of Fortune, accident summaries, etc., which are very inspiring for practitioners. There are many ideas and tools mentioned in the book. I think different units have different cultural and institutional backgrounds, so this guiding principle may not be implemented, but the tools mentioned in the book may be used by others. Therefore, I compiled the following list of tools mentioned in the book and searched for corresponding open source projects for your reference.
If you find something incomplete, or if you want to discuss a tool in depth, feel free to leave a comment.
Google technology stack
Function is introduced | product | Open source products for the target | note |
---|---|---|---|
Distributed consensus system, distributed lock service | Chubby describes it as a strongly consistent storage system | ZooKeeper, Consul | |
Monitoring service | Borgmon | Prometheus, Riemann, Heka, Bosun | |
Photon | |||
Distributed periodic task system | Cron | ||
Task distribution system, cluster management system | Borg | ||
Distributed file system | GFS | ||
Mesos | |||
Manage alarm response and upgrade rules | Escalator | ||
Fault tracking tool (passively collects all alarm information issued by the monitoring system while providing marking, grouping and data analysis functions) | Outalator | ||
Data pipelining | Graphs, the Flume | ||
Large-scale data processing | Workflow | Spanner ? | |
Incident Command System | |||
Build system | Bazel | ||
Distributed file system | GFS |
Borg Scheduling Services (2003), Kubernetes Borg Name Service BNS Bigtable Blaze/Bazel Build Rapid release Midas Pacakge Management MPM package Sisyphus Release automation framework Chubby Strong consistency Storage system Prober End-to-end Monitoring Protocol Buffer (Protobuf) Alert Manager Alarm management service Dapper Distributed component tracking tool Incident Command System Emergency management IRC Robot Dagger Dependency Injection Tool Protocol Buffer Data exchange format Auxon Automated capacity planning GRPC Google RPC Framework Doorman Collaborative distributed client throttling system Zipking business flow tracking Stackdriver
Poke fun at two o ‘clock
P158: A test system can detect a Bug where MTTR is 0. P253: This type of design is sharded in the workload of the service leader. P327:Google has little experience dealing with large scale consumer products running client code that cannot be directly controlled.
Second, a powerful client
Chapters and reviews
Chapter and Title | feeling |
---|---|
1 introduction | |
2 Google production environment: SRE perspective | |
3 Embrace risk | |
4 service quality objectives | |
5. Cut down on chores | |
6 Monitoring distributed systems | |
Google automation system evolution | The value of automation, the level of automation |
8 Release Project | |
9 simplification | |
10 Effective alarm based on time series data | |
11 on – call rotation | |
12 Effective troubleshooting methods | |
13 Emergency response | |
14 Emergency management | |
Hindsight: Learn from failure | |
16 Tracing Faults | |
17 Test Reliability | |
Software engineering practice in SRE department | |
19 Load balancing for front-end servers | Best practices for load balancing policies between different data centers. Basic solutions include DNS, VIP (Network load balancer F5) |
20 Load balancing system in the data center | From the application layer to discuss how to carry out load balancing, how to make the utilization rate of each server more balanced, to avoid the situation of unevenly busy. How to identify the real state of the back end more accurately: lame-duck state. |
21 Coping with Overload | |
22 Handle the chain fault | |
23 Managing critical status: Use distributed consensus to improve reliability | |
24 Distributed periodic task system | |
25 data processing pipeline | |
26 Data integrity: Consistent read and write | |
Reliable mass product launches | |
28 Quickly train SRE to join on-call | |
29 Handle interrupted tasks | |
30 Help the team recover from operation overload by embedding SRE | |
Communication and collaboration between SRE and other teams | |
32 Evolution history of the SRE participation mode | |
Practical experience in other industries | |
34 epilogue |
References: 3. Database Monitoring based on Prometheus 4. Prometheus 5. Use and principle of Google Protocol Buffer 6, the Doorman: Global Distributed Client Side Rate Limiting 7, SRE Book Notes 8, Zipkin 9, Morgue Accident summary tool 10, Incident Management at Google 11. TerraForm 12.