This article has participated in the activity of “New person creation Ceremony”, and started the road of digging gold creation together.
How can we improve r&d efficiency? Do you rely on separate local development test environments, or do you rely on full end-to-end testing? This series of articles describes the history and evolution of Lyft’s development environment and helps us think about how to build an efficient development environment for large-scale microservices. This is the third article in a four-part series. 原 译 : Scaling productivity on microservices at Lyft (Part 3): Extending our Envoy mesh with staging overrides[1]
This is the third article in a series on how Lyft is effectively expanding its development practices as it faces a growing number of developers and services.
- Part I: History of development and test environments
- Part two: Optimizing rapid local development
- Part 3: Extending the Service Grid in a Pre-delivery Environment using coverage Mechanisms (this article)
- Part FOUR: Deployment of access control based on automatic acceptance test
In previous articles, we described a laptop development workflow designed for rapid iteration of local services. In this article, you’ll detail a secure and isolated end-to-end (E2E) testing solution: a pre-production shared environment. Before delving into the implementation details, we’ll briefly review the issues that led us to build this system.
The previous integration environment
In Part 1 of this series, we introduced Onebox, a tool previously used for multi-server end-to-end testing. Onebox users need to rent a large AWS EC2 VIRTUAL machine to start more than 100 services to verify that the changes work across service boundaries. This solution provides a sandbox for each developer to run their own version of Lyft, controlling each service’s version, database content, and runtime configuration.
Each developer runs and manages his own separate Onebox
Unfortunately, Onebox ran into scale issues as Lyft’s number of engineers and services increased (see the first article for details), and we needed to find sustainable alternatives to perform end-to-end testing.
We see a shared pre-release environment as a viable alternative. The similarity between the pre-release environment and the production environment gave us confidence, but we needed to add the missing piece to provide a safe development environment: isolation.
Staging Environment
The pre-delivery environment runs the same technology stack as the production environment, but uses elastic resources, simulated user data, and artificial Web traffic generators. The pre-launch environment is the Lyft Level 1 environment, and if the environment becomes unstable and the SLO[2] is affected, the engineers and developers on standby will upgrade the SEV[3]. While the availability and real traffic of pre-delivered environments adds end-to-end credibility, there are some issues that may arise if we encourage widespread use of pre-delivered environments:
-
A pre-launch environment is a fully shared environment, and just like a production environment, if someone deplores a failed instance to a pre-launch cluster, it can affect other (possibly transitive) people who rely on the service.
-
The way to deliver the new code is to merge the PR into the mainline, triggering a new deployment pipeline. To test how experimental changes work in an end-to-end environment, there is a lot of process overhead: writing tests, reviewing code, merging, and progressing through CI/CD.
-
This onerous process can lead users to use an escape hatch: deploy the PR branch directly into the pre-launch environment. Defects that reduce environment stability are further magnified when unprocessed commits are run in a pre-release environment.
Our goal is to overcome these challenges and make the pre-delivery environment more suitable for manually validating end-to-end workflows. We want users to test their code in preparation, not get bogged down in the process. Minimize the radius of change if there are problems with their revisions. To do this, we created the staging Override.
Staging Overrides
Staging Override is a set of tools used to securely and quickly verify user changes in a Staging environment. ** We fundamentally changed the approach of the isolation model: isolating requests in a shared environment rather than providing a completely isolated environment. ** At its core, we allow users to override requests through the pre-delivery environment and conditionally execute experimental code. The general working process is as follows:
-
Create a new deployment on the pre-deployment environment that is not registered with the service discovery, known as offloaded deployment, and ensure that other users making requests to the service are not routed to the (potentially corrupted) instance.
-
The infrastructure should know how to interpret the override information embedded in the request header to ensure that override metadata is propagated throughout the Request Call Graph.
-
Modify the routing rules for each service so that the coverage information provided in the request header can be used to route to the corresponding offloaded deployment based on the rules specified by the coverage metadata.
The example scenario
Suppose a user wants to test a new version of onboarding service in an end-to-end scenario. Previously with Onebox, users could launch an entire copy of the Lyft stack and modify the corresponding service to verify that it worked as expected.
In a pre-shipped environment today, users can share the environment, but can replace unloaded instances that do not affect normal pre-shipped traffic.
Requests made by a typical user to the pre-delivery environment do not pass through any instances unloaded in real time
By attaching a specific header to the request (” Request Baggage “), the user can choose to route the request to the new instance:
Header metadata allows users to modify the call flow on a per-request basis
In the remainder of this article, we’ll delve into how to build these components to provide an integrated debugging experience.
Offloaded deployment (Deployments)
Lyft uses Envoy as a service network proxy to handle communications between many services
In Lyft, each instance of each service is deployed next to an Envoy[4] Sidecar as the service’s only entry and exit. By ensuring that all network traffic goes through Envoy, we provide developers with a simplified view of traffic that provides service abstraction, visibility, and extensibility in a language-independent manner.
The service invokes the upstream [5] service by sending a request to its Envoy Sidecar, which Envoy forwards to the upstream health instance. We update the Envoy configuration via the control plane, which is updated via the xDS API[6] based on Kubernetes events.
Avoiding service discovery
If we want to create an instance that does not normally fetch service traffic from the grid, we need to instruct the control plane to exclude it from service discovery. To achieve this, we embed additional information in the Kubernetes POD tag to indicate that the pod has been uninstalled:
.
app=foo
environment=staging
offloaded-deploy=true
.
Copy the code
We can then modify the control plane to filter these instances to ensure that they do not receive standard traffic during the preparation phase.
When a user is ready to create an unload deployment in a pre-launch environment (after local iteration), he must first create a pull Request on Github. Our continuous integration will automatically start the container image build required for deployment. Users can then use Github bots to explicitly uninstall and deploy their services to the pre-delivery environment:
Our Github bot can simply create an offload deployment from PR
In this way, users can create a separate deployment for a service that shares exactly the same environment as a normal AD hoc deployment: interaction with a standard database, exit calls to other services, and can be observed by a standard metrics/logging/analysis system. This can prove very useful for developers who just want to SSH into an instance and test a script or run the debugger without worrying about impacting the rest of the pre-release environment. However, undeployment really comes into its own when a developer can open the Lyft app on their phone and ensure that a request gets a PR code for the service in an undeployment.
Headers and Context Propagation Override Headers and Context Propagation
To route a request to an offloaded deployment, you need to embed metadata in the request to inform the infrastructure when the call flow is modified. The metadata contains routing rules for the services you want to override and which offloaded deployments you should direct traffic to. We decided to carry this metadata in the request header to be transparent to the service and the service owner.
However, we need to ensure that header information can be propagated in the grid by services written in different languages. We have used the OpenTracing header (X-OT-SPan-context) to propagate trace information from one request to the next. OpenTracing has a concept called “Baggage [7]”, which is a persistent key/value structure embedded in headers that cross service boundaries. Encoding metadata into Baggage and propagating it from one request to the next through the request and trace libraries allowed us to do fast processing.
Construct and attach Baggage
The actual HTTP header is a Base64 encoded Trace Protobuf [8]. We created our own protobuf, named Overrides, and injected it into the tracer, as shown in the following code:
syntax = "proto3";
/* container for override metadata */
message Overrides {
// maps cluster_name -> ip_port
map<string.string> routing_overrides = 1;
}
Copy the code
from base64 import standard_b64decode, standard_b64encode
from flask import Flask, request
from lightstep.lightstep_carrier_pb2 import BinaryCarrier
import overrides_pb2
def header_from_overrides(overrides: overrides_pb2.Overrides) - >bytes:
""" Attach the `overrides` to the trace's baggage and return the new `x-ot-span-context` header """
# decode the trace from the current request context
header = request.headers.get('x-ot-span-context'.' ')
trace_proto = BinaryCarrier()
trace_proto.ParseFromString(standard_b64decode(header))
# b64encode the provided custom `overrides` and place in the baggage
b64_overrides = standard_b64encode(overrides.SerializeToString())
trace_proto.basic_ctx.baggage_items['overrides'] = b64_overrides
# re-encode the modified trace for use as an outgoing HTTP header
return standard_b64encode(trace_proto.SerializeToString())
# create a sample `Overrides` proto that overrides routing for `users` service
overrides_proto = overrides_pb2.Overrides()
overrides_proto.routing_overrides["users"] = "10.0.0.42:80"
with Flask(__name__).test_request_context('/add-baggage'):
new_header_with_baggage = header_from_overrides(overrides_proto)
print({"x-ot-span-context": new_header_with_baggage})
# {'x-ot-span-context': b'Ei8iLQoJb3ZlcnJpZGVzEiBDaFVLQlhWelpYSnpFZ3d4TUM0d0xqQXVOREk2T0RBPQ=='}
Copy the code
How do I extract the current trace and overwrite it
To abstract this data serialization from the developer, we added header creation tools to our existing proxy application (read more about proxies). The developer points the client to the proxy to intercept request/response data with user-defined Typescript code. We create a helper function, setEnvoyOverride(service: String, SHA: String), which looks up the IP address via SHA, creates Override Protobuf, encodes the header, and ensures that it is attached to every request that passes through the proxy.
Context Propagation
Context propagation is important in any distributed tracing system. We need metadata to be available throughout the life of the request to ensure that many deep-called services have access to user-specified overlays. We want to ensure that each service correctly forwards metadata to subsequent services in the request flow — even if the service itself doesn’t care about its content.
Each service in the call diagram must propagate metadata to achieve full trace coverage
Lyft’s infrastructure maintains a standard request library in our most common languages (Python, Go, Typescript) that handles context propagation for developers. Context propagation is transparent to users if the service owner uses these libraries to invoke another service.
Unfortunately, during the launch of this project, we found that context propagation was not as common as we had hoped. Initially, users often come to us saying that their requests have not been overwritten, and the culprit is usually trace missing. We invested a lot of money to ensure that context propagation worked across various language features (such as Python GEvent/Greenlets), multiple request protocols (HTTP/gRPC), and various asynchronous jobs/queues (SQS[9]). We also added observability and tools to diagnose problems involving trace loss, such as dashboards that identify service exits without added headers.
Extend the Envoy
Now that we have propagated the rewrite metadata in the request, we need to modify the network layer to read the metadata and redirect it to the desired unload instance.
Because all of our services make requests through Envoy Sidecars, we can embed some middleware in these proxies to read the coverage and modify the routing rules appropriately. We use Envoy’s HTTP filtering system [10] to process the request, so we implement two steps in the HTTP filter: read the coverage information in the request header and modify the routing rules to redirect the route to the unloaded deployment.
Trace using an Envoy HTTP filter
We decided to create a decoder filter [10] that allowed us to parse and react to overwrites before the request was sent to the upstream cluster. The HTTP filtering system provides a simple API to get the current destination route as well as all headers for the requests being processed. Although implemented in C++, the following pseudocode illustrates the basics:
def routing_overrides_filter(route, headers) :
routing_overrides = headers.trace().baggage()['overrides'] # {' users' : '10.0.0.42:80}
next_cluster = route.cluster() # 'users'
# modify the route if there's an override for the cluster we are going to
if next_cluster in routing_overrides:
# the user provided the ip/port of their offloaded deploy in the header baggage
offloaded_instance_ip_port = routing_overrides[next_cluster] # '10.0.0.42:80'
# redirect the request to the ORIGINAL_DST cluster with the new ip/port header
headers.set('x-envoy-original-dst-host', ip_port)
route.set_cluster('original_dst_cluster')
Copy the code
The filter uses Envoy’s trace utility to extract the coverage contained in Baggage. While filters can always access trace information like traceId and isSampled, we first had to modify the Envoy so we could extract the information in Baggage [11]. After incorporating this change, the filter can use the new API to extract baggage from the underlying trace: routing_overrides = headers trace().baggage()[‘overrides’]
Original Destination Cluster
Assuming that the override applies to the current target cluster, the request must be redirected to the unmounted deployment. We use the Envoy’s original destination [12](ORIGINAL_DST) to send the request to an override provided by Baggage.
For our configured ORIGINAL_DST cluster, the final destination is determined by a special X-enlist-original-dst-host [13] header that contains an IP /port, such as 10.0.0.2:80, which the HTTP filter can alter to redirect the request.
For example, if the request was originally intended for the user cluster, but the user overrides the IP /port, we will change the X-enlist-original-dst-host to the SUPPLIED IP /port.
When x-enlist-original-dst-host is modified, the filter needs to send the request to the ORIGINAL_DST cluster to ensure delivery to the new destination. This requirement prompted us to make a second change to Envoy: support for routing variability [14]. By incorporating this change, the filter can change the target cluster: route.set_cluster(‘ original_dST_cluster ‘).
The results of
By uninstalling the deployment, propagation Baggage, and Envoy filters, we have now shown all the major components of pre-shipped coverage for 🎉.
This workflow greatly improves the overhead of end-to-end testing. We now have 100 individual service deployments per month, and pre-delivery coverage has the following advantages over the previous Onebox solution:
- Environment configuration: Onebox requires users to launch hundreds of containers and run custom seed scripts, requiring developers to spend at least an hour preparing the environment. With pre-sent overrides, users can deploy a change in an end-to-end environment in 10 minutes.
- Low-cost infrastructure: Onebox runs a technology stack that is completely separate from the pre-release/production environment, so the underlying infrastructure components (such as networking, visibility) are typically implemented separately. By moving end-to-end testing to a pre-release environment, infrastructure support costs are reduced due to environment improvements to centralized maintenance.
- Low-cost functional verification: Due to the differences between Onebox and the product, even after Onebox’s end-to-end testing, users often (reasonably) doubt the correctness of the code. Pre-release is closer to production in terms of data and traffic patterns, giving users more confidence that if changes are ready in pre-release, they are ready in production.
Extra work
Enabling pre-launch coverage is a cross-organizational effort involving networking, deployment, visibility, security, and development tools. Here are some additional workflows not covered:
- Configure overrides: In addition to specifying routing overrides in Baggage, we also allow users to modify configuration variables on a per-request basis. By modifying the repository to give Priority to Baggage, users can set the feature flags for requests before enabling global configuration.
- Security implications: Because override routing rules can be specified, the filter function must be locked to ensure that bad actors cannot be arbitrarily routed.
Future jobs
Going forward, we can do a lot more with pre-delivery coverage, allowing users to recreate the end-to-end scenarios they want to verify:
- Shared baggage:Provide users with a centrally managed Baggage store that allows persistence of a unique set of overrides (services)
foo
It’s X, servicebar
It’s Y tagbaz
Z) to improve collaboration by sharing exact scenarios with team members. - Override use cases: Make our infrastructure aware of other overrides so that users can control the behavior of requests. For example, we can use Envoy injection [15] to inject artificial delays into a request, temporarily enable debug logging, or redirect to a different database.
- Integration with local development: We can allow requests to be rewritten, rerouting requests directly to the user’s laptop, rather than requiring the user to launch their PR instance in preparation.
Stay tuned for our next article in the series, which will show how to use automated acceptance tests to check production deployments during the delivery phase!
References: [1] Scaling productivity on microservices at Lyft (Part 3): Extending our Envoy mesh with staging overrides: eng.lyft.com/scaling-pro… [2] Service Level Objective: en.wikipedia.org/wiki/Servic… [3] Severity Levels: response.pagerduty.com/before/seve… [4] Envoy Proxy: www.envoyproxy.io/ [5] Terminology: www.envoyproxy.io/docs/envoy/… . [6] xDS protocol: www.envoyproxy.io/docs/envoy/… [7] Tags, logs and baggage: opentracing. IO/docs/overvi… [8] lightstep_carrier. Proto: github.com/lightstep/l… [9] Amazon Simple Queue Service: aws.amazon.com/sqs/ filters [10] HTTP: www.envoyproxy.io/docs/envoy/… [11] Baggage methods to tracing: :Span: github.com/envoyproxy/ [12] the Original destination: www.envoyproxy.io/docs/envoy/… [13] the Original destination host request header: www.envoyproxy.io/docs/envoy/… [14] HTTP: Support route Mutability: github.com/envoyproxy/… [15] Fault Injection: www.envoyproxy.io/docs/envoy/…
Hello, MY name is Yu Fan. I used to do R&D in Motorola, and now I am working in Mavenir for technical work. I have always been interested in communication, network, back-end architecture, cloud native, DevOps, CICD, block chain, AI and other technologies. The official wechat account is DeepNoMind