The author: the original author: Nick Craver translation Luo Sheng & DiJingChao (hj engineer) address: the original nickcraver.com/blog/2017/0… This article is an original translation, which has been authorized by the original author. Please indicate the author and source.

Today, we have HTTPS deployed on Stack Overflow by default. Currently all traffic is redirected to https://. In the meantime, we’ll be changing Google links in the coming weeks. The process of enabling HTTPS itself was a piece of cake, but it took years to get there. So far, HTTPS is enabled by default on all of our Q&A sites.

For the past two months, we’ve been publishing HTTPS across the Stack Exchange network. Stack Overflow is the last, and by far the biggest, site. This is a huge milestone for us, but it is by no means the end. We still have a lot of work to do, but now we can see the end in sight. Yay!

Note: This article is about a long journey. Very long. You may have noticed that your scrollbar is now very small. The problems we encountered were not unique to Stack Exchange/Overflow, but the combination of these problems was quite rare. I’ll cover some of our trials and tribulations, twists and turns, mistakes and successes, as well as some of our open source projects — hopefully those details will be helpful. Because of the complexity of their relationships, IT’s hard to organize this article chronologically, so I’ll break it down into several topics: architecture, application layer, errors, and so on.

First, let’s mention why our situation is relatively unique:

  • We have hundreds of domain names (A large number of sitesAnd services)
    • Lots of secondary domain names (Stackoverflow.com, StackExchange.com, askubuntu.com, etc.)
    • A large number of level 4 domain names (e.g., meta.gaming.stackexchange.com)
  • We allow users to submit and embed content (such as images in posts and YouTube videos)
  • We only have one data center (causing single source delay)
  • We have advertising (and advertising networks)
  • With WebSockets, we can’t have less than half a million active at any one time.
  • We will be attacked by DDoSed (proxy problem)
  • We have a lot of sites and applications that still communicate through the HTTP API (proxy issue)
  • We are passionate about performance (As ifA little too much)

Since this article is so long, I’ll list the links here:

  • The opening
  • The lazy package
  • infrastructure
    • certificate
      • Meta subdomain (Meta.*.stackExcange.com)
    • Performance: HTTP / 2
    • HAProxy: HTTPS is supported
    • CDN/ Proxy layer: Optimizes latency with Cloudflare and Fastly
      • Preparation for optimizing the proxy layer: client performance testing
      • CloudFlare
        • Railgun
      • Fastly
    • The global DNS
    • test
  • Application layer/code
    • Application Layer Preparation
    • The global login
    • Local HTTPS development
    • Mixed content
      • From you
      • From our
    • Jump (301)
    • Websockets
  • The unknown
  • error
    • Relative protocol URL
    • API and internal
    • 301 the cache
    • The help center episode
  • Open source
  • The next step
    • HSTS preload
    • chat
    • today

The opening

We started thinking about deploying HTTPS on Stack Overflow back in 2013. Yes, it’s 2017. So what’s holding us back for four years? The answer to this question applies to any IT project: dependencies and priorities. To be honest, Stack Overflow is not as secure as others. We’re not a bank, we’re not a hospital, we’re not involved in credit card payments, and we even publish most of our databases quarterly via HTTP and BitTorrent. That means, from a security point of view, it’s not as urgent as it is in other areas. In terms of dependencies, we are more complex than others, and there are several areas where HTTPS deployment stumbles, the combination of which is unique. As we will see later, there are some persistent domain name problems.

Some areas that are easy to pit include:

  • User content (users can upload images or specify urls)
  • Advertising Network (Contracts and Support)
  • Single data center hosting (deferred)
  • Hundreds of domains at different levels (certificates)

So why do we need HTTPS at all? Because data isn’t the only thing that needs security. Our users include operators, developers, and employees at all levels of the company. We want their communication to our site to be secure. We want every user’s browsing history to be secure. Some users secretly like Monad but are afraid of being discovered. At the same time, Google will improve its search rankings for HTTPS sites (though we don’t know by how much).

Oh, and performance. We love performance. I love performance. You love performance. My dog loves sex. Let me give you a sex hug. Very good. thank you You smell good.

The lazy package

A lot of people love valentine bags, so let’s do a quick q&A (we love q&A!). :

  • Q: What kind of agreement do you support?
    • A: TLS 1.0, 1.1, 1.2 (Note: Fastly is ready to drop TLS 1.0 and 1.1). We will soon support TLS 1.3 as well.
  • Q: Do you support SSL V2 or V3?
    • A: Not supported. These protocols are not secure. Everyone should disable it as soon as possible.
  • Q: What encryption suites do you support?
    • A: On CDN, we used the default suite for Fastly;
    • A: We use Mozilla’s modern compatibility suite on our own load balancer.
  • Q: Is Fastly going back to the source HTTPS?
    • A: is. If the request to the CDN is HTTPS, the request back to the source is also HTTPS.
  • Q: Do you support forward security?
    • A: is.
  • Q: You support itHSTS?
    • A: Yes. We are gradually supporting it on the Q&A site. Once that’s done we’ll move it to the node.
  • Q: Do you support HPKP?
    • A: It does not and should not.
  • Q: Do you support SNI?
    • A: Not supported. For HTTP/2 performance reasons, we used a combined wildcard certificate (more on that later).
  • Q: Where did you get your certificates?
    • A. We use DigiCert, and they’re great.
  • Q: Do you support IE 6?
    • A: After this, I finally stopped supporting it. IE 6 does not support TLS by default (although you can enable 1.0 support), and we do not support SSL. Most IE 6 users won’t be able to access Stack Overflow when our 301 jump is ready. Once we abandoned TLS 1.0, all IE 6 users were dead.
  • Q: What do you use for load balancers?
    • A: HAProxy (using OpenSSL internally).
  • Q: What is the motivation for using HTTPS?
    • A: someone has attacked our controller part routing, such as stackoverflow.com/admin.php.

certificate

Let’s talk about certificates first, because this is the most misunderstood part. Quite a few friends have told me that he has HTTPS certificates installed, so they are already HTTPS ready. Ha ha, please take a look at the small scroll bar on the right, this article has just begun, do you think it is really that simple? I need to tell you a little lesson about life: it’s not that easy.

One of the most common questions is, “Why not just use Let’s Encrypt?”

The answer is: it doesn’t work for us. Let’s Encrypt is really a great product and I hope they will be around for a long time. It’s a great choice when you only have one or a few domain names. Unfortunately, our Stack Exchange has hundreds of sites, and Let’s Encrypt does not support wildcard domain name configuration. This led to Let’s Encrypt not meeting our requirements. To do this, we would have to deploy a certificate (or two) for every new Q&A site. This would increase the complexity of our deployment, and we would either have to abandon non-SNI-enabled clients (about 2% of traffic) or provide too many IPS — which we currently don’t have.

Another reason we want to control certificates is that we want to complete the same certificates with both the local load balancer and the CDN/proxy provider. Without this, we would not be able to do failover smoothly from the agent. Clients that support HTTP public Key fixation (HPKP) report authentication failure. We are still evaluating whether to use HPKP, but we need to be prepared if we need to use it someday.

Many of our friends were shocked when they saw our master certificate, which contained both our master domain and our wildcard subdomain. It looks something like this:

Main Certificate

Why do you do that? Honestly, we asked DigiCert to do it for us. Why put up with the hassle of manually merging certificates every time a change is made? First, we want to get as many people using our product as possible. This includes users who don’t already support SNI (for example, Android 2.3 was gaining momentum when we launched the project). It also includes HTTP/2 and some real-world issues — we’ll get to that later.

Meta subdomain (Meta.*.stackExcange.com)

One of the ideas behind Stack Exchage is that for each Q&A site, we have a place for discussion. We call it “second place”. Such as meta.gaming.stackexchange.com to gaming.stackexchange.com. What’s so special about this one? Well, not really, except for the domain name: it’s a level 4 domain.

I’ve talked about this before, but what happened? In particular, are faced with the problem now is *. Stackexchange.com includes the gaming.stackexchange.com (and hundreds of other sites), but it does not include meta.gaming.stackexchange.com. RFC 6125 (Section 6.4.3) reads:

Clients should not attempt to match a domain name with a wildcard in the middle (e.g., do not match bar.*.example.net)

This means we can’t use Meta.*. Stackexchange.com, so what?

  • Solution 1: DeploymentSAN Certificate (Multi-domain Certificate)
    • We need to prepare 3 certificates and 3 IP addresses (each certificate supports a maximum of 100 domain names) and it will complicate the launching of new sites (although this mechanism has been changed).
    • We are going to deploy three custom certificates on the CDN/ proxy layer
    • We want to givemeta.*This form of domain name is configured with additional DNS entries
      • According to DNS rules, we must configure one DNS for each such site, which cannot be configured in batches, thus raising the cost of launching new sites and maintaining agents
  • Solution 2: Migrate all domain names to*.meta.stackexchange.com
    • We will have a painful migration process, but this is one-time, and the cost of maintaining certificates is low in the future
    • We need to deploy a global login system (see here for details)
    • This solution still does not address the HSTS preloading belowincludeSubDomainsThe problem (See this)
  • Plan three: do nothing, give up
    • This is the simplest solution, but it is a fake solution

We deploy the global login system, then the son meta domains in a 301 redirect to the new address, like gaming.meta.stackexchange.com. After doing this we realized that because these domain names * used to * exist, it was a big problem for HSTS preloading. This is a work in progress and I’ll discuss it at the end of the article. This kind of problem for a site such as meta.pt.stackoverflow.com also exist, but it’s good we have only four English versions of Stack Overflow, so the problem has not been expanded.

Oh, and there’s another problem with the plan itself. Because of moving cookies to the top-level directory and then relying on subdomain inheritance of them, we had to adjust some other domain names. For example, in our new system, we use SendGrid to send mail (in progress). We from stackoverflow. Email this email domain, the link in the email domain is sg – links. Stackoverflow. Email (using the CNAME management), so your browser will not have to send out sensitive cookies. If the domain name is links.stackoverflow.com, your browser will send out your cookies under this domain name. We have a number of services that use our domain name but are not our own. These subdomains need to be removed from our trusted domain names, or we will send your cookies to servers that are not ours. It would be a shame if cookie data were compromised because of this error.

We have tried proxy access to our Hubspot CRM website, allowing cookies to be removed in transit. Unfortunately, Hubspot uses Akamai, which decides that our HAProxy instance is a robot and blocks it. It was fun the first three times… Of course, that means it really doesn’t work. We never tried it again.

Are you wondering why Stack Overflow blogs at stackoverflow.blog/? Yes, it’s also for security purposes. We’ve bundled the blog with an external service that marketing and other teams can use more easily. Because of this, we can’t put it under a domain with cookies.

The above scenarios involve subdomains, includeSubDomains and HSTS preloading, which we will cover in a moment.

Performance: HTTP / 2

A long time ago, everyone thought HTTPS was slower. And that was true in those days. But times are changing, and when we say HTTPS, we don’t just mean HTTPS anymore, but HTTPS based HTTP/2. Although HTTP/2 does not require encryption, it does. Major browsers require HTTP/2 to provide an encrypted connection to enable most of its features. You can say it’s not in the spec or the spec, but the browser is your reality. I sincerely hope that this protocol will be renamed HTTPS/2 to save you some time. Browser vendors, did you hear that?

HTTP/2 has a number of enhancements, particularly the ability to actively push resources before the user requests them. I won’t expand on it here, Ilya Grigorik has written a very good article. Let me briefly list the main advantages:

  • Request/response multiplexing
  • Server push
  • The Header compression
  • Network traffic priority
  • Fewer connections

Yi? Why didn’t you mention certificates?

A little-known feature is that you can push content to different domains as long as the following conditions are met:

  1. The two domain names must be resolved to the same IP address
  2. Both domains need to use the same TLS certificate (see!).

Let’s take a look at our current DNS configuration:

λ dig stackoverflow.com +noall +answer; <<>> DiG 9.10.2-p3 <<>> stackOverflow.com +noall +answer; global options: + CMD stackoverflow.201 IN A 151.101.65.69 stackoverflow.201 IN A 151.101.65.69 stackoverflow.201 IN A 151.101.65.69 Staticstatic.net +noall +answer IN A staticstatic.net +noall +answer; <<>> STATIC.net +noall +answer; global options: + CMD cdn.sstatic.net.724 IN A 151.101.193.69 cdn.sstatic.net.724 IN A 151.101.1.69 151.101.65.69 cdn.sstatic.net. 724 IN A 151.101.129.69Copy the code

Hey, these IP’s are all the same, and they all have the same certificate! This means you can use HTTP/2’s server push functionality directly without affecting HTTP/1.1 users. While HTTP/2 has push, HTTP/1.1 has domain sharing (via sstatic.net). We haven’t deployed server push yet, but everything is under control.

HTTPS is a means to our performance goals. Suffice it to say, our primary goal is performance, not site security. We want security, but security alone is not enough to justify the effort to deploy HTTPS across the web. When we take all the factors into consideration, we can assess how much time and effort it will take to accomplish the task. In 2013, HTTP/2 didn’t play that big a role. Now that the tide is turning, there’s more support for it, and ultimately that’s the catalyst for us to take the time to investigate HTTPS.

It is worth noting that the HTTP/2 standard continues to change as our project progresses. It evolves from SPDY to HTTP/2 and from NPN to ALPN. We won’t go into too much detail here because we didn’t contribute much to it. We watched and allowed it, but the whole Internet pushed it forward. If you’re interested, check out what Cloudflare has to say about its evolution.

HAProxy: HTTPS is supported

We first started using HTTPS in HAProxy in 2013. Why HAProxy? For historical reasons, we already use it, and it supported HTTPS in the 1.5 development release of 2013 and the official release in 2014. There was a time when we put Nginx before HAProxy (see here for details). But simplicity is always better, and we always want to avoid complications with links, deployment, and other issues.

I won’t go into too much detail, because there’s nothing to talk about. HAProxy supports HTTPS using OpenSSL after 1.5, and the configuration files are clear and easy to understand. Our configuration is as follows:

  • Run on 4 processes
    • One for HTTP/ front-end processing
    • Two to four are used to process HTTPS traffic
  • The HTTPS front end uses the socket abstraction namespace to connect to the HTTP back end, which greatly reduces resource consumption
  • Each front-end or “layer” listens on: port 433 (we have primary, secondary, WebSockets and development environment)
  • When the request comes in, we add some data to the request head (and remove some of the data you sent) and forward it to the Web layer
  • We use the encryption suite provided by Mozilla. Note that this is not the same kit as our CDN.

HAProxy is relatively simple, and this is the first step in using an SSL certificate to support port 433. In hindsight, it was a small step.

Here’s the architecture of the situation described above, and let’s get to the front cloud in a moment:

Logical Architecture

CDN/ Proxy layer: Optimizes latency with Cloudflare and Fastly

I’ve always prided myself on the efficiency of the Stack Overflow architecture. Aren’t we good? Built a massive website with just one data center and a few servers. But this time it’s different. While efficiency is a good thing, latency is a problem. We don’t need as many servers, and we don’t need to scale as much (though we do have a Dr Node). This time, that’s a problem. Because of the speed of light, we can’t solve the fundamental problem of delay. We heard someone was already working on it, but there seems to be a problem with the time machine they built.

Let’s think about delay in terms of numbers. The equatorial length is 40,000 kilometers (worst-case for light orbiting the Earth once). The speed of light in a vacuum is 299,792,458 m/s. A lot of people use that number, but fiber optics are not vacuum. In fact fiber optics have 30-31% loss, so our figure is :(40,075,000 m)/(299,792,458m /s *.70) = 0.191s, which means 191ms in the worst case around the earth, right? Not right. This assumes an ideal path, when in practice there is almost no straight line between two network nodes. There are routers, switches, caches, processor queues, and all sorts of delays in between. The cumulative delay is considerable.

What does this have to do with Stack Overflow? The advantages of cloud host come out. If you’re using a cloud provider, you’re accessing a relatively close server. This is not the case for us, and the further away you are from the service being deployed in New York or Denver (active/standby mode), the higher the latency. With HTTPS, an extra round trip is required to negotiate a connection. This is still the best case scenario (using 0-RTT to optimize TLS 1.3). Ilya Grigorik summed it up very well.

Cloudflare and Fastly. HTTPS is not a closed door project, and as you can see, we have several projects going on. When setting up an HTTPS terminal close to the user (to reduce round trip time), our main considerations are:

  • HTTPS for Terminals
  • DDoS protection
  • CDN function
  • Performance equal to or better than direct connection

Preparation for optimizing the proxy layer: client performance testing

Before we can officially enable terminal link acceleration, we need performance test reports. We’ve built a full set of tests in the browser that cover all link performance data. The browser can use JavaScript to fetch the performance time from window.performance. Open your browser’s censor and give it a try. We wanted this process to be transparent, so we put the details on testStackOverflow.com from day one. There is no sensitive information, just some URIs and resources loaded directly by the page, and their time consumption. Each recorded page looks something like this:

teststackoverflow.com

We currently do performance monitoring for 5% of traffic. The process is not that complicated, but the things we need to do include:

  1. Convert the time to JSON
  2. Upload performance test data after the page is loaded
  3. Upload performance tests to our backend server
  4. Clustered ColumnStore is used to store data in SQL Server
  5. Use Bosun (specifically BosunReporter.NET) to aggregate data

The end result is a good real-time summary of real users from around the world. This data can be analyzed, monitored, alerted, and used to assess change. It would look something like this:

Client Timings Dashboard

Fortunately, we have a constant flow of data to use for our decisions, currently at the order of 5 billion and growing. An overview of the data is as follows:

Client Timings Database

OK, now that we’ve laid the groundwork, it’s time to test the CDN/ proxy layer vendor.

Cloudflare

We evaluated a number of CDN/DDoS layer vendors. Cloudflare was chosen because of their infrastructure, fast response, and their promise of Railgun. So how do we test the real effect of using Cloudfalre? Do you need to deploy a service to retrieve user data? The answer is no!

Stack Overflow has a huge amount of data: PV over a billion a month. Remember the client time log we talked about above? We get millions of interviews every day, so can’t we just ask them? We can do this by embedding

In fact, to test performance we need to give them the secondary domain name, rather than something.stackoverflow.com, because it may be inconsistent records of glue and cause multiple queries. Just to be clear, First level domain names (TLDs) refer to.com,.net,.org,.dance,.duck,.fail,.gripe,.here,.horse,.ing,.kim,.lol,.ninja,.pink,.red, .vodka. And the WTF. Notice, they all end with, I’m not kidding. Second level domain names (SLDs), such as Stackoverflow.com, superuser.com, etc. What we need to measure is the behavior and performance of these domains. So we have testStackOverflow.com, and with this new domain name, we are testing DNS performance on a global scale. For a subset of users, we can easily get data about user access to DNS by embedding a

Note that the testing process takes a minimum of 24 hours. The performance of the Internet varies from time zone to time, depending on users’ schedules or Netflix usage. So to test a country, you need a full day of data. Preferably on a weekday (rather than half day falling on a Saturday). We knew there would be surprises. The performance of the Internet is not stable, and we need to prove it with data.

Our initial assumption was that by adding an extra node we would incur additional latency, and we would lose some page load performance. But the increase in DNS performance actually makes up for this. Cloudflare’s DNS servers are deployed closer to users, which is a much better piece of performance than if we had only one data center. I wish we had time to release this piece of data, but it takes a lot of processing (and hosting), and I don’t have enough time right now.

Next, we started putting TestStackOverflow.com on Cloudflare’s proxy for link acceleration, again in

After these tests were completed, we did some other work to prevent DDos work. We connected to additional ISPs for our CDN/ proxy layer. After all, we don’t need to defend at the proxy level if we can bypass the attack. Each computer room now has four ISPs, two sets of routers, and they use BGP between them. We also added two additional sets of load balancers specifically designed to handle traffic at the CDN/ proxy layer.

Cloudflare: Railgun

With this, we enabled two sets of Railguns. Railgun works by using Memcached urls to cache data at Cloudflare. When Railgun is enabled, each page (with a size threshold) is cached. So on the next request, if the URL is cached on the Cloudflare node and on us, we will still ask the Web server for the latest data. But instead of transferring the complete data, we just need to transfer the difference between the data transferred and the last request to Cloudflure. They apply this difference to their cache and send it back to the client. At this point, gzip compression was also moved from Stack Overflow’s nine Web servers to a Railgun service that was CPU intensive — I point this out because the service needed to be evaluated, purchased, and deployed on our side.

As an example, imagine that two users open a page with the same question. In terms of browsing, their pages technically look almost the same, with only minor differences. This would be a huge performance boost if most of our transmissions were just a diff.

All in all, Railgun improves performance by reducing the amount of data transferred. That’s true when it works well. In addition, there is an additional advantage: the request does not reset the connection. TCP starts slowly, which may result in traffic limiting in a complex connection environment. Railgun, on the other hand, always connects to Cloudflare’s terminals with a fixed number of connections, multiplexing user requests so that they are not affected by slow starts. The small diff also reduces the overhead of a slow start.

Unfortunately, we have been having problems with Railgun for a variety of reasons. As far as I know, we had the largest Railgun deployment at the time, which pushed Railgun to the limit. After a year of tracking down problems, we finally had to give up. Instead of saving us money, this situation consumes more energy. Now years have passed. If you’re evaluating using Railgun, you’d better look at the latest version, which is being optimized all the time. I also recommend that you make your own decision about using Railgun.

Fastly

We just recently moved to Fastly because we’re talking about the CDN/ agent layer, which I’ll mention as well. Since a lot of the technical work has already been done at Cloudflare, the migration itself is not worth talking about. You’ll be more interested in: Why migrate? Cloudflare, after all, is good in every way: rich data centers, stable bandwidth prices, including DNS. The answer is: it’s not our best option anymore. Flastly offers some of the more desirable features: flexible end-node control, configuration fast distribution, and automatic configuration distribution. It’s not that Cloudflare doesn’t work, it just doesn’t fit Stack Overflow anymore.

Actions speak louder than words: If I didn’t endorse Cloudflare, my private blog wouldn’t have picked it up. Hey, that’s the blog you’re reading right now.

The main feature that attracted us to Fastly was that it provided Varnish) and VCL. This provides a high degree of terminal customizability. There are some features that Cloudfalre cannot provide quickly (because they are universal and affect all users), but at Fastly we can do it ourselves. This is the difference between the two architectures, and this “highly configurable code level” works for us. We also like their openness in communication and infrastructure.

Let me show you an example of where VCL is useful. We recently encountered a nasty bug in.NET 4.6.2 that caused max-age to have a cache time of over 2000 years. The quick fix is to override the header when needed on the end node. As I write this, the VCL configuration looks like this:

sub vcl_fetch {
  if (beresp.http.Cache-Control) {
      if (req.url.path ~ "^/users/flair/") {
          set beresp.http.Cache-Control = "public, max-age=180";
      } else {
          set beresp.http.Cache-Control = "private"; }}Copy the code

This will give the user the ability to display the page for 3 minutes of cache time (with a fine amount of data) and leave the rest of the page unset. This is a very deployable global solution for emergency times. We’re very happy that we now have the ability to do something on the terminal. Our Jason Harvey was responsible for the VCL configuration and wrote some automated push functionality. We developed it based on fastlyctl, an open source library of Go.

Another feature of Fastly is the ability to use our own certificates, which Cloudflare offers but is expensive. As I mentioned above, we now have the ability to push using HTTP/2. However, Fastly does not support DNS, which is supported at Cloudflare. Now we need to solve the DNS problem ourselves. Probably the most interesting part is the back and forth?

The global DNS

When we migrated from Cloudflare to Fastly, we had to evaluate and deploy a new DNS provider. Here’s an article by Mark Henderson. In view of this, we must manage:

  • Our own DNS server (standby)
  • Name.com servers (for jump services that don’t require HTTPS)
  • Cloudflare DNS
  • Route 53 DNS
  • Google DNS
  • Azure DNS
  • Others (for testing)

This is another project in itself. For efficient management, we developed DNSControl. This is now an open source project, hosted on GiHub and written in Go. In short, whenever we push JavaScript configuration to Git, it immediately deploys DNS configuration on a global scale. Here’s a simple example, using Askubuntu.com:

D('askubuntu.com', REG_NAMECOM,
    DnsProvider(R53,2),
    DnsProvider(GOOGLECLOUD,2),
    SPF,
    TXT(The '@'.'google-site-verification=PgJFv7ljJQmUa7wupnJgoim3Lx22fbQzyhES7-Q9cv8'), // webmasters
    A(The '@', ADDRESS24, FASTLY_ON),
    CNAME('www'.The '@'),
    CNAME('chat'.'chat.stackexchange.com.'),
    A('meta', ADDRESS24, FASTLY_ON),
END)Copy the code

Great, now we can use the client response test tool to test! The tools mentioned above tell us about real deployments in real time, not simulated data. But we still need to test that everything works.

test

Tracking client response tests is useful for performance testing, but not for configuration testing. Client-side response testing is great for presenting results, but the configuration sometimes doesn’t have an interface, so we developed httpUnit (which we later learned is the same name). It is also an open source project using the Go language. Using testStackOverflow.com as an example, the following configuration is used:

[[plan]]
    label = "teststackoverflow_com"
    url = "http://teststackoverflow.com"
    ips = ["28i"]
    text = "<title>Test Stack Overflow Domain</title>"
    tags = ["so"]
[[plan]]
    label = "tls_teststackoverflow_com"
    url = "https://teststackoverflow.com"
    ips = ["28"]
    text = "<title>Test Stack Overflow Domain</title>"
    tags = ["so"]Copy the code

It is necessary to test every time we update firewalls, certificates, bindings, jumps. We must ensure that our changes do not affect user access (deployed in a pre-release environment). HttpUnit is our tool for integration testing.

We also developed an internal tool (developed by dear Tom Limoncelli) to manage VIP addresses on our load balancing. We finished testing on an alternate load balancer and then cut all traffic off to leave the previous primary load balancer in a stable state. If anything goes wrong in between, we can easily roll back. If all goes well, we’ll apply the change to that load balancer. The tool is called KeepCTL (Short for Keepalived Control) and will be open sourced soon if time permits.

Application Layer Preparation

This is just the architectural work. This is usually done by a team of several site reliability engineers from Stack Overflow. The application layer also has a lot of work to do. It’s going to be a long list, so let me get some coffee and snacks before I start.

It is very important that Stack Overflow and Stack Exchange architecture Q&A adopt multi-tenant technology. This means that if you visit stackoverflow.com or superuser.com or bicycles.stackexchange.com, you return to the same w3wp is actually the same server. Exe process. We change the context of the request through the Host request header sent by the browser. To better understand some of the concepts we’ll discuss below, you need to know that Current.Site in our code actually refers to the Site in the request. Current. The Site. The Url () and the Current Site. The Paths. FaviconUrl is also based on the concept of the same.

In other words: our Q&A site is all running in the same process on the same server, and users are not aware of it. We ran one process on each of the nine servers, just for release and redundancy.

The global login

Some of the projects look like they could stand on their own (and they are), but are part of a larger HTTPS migration. Login is one of those items. I’m going to say this first, because it’s going to change faster than anything else.

For the first five or six years on Stack Overflow (and Stack Exchange), you logged on to separate sites. For example, StackOverflow.com, StackExchange.com, and Gaming.StackExchange.com have their own cookies. It is worth noting: meta.gaming.stackexchange.com login cookie bring it from gaming.stackexchange.com. These are the meta sites we mentioned above when we discussed certificates. Their login information is linked and you can only log in through the parent site. Technically nothing special, but considering the user experience it’s pretty bad. You must log in one station at a time. We “fixed” the problem with “global authentication” by placing an

So we have universal login. Why the name “Universal”? Because we’ve already used global. We are so simple. Fortunately, cookies are also very simple things. Cookies in parent domains (e.g. Stackexchange.com) are carried to all subdomains in your browser (e.g. Gaming.stackexchange.com). If we only have secondary domain names, we don’t have many:

  • askubuntu.com
  • mathoverflow.net
  • serverfault.com
  • stackapps.com
  • stackexchange.com
  • stackoverflow.com
  • superuser.com

Yes, we have some domain names that jump to the list above, such as askdifferent.com. But these are just jumps, they don’t have cookies and they don’t require login.

There’s a lot of detail on the back end that I haven’t mentioned (thanks to Geoff Dalgas and Adam Lear), but the general idea is that when you log in, we write these domains into a cookie. We do this through cookies and random numbers from third parties. When you log on to any one of these sites, we put six < IMG > tags on the page and write cookies to other domains, essentially completing the login. This doesn’t work in all situations (especially mobile Safari is a killer), but it’s much better than it was before.

The client code is not complex and basically looks like this:

$.post('/users/login/universal/request'.function (data, text, req) {
    $.each(data, function (arrayId, group) {
        var url = '/ /' + group.Host + '/users/login/universal.gif? authToken=' + 
            encodeURIComponent(group.Token) + '&nonce=' + encodeURIComponent(group.Nonce);
        $(function () {$('#footer').append('<img style="display:none" src="' + url + '"></img>'); });
    });
}, 'json');Copy the code

But to do this, we had to move up to account level authentication (previously user level), change the way cookies are read, change the way login works on these meta sites, and integrate this new change into other applications. For example, Careers (now split into Talent and Jobs) uses a different code base. We need to have these apps read the corresponding cookies and then call the Q&A app through the API to get the account. We deployed a NuGet library to reduce duplicate code. Bottom line: you log in in one place, you log in in all domains. No popup box, no reload page.

Technically, we don’t have to care what *.*. Stackexchange.com is, as long as they’re stackExchange.com. It seems to have nothing to do with HTTPS, but this allows us to make meta.gaming.stackexchange.com gaming.meta.stackexchange.com without affecting the users.

Local HTTPS development

To do this better, the local environment should be as consistent as possible with the development and production environments. Fortunately, we’re using IIS, so it’s easy. We use a tool to set up the developer environment. This tool is called “Local Development Settings” — pure? It can install tools (Visual Studio, Git, SSMS, etc.), services (SQL Server, Redis, Elasticsearch), repositories, databases, websites, and a few other things. With the basic tools set up, all we need to do is add SSL/TLS certificates. The main ideas are as follows:

Websites = @(
    @{
        Directory = "StackOverflow";
        Site = "local.mse.com";
        Aliases = "discuss.local.area51.lse.com"."local.sstatic.net";
        Databases = "Sites.Database"."Local.StackExchange.Meta"."Local.Area51"."Local.Area51.Meta";
        Certificate = $true;
    },
    @{
        Directory = "StackExchange.Website";
        Site = "local.lse.com";
        Databases = "Sites.Database"."Local.StackExchange"."Local.StackExchange.Meta"."Local.Area51.Meta";
        Certificate = $true; })Copy the code

I put the code I used in a gist: register-websites. Psm1. We set up the site with the host header (added by alias), give it a certificate if it is directly connected (well, it should default to $true now), and then allow the AppPool account to access the database, so we are using https:// locally. Well, I know we should open source this setup process, but we still need to get rid of some proprietary business. There will be a day.

Why is this important? Previously, we loaded static content from /content, not from another domain. This is convenient, but hides problems like cross-domain requests (CORS). If resources that can be loaded under the same domain name and under the same protocol are transferred to the development or production environment, errors may occur. “It’s good with me.”

When we use the same protocol and the same architecture of the CDN and domain name setup as in the production environment, we can find and fix more problems on the development machine. Did you know, for example, that browsers don’t send referer when you jump from https:// to http://? This is a security issue; the referer header may contain sensitive information transmitted in clear text.

“Nick, don’t be ridiculous, we can get referer from Google!” Indeed. But that’s because they choose to do it. If you look at Google’s search page, you can see the
directive:

<meta content="origin" id="mref" name="referrer">Copy the code

That’s why you can get referer.

Ok, so we’ve set it up, so what do we do now?

Mixed content: From you

Mixed content is a basket into which you can fit anything. What mix of content have we accumulated over the years? Unfortunately, there are many. Here’s a list of user submissions we have to deal with:

  • http://The picture appears inThe problemThe answer,The labelWiki, etc
  • http://Head portrait
  • http://Avatar, appearing in chat (site sidebar)
  • http://The picture appears in the about me section of the profile page
  • http://The picture appears inHelp center articles
  • http://YouTube videos (some sites have enabled, for examplegaming.stackexchange.com)
  • http://The picture appears inPrivilege Description
  • http://The picture appears inDeveloper stories
  • http://The picture appears inIn job description
  • http://The picture appears inIn the company page
  • http://The source address appears inJavaScript code.

Each of the above has its own unique problems, and I’ll just cover the ones worth mentioning. Note: Every solution I talk about has to scale to hundreds of sites and databases on our architecture.

In all of the cases above (except for snippets), the first step to removing mixed content is this: you must first remove the new mixed content. Otherwise, the cleaning process will be endless. To do this, we started to force the entire web to allow only embedding of https:// images. Once this is done, we can start cleaning up.

For questions, answers, and other postings, we need to analyze them on a case-by-case basis. Let’s take care of the 90% or more: stack.imgur.com. Stack Overflow had its own hosted Imgur instance before I arrived. That’s where the pictures you upload in the editor go. The vast majority of posts are done this way, and they added HTTPS support for us years ago. So this is a very straightforward search and replace (what we call post markdown reprocessing).

We then find all the remaining files by indexing them all by Elasticsearch. When I say we, I mean Samo. He handles a lot of mixed content work here. When we saw that most domain names already support HTTPS, we decided:

  1. For each<img>The source address of thehttps://. Replace the link in the post if it works
  2. If the source address is not supportedhttps://, turn it to a link

Of course, it didn’t go so well. We discovered that the regular expressions used to match urls had been broken for years and no one had noticed… So we fixed the re and re-indexed it.

We were asked, “Why not be an agent?” Well, legally and ethically, agency is a gray area for our content. For example, our photographers at Photo.stackExchange.com explicitly state that they do not use Imgur to preserve their rights. We fully understand. If we start proxying and caching the full graph, this is a bit of a problem legally. We later discovered that out of millions of embedded images, only a few thousand did not support either https:// or 404. The percentage (less than 1%) is not enough for us to get an agent.

We did look into the issues associated with taking on an agent. How much does it cost? How much storage is needed? Do we have enough bandwidth? We have a rough estimate, but some of the answers are uncertain. For example, should we use Fastly, or go directly to the operator? Which one is faster? Which one is cheaper? Which can be extended? That’s enough for another blog post, but if you have any specific questions please post them in the comments and I’ll try to answer them.

Fortunately, along the way, Balpha changed the way YouTube is embedded in HTML5 to address several issues. We also incidentally enforced YouTube’s https:// embedding.

The rest of the content areas do the same thing: block new hybrid content from coming in, then replace the old. This requires us to make changes in the following areas:

  • post
  • The personal data
  • Development of the story
  • Help center
  • In the workplace
  • The company’s business

Disclaimer: The JavaScript fragment issue is still unresolved. The reason this is a little difficult is:

  1. Resources may not be availablehttps://(e.g., a library)
  2. Since this is JavaScript, you can make up any URL you want. There’s no way we can check it.
    • If you have a better way to handle this, please let us know. We can’t have both usability and security.

Mixed content: From us

The problem is not solved by processing user submissions. We still have a lot of http://’s to deal with ourselves. The changes themselves are nothing special, but they at least answer the question “Why did it take so long?” The question:

  • Advertising Services (Calculon)
  • Advertising Services (Adzerk)
  • Label sponsor
  • JavaScript assumes that
  • Area 51 (This code base is too old)
  • Analytical tracker (Quantcast, GA)
  • JavaScript referenced per site (community plug-in)
  • /jobs(This is actually a proxy)
  • Ability of the user
  • … And everything that appears in the codehttp://The place where

JavaScript and links can be a pain, so I’ll mention them here.

JavaScript is a bit of a forgotten corner, but it certainly can’t be ignored. A lot of places we pass the host name to JavaScript assuming it’s http://, and a lot of places we write the meta in the meta station. Prefix. A lot. A lot. Help. Thankfully this is no longer the case, we now render a site with the server and place the appropriate selection at the top of the page:

StackExchange.init({
  "locale":"en"."stackAuthUrl":"https://stackauth.com"."site": {"name":"Stack Overflow"
    "childUrl":"https://meta.stackoverflow.com"."protocol":"http"
  },
  "user": {"gravatar":"<div class=\"gravatar-wrapper-32\"><img src=\"https://i.stack.imgur.com/nGCYr.jpg\"></div>"."profileUrl":"https://stackoverflow.com/users/13249/nick-craver"}});Copy the code

We’ve also used a lot of static linking in our code over the years. For example, at the bottom of the page, in the footer, in the help area… It’s everywhere. For each, the solution isn’t complicated: just change them to

.url (“/path”). But it’s kind of fun to find these links, because you can’t just search for “http://”. Thanks to the W3C for its efforts:

<svg xmlns="http://www.w3.org/2000/svg".Copy the code

Yes, these are identifiers and they can’t be changed. So I want Visual Studio to add an “Exclude file types” option to the Find File box. Did you hear Visual Studio? VS Code added this feature a while back. I’m not asking too much.

It’s tedious to find a thousand links in your code and replace them (comments, license links, etc.). But that’s life. We have to do it. By changing these links to.url (), we can dynamically switch them over once the site supports HTTPS. For example, we had to wait until the relocation of Meta.*. Stackexchange.com was complete before switching. The password of our data center is “jianbing guozi” pinyin, no one will read it here, so it is very safe to save the password here. When the site is migrated,.url () will still work, and so will sites rendered with.url () as HTTPS by default. This turns static linking into dynamic.

Another important thing: this allows both our development and local environments to work, not just chain to production. It’s boring but worth doing. Oh, and since our canonical web address (canonical) also does this via.url (), Google will know when users start using HTTPS.

Once a site moves to HTTPS, we have crawlers update the site links. We call this a fix for “Google juice,” and it also allows users to stop bumping into 301.

Jump (301)

When you move your site to HTTPS, you have two important things to do in order to cooperate with Google:

  • Update the canonical url, for example<link rel="canonical" href="https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454" />
  • thehttp://The link jumps to 301https://

It’s not complicated, it’s not a huge project, but it’s very, very important. Most of the traffic on Stack Overflow comes from Google search results, so we need to make sure this doesn’t have a negative impact. This is our livelihood, and if we lose traffic then I’m out of a job. Remember those.internal API calls? Yeah, we can’t jump to everything either. So we need some logic to handle the jump (for example, we can’t jump the POST request either, because the browser doesn’t handle it well), but of course it’s fairly straightforward. Here’s the actual code:

public static void PerformHttpsRedirects()
{
    var https = Settings.HTTPS;
    // If we're on HTTPS, never redirect back
    if (Request.IsSecureConnection) return;

    // Not HTTPS-by-default? Abort.
    if (!https.IsDefault) return;
    // Not supposed to redirect anyone yet? Abort.
    if (https.RedirectFor == SiteSettings.RedirectAudience.NoOne) return;
    // Don't redirect .internal or any other direct connection
    // ...as this would break direct HOSTS to webserver as well
    if (RequestIPIsInternal()) return;

    // Only redirect GET/HEAD during the transition - we'll 301 and HSTS everything in Fastly later
    if (string.Equals(Request.HttpMethod, "GET", StringComparison.InvariantCultureIgnoreCase)
        || string.Equals(Request.HttpMethod, "HEAD", StringComparison.InvariantCultureIgnoreCase))
    {
        // Only redirect if we're redirecting everyone, or a crawler (if we're a crawler)
        if (https.RedirectFor == SiteSettings.RedirectAudience.Everyone
            || (https.RedirectFor == SiteSettings.RedirectAudience.Crawlers && Current.IsSearchEngine))
        {
            var resp = Context.InnerHttpContext.Response;
            // 301 when we're really sure (302 is the default)
            if (https.RedirectVia301)
            {
                resp.RedirectPermanent(Site.Url(Request.Url.PathAndQuery), false);
            }
            else
            {
                resp.Redirect(Site.Url(Request.Url.PathAndQuery), false);
            }
            Context.InnerHttpContext.ApplicationInstance.CompleteRequest();
        }
    }
}Copy the code

Note that we are not jumping 301 by default (there is a.redirectvia301 setting), as we must test carefully before doing something that will have a permanent impact. We’ll talk about HSTS and the aftereffects later.

Websockets

This one will go a little faster. Websocket isn’t hard, and in some ways, it’s the easiest thing we’ve ever done. We use WebSockets to handle real-time user influence changes, inbox notifications, new questions asked, new answers added, and more. This means that for almost every Stack Overflow page we open, we have a corresponding WebSocket connection connected to our load balancer.

So how do you change it? Install a certificate, actually very simple: listening: port 443, then use wss://qa.sockets.stackexchange.com to replace the ws: / /. The latter is already done (we used a proprietary certificate, but it doesn’t matter). Going from ws:// to WSS :// is just a matter of configuration. At first we used ws:// as a backup for WSS ://, but it has since changed to just WSS ://. There are two reasons for doing this:

  1. If you don’t use it,https://There will be mixed content warnings below
  2. More users can be supported. Because many older proxies don’t handle Websockets very well. If encrypted traffic is used, most agents simply pass through without messing up traffic. This is especially true for mobile users.

The big question is, “Can we handle the load?” We handle a lot of concurrent Websockets across the network, and at the time I wrote this estimate we had over 600,000 concurrent connections. This is the interface of our HAProxy dashboard in Opserver:

HAProxy Websockets

There are many connections, whether on terminals, abstract namespace sockets, or on the front end. With TLS session recovery enabled, HAProxy itself is heavily loaded. To make the next time the user reconnects faster, the user gets a token after the first negotiation, and the token is sent the next time. If we have enough memory and no timeout, we will resume the last session instead of opening another one. This operation saves CPU, improves performance for the user, but uses more memory. This varies depending on the key size (2048,4096, or more?). We’re now using 4096-bit keys. With 600,000 Websockets open, we only used 19GB of the load balancer’s 64GB of memory. Of this 12GB is used by HAProxy, most of which is used by TLS session cache. So it worked out pretty well, and if we had to buy memory, it would be the cheapest thing in the whole HTTPS migration.

HAProxy Websocket Memory

The unknown

I guess it might be time for us to talk about the unknown. Some questions we can’t really know until we try:

  • What is the traffic performance in Google Analytics? (Will we lose referer?)
  • Is the Conversion of Google Webmasters smooth? (Does 301 take effect? What about canonical domain names? How long will it take?
  • How will Google search analysis work (as we’ll see in search analysishttps://?).
  • Will our search rankings drop? (Scariest)

A lot of people have talked about their conversion to https://, but it’s a little different for us. We are not a site. We are multiple sites under multiple domains. We don’t know what Google will do with our network. Will it know that StackOverflow.com and Superuser.com are related? I don’t know. We can’t rely on Google to tell us that.

So we did the test. In our full web launch, we tested several domains:

  • meta.stackexchange.com
  • security.stackexchange.com
  • superuser.com

Yeah, that’s what Samo and I talked about, and it took about three minutes. Meta because it’s our most important feedback site. There are a number of issues on the Security site that experts might notice, especially regarding HTTPS. And finally, Super User, we need to know the impact of search on our content. Super User traffic is much larger than meta and Security. Most importantly, it has native traffic from Google.

We were always watching and evaluating the impact of search, so it took a long time for other sites to catch up with Super User. All we can say so far is: basically no impact. Weekly changes in searches, results, clicks and rankings are within normal limits. Our company depends on that traffic, which is really important to us. Fortunately, there is nothing to worry about and we can continue to publish.

error

The article wouldn’t be good enough without mentioning the parts we screwed up. Mistake is always a choice. Let’s summarize some of our regrets along the way:

Error: Relative protocol URL

If you have a URL for a resource, you’ll typically see something like http://example.com or https://example.com, including the path to our image and so on. Another option is that you can use //example.com. This is called a relative protocol URL. We’ve been doing this for a long time with images, JavaScript, CSS, etc. (our own resources, not user-submitted). A few years later, we realized that wasn’t a good idea, at least not for us. Relative in a protocol link is relative to a page. When you are at http://stackoverflow.com, //example.com refers to http://example.com; If you are in the https://stackoverflow.com, and https://example.com. So what’s wrong with this?

The problem is that image urls aren’t just used in pages; they’re also used in email, apis, and mobile apps. When we took a look at the path structure and used image paths everywhere we found something wrong. While this change greatly reduced code redundancy and simplified a lot of things, we ended up using relative urls in our mail. Most mail clients cannot handle images of relative protocol urls. Because they don’t know what the protocol is. Email is not http:// or https://. Only if you check your email in your browser, it’s likely to have the desired effect.

So what to do? We replaced everything with https://. I consolidated all of our path code into two variables: the CDN root path, and the corresponding site-specific folder. Such as Stack Overflow style sheet on https://cdn.sstatic.net/Sites/stackoverflow/all.css (of course we have cache interrupt), With local is https://local.sstatic.net/Sites/stackoverflow/all.css. You can see the common ground. By concatenating paths, the logic is much simpler. By forcing https://, the user can also take advantage of HTTP/2 before the whole site switches because all the static resources are already in place. Both https:// also indicate that we can use the same property for pages, mail, mobile, and apis. This unification also means that we have one fixed place to process all paths — we have cache interrupters everywhere.

Note: if you like we interrupt cache, such as https://cdn.sstatic.net/Sites/stackoverflow/all.css?v=070eac3e8cf4, please don’t use the build number. Our cache interrupts use file checkouts, which means you only download a new file if it actually changes. It may be a little easier to use a build number, but it can also hurt your cost and performance.

It would be nice to do that, but why don’t we do it from the beginning? Because HTTPS wasn’t performing well at the time. Users access it through https:// more slowly than http://. To take a bigger example: we received 4 million requests last month on sstatic.net, totaling 94TB. This can add up to a significant amount of latency if HTTPS is not performing well. But now that we’re on HTTP/2 and have the CDN/ proxy layer set up, the performance issues are much better. Faster for users, simpler for us, why not!

Error: API and.internal

What did we find when we set up the agent and started testing it? We forgot one very important thing, or rather, I forgot one very important thing. We use HTTP heavily in our internal API. These work, of course, but they are slower, more complex, and more likely to go wrong.

For example, an internal API needs to access stackoverflow.com/some-internal-route. Previously, the nodes were like this:

  • The original app
  • Gateway/Firewall (Exposed to the Public network)
  • Local load balancer
  • Target Web server

This is because stackOverflow.com can be parsed, and the parsed IP is our load balancer. In the case of a proxy, in order for users to access the nearest node, they access different IP addresses and target points. The IP that their DNS resolves is a CDN/ Fastly layer. Oops, that tells us that our current path looks like this:

  • The original app
  • Gateway/Firewall (Exposed to the Public network)
  • Our external router
  • Operator (Multi-node)
  • Agent (Cloudflare/Fastly)
  • Operator (to our agent road)
  • Our external router
  • Local load balancer
  • Target Web server

Well, this looks worse. In order to implement A call from A to B, we have A lot of unnecessary dependencies, and the performance is reduced. I’m not saying our proxy is slow, but it used to take 1ms to connect to our data center… Well, our agency is slow.

We had a lot of internal discussions about the easiest way to solve this problem. We can change the request to internal.stackoverflow.com, but this can cause considerable changes (may also produce conflict). We also create a DNS specifically to resolve internal addresses (but this creates wildcard inheritance issues). We could also internally resolve Stackoverflow.com into different addresses (this is known as split DNS horizontally), but this would be difficult to debug, and it would be difficult to know which one to go to in a multi-data center scenario.

Eventually, we added an.internal follow-up to all domain names exposed to external DNS. In our network, for example, stackoverflow.com.internal parses behind our load balancer (DMZ) of an internal subnet. We did this for several reasons:

  • We can override and include a top-level domain name server (active directory) in an internal DNS server
  • When a request is sent from HAProxy to a Web application, we can put.internalHostRemove from the header (application layer not aware)
  • If we need internal DMZ SSL, we can use a similar wildcard combination
  • The code for the client API is simple (if you add one to the list of domain names).internal)

The API code for our client is a NuGet library for StackExchange.Net Work written mostly by Marc Gravell. For each URL to be accessed, we use static method calls (so there are only a few places for generic fetch methods). An “internalized” URL is returned if it exists, otherwise it remains unchanged. This means that a simple NuGet update can deploy this logical change to all applications. This call is pretty simple:

uri = SubstituteInternalUrl(uri);Copy the code

Here is an example of stackOverflow.com DNS behavior:

  • Fastly: 151.101.193.69, 151.101.129.69, 151.101.65.69, 151.101.1.69
  • Direct connection (external route) : 198.252.206.16
  • Internal: 10.7.3.16

Remember we talked about dNSControl earlier? We can use this for fast synchronization. Thanks to JavaScript configuration/definition, we can easily share and simplify code. We match the last byte of all IP addresses in all subnets and all data centers, so with a few variables, all AD and external DNS entries are aligned. This also means that our HAProxy configuration is simpler, basically:

stacklb::external::frontend_normal { 't1_http-in':
  section_name    => 'http-in',
  maxconn         => $t1_http_in_maxconn,
  inputs          => {
    "${external_ip_base}.16:80"= > ['name stackexchange']."${external_ip_base}.17:80"= > ['name careers']."${external_ip_base}.18:80"= > ['name openid']."${external_ip_base}.24:80"= > ['name misc'].Copy the code

In summary, API paths are faster and more reliable:

  • The original app
  • Local Load Balancer (DMZ)
  • Target Web server

We’ve solved a few problems, and we’ve got hundreds left.

Error: 301 cache

One thing we didn’t realize when jumping from http://301 to https:// was that Fastly caches our return value. In Fastly, the default cache key does not take into account protocols. I personally disagree with this behavior, as the default 301 redirect enabled on the source site leads to an infinite loop. The problem is caused by:

  1. The user to accesshttp://A network on the Internet
  2. Jump to 301https://
  3. Fastly caches this jump
  4. Any user (including the one in #1) canhttps://Accessing the same page
  5. Fastly returns a jump tohttps://301, try to get you on this page already

That’s why we have infinite loops. To fix this, we need to turn off the 301, clear the Fastly cache, and start investigating. Fastly suggests adding Fastly-SSL to vary, like this:

 sub vcl_fetch {
   set beresp.http.Vary = if(beresp.http.Vary, beresp.http.Vary ","."") "Fastly-SSL";Copy the code

In my opinion, this should be the default behavior.

Mistake: Help center episode

Remember that help file we had to fix? Help documents are categorized by language, and very few are categorized by site, so they could be shared. In order not to create a lot of duplicate code and storage structures, we did a little processing. We stored the actual post object (as well as the question and answer) in meta.stackExchange.com or the site associated with the post. We store the generated HelpPost in the Sites database in the center, which is essentially the generated HTML. When dealing with mixed content, we also deal with posts in a single station, simple!

When the original post is fixed, we just need to regenerate HTML for each site and fill it back in. But I made a mistake at this point. Backfill takes the current site (the one that called backfill), not the original site. This caused 12345 posts in meta.StackExchange.com to be replaced by 12345 posts in StackOverflow.com. Sometimes it’s an answer, sometimes it’s a question, sometimes it’s a Tag wiki. This has led to some interesting help documentation. There are some consequences.

I can only say that the repair process is fairly simple:

Me being a dumbass

Fill in the data again and you can fix it. But anyway, it was a bit of a public joke. I’m sorry.

Open source

Here are some of the projects we’ve produced along the way that have helped us improve our HTTPS deployment efforts and hopefully will one day save the world:

  • BlackBox (secure storage of private information in version control) by Tom Limoncelli
  • Capnproto-net (no longer supported –.net version of Cap ‘n Proto) by Marc Gravell
  • DNSControl (Controlling Multiple DNS Providers) by Craig Peterson and Tom Limoncelli
  • HttpUnit (Web Integration Testing) by Matt Jibson and Tom Limoncelli
  • Opserver (support Cloudflare DNS) by Nick Craver
  • Jason Harvey, author of Fastlyctl (a Fastly API call for the Go language
  • Jason Harvey, author of Fastly – Ratelimit (a syslog traffic limiting solution based on Fastly

The next step

Our work is not done. Here’s what to do:

  • We’re going to fix the mixed content in our chat domain, like chat.stackoverflow.com, which has user-embedded images and so on
  • If possible, we add all applicable domain names to the Chrome HSTS preload list
  • We need to evaluate the HPKP and whether we want to deploy it (it’s dangerous, we’re leaning against it right now)
  • We need to move the chat tohttps://
  • We need to migrate all cookies to safe mode
  • We are waiting for HAProxy 1.8 with HTTP/2 support (due in September)
  • We need to take advantage of HTTP/2 push (I’ll talk to Fastly about this in June — they don’t support cross domain push yet)
  • We need to move 301 behavior out of CDN/ proxy for better performance (need to publish by site)

HSTS preload

HSTS stands for “HTTP strict Transport security”. OWASP has a good summary here. The concept is actually quite simple:

  • When you visithttps://Page, we send you a header like this:Strict-Transport-Security: max-age=31536000
  • During this time (seconds), your browser will only passhttps://Visit this domain name

Even if you click on a link for http://, your browser will jump straight to https://. Even if you have a http:// jump set up, your browser will not access it, it will directly access SSL/TLS. This also prevents users from being hijacked to access insecure http://. For example, it can hijack you to https://stack (unicode looks like o but is actually a circle) verflow.com, which may even have SSL/TLS certificates. It is only safe not to visit this site.

But that requires us to visit the site at least once before we have this header, right? right So we have HSTS preloading, which is a list of domain names that are distributed along with all the major browsers and preloaded by them. That means they jump to https:// the first time they access it, so there’s never any http:// communication.

Nice!!!! So how do you get on this list? Here are the requirements:

  1. Have a valid certificate
  2. If you listen on port 80, HTTP should jump to HTTPS on the same host
  3. All subdomains must support HTTPS
    • In particular, WWW subdomains should support HTTPS if there are DNS records
  4. The HSTS header of the main domain name must meet the following conditions:
    • Max-aget must be at least 18 weeks (10,886,400 seconds)
    • Must include includeSubDomains directive
    • The preload directive must be specified
    • If you want to jump to an HTTPS site, the jump must also have an HSTS header (not just the page you skipped).

Does that sound okay? All of our active domains are HTTPS enabled and have valid certificates. No, we have one more question. Remember we have a meta.gaming.stackexchange.com, though it to gaming.meta.stackexchange.com, but the jump valid certificate does not in itself.

Taking meta as an example, if we included includeSubDomains at the head of HSTS, all the links on the web to old domains would run into holes. They were supposed to jump to a http:/// site (as they are now), but when they were changed they turned into an invalid certificate error. Yesterday we looked at the traffic log and there are still 80,000 visits per day that are jumping to the meta subdomain via 301. There are a lot of crawlers here, but there are still a lot of artificial traffic coming from blogs or favorites… And some crawlers are really stupid and never update their information according to 301. Well, are you still reading this article? I fell asleep three times while I was writing.

What should we do? Should we enable SAN certificates, add a few hundred domain names, and adjust our infrastructure so that the 301 redirect also strictly complies with HTTPS? Doing it via Fastly increases our cost (more IP, certificates, etc.). Let’s Encrypt actually helps a little bit. The cost of obtaining a certificate is relatively low, if you don’t consider the labor cost of setting it up and maintaining it (because we don’t use it because of the above).

Another is an old hangover: our internal domain name is ds.stackexchange.com. Why ds.? I’m not sure. I guess we just don’t know how to spell data Center. This means that includeSubDomains automatically include all internal terminals. Although most of us already have https://, if everything goes HTTPS it can cause some problems and delays. It’s not that we don’t want to use https:// internally as well, it’s just that this is an overall project (mostly certificate distribution and maintenance, with multiple levels of certificates) and we don’t want to add coupling. Why not change the internal domain name? It’s mainly a matter of time. This move will take a lot of time and coordination.

At present, we set the MAX-age of HSTS to two years and exclude includeSubDomains. I won’t remove this from the code unless I have to, because it’s too dangerous. Once we have HSTS times set for all Q&A sites, we will talk to Google about whether they can include us in the HSTS list without includeSubDomains, at least we will try. You can see that, although rare, it does appear on the current list. Hopefully they’ll agree, in terms of strengthening Stack Overflow security.

chat

To enable secure cookies (sent only under HTTPS) as soon as possible, We will chat (chat.stackoverflow.com, [jump to https:// chat.stackexchange.com and chat.meta.stackexchange.com). As our universal login does, chat relies on cookies under the secondary domain. If cookies are only sent under https://, you can only log in under https://.

This one is open to debate, but moving the chat to https:// is actually a good thing if there is mixed content. Our networks are more secure, and we can handle mixed content in live chat. Hopefully that will happen in the next week or two, that’s on my schedule.

today

Anyway, that’s where we are today, and that’s what we’ve been doing for the last four years. There are a lot of higher-priority things standing in the way of HTTPS — and it’s far from the only thing we’re doing. But that’s life. The people who did this are working in a lot of places that you can’t see, and there are a lot more people involved than I mentioned. In this post I’ve only touched on some of the more complex topics that took a lot of our time (otherwise it would have been too long), but a lot of people both inside and outside Stack Overflow helped us along the way.

I know you’ll have a lot of questions, concerns, complaints, suggestions, etc. We welcome that. This week we’ll be following the comments below, our Meta site, Reddit, Hacker News, and Twitter to answer/help you as much as possible. Thanks for reading. It was great to read it in full. (heart)