RIP, mourning, the first time a celebrity has died.

Joe Armstrong doesn’t need to mention Erlang and OTP, but his paper “Building Reliable Distributed Systems in the Face of Software Errors” will go down in history – decades ahead of the present, proposing correct approaches to ideas such as OOP that are not inherently concurrent.

Joe Armstrong argues in his paper that almost all traditional programming languages lack strong support for true concurrency — sequential in nature, and language concurrency is only provided by the underlying operating system, not the language.

Concurrency Oriented Language It’s pretty easy to write a program while using a Language that supports Concurrency well:

  1. Identify true concurrent activities from real-world activities
  2. Identify all message channels between concurrent activities
  3. Write down all the messages that can flow through different message channels

The structure of the program is strictly consistent with the structure of the problem, that is, every real-world activity maps strictly to a concurrent process in our programming language. If the mapping ratio from problem to program is 1:1, programs and problems are said to be isomorphic. It is important that the mapping ratio is 1:1. This minimizes the conceptual gap between problems and solutions. If the ratio is not 1:1, the program quickly degrades and becomes unintelligible. This degradation is very common when addressing concurrency problems using non-concurrency oriented programming languages. In non-concurrency oriented programming languages, multiple independent activities are often forced to be controlled by threads or processes at the same language level to solve a problem, which inevitably leads to a loss of clarity and can lead to complex, hard-to-repeat errors. When analyzing the problem, we must also choose the right granularity for our model. For example, when we wrote an Instant messaging system, we did it one process per user, not one process for every atom in the user.

Second, he was able to deliver what he calls a Concurrency Oriented Language through his nine principles of design, writing a naturally distributed system enabled Erlang and OTP framework.

  • All in progress
  • Strong process isolation
  • Process creation and destruction are lightweight operations
  • Messaging is the only way processes interact
  • Each process has a unique name
  • If you know the name of the process, you can send it a message
  • Processes do not share resources
  • Error handling de-localization
  • Processes either run normally or die immediately

Based on the above nine concepts, the design of Erlang language has achieved the reliability of 99.9999999% of the world’s most complex ATM switch.

Third, the proposal and implementation of let it Crash.

A program can’t handle everything, so the programmer should just handle the obvious ones as much as he can, and let the hidden, non-intuitive ones — which are likely to be rare and frequent — go away. Right? It’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency, it’s an emergency.

It’s not like you’re packaging a binary in golang, C++ or something like that, it’s going to be monitored by tools like the container container, it’s going to crash, it’s going to be dead, it’s going to reboot automatically, How do I understand Erlang’s “let it collapse” philosophy in reference to my previous article?

Fourthly, all processes are lightweight and can be monitored by a Supervisor.

You can easily use a supervisor to manage child processes, and the Supervisor will handle child processes that fail unexpectedly according to the policies you set. These policies include:

  • One_for_one: restarts only child processes that have been suspended
  • One_for_all: A child process hangs. Restart all child processes
  • Rest_for_one: All child processes created after the creation time of the suspended child process are restarted.

Old man knock code is a shuttle! Yeah, all we have to do is reboot.

In essence, this is supported by the paper: in complex production systems where almost all failures and errors are transient, retrying an operation is a good way to solve the problem — Jim Gray’s paper showed that the mean time between failures (MTBF) of the system increased by 4 times using this method for transient failures.

Therefore, you can create a monitor tree where the root node does nothing but monitor the process. Everything else is its child, and if it weren’t for coredump (which almost never happens), the root node wouldn’t hang; So the other child processes are handled correctly.

This assumes, of course, that if you restart more than 3 times in 5 seconds, the process will stop restarting and die. Why is that? Because restart is to let the process back to the original start stable state, since the stable state is not stable, repeated restart is meaningless, at this time urgently need someone to deal with.


Joe Armstrong had a lot of insight, for example, when he said that rather than having a complete and seemingly perfect Type system, what matters is how you do it. If you think about it wrong from the start (such as using a non-union language to deal with concurrency), No matter how correct the code is, it’s just the right thing going in the wrong direction.