In a recently published paper, Turing Prize winner Yoshua Bengio and others detail their team’s current research focus: causal representation learning, which combines machine learning with causal reasoning. The authors not only comprehensively review the basic concepts of causal reasoning, but also explain its integration with machine learning and its profound influence on machine learning. The paper has been accepted by the Journal Proceedings of the IEEE.

Heart of the Machine reporting, editors: Devil king, Du Wei.

Machine learning and causal reasoning have always been two relatively independent research directions, each with its own advantages and disadvantages.

But in the past few years, the two have begun to learn from each other and advance each other. For example, the rapid development of machine learning has promoted the development of causal reasoning. Powerful machine learning methods such as decision trees, integration methods, and deep neural networks can be used to estimate potential outcomes more accurately. In a 2018 reporting article on Heart of Machine, Turing Award winner Judea Pearl, one of the advocates of causal models, also explores the limitations of current machine learning theories and seven implications from causal reasoning.

Therefore, Causal Representation Learning combining the two has attracted more and more attention in recent years and has become a potential direction for Human beings to move towards human-level AI.

At the beginning of 2020, Machine Heart selected several recent literatures in the field of causal representation learning, and carefully analyzed the basic architecture of different approaches to help interested readers understand the direction and possibility of the combination of causal learning and machine learning. (See: what is the latest research on counterfactual reasoning, feature separation, causal representation learning?)

Today, we recommend another Causal Representation Learning paper: Towards Causal Representation Learning published by Yoshua Bengio’s team, which has been accepted by Proceedings of the IEEE.

In a lecture at the end of 2020, Bengio described this as the core of their current research project.

Links to papers: arxiv.org/pdf/2102.11…

In this paper, Yoshua Bengio et al. reviewed the basic concepts of causal reasoning and related them to key open problems in machine learning, such as migration and generalization, and then analyzed the possible contributions of causal reasoning to machine learning research. The reverse is also true: most causal studies presuppose causal variables. Therefore, the core problem in AI and causality field is causal representation learning, that is, discovering high-level causal variables based on low-level observations. Finally, the paper describes the effect of causality on machine learning and puts forward the core research direction of this crossover field.

The main contributions of this paper are as follows:

  • The second chapter introduces the different levels of causal modeling in physical system, and the third chapter shows the difference between causal model and statistical model. Not only are modeling capabilities discussed, but also the assumptions and challenges involved.
  • In Chapter 4, the principle of Independent Causal Mechanisms (ICM) is extended to a core component of data-based Causal relationship estimation. Sparse Mechanism Shift hypothesis is taken as the result of ICM principle. And discuss its influence on learning causal model.
  • In chapter 5, we review the existing methods of learning causality based on appropriate descriptors (or features), covering classical methods and modern methods based on deep neural networks, focusing on the underlying principles of causality discovery.
  • Chapter 6 discusses how to learn useful models based on causal representation of data, and how to look at machine learning problems from the perspective of causality.
  • Chapter 7 analyzes the effect of causality on actual machine learning. The researchers used causal language to reinterpret robustness and generalization, as well as common techniques such as semi-supervised learning, self-supervised learning, data enhancement, and pre-training. The researchers also explore the intersection of causality and machine learning in scientific applications, and consider how to combine the strengths of both to create more general-purpose AI.

Hierarchy of causal modeling

Predictions under independent identically distributed Settings

Statistical models are superficial representations of reality because they only need to model associations. For a given input sample and target label Y X, we may want to approximate P (Y | X) to answer the following question: “what is the probability that the image contains a dog?” “Or” Given a diagnostic measure, such as blood pressure, what is a patient’s probability of heart failure?” . Under suitable assumptions, these questions can be answered by observing sufficient independent identically distributed (I.I.D.) data based on P(X, Y).

Prediction under distribution offset

The Interventional questions are more challenging than the predictions because they involve behaviors that go beyond statistical learning independent codistributed Settings. Intervention may influence the values and relationships of a subset of causal variables. For example, “Can increasing the stork population in a country increase the human fertility rate?” “Would fewer people smoke if there was more social stigma attached to tobacco?”

Answer counterfactual questions

Counterfactual questions require reasoning about why things happen, imagining the consequences of different actions after they happen, and deciding which actions can achieve the desired results. Counterfactual questions are harder to answer than interventionist ones. However, this may be a key challenge for AI, as intelligence benefits from imagining the consequences of actions and understanding which actions lead to specific outcomes.

The nature of data: observation, intervention, (non-) structure

Data formats play a major role in inferring relationship types. We can distinguish between two axes of data modes: observational data versus intervention data, manual engineering data versus raw (unstructured) perceptual inputs.

Observational and intervention data: An extreme data format that is often assumed but rarely rigorously observed is observational independent identically distributed data, in which each data point is independently sampled from the same distribution.

Manual engineering data versus raw data: In classical AI, data is often assumed to be structurally high-level and semantically meaningful variables, which may in part correspond to the causal variables of the underlying graph.

Causal models and reasoning

This part mainly introduces the difference between statistical modeling and causal modeling, and uses formal language to introduce intervention and distribution change.

Independent identically distributed data-driven approach

For independent identically distributed data, strong universal consistency ensures that the learning algorithm converges to the lowest risk. Such algorithms do exist, such as the nearest neighbor classifier, support vector machines, and neural networks. However, current machine learning methods generally perform poorly when faced with problems that do not conform to the assumption of independent homodistribution, which is easy for humans to do.

Reichenbach’s Principle: From statistics to causality

Reichenbach [198] clearly describes the link between causality and statistical correlation:

It’s a special case where X is the same as Y. In the absence of additional assumptions, we cannot distinguish these cases using observational data. At this point, the causal model contains more information than the statistical model.

Finding causal structures is hard when there are only two observations, but it gets easier when there are more observations. The reason is that in this case there are multiple non-trivial conditional independence conveyed by causal structures. They generalize Reichenbach’s principles and can be described in the language of causal graphs or structural causal models, fusing probabilistic graph models with intervention concepts.

Structural causal model (SCM)

SCM considers a set of observations (or variables) X_1,.., X_n associated with the vertices of the directed Acyclic graph (DAG). The study assumes that each observation is based on the following formula:

Mathematically, observations are also random. Intuitively, we can think of independent noise as “information probes” spreading across the graph (like the independent elements of a rumor spreading on a social network). This is of course more than just two observations, since any non-trivial conditional independent statement requires at least three variables.

Differences between statistical models, causal graph models and SCM

Figure 1 illustrates the difference between a statistical model and a causal model.

Statistical models can be defined by graph models, which are probability distributions with graphs. A graph is a causal model if its edges are causal (in this case, the graph is a “causal graph”). The structural causality model consists of a set of causal variables and a set of structural equations based on the noise variable U_i distribution.

Independent causal mechanism

The concept of independence has two aspects: one related to influence and one related to information. In the history of causal research, invariable, autonomous, and independent mechanisms have come in many forms. For example, Haavelmo’s earlier work [99] assumed that changing one structural assignment would leave the others unchanged; Hoover [111] introduced the principle of invariance: true causal order is invariance under appropriate intervention; Aldrich [4] discussed the historical development of these ideas in economics; Pearl [183] discussed autonomy in detail and believed that the causal mechanism could remain unchanged when other mechanisms were subject to external influences.

The study treats any real-world distribution as the product of a causal mechanism. Such changes in distribution are usually caused by changes in at least one causal mechanism. According to ICM principles, the researchers made the following assumptions:

In ICM principles, researchers state that the independence of the two mechanisms (formalized as conditional distributions) implies that the two conditional distributions should not affect each other. The latter can be interpreted as a demand for independent intervention.

Causal discovery and machine learning

According to the SMS hypothesis, many causal structures are thought to need to remain constant. Thus, distribution offsets (such as observing a system in different “environments or contexts”) can be of great help in determining causal structures. These contexts can come from interventions, unstable time series, or multiple views. Similarly, these contexts can be interpreted as different tasks and thus associated with meta-learning.

The traditional causal discovery and inference assumption unit (unit) is a random variable connected by a causal graph. However, real-world observations are often not initially structured into these units, such as objects in images. Thus, causal representation learning emerges to try to learn these variables from the data, just as machine learning beyond symbolic AI does not require that the symbols of the algorithm’s operation be given in advance. Based on this, the researchers attempted to combine the random variable S_1… , S_n is connected with the observed value, and the formula is as follows:

Where G is a nonlinear function. Figure 2 below shows an example where a high-dimensional observation is the result of looking at the state of a causal system and then processing it using a neural network to extract high-level variables useful for a variety of tasks.

To combine structural causality modeling with presentation learning, we should strive to embed SCM into larger machine learning models whose inputs and outputs may exhibit high and unstructured characteristics, but whose internal workings are at least partially controlled by SCM (parameterized using neural networks).

The researchers show a visualized example in figure 3, where the variation of appropriate causal variables is sparse (finger and square positions change), but dense (many pixel values change due to finger and square movements) in other representations such as pixel space.

From the perspective of causal representation learning, the authors discuss three problems in modern machine learning: decoupled representation learning, transferable mechanism learning, intervention-world model learning and inference learning.

Effects of causal reasoning on machine learning

All of the above discussion requires a learning paradigm that does not rely on common I.I.D. assumptions. The researchers wanted to make a weaker assumption: the data from the applied model may come from different distributions, but the causal mechanisms involved are (mostly) the same.

Semi-supervised Learning (SSL)

Assuming the underlying causal graph X → Y, and also wanting to learn the mapping X → Y, the causal factorization of this case is as follows:

From an SSL perspective, subsequent developments include further theoretical analysis and conditional SSL. SSL as using marginal P (X) and non causal conditional P (Y | X) dependencies between, this view is consistent with the common assumption of verifying the rationality of SSL.

In addition, some theoretical results in the SSL field use well-known assumptions from causal graphs (even if these assumptions do not mention causality) : cooperative training theory states learnability of unlabeled data and relies on predictors to assume conditional independence based on a given tag. We usually expect this predictor to be caused (only) by a given label, i.e., an anti-causal setting.

Antagonistic vulnerability

Now suppose we are in a cause-and-effect setting, where the cause-and-effect generation model factorizes into separate components, one of which is (essentially) a classification function. Thus, we might expect that if predictors approximate causal mechanisms that are inherently transferable and robust, adversarial samples should be harder to find.

Recent work supports the idea that a potential defense against attacks addresses the problem of anti-causal classification by modeling the direction of causal generation, an approach known in the visual field as Analysis by Synthesis.

Robustness and strong generalization

In order to learn a robust predictor, we should have a subset of the environmental distribution

And to solve

In practice, solving formula (18) requires specifying a causal model with an intervention correlation set. If the observed environment set ε does not correspond to the possible environment set, we will get an additional estimation error, which may be arbitrarily large in the worst case.

Pre-training, data enhancement and self-monitoring

It is difficult to learn the prediction model used to solve the (18) min-max optimization problem. This study interprets several common techniques in machine learning as approximations (18). The first method is to enrich the distribution of training sets. The second approach, often used in combination with the first, relies on data enhancement to increase data diversity; The third method relies on self-supervised learning P(X).

An interesting line of research is to combine all of these techniques for large-scale training, data enhancement, self-monitoring and robust fine-tuning based on data from multiple simulated environments.

Reinforcement learning

Reinforcement learning (RL) is closer to causal studies than mainstream machine learning studies because it can sometimes efficiently estimate do-probabilities directly. However, in the departure strategy learning setting, and especially in the batch (or observation) setting, the problem of causality becomes subtle. Causal learning applied to reinforcement learning can be divided into two aspects: causal induction and causal reasoning.

Causal induction in the reinforcement learning setting is quite different from the challenge faced in the classical causal learning setting because causal variables are usually given. However, there is growing evidence of the effectiveness of structured representations of appropriate environments. Such as:

  • World model;
  • Generalization, robustness and fast migration;
  • The facts;
  • Offline reinforcement learning

Science applications

When machine learning is applied to the natural sciences, a fundamental question is how far we can complement our understanding of physical systems with machine learning. One interesting direction is to use neural networks for physical simulations, which are much more efficient than hand-designed simulators. On the other hand, the lack of systematic experimental conditions can be challenging in applications such as healthcare.

Causality has enormous potential to help understand medical phenomena. Causal mediation analysis during the COVID-19 pandemic helps to really detect the effect of different factors on fatality when a textbook example of Simpson’s paradox is observed.

Another example of a scientific application is astronomy, where researchers use causal models to identify exoplanets in instrumental confusion.

Multitasking and continuous learning

Multitasking learning refers to building a system that can solve multiple tasks in different environments. These tasks usually have some common characteristics. By learning the similarity across tasks, the system can make more effective use of the knowledge gained from previous tasks when encountering new ones.

In this regard, it is clear that we have come a long way and have not explicitly seen multitasking as a causal problem. Driven by massive amounts of data and computing power, ARTIFICIAL intelligence has made remarkable progress in a wide range of applications. This raises the question: “Why can’t we simply train a large model to learn environmental dynamics (such as Settings in reinforcement learning) to include all possible interventions?” After all, distributed representations can generalize to unseen samples, and if we train based on a large number of interventions, we might end up with a large neural network that generalizes well between a large number of interventions.

To do so, first of all, if there is not enough diversity in the data, the worst case is that the error due to the undetected distribution deviation can still be high. Furthermore, if we have a model that successfully deals with all interventions in a particular environment, we may want to use it in different environments with similar dynamics, though not necessarily the same dynamics.

I.I.D. pattern recognition is essentially just a mathematical abstraction, and causality may be essential for most forms of animate learning. However, until now, machine learning has neglected the full integration of causality, and this research suggests that machine learning would benefit from integrating the concept of causality. The researchers believe that combining current deep learning methods with cause-and-effect tools and ideas could be a necessary step toward general-purpose AI systems.