Empty chestnut is from aofei Temple


Qubit product | public QbitAI

True · Deep learning ↓↓↓


Jeff Dean.

That’s because Google has injected the spirit of machine learning into Pixel’s camera: a bit of Depth on the background blurring task.

Before learning (Stereo) and after learning (Learned) are compared on the right side of the GIF. On the face of it, there was more emptiness where it ought to be:

▽ “If you don’t look carefully, you don’t know it’s a GIF.

It’s called deep learning.

But there’s more to neural networks than meets the eye.

Make up for stereoscopic vision

Previous portrait modes simply used stereoscopic principles:

Take two slightly different pictures of the same person in the same scene.

▷ The movement is almost invisible to the naked eye

Play the two photos in a loop and find that the person is not moving, but the background has shifted. The phenomenon is called Parallax.

△ Look closely at that vague circle

Using parallax to predict the depth of an object is called phase focusing (PDAF).

However, PDAF has its own limitations. First, the translation is usually very small, so it is difficult to find the function relationship.

The movement direction of parallel lines is a mystery

The other is that stereoscopic vision technology always has a Aperture Problem: when you encounter a straight line, you may not be able to judge the direction of translation or the amount of translation.

Take a closer look at the chestnut again (this time not a GIF) :

△ Learned (right) than not (left), pay attention to the parallel plate

For example, depth predictions are often wrong when there are horizontal lines in the graph. As shown on the left, several parallel plates should have similar depths, but the degree of blur is quite different.

So, the Google AI team decided to rely on other predictions rather than PDAF alone.

Multiple prediction tools x high quality data collection

The new method developed by the team adds a number of other predictive tools:

Points away from the in-focus Plane, for example, are not as sharp as those closer up. This provides one
defocusing(Defocus) Depth judgment basis.


For example, common objects in our lives, we already have a rough idea of their dimensions. Using the size of these objects in the image to determine the depth, yes
The semanticOn the basis of.

A CNN is used to combine these auxiliary bases with the original PDAF.


Special data collection posture

Training this CNN requires feeding a large number of PDAF images, which are groups of images with slightly different angles.

You also need high-quality Depth Maps that correspond to the image.



In addition, to improve the phone’s portrait mode, the training data needs to be similar to the phone’s camera.

▷ It looks sloppy, but actually it is a bit impressive

So, the team DIY a very spooky piece of equipment. Tie five Pixel 3’s together and have them shoot at the same time (within 2 milliseconds).

There’s a story about the location of the five phones:

The five viewing angles ensure the existence of parallax in multiple directions, avoiding
The aperture problem;





It’s almost guaranteed that a point in one photo will appear in at least one other photo. The lack of rare
With reference toThe point;





The distance between the cameras
Much larger than the PDAF baseline, so that the prediction will be more accurate;





Synchronized photography, to make sure that
Dynamic sceneYou can also calculate depth in.

▽ Dynamic scene = The baby is moving all the time, but it is photographed at the same time

(Plus, the kit is portable and samples taken outdoors can also be collected.)


Eliminate other distractions

But even with good data, it’s not easy to accurately predict the depth of an object in a map.


With a pair of PDAF images, many different depth maps can be produced:

(Different lenses, different focal lengths, all affect depth judgment.)

To take this into account, the relative depth of each object is directly predicted, and the image factor of the second lens is removed.

This, the team says, produces satisfactory results.

Speed is of the essence

(Although, there may not be many Domestic Pixel users……)

The team wrote in a blog post that the camera needs to be quick to predict when taking a picture, without keeping a human with a camera holding a phone waiting too long.


So, use TensorFlow Lite to put CNN on your phone, and use Pixel 3’s GPU to do quick calculations.

Go to Version 6.1 of Google Camera and it will work.

With Google Photos, users can modify the depth themselves, change the blur value, and change the focus.

You can also use a three-way depth extractor to extract a JPG depth map for your own appreciation.


– the –

Welcome to our column: Qubit – Zhihu column

Sincere recruitment

Qubit is looking for editors/reporters to work in Zhongguancun, Beijing. Looking forward to the talented and enthusiastic students to join us!

For more details, please reply “Wanted” on the QbitAI chat screen.

Qubit QbitAI· Head number signing author

վ’ᴗ’ ի Tracks new developments in AI technology and products