Green screen is a film and television cutout, change the background of the powerful tool, but if not in front of the green screen shooting, we can also be perfect to change the background? Researchers at the University of Washington recently uploaded a paper showing how you can transform video backgrounds perfectly without being in front of a green screen, turning the whole world into your green screen.

Heart of Machine reporting, participation: Racoon, Zhang Qian.

As you can see in a demo by the authors, their method is quite impressive, even if the person in the video is shaking his hair wildly.

Do a variety of movements without “wear help” :

Matting works well even when the people are indistinguishable from the background and the handheld camera wobbles slightly:

This paper has been accepted by CVPR 2020.
  • The thesis links: https://arxiv.org/pdf/2004.00626.pdf

  • Making the link: https://github.com/senguptaumd/Background-Matting

In the paper, the researchers propose a new method for creating matting. Most existing mask methods require a green screen background or manually create a trimap. Of course, there are automated methods that do not require ternary graphs, but the results will be poor. The mask method proposed in this paper also does not need ternary graph, but the effect of matting and changing the background is better.


Of course, there are conditions for achieving such good results. In addition to the original image/video, the subjects were also asked to take an additional background image without a person in it. This process takes much less time than creating ternary graphs. The researchers trained a deep network with loss resistance to predict masks. They first used synthetic data with ground truth to train a mask network with supervision loss. To close the gap between the composite image and the real image without labeling, they trained another masked network under the guidance of the first network and judged the quality of the composite image through a discriminator. The researchers tested their new method on a variety of images and videos and found it significantly superior to the previous SOTA method.




In the discussion of this paper, we can see many potential application scenarios, such as Vlog (cloud travel), video conferencing, etc.
Next time your instructor asks you to post a video of you working on a fish in a lab, you might find it useful, too.




methods


The input to the system is an image or video of a person standing in front of a static natural background, along with a pure background. It is easy to get the background image, just take the person out of the frame area, and then operate the camera with fixed exposure and focal length (such as smartphone camera). For hand-held cameras, the researchers assumed that the camera moved very little, using homography to align the background with a given input image. The soft segmentation of the target is also extracted from the input. For video input, adjacent frames can be added to aid in mask generation.


Figure 2: Method overview.



Supervised training on Adobe datasets


The researchers first trained a deep mask network on an Adobe Matting dataset that uses only images of opaque objects. The network takes soft segmentation of the image I with the person, the pure background image B “, the person S, and the time stack M (optional) of adjacent frames as input, and the output is a foreground image F and a foreground mask α. To produce the S, the researchers used a character segmentation, etching, dilate and Gaussian blur. In video processing, they set M to I for two consecutive frames. If the interval between two frames is T, the selected adjacent frames are {I−2T, I−T, I+T, I+2T}. These images were converted to grayscale to ignore color and focus more on motion information. If there is no video in the input, the researcher sets M to {I, I, I, I}, and these images are also converted to grayscale images. The input set is represented as {I,B ‘,S,M}, and the operation of the network with the weight parameter θ can be expressed as:
Researchers proposed a Context Switching block (CS block) network to more efficiently combine the features of all input information based on the input image (see Figure 2). For example, when part of a character’s background is similar, the network should pay more attention to the segmentation cues of that area. The network has four encoders that each generate feature maps of 256 channels, and then generate 64 channel features for each of the three pairs by combining the image features from I with those of B ‘, S and M respectively by applying 1×1 convolution, BatchNorm and ReLU. Finally, they combined these three sets of 64-channel features with the original 256-channel features using 1X1 convolution, BatchNorm, and ReLU to get the encoded features and pass them on to the rest of the network, including residual blocks and decoders. The researchers observed that the ABOVE CS block architecture helped the network to generalize from Adobe datasets to actual data.




Researchers used supervised loss training networks on Adobe datasets. Theta _Adobe) :




Where, (F, α) = G(X; θ_Adobe), and the gradient term α can encourage the model to generate higher sharpness images.


Adversarial training on unlabeled real data


Although the CS block proposed by the researchers, combined with data augmentation, significantly reduces the gap between real images and images assembled using Adobe data. However, this method still has the following difficulties in processing real images:
  • Backgrounds around fingers, arms, and hair are copied into the mask;

  • Image segmentation failure;

  • The important part of the foreground color is similar to the background color;

  • There is no alignment between the image and the background.



To solve these problems, the researchers propose a self-supervised learning method to train models from unlabeled real data.

The main idea of this method is that the main error in mask estimation will lead to the distortion of the composite image under the new background. For example, a bad mask may contain some of the original image background, which when combined with the new background will copy some of the content of the previous background into the new background. Therefore, we train an adversarial discriminator to distinguish synthetic images from real images to improve the performance of masked networks.

The authors used the LS-GAN framework to train the generator G_{Real} and discriminator D, minimizing the following objective functions to train the generator:






Where, (F, α) = G(X; θ_{Real}), \bar{B} is the given background that the discriminator is used to generate the composite image. The researchers set the λ to 0.05 and reduced it by half every two epoches during training to allow the discriminator to play an important role. The researchers gave higher weights to the alpha loss function to encourage the model to produce sharper images.


The researcher trained the discriminator using the following objective function:




Where θ_{Disc} represents the weight of the discriminator network, and (F, α) = G(X; Theta _ {Real}).


The experimental results


The researchers compared this method with other methods, including several deep mask algorithms that perform well on benchmarks, such as Bayesian Matting, Context-aware Matting, Index Matting, and Late Fusion Matting.


Results on Adobe datasets


The researchers first trained GAdobe with 26,900 samples, synthesizing 269 targets on 100 random backgrounds, and adding the disturbed version of the background as the network input. Adam optimizer was used to train GAdobe with batch size of 4 and learning rate of 1E.


The results of 220 synthetic materials in Adobe Dataset are compared experimentally, as shown in the figure below:


Table 1: Alpha mask errors on the Adobe Dataset, the lower the value, the better the performance.



Results on real data


In addition, the researchers used a smartphone iPhone 8 to shoot video indoors and outdoors with a hand-held camera and a fixed camera.


Figure 3 :(a-e) is the alpha channel and foreground of a video shot with a handheld camera in a natural background, and (e) is a failure case in a dynamic background.



In addition, the researchers conducted a survey of the user population, including the sum of the ratings of the test videos. The score proves that the method proposed in this paper is superior to other methods, especially in camera shooting scenes, but there are still some mask errors in hand-held video due to parallax caused by non-planar backgrounds.


Table 2: User study results in 10 real-world videos (fixed camera).



Table 3: User study on 10 real-world videos (hand-held cameras).



Introduction to the use of open source code


Environment configuration


Clone the project locally:
git clone https://github.com/senguptaumd/Background-Matting.gitCopy the code



The code provided by the authors needs to run in Python 3 and be tested against Pytorch=1.1.0, Tensorflow=1.14, cuda10.0. Next we create the Conda virtual environment and install the dependencies:
Conda create --name back-matting python=3.6 conda activate back-mattingCopy the code



Ensure that CUDA 10.0 is the default CUDA. If CUDA 10.0 is installed in /usr/local/cuda-10.0, run the following command:
export LD_LIBRARY_PATH=/usr/local/ cuda - 10.0 / lib64export PATH=$PATH:/usr/local/ cuda - 10.0 / binCopy the code



Install PyTorch and Tensorflow and their dependencies:
Conda Install PyTorch =1.1.0 TorchVision CudatoolKit =10.0 -C PyTorch PIP install Tensorflow - GPU =1.14.0 PIP install -r requirements.txtCopy the code



Run the inference program on the sample image


(1) Prepare data


To achieve the green screen effect of character matting, we need the following data:


  • Image with people (extension _img.png)

  • No background image for people (extension _back.png)

  • Insert the target background image of the character (stored in the data/background folder)



We can also use the sample_data/ folder for testing and reference it to prepare our own test data.


(2) Pre-training model


Download the pre-training model from the cloud disk provided by the author and place it under the Models/ directory.


  • Pre-processing

  • Segmentation

  • Background Matting needs a segmentation mask for the subject. We use tensorflow version of Deeplabv3+.



(3) pretreatment


The authors use TensorFlow version of Deeplabv3+ to generate segmentation masks for people matting:
cd Background-Matting/
git clone https://github.com/tensorflow/models.git
cd models/research/
export PYTHONPATH=$PYTHONPATH: `pwd` : `pwd`/slimcd .. /.. python test_segmentation_deeplab.py -i sample_data/inputCopy the code

Of course, we can also use any other image segmentation network to replace Deeplabv3+. Save the split result into a file with the extension _masksdl-png.


After that, we need to pre-align the image, that is, align the background with the input image. Note that we need to turn off auto focus and auto exposure when taking images. Run python test_pre_process.py -i sample_data/input to preprocess the image. It automatically aligns the background image and adjusts the bias and gain to match the input image.


(4) Portrait matting


To implement the background replacement, run the following code. For images taken using a tripod, select -M real-fixed-cam for best results. Selecting -m SYN-comp-Adobe will let us use models trained on Adobe’s synthetic dataset rather than real data (worst of all).
python test_background-matting_image.py -m real-hand-held -i sample_data/input/ -o sample_data/output/ -tb sample_data/background/0001.pngCopy the code