Detailed analysis of Cas-MVSNet code structure

The data set

Added Depths_raw folder compared to MVSNet

scans/: Saved the original resolution depth maps GT and Mask (1200, 1600) used in the code
scanx_train/: Low resolution depth maps and masks (128, 160)

The great difference between Cas and MVSNet and CVP, or should I say CVP and the other two, is that the CVP data set train is train and test is test, while the MVSNet data set Train has a complete test result (GT). Therefore, a complete test will be conducted after train completes an Epoch, and the performance of the current model will be observed through indicators such as 2mm, instead of a complete DTU quantitative test

Parameter Settings

The basic parameters are similar to those of MVSNet and CVP-MVSNET.

Batch_size: 1Batch Indicates the batch number of gpus. If the number of Gpus is greater than 1, the batchsize indicates the batch number of gpus
Numdepth =192, interval_scale=1.06, both of which refer to MVSNet Settings
Eval_fraq: The frequency at which tests are performed, usually set to 3, that is, a full test is performed after training three epochs
Share_cr: whether share cost volume regression
- True: Cost regularization is the same CostRegNet for each layer
- False: The cost body regularization for each layer is an item in the ModuleList (although the configuration of each item is identical)
Ndepths: assume layers per stage, 48,32,8
Depth_inter_r: assume ratio for each stage depth, 4,2,1
Dlossw: Weight of loss at each stage, 0.5,1.0,2.0
Cr_base_chs: Cost body regression base channel
Using_apex SYNc_BN: Apex related configuration, mainly for the use of sync_BN

Pyramid structure

Cost volume size is [B, C, D, H, W]

stage1
- Resolution: H/4, W/4 (128, 160)
- Characteristic: C=32 channels
- Depth assumption: D= 48th layer, that is, 48th layer from 425 to 935
stage2
- H/2, W/2 (256, 320)
- C=16
- D=32, depth assumption interval = 2.5 * 1.06 * 2
stage3
- H, W (512, 640)
- C=8
- D=8, depth assumption interval = 2.5 * 1.06 * 1

The Data module

train

The following modules are almost the same as MVSNet and CVP-MVSNET:

build_list()
read_cam_file()
read_img()
read_depth()

Several methods have been added:

prepare_img(): Reduce the resolution of the original size picture and cut the middle piece, 1600The 1200-640512, the down-sampling method is NEAREST nearest difference method of CV2
read_mask_hr(): Read the mask and make the pyramid after sampling. The size of the mask in the final three stages is (160)128)/(320256)/(640*512), at the same time, the filter only retains the part of pixels >10 when reading
read_depth_hr(): Reads depth truth, do the same as above

get_item()

imgs: 1 ref + nview-1 SRC
proj_matrices: three-stage projection matrix. The difference between the projection matrix at each stage is that the dimension is (2,4,4). You don’t multiply the internal and external parameters in advance, essentially you just store the internal and external parameters in a tensor. The camera parameters in the data set are those after /4, that is, those in the first stage, and the internal parameters in the last two stages are *2 in sequence
depth: True value of three-stage depth map
depth_values: min ~ Max interval Indicates the depth hypothesis layer
mask: Three-stage mask

Train

There are some operations on multiple Gpus, model parallelism, APEX, etc
To establish the model
There are parallel versions for establishing Dataset and DataLoader

train()

Lr warmup strategy
“Train” an Epoch
Save the breakpoint
Test an Epoch. The effect of this round of training can be reflected through the error of test and other indicators. Dtu test set also has GT, which can avoid quantitative test

cas_mvsnet_loss()

Add loss for the three-layer pyramid, the default weight of the three layers is [0.5, 1.0, 2.0].

Final return:

total_loss: Accumulated loss for network gradient backtransmission after weighting
depth_loss: Loss accumulation before weighting

FeatureNet

There are two types of architecture, and the number of pyramids is treated differently. The default is 3 layers:

Output: (H / 4, W / 4, 32)/(H / 2, W / 2, 16)/(H, W, 8) H = 612 W = 640

First of all, the 8-layer CNN of MVSNet was first used for 4-fold downsampling

Stage1: After 1*1 convolution → (H/4, W/4, 32)

Unet:
- Stage2: (8-layer output deconvolution + splicing conv1 + convolution again) → (H/2, W/2, 16)
- Stage3: ditto → (H, W, 8)
FPN:
- Stage2: (8 layers of output up sampling twice + conv1 convolution once) whole volume again → (H/2, W/2, 16)
- Stage3: (double sampling + conv0 convolution once again)

CostRegNet

Stage1 regression [B, 32, 48, H/4, W/4], STAGE2 regression [B, 16, 32, H/2, W/2], STAGE3 regression [B, 8, 8, H, W]

Standard 3D U-NET, but the web is atrocity with 8 times less resolution [1, 64, 6, 16, 20]

For example stage1: [32, 48, 128, 160] → [8, 48, 128, 160] → [16, 24, 64, 80] → [32, 12, 32, 40] → [64, 6, 16, 20], then add back the deconvolution all the way, and finally return the C-dimension away. Becomes [1, 48, 128, 160]

DepthNet

A complete MVSNet process, feature extraction → cost body construction → cost body regularization → probability body depth regression → depth map

Input parameters:
- Ref and feature bodies + SRC feature bodies of the current stage
- Projection matrix for the current phase
- Depth assumptions for the current phase
- Number of assumed layers of depth [48, 32, 8]
- Whether to share cost body regularization (default is False, that is, regularization network of each layer is an item in ModuleList, the difference is that in_channel of the first layer of the network corresponds to channel dimension of output channel of the three layers of FeatureNet network respectively)
Depth maps and confidence maps for each stage are saved and returned

CasMVSNet

Default parameters:
- ndepths8] = [48, 32,
- depth_interval_ratio= [4, 2, 1)
- The cost body regularization network is not shared
- Feature extraction is done withfpn(Feature pyramid Network)
- Not to refine
- num_stage: The default value is phase 3
- stage_infos: Scale 4/2/1 for each stage

Feature extraction: Ref and SRCS images go through the feature extraction pyramid, and each viewpoint gets a three-stage feature body
For each level of the pyramid
1. Take out the feature body, projection matrix and scale for this layer of pyramid
2. Depth assumption: the initial 192 numbers from 425 to 935, the same as MVSNet configuration, 192 layer depth assumption interval, interval=1.06 to cover the maximum and minimum range of DTU
  1. The first layer: [B, 48, 512, 640] is processed separately. In essence, the depth interval of 48 layers is taken for the range of 425 ~ 935. However, in order to unify the data size, the whole image is copied, which is equivalent to 48 depths taken for each pixel of 425 ~ 935
  2. Second layer: [B, 32, 512, 640] On the basis of the depth map of the first stage, the range of assumed depth is reduced, and the number of layers is also reduced to 32 layers. The specific implementation is to sample the depth map of the first stage to the resolution of the original image, and each pixel takes 16 layers forward and back as DEPTH_min and DEPth_max, and then evenly divides 32 layers within the range of DEPth_min to DEPth_max. At the same time, it should be noted that the interval for minmax acquisition in each stage is decreasing, corresponding to the parameter indepth_interval_ratio
  3. The third layer: [B, 8, 512, 640] the same as above, further narrowing the scope and the number of hypothetical layers
3. Run a complete DepthNet MVSNet process
4. Upsampling to the original image resolution (512, 640) and then go to the next round of depth assumption, regularization…
  
  The odd thing here is that instead of sampling the depth map by 2x, it samples directly up to the original size