The data set
Added Depths_raw folder compared to MVSNet
scans/
: Saved the original resolution depth maps GT and Mask (1200, 1600) used in the codescanx_train/
: Low resolution depth maps and masks (128, 160)
The great difference between Cas and MVSNet and CVP, or should I say CVP and the other two, is that the CVP data set train is train and test is test, while the MVSNet data set Train has a complete test result (GT). Therefore, a complete test will be conducted after train completes an Epoch, and the performance of the current model will be observed through indicators such as 2mm, instead of a complete DTU quantitative test
Parameter Settings
The basic parameters are similar to those of MVSNet and CVP-MVSNET.
-
Batch_size: 1Batch Indicates the batch number of gpus. If the number of Gpus is greater than 1, the batchsize indicates the batch number of gpus
-
Numdepth =192, interval_scale=1.06, both of which refer to MVSNet Settings
-
Eval_fraq: The frequency at which tests are performed, usually set to 3, that is, a full test is performed after training three epochs
-
Share_cr: whether share cost volume regression
- True: Cost regularization is the same CostRegNet for each layer
- False: The cost body regularization for each layer is an item in the ModuleList (although the configuration of each item is identical)
-
Ndepths: assume layers per stage, 48,32,8
-
Depth_inter_r: assume ratio for each stage depth, 4,2,1
-
Dlossw: Weight of loss at each stage, 0.5,1.0,2.0
-
Cr_base_chs: Cost body regression base channel
-
Using_apex SYNc_BN: Apex related configuration, mainly for the use of sync_BN
Pyramid structure
Cost volume size is [B, C, D, H, W]
-
stage1
- Resolution: H/4, W/4 (128, 160)
- Characteristic: C=32 channels
- Depth assumption: D= 48th layer, that is, 48th layer from 425 to 935
-
stage2
- H/2, W/2 (256, 320)
- C=16
- D=32, depth assumption interval = 2.5 * 1.06 * 2
-
stage3
- H, W (512, 640)
- C=8
- D=8, depth assumption interval = 2.5 * 1.06 * 1
The Data module
train
The following modules are almost the same as MVSNet and CVP-MVSNET:
build_list()
read_cam_file()
read_img()
read_depth()
Several methods have been added:
prepare_img()
: Reduce the resolution of the original size picture and cut the middle piece, 1600The 1200-640512, the down-sampling method is NEAREST nearest difference method of CV2read_mask_hr()
: Read the mask and make the pyramid after sampling. The size of the mask in the final three stages is (160)128)/(320256)/(640*512), at the same time, the filter only retains the part of pixels >10 when readingread_depth_hr()
: Reads depth truth, do the same as above
get_item()
imgs
: 1 ref + nview-1 SRCproj_matrices
: three-stage projection matrix. The difference between the projection matrix at each stage is that the dimension is (2,4,4). You don’t multiply the internal and external parameters in advance, essentially you just store the internal and external parameters in a tensor. The camera parameters in the data set are those after /4, that is, those in the first stage, and the internal parameters in the last two stages are *2 in sequencedepth
: True value of three-stage depth mapdepth_values
: min ~ Max interval Indicates the depth hypothesis layermask
: Three-stage mask
Train
- There are some operations on multiple Gpus, model parallelism, APEX, etc
- To establish the model
- There are parallel versions for establishing Dataset and DataLoader
train()
- Lr warmup strategy
- “Train” an Epoch
- Save the breakpoint
- Test an Epoch. The effect of this round of training can be reflected through the error of test and other indicators. Dtu test set also has GT, which can avoid quantitative test
cas_mvsnet_loss()
Add loss for the three-layer pyramid, the default weight of the three layers is [0.5, 1.0, 2.0].
Final return:
total_loss
: Accumulated loss for network gradient backtransmission after weightingdepth_loss
: Loss accumulation before weighting
FeatureNet
There are two types of architecture, and the number of pyramids is treated differently. The default is 3 layers:
Output: (H / 4, W / 4, 32)/(H / 2, W / 2, 16)/(H, W, 8) H = 612 W = 640
First of all, the 8-layer CNN of MVSNet was first used for 4-fold downsampling
Stage1: After 1*1 convolution → (H/4, W/4, 32)
-
Unet:
- Stage2: (8-layer output deconvolution + splicing conv1 + convolution again) → (H/2, W/2, 16)
- Stage3: ditto → (H, W, 8)
-
FPN:
- Stage2: (8 layers of output up sampling twice + conv1 convolution once) whole volume again → (H/2, W/2, 16)
- Stage3: (double sampling + conv0 convolution once again)
CostRegNet
Stage1 regression [B, 32, 48, H/4, W/4], STAGE2 regression [B, 16, 32, H/2, W/2], STAGE3 regression [B, 8, 8, H, W]
Standard 3D U-NET, but the web is atrocity with 8 times less resolution [1, 64, 6, 16, 20]
For example stage1: [32, 48, 128, 160] → [8, 48, 128, 160] → [16, 24, 64, 80] → [32, 12, 32, 40] → [64, 6, 16, 20], then add back the deconvolution all the way, and finally return the C-dimension away. Becomes [1, 48, 128, 160]
DepthNet
A complete MVSNet process, feature extraction → cost body construction → cost body regularization → probability body depth regression → depth map
-
Input parameters:
- Ref and feature bodies + SRC feature bodies of the current stage
- Projection matrix for the current phase
- Depth assumptions for the current phase
- Number of assumed layers of depth [48, 32, 8]
- Whether to share cost body regularization (default is False, that is, regularization network of each layer is an item in ModuleList, the difference is that in_channel of the first layer of the network corresponds to channel dimension of output channel of the three layers of FeatureNet network respectively)
-
Depth maps and confidence maps for each stage are saved and returned
CasMVSNet
-
Default parameters:
ndepths
8] = [48, 32,depth_interval_ratio
= [4, 2, 1)- The cost body regularization network is not shared
- Feature extraction is done with
fpn
(Feature pyramid Network) - Not to refine
num_stage
: The default value is phase 3stage_infos
: Scale 4/2/1 for each stage
-
Feature extraction: Ref and SRCS images go through the feature extraction pyramid, and each viewpoint gets a three-stage feature body
-
For each level of the pyramid
-
Take out the feature body, projection matrix and scale for this layer of pyramid
-
Depth assumption: the initial 192 numbers from 425 to 935, the same as MVSNet configuration, 192 layer depth assumption interval, interval=1.06 to cover the maximum and minimum range of DTU
- The first layer: [B, 48, 512, 640] is processed separately. In essence, the depth interval of 48 layers is taken for the range of 425 ~ 935. However, in order to unify the data size, the whole image is copied, which is equivalent to 48 depths taken for each pixel of 425 ~ 935
- Second layer: [B, 32, 512, 640] On the basis of the depth map of the first stage, the range of assumed depth is reduced, and the number of layers is also reduced to 32 layers. The specific implementation is to sample the depth map of the first stage to the resolution of the original image, and each pixel takes 16 layers forward and back as DEPTH_min and DEPth_max, and then evenly divides 32 layers within the range of DEPth_min to DEPth_max. At the same time, it should be noted that the interval for minmax acquisition in each stage is decreasing, corresponding to the parameter in
depth_interval_ratio
- The third layer: [B, 8, 512, 640] the same as above, further narrowing the scope and the number of hypothetical layers
-
Run a complete DepthNet MVSNet process
-
Upsampling to the original image resolution (512, 640) and then go to the next round of depth assumption, regularization…
The odd thing here is that instead of sampling the depth map by 2x, it samples directly up to the original size
-