Planar Reconstruction - 深度学习之平面重建

发表于 2019-11-30 分类于算法与硬件阅读次数： 395


input image	piece-wise planar segmentation	reconstructed depthmap	texture-mapped 3D model

0x00 Datasets

ScanNet [1,3,4]
SYNTHIA [2,3]
Cityscapes [2]
NYU Depth Dataset [1,3,4]
Labeling method

ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. annotated with 3D camera poses, surface reconstructions, and instance-level semantic segmentations.

SYNTHIA: The SYNTHetic collection of Imagery and Annotations. 8 RGB cameras forming a binocular 360º camera, 8 depth sensors

Cityscapes: Benchmark suite and evaluation server for pixel-level and instance-level semantic labeling.
video frames / stereo / GPS / vehicle odometry

NYU Depth Dataset: is recorded by both the RGB and Depth cameras from the Microsoft Kinect.

Dense multi-class labels with instance number (cup1, cup2, cup3, etc).
Raw: The raw rgb, depth and accelerometer data as provided by the Kinect.
Toolbox: Useful functions for manipulating the data and labels.

Obtaining ground truth plane annotations :

Difficulty in detect planes from the 3D point cloud by using J-Linkage method.


(c-d): Plane fitting results generated by J-Linkage with δ = 0.5 and δ = 2, respectively.

Labeling method:

ScanNet:
1. Fit plans to a consolidated mesh (merge planes if (normal diff < 20° && distance < 5cm)
2. Project plans back to individual frames

SYNTHIA:
1. Manually draw a quadrilateral region
2. Obtain the plane parameters and variance of the distance distribution
3. Find all pixels that belong to the plane by using the plane parameters and the variance estimate

Cityscapes:
1. “planar” = {ground, road, sidewalk,parking, rail track, building, wall, fence, guard rail, bridge, and terrain}
2. Manually label the boundary of each plane using polygons

0x01 PlaneNet

[CVPR 2018] Liu, Chen, et al. Washington University in St. Louis, Adobe.

The first deep neural architecture for piece-wise planar depthmap reconstruction from a RGB image.

Pipeline

DRN: Dilated Residual Networks (2096 channels)

CRF: Conditional Random Field Algorithm

Step	Loss
Plane parameter:	$L^{P} = \sum_{i = 1}^{K^{}} m i n_{j \in [1, K]} ‖ P_{i}^{} - P_{j} ‖_{2}^{2} (K = 10)$
Plane segmentation: softmax cross entropy	$L^{M} = \sum_{i = 1}^{K + 1} \sum_{p \in I} (1 (M^{* (p)} = i) l o g (1 - M_{i}^{(p)}))$
Non-planar depth: ground-truth <==> predicted depthmap	$L^{D} = \sum_{i = 1}^{K + 1} \sum_{p \in I} (M_{i}^{(p)} (D_{i}^{(p)} - D^{* (p)})^{2})$
-	$M^{(p)} : probability of p belonging to the i^{t h} plane ; D^{(p)} : depth value at pixel p; *: GT .$

0x02 Plane Recover

[ECCV 18] Fengting Yang and Zihan Zhou Pennsylvania State University.

Recovering 3D Planes from a Single Image. Propose a novel plane structure-induced loss

Step	Loss
Plane loss	$L_{r e g} (S_{i}) = \sum_{q}^{} - z (q) \cdot l o g (p_{p l a n e} (q)) - (1 - z (q)) \cdot l o g (1 - p_{p l a n e} (q))$
Loss	$L = \sum_{i = 1}^{n} \sum_{j = 1}^{m} (\sum_{q} S_{i}^{j} (q) \cdot \| (n_{i}^{j})^{T} Q - 1 \|) + α \sum_{i = 1}^{n} L_{r e g} (S_{i})$

0x03 PlaneRCNN

[CVPR2019] Liu, Chen, et al. NVIDIA, Washington University in St. Louis, SenseTime, Simon Fraser University

0x04 PlanarReconstruction

[CVPR 2019] Yu, Zehao, et al. ShanghaiTech University, The Pennsylvania State University

Single-Image Piece-wise Planar 3D Reconstruction via Associative Embedding

Step	Loss
Segmentation: balanced cross entropy	$L_{S} = - (1 - w) \sum_{i \in F}^{} \log p_{i} - w \sum_{i \in B}^{} \log (1 - p_{i})$
Embedding: discuiminative loss	$L_{E} = L_{p u l l} + L_{p u s h}$
Per-pixel plane: L1 loss	$L_{P P} = \frac{1}{N} \sum_{i = 1}^{N} \| n_{i} - n_{i}^{*} \|$
Instance Parameter:	$L_{I P} = \frac{1}{N \tilde{C}} \sum_{j = 1}^{\tilde{C}} \sum_{i = 1}^{N} S_{i j} \cdot \| n_{j}^{T} Q_{i} - 1 \|$
Loss	$L = L_{S} + L_{E} + L_{P P} + L_{I P} + \dots$

Embedding:
associative emvedding (End-to-End Learning for Joint Detection and Grouping) ;

Discriminative loss function

An image can contain an arbitrary number of instances
The labeling is permutation-invariant: it does not matter which specific label an instance gets, as long as it is different from all otherinstance labels.

$L_{E} = L_{p u l l} + L_{p u s h}$

$w h e r e$

$L_{p u l l} = \frac{1}{C} \sum_{c = 1}^{C} \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} max (‖ μ_{c} - x_{i} ‖ - δ_{v}, 0)$

$L_{p u s h} = \frac{1}{C (C - 1)} \underset{c_{A} \neq c_{B}}{\sum_{c_{A} = 1}^{C} \sum_{c_{B} = 1}^{C}} max (δ_{d} - ‖ μ_{c_{A}} - μ_{c_{B}} ‖, 0)$

Here, $C$ is the number of clusters $C$ (planes) in the ground truth, $N_{c}$ is the number of elements in cluster $c$ , $x_{i}$ is the pixel embedding, $μ_{c}$ is the mean embedding of the cluster $c$ , and $δ_{v}$ and $δ_{d}$ are the margin for “pull” and “push” losses, respectively.

Instance Parameter Loss:

$L_{I P} = \frac{1}{N \tilde{C}} \sum_{j = 1}^{\tilde{C}} \sum_{i = 1}^{N} S_{i j} \cdot \| n_{j}^{T} Q_{i} - 1 \|$	$S : instance segmentation map n_{j} : predicted plane param Q_{i} : the 3D point at pixel i$

$n ≐ \tilde{n} / d$ , where $\tilde{n} \in S^{2}$ and $d$ denote the surface normal and plane distance to the origin

Planar Reconstruction - 深度学习之平面重建

0x00 Datasets

Obtaining ground truth plane annotations :

Labeling method:

0x01 PlaneNet

Pipeline

0x02 Plane Recover

0x03 PlaneRCNN

0x04 PlanarReconstruction

Discriminative loss function

0xFF Results

PlaneNet

PlaneRecover

PlaneRCNN

PlanarReconstruction