Thesis Defense


"Semantic Three-Dimensional Understanding of Dynamic Scenes"

Zhile Ren

Tuesday, May 1, 2018 at 1:00 P.M.

Lubrano Conference Room (CIT 4th Floor)

We develop new representations and algorithms for three-dimensional (3D) scene understanding from images and videos. In cluttered indoor scenes, RGB-D images are typically described by local geometric features of the 3D point cloud. We introduce descriptors that account for 3D camera viewpoint, and use structured learning to perform 3D object detection and room layout prediction. We also extend this work by using latent support surfaces to capture style variations of 3D objects and help detect small objects. In outdoor autonomous driving applications, given two consecutive frames from a pair of stereo cameras, 3D scene flow methods simultaneously estimate the 3D geometry and motion of the observed scene. We incorporate semantic segmentation in a cascaded prediction framework to more accurately model moving objects.

We first propose a cloud of oriented gradient (COG) descriptor that links the 2D appearance and 3D pose of object categories, and thus accurately models how perspective projection affects perceived image boundaries. We also propose a “Manhattan voxel” representation which better captures the 3D room layout geometry of common indoor environments. Effective classification rules are learned via a structured prediction framework that accounts for the intersection-over-union overlap of hypothesized 3D cuboids with human annotations, as well as orientation estimation errors. Our model is learned solely from annotated RGB-D images, without the benefit of CAD models, but nevertheless its performance substantially exceeds the state-of-the-art on the SUN RGB-D database.

Existing 3D representations for RGB-D images have limited power to represent objects with different visual styles. The detection of small objects is also challenging because the search space is very large in 3D scenes. However, much of the shape variation within 3D object categories can be explained by the location of a latent support surface, and smaller objects are often supported by larger objects. We thus design algorithms that use latent support surfaces to better represent the 3D appearance of large objects, and provide contextual cues to improve the detection of small objects. Contextual relationships among categories and layout are captured via a cascade of classifiers, leading to holistic scene hypotheses with improved accuracy. The proposed system further improves 3D detection performances for 19 diverse object categories in the SUN RGB-D database.

We then focus on outdoor scene flow prediction tasks. Many existing approaches use superpixels for regularization, but may predict inconsistent shapes and motions inside rigidly moving objects. We instead assume that scenes consist of foreground objects rigidly moving in front of a static background, and use semantic cues to produce pixel-accurate scene flow estimates. Our cascaded classification framework accurately models 3D scenes by iteratively refining semantic segmentation masks, stereo correspondences, 3D rigid motion estimates, and optical flow fields. We evaluate our method on the challenging KITTI autonomous driving benchmark, and show that accounting for the motion of segmented vehicles leads to state-of-the-art performance.

This work motivates ongoing research to better exploit deep learning for 3D scene understanding. We report preliminary results on learning contextual relationships for object detection, exploiting temporal cues from multiple video frames to improve optical flow estimates, and progress in constructing large-scale databases of synthetic scenes with known 3D geometry.

Host: Professor Erik Sudderth