Data Collection

This work is done as part of my research assistant position at Computer Vision and Machine Perception Lab. The task is collect multiview images of 120 specular objects using mobile held camera in good and bad lighting conditions. Process them to obtain poses, 2D object masks and corresponding 3D bounding box.

Setup

The video showcases the configuration for the dataset capture. For larger objects, an April tag with a size of 60mm was utilized, and for smaller objects were associated with an April tag of 30mm. Additionally, few stickers with random markings were affixed to enhance feature matching consistency across frames.

Pose Estimation

We sample 120 frames and used COLMAP to estimate the poses. We provide COLMAP with the calibrated camera intrinsics. Given that the input images for COLMAP comprise a sequence of videos, we utilize a sequential matcher for the feature matching stage. However, due to the resemblance in features among April tags, the initial pose estimation yielded unsatisfactory results. Subsequently, incorporating colored stickers led to a enhancement in the accuracy of the pose estimation.

Object Masks

We used LangSAM: Language based Segment Anything model to obtain object masks.

Credits: https://github.com/luca-medeiros/lang-segment-anything

Filtering Noisy Points

To derive the 3D bounding box, a prerequisite is the availability of a 3D representation of the object. We leverage the point cloud generated through the dense reconstruction process in COLMAP. However, the resulting point cloud might exhibit considerable noise. To mitigate this, a three-step noise filtering approach is implemented:
The initial step involves the triangulation of the April tags positioned at the object's four corners, allowing for the determination of the object's center.
2) Subsequently, a cylinder is constructed with a specified radius, and points that lie outside this cylinder is excluded.
3) Finally, the 3D points are projected back onto image coordinates. We noticed the displacement of projected points onto the image masks. This displacement can be attributed to errors in the estimated camera poses. To prevent the inadvertent removal of these object points, a dilation operation is applied to the masked image and a voting-based removal mechanism is employed.
The noisy points within the point cloud are effectively filtered out, contributing to the accuracy of the subsequent 3D bounding box estimation process.

3D Bounding Box

To achieve an oriented bounding box, we employ the Open3D OrientedBoundingBox function. This function also provides its orientation and center. We proceed to estimate the extrinsic matrix, allowing us to relocate the world coordinate system to the object center.