«VIRGILE HÖGMAN Master’s Thesis in Computer Science Supervisor: Alper Aydemir Examiner: Stefan Carlsson Computer Vision and Active Perception ...»
Building a 3D map from RGB-D sensors
Master’s Thesis in Computer Science
Supervisor: Alper Aydemir
Examiner: Stefan Carlsson
Computer Vision and Active Perception Laboratory
Royal Institute of Technology (KTH), Stockholm, Sweden
TRITA xxx yyyy-nn
For a mobile robot exploring an unknown static environment, localizing itself and building a map at the same time
is a chicken-or-egg problem, known as Simultaneous Localization And Mapping (SLAM). When a GPS receiver cannot be used, such as in indoor environments, the measurements are generally provided by laser rangeﬁnders and stereo cameras, but they are expensive and standard laser rangeﬁnders oﬀer only 2D cross sections. However, recently there has been a great interest in processing data acquired using depth measuring sensors due to the availability of cheap and performant RGB-D cameras. For instance, the Kinect developed by Prime Sense and Microsoft has considerably changed the situation, providing a 3D camera at a very aﬀordable price.
In this study, we will see how a 3D map based on a graphical model can be built by tracking visual features like SIFT/SURF, computing geometric transformations with RANSAC, and applying non-linear optimization techniques to estimate the trajectory. This can be done from a sequence of video frames combined with the depth information, using exclusively the Kinect, so the ﬁeld of applications can be wider than robotics.
Referat Skapa en 3D karta från en RGB-D kamera En robot som utforskar en okänd statisk miljö, där den inte bara ska lokalisera sig utan också skapa en karta på samma gång, måste hantera det problem som benämns Simultaneous Localization And Mapping (SLAM). När en GPSmottagare inte kan användas, typiskt i en inomhus-miljö, brukar laseravståndsmätare och kameror andvändas som sensorer. En laseravståndsmätare är dyr och erbjuder information enbart i 2D. Kameror kräver mycket databehandling eftersom de inte tillhandahåller djupinformation. Det har på sistone vuxit fram ett stort intresse för behandling av data från RGB-D kameror, dvs kameror som utöver den vanliga bilden ger djupinformation. Nyligen blev Kinect, utvecklad av Prime Sense och Microsoft, tillgänglig. Kinect erbjuder en RGB-D data till ett väldigt attraktivt pris.
I denna rapport studerar vi hur man kan skapa en 3D karta baserad på en graﬁsk model genom att spåra visuella landmärken från tex SIFT/SURF, beräkna geometriska transformationer med RANSAC, och applicera icke-linjära optimeringstekniker för att skatta hur sensorn rört sig. Vi visar hur vi kan använda en vanlig Kinect till detta utan några andra sensorer på roboten som annars ofta är fallet.
Detta innebär att tillämpningsområdet kan vara bredare än robotik.
Acknowledgments “ The most exciting phrase to hear in science, the one that heralds new discoveries, is not ’Eureka!’ but ’That’s funny...’ ” — Isaac Asimov My ﬁrst thanks go to my supervisor Alper Aydemir, for carefully following my progress and always giving me some useful suggestions to solve the problems raised all along this work. I am grateful for all the time he spent in discussions, experiments and practical issues. I also want to warmly thank Giorgio Grisetti for his most valuable advices, Patric Jensfelt and John Folkesson for their help, and my examiner Stefan Carlsson. My deepest thoughts go to my family and friends, especially to my father Hugues for constantly showing interest in my activities, and for his wise support.
This work was achieved at Computer Vision and Active Perception Lab, at the Royal Institute of Technology of Stockholm, a very pleasant ambient to work in, for the quality and the atmosphere of the school by itself, and above all for the persons I encountered, thanks to their availability, involvement, and their ease to share their knowledge. Special thanks to André Susano Pinto for his friendship, his smart ideas, and the nice moments spent together during his stay. Staﬀ, students, and visitors, this time would not have been the same without Alessandro Pieropan, Ali Mosavian, Andrzej Pronobis, Cheng Zhang, Christian Smith, Gert Kootstra, Johan Ekekrantz, Josephine Sullivan, Magnus Burènius, Miroslav Kobetski, Niklas Bergström, Oscar Danielsson, Renaud Detry, Lazaros Nalpantidis, Victoria Matute Arribas, and many others.
Contents Acknowledgments Contents
Introduction To navigate in an unknown environment, a mobile robot needs to build a map of the environment and localize itself in the map at the same time. The process addressing this dual problem is called Simultaneous Localization And Mapping (SLAM). In an outdoor environment, this can generally be solved by a GPS which provides a good accuracy for the tasks the robot can take on. However, when moving indoor or in places where the GPS data is not available, or not reliable enough, it can become diﬃcult to estimate the robot’s position precisely and other solutions have to be found.
The main problem raised with SLAM comes from the uncertainty of the measurements, due to the sensory noise or technical limitations. Probabilistic models are widely used to reduce the inherent errors and provide satisfying estimations.
While this process is generally based on data provided by sensors such as laser scanners, combined with the odometry, Visual Simultaneous Localization And Mapping (VSLAM) focuses on the use of camera, as illustrated in ﬁgure 1.1.
Figure 1.1: Concept of Visual SLAM.
The poses of the camera (hence, the robot for a ﬁxed camera) are determined from video data. The estimations generally drift with respect to the real trajectory, and the uncertainty grows over time.
1.1 Context Currently, most of robotic mapping is performed using sensors that oﬀers only a 2D cross section of the environment around them. One reason is that acquiring high quality 3D data was either very expensive or had hard constraints on the robot movements. Therefore, research has mainly focused on laser scanners to solve the SLAM problem, although there are some methods making use of stereo and monocular cameras  . However, recently there has been a great interest in processing data acquired using depth measuring sensors due to the availability of cheap and eﬃcient RGB-D cameras. For instance, the Kinect Camera developed by Prime Sense and Microsoft has considerably changed the situation, providing a 3D camera at a very aﬀordable price. Primarily designed for entertainment, it has received a warm welcome in the research community, especially in robotics.
In the past, to solve the inherent problem of drift and provide a reliable estimation of the camera poses, most of the projects have used techniques such as Extended Kalman Filtering (EKF) or particle ﬁlters . More recently, several methods rely on pose graphs to model and improve the estimations    .
Some projects making use of both RGB-D data and graph optimization   illustrate well the interest for such an approach.
The work presented in this thesis is done at the Computer Vision and Active Perception Lab (CVAP) at KTH, the Royal Institute of Technology, in Stockholm.
Since 1982, the research at CVAP focuses on the topics of computer vision and robotics.
1.2 Goals The main goal of this thesis is to build a 3D map from RGB and depth information provided by a camera, considering a 6 Degrees-of-Freedom (DOF) motion system.
The hardware device used for the experimentations is the Microsoft Kinect but this work could be extended to any system providing video and depth data.
The VSLAM process can be described as estimating the poses of the camera from its data stream (video and depth), in order to reconstruct the entire environment while the camera is moving. As the sensory noise leads to deviations in the estimations of each camera poses with respect to the real motion, the goal is to build a 3D map which is close, as much as possible, to the real environment.
One objective is to have an overview of the diﬀerent methods and techniques, so the problems can better be identiﬁed and then analyzed more deeply after doing some experiments. However, this work will focus on the use of visual features.
Though a VSLAM system is intended to be used in real time, some of the processing may be done without considering the performance as a main priority, and could therefore be delayed. The rendering of the scene is also out of scope of this study.
Figure 1.2: Microsoft Kinect mounted on a mobile robot platform at CVAP (KTH)
1.3 Thesis outline
The rest of the document is structured as follows:
Chapter 2 presents the background and the underlying concepts that are most commonly used in this area, with features and methods commonly used in computer vision, and general notions about SLAM.
Chapter 3 presents the feature matching between a couple of frames, and how a 3D transformation can be computed from these associations using the depth data.
Chapter 4 describes how a map can be built, by estimating the camera poses through the use of a pose graph, reﬁning them through the detection of loop closures, and ﬁnally performing a 3D reconstruction.
Chapter 5 presents the experimentations, the software, how the data was acquired, and ﬁnally the results, with examples of maps generated from diﬀerent datasets.
Chapter 6 presents the conclusions and the future works with some suggestions of possible improvements.
Chapter 2 Background In this chapter, we ﬁrst present the Kinect camera, describe some of the features and methods commonly used in computer vision, and ﬁnally give a short introduction about SLAM concepts.
2.1 Microsoft Kinect As mentioned in the introduction, the hardware used in this work is the Kinect, a device developed by PrimeSense, initially for the Microsoft Xbox 360 and released in November 2010. It is composed by an RGB camera, 3D depth sensors, a multi-array microphone and a motorized tilt. In this work, only the RGB and depth sensors are used to provide the input data.
Figure 2.1: The Kinect sensor device (courtesy of Microsoft)
• The depth sensing system is composed by an IR emitter projecting structured light, which is captured by the CMOS image sensor, and decoded to produce the depth image of the scene. Its range is speciﬁed to be between 0.7 and 6 meters, although the best results are obtained from 1.2 to 3.5 meters. Its data output has 12-bit depth. The depth sensor resolution is 320x240 pixels with a rate of 30 Hz.
• The ﬁeld of view is 57° horizontal, 43° vertical, with a tilt range of ± 27°.
The drivers used are those developed at OpenNI1 (Open Natural Interface), an organization established in November 2010, launched by PrimeSense and joined by Willow Garage, among others. The OpenNI framework oﬀers high level functions which are mainly oriented for the gaming experience, such as gesture recognition and motion tracking. It aims to provide an abstraction layer with a generic interface to the hardware devices. In the case of the Kinect, one advantage is that the calibration of the RGB sensor with respect to the IR sensor is ensured, so the resulting RGB and depth data are correctly mapped with respect to a unique viewpoint.
2.2 Features In this work, we focus on tracking some parts of the scene observed by the camera.
In computer vision, and more speciﬁcally in object recognition, many techniques are based on the detection of points of interests on object or surfaces. This is done through the extraction of features. In order to track these points of interests during a motion of the camera and/or the robot, a reliable feature has to be invariant to
image location, scale and rotation. A few methods are brieﬂy presented here:
Harris Corner A corner detector, by Harris and Stephens  SIFT Scalar Invariant Feature Transform, by David Lowe  SURF Speeded Up Robust Feature  NARF Normal Aligned Radial Feature  BRIEF Binary Robust Independent Elementary Feature  There are two aspects concerning a feature: the detection of a keypoint, which identiﬁes an area of interest, and its descriptor, which characterizes its region. Typically, the detector identiﬁes a region containing a strong variation of intensity such as an edge or a corner, and its center is designed as a keypoint. The descriptor is generally computed by measuring the main orientations of the surrounding points, leading to a multidimensional feature vector which identiﬁes the given keypoint.
Given a set of features, a matching can then be performed in order to associate some pairs of keypoints between a couple of frames.
2.2. FEATURES The features listed previously can be summarized in table 2.1.
2.2.1 Harris Corner Known as the Harris corner operator, this is one of the earliest detector, as it was proposed in 1988 by Harris and Stephens . The notion of corner should be taken in a wide sense as it allows to detect not only corners, but edges and more generally, keypoints. It is done by computing the second moment matrix (or auto-correlation matrix) of the image intensities, describing its local variations. One of the main limitation with the Harris operator, at least in its original version, concerns the scale invariance as the matrix should be recomputed for a diﬀerent scale. Therefore, we will not give further details about this method.
2.2.2 SIFT feature The Scalar Invariant Feature Transform (SIFT) is a method presented by David Lowe , now widely used in robotics and computer vision. This is a method to detect distinctive, invariant image feature points, which easily can be matched between images to perform tasks such as object detection and recognition, or to compute geometrical transformations between images.
The main idea of the SIFT method is to deﬁne a cascade of operations following an increasing complexity, so that the most expensive operations are only performed to the most probable candidates.
1. The ﬁrst step relies on a pyramid of Diﬀerence-of-Gaussian (DoG) in order to be invariant to scale and orientation.