«vorgelegt von Matthew Brian Blaschko, M.S. aus La Jolla Von der Fakult¨t IV - Elektrotechnik und Informatik a der Technischen Universit¨t Berlin a ...»
Kernel Methods in Computer Vision:
Object Localization, Clustering,
and Taxonomy Discovery
Matthew Brian Blaschko, M.S.
aus La Jolla
Von der Fakult¨t IV - Elektrotechnik und Informatik
der Technischen Universit¨t Berlin
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
Dr. rer. nat.
Vorsitzender: Prof. Dr. O. Hellwich
Berichter: Prof. Dr. T. Hofmann Berichter: Prof. Dr. K.-R. M¨ller u Berichter: Prof. Dr. B. Sch¨lkopf o Tag der wissenschaftlichen Aussprache: 23.03.2009 Berlin 2009 D83 Zusammenfassung In dieser Arbeit studieren wir drei fundamentale Computer Vision Probleme mit Hilfe von Kernmethoden.
Zun¨chst untersuchen wir das Problem, Objekte in nat¨rlichen Bildern zu lokalisieren, a u welches wir als die Aufgabe formalisieren, die Bounding Box eines zu detektierendes Objekt vorherzusagen. In Kapitel II entwickeln wir hierf¨r ein Branch-andu Bound Framework, das es erlaubt uns, eﬃzient und optimal diejenige Bounding Box zu ﬁnden, welche eine gegebene Qualit¨tsfunktion maximiert. Dabei kann es sich a sowohl um die Entscheidungsfunktion eines kernbasierten Klassiﬁkators, als auch um ein Nearest-Neighbor Abstandsmaß handeln kann. Wir zeigen, dass dieses Verfahren bereits hervorragende Lokalisierungergebnisse erzielt, wenn es mit einer einfachen lineare Qualit¨tsfunktion verwendet wird, die durch Trainieren einer Supporta Vektor-Maschine gefunden wurde.
In Kapitel III untersuchen wir, wie sich kernbasierte Qualit¨tsfunktionen lernen a lassen, die optimal f¨r die Aufgabe der Objektlokalisierung geeignet sind. Insbesonu dere zeigen wir, dass Structured Output Regression dies erm¨glicht: im Gegensatz o zu Support-Vektor-Machinen kann Structured Output Regression nicht nur bin¨re a Entscheidungen treﬀen, sondern beliebige Elemente eines Ausgaberaumes vorhersagen. Im Fall der Objektlokalisierung besteht der Ausgaberaum dabei aus allen m¨glichen Bounding Boxes innerhalb des Zielbildes. Structured Output Regreso sion lernt eine Funktion, die die Kompatibilit¨t zwischen Eingaben und Ausgaben a messen kann, und pr¨diziert a
In this thesis we address three fundamental problems in computer vision using kernel methods. We ﬁrst address the problem of object localization, which we frame as the problem of predicting a bounding box around an object of interest. We develop a framework in Chapter II for applying a branch and bound optimization strategy to eﬃciently and optimally detect a bounding box that maximizes objective functions including kernelized functions and proximity to a prototype. We demonstrate that this optimization can achieve state of the art results when applied to a simple linear objective function trained by a support vector machine. In Chapter III, we then examine how to train a kernelized objective function that is optimized for the task of object localization. In particular, this is achieved by the use of structured output regression. In contrast to a support vector machine, structured output regression does not simply predict binary outputs but rather predicts an element in some output space. In the case of object localization the output space is the space of all possible bounding boxes within an image. Structured output regression learns a function that measures the compatibility of inputs and outputs, and the best output is predicted by maximizing the compatibility over the space of outputs. This maximization turns out to be exactly the same branch and bound optimization as developed in Chapter II. Furthermore, a variant of this branch and bound optimization is also utilized during training as part of a constraint generation step.
We then turn our focus to the problem of clustering images in Chapter IV. We ﬁrst report results from a large scale evaluation of clustering algorithms, for which we measure how well the partition predicted by the clustering algorithm matches a known semantically correct partition of the data. In this study, we see particularly strong results from spectral clustering algorithms, which use the eigenvectors of an appropriately normalized kernel matrix to cluster the data. Motivated by this success, we develop a generalization of spectral clustering to data that appear in more than one modality, the primary example being images with associated text.
As spectral clustering algorithms can be interpreted as the application of kernel principal components analysis followed by a reclustering step, we use the generalization of regularized kernel canonical correlation analysis followed by a reclustering step. The resulting algorithm, correlational spectral clustering, partitions the data signiﬁcantly better than spectral clustering, and allows for the projection of unseen data that is only present in one modality (e.g. an image with no text caption).
Finally, in Chapter V, we address the problem of discovering taxonomies in data.
Given a sample of data, we wish to partition the data into clusters, and to ﬁnd a taxonomy that relates the clusters. Our algorithm, numerical taxonomy clustering, works by maximizing a kernelized dependence measure between the data and an abstracted kernel matrix that is constructed from a partition matrix that deﬁnes the clusters and a positive deﬁnite matrix that represents the relationship between clusters. By appropriately constraining the latter matrix to be generated by an additive metric, we are able to interpret the result as a taxonomy. We make use of the well studied ﬁeld of numerical taxonomy to eﬃciently optimize this constrained problem, and show that we not only achieve an interpretable result, but that the quality of clustering is improved for datasets that have a taxonomic structure.
This thesis would not be possible without the support and help I’ve received from many people. Christoph Lampert, Arthur Gretton, Thomas Hofmann, Tinne Tuytelaars, and Wray Buntine have been wonderful coauthors. It is a privelige to learn by working with such excellent scientists. I owe special mention to Christoph Lampert, Bernhard Sch¨lkopf, and Thomas Hofmann for advising me throughout my PhD.
o Christoph additionally translated my
I could not have asked for a better environment to do a PhD than the Max Planck Institute for Biological Cybernetics. The computer vision group consisted of several very strong researchers, and I enjoyed working with and learning from Guillaume Charpait, Matthias Franz, Peter Gehler, Wolf Kienzle, Kwang In Kim, and Sebastian Nowozin. I’d like to thank all of my colleagues, and especially to thank Sabrina Nielebock for all her help.
During my doctoral work, I was funded in part by a Marie Curie fellowship through the PerAct project (EST 504321), and by the EU funded CLASS project (IST 027978). Through the CLASS project, I was able to learn about the research being done at a consortium of ﬁve leading European research insitutions, and to get feedback on my own work. Thanks are due to the participants for helping to provide insight into the big issues addressing the ﬁeld of computer vision, and the role that statistical learning can play in solving them.
Klaus-Robert M¨ller has given very valuable feedback, and is responsible for having u suggested several experiments that have improved the scientiﬁc content of this work.
I especially thank him for reading my thesis, and for giving his comments during a marathon three hour phone call between California and Berlin, all while recovering in bed from a surgery for his broken leg. I’d also like to thank Gabriela Ernst at the Technische Universit¨t Berlin for all her help throughout the process of arranging a the defense.
A PhD isn’t all work, and I’d like to take the time to mention my friends who made my time in T¨bingen so enjoyable, whether it was just a coﬀee break, or a night u out. Among others: Yasemin Altun, Andreas Bartels, Matthias Bethge, Olivier Chapelle, Guillaume Charpait, Jan Eichhorn, Ayse Naz Erkan, Jason Farquhar, Peter Gehler, Elisabeth Georgii, Arthur Gretton, Moritz Grosse-Wentrup, Jez Hill, Matthias Hofmann, Reshad Hosseini, Stefanie Jegelka, Wolf Kienzle, Kwang In Kim, Jens Kober, Lukasz Konieczny, Oliver Barnabas Kroemer, Shih-pi Ku, Christoph Lampert, Luise Liebig, Markus Maier, Suzanne Martens, Betty Mohler, Sebastian Nowozin, Jan Peters, Justus Piater, Hiroto Saigo, Gabriele Schweikert, Matthias Seeger, Jacquelyn Shelton, Suvrit Sra, Julia Veit, Lena Veit, Ulrike von Luxburg, Christian Walder, Anne Wessendorf, and Mingrui Wu. Extra special thanks goes to Elena Prokofyeva.
Finally, I’d like to thank my family for all their support.
Chapter I Introduction Computer vision is the process of automatically understanding visual information and abstracting meaningful representations that can be used in subsequent data processing and organization. It is a relatively immature ﬁeld: the goal of enabling computers to interact with visual information with similar sophistication to a human is far from achieved. Furthermore, the tasks which have been approached by the research community are fragmented and not always well deﬁned. Nevertheless, there has been signiﬁcant progress in recent years, especially in the areas of object classiﬁcation and localization (the more classical tasks of three-dimensional reconstruction and tracking have approached a relatively high level of sophistication, and have not been addressed in this work). This work improves on the state of the art in several important computer vision tasks, and does so by leveraging the power of statistical learning theory and the ﬂexibility of representing data with domain speciﬁc kernels, positive deﬁnite functions that are equivalent to an inner product in some Hilbert space Aizerman et al. (1964); Sch¨lkopf and Smola (2002). Statistical learning theo ory allows us to pose the problem of learning functions that map raw image data to their meaningful representations as the problem of generalizing from observed examples. Rather than engineer the solution using hand tuning, we utilize observed data directly in order to more quickly, ﬂexibly, and accurately learn the function. While the problem of supervised classiﬁcation has been shown to be especially suited to the computer vision setting, we attempt to move beyond this relatively well studied area and propose additional solutions from statistical learning theory for problems in the computer vision domain. Speciﬁcally, we have addressed three problems of interest to the computer vision community, each of which has been the subject of recent attention due to their importance in the automatic understanding of visual scenes on a semantic level: object localization, clustering, and taxonomy discovery.
In this chapter, we introduce several basic concepts from statistical learning theory and introduce the notation for kernels that we will use throughout this thesis (Section I.1). In particular, we will see that the representer theorem allows us to easily kernelize certain classes of optimization problems. We then review several recent advances in machine learning that will be applicable to problems in computer vision. Once we have ﬁnished our overview of machine learning, we will discuss in Section I.2 the basic concepts from computer vision used throughout the thesis. We will explore how the incorporation of invariances can be treated naturally within the framework of kernel methods, discuss methods for learning task speciﬁc image representations, and give an overview of the state of the art in the learning of
16 Chapter Isemantic information from image data. Finally, in Section I.3, we introduce the main contributions of this thesis, and place them within the context of the current state of the art.
I.1 Kernel Methods in Machine Learning Kernel methods have increased in popularity in the past two decades due to their solid mathematical foundation, tendency toward easy geometric interpretation, and strong empirical performance in a wide variety of domains Vapnik (1995, 1998);
Burges (1998); M¨ller et al. (2001); Sch¨lkopf and Smola (2002); Shawe-Taylor and u o Cristianini (2004); Hofmann et al. (2008). While traditional linear methods are well founded mathematically, and existing algorithms tend to be optimized for performance, real world data often have signiﬁcant non-linearities. By adopting an appropriate non-linear kernel, the mathematical foundations and often signiﬁcant portions of the algorithmic analysis of linear algorithms can be transferred to the non-linear case. Though the feature space implicit in the kernel function can be very high dimensional, by appropriately formulating the problem so that the (implicit) feature vectors are only accessed through kernel evaluations, we can avoid the computational cost imposed by the size of the space.
Given a sample1 of training points (x1, y1 ),... (xi, yi ),... (xn, yn ) ∈ X × Y, where X is some input space, and Y is an output space, we wish to learn a function f : X → Y such that the expected loss, Epxy [l(x, f (x), y)], is minimized for some loss function, l : X × Y × Y → R. Ignoring the computational issues of ﬁnding the minimizing f ∈ F, for some class of functions, F, we are faced with the problem that we do not know the underlying data distribution, pxy, of sample points in X × Y.
We can of course substitute the empirical loss on the training sample, n l(xi, f (xi ), yi ), (I.1) n i=1
where the parameter C controls the level of regularization.
Let F be a reproducing kernel Hilbert space (RKHS) of functions from X to R.
To each point x ∈ X there corresponds an element φ(x) ∈ F (we call φ : X → F the feature map) such that φ(x), φ(x ) F = k(x, x ), where k : X × X → R is a unique positive deﬁnite kernel. For optimization problems of the form given in Equation (I.2), the well known Representer Theorem (e.g. (Sch¨lkopf and Smola, o Samples are usually assumed to be i.i.d.