Computer Vision
My research interests includes broad topics in Computer Vision - Complex
Event Recognition, Image registration and Photo-aesthetics. Following
sections provide an overview of my recent papers in these areas.
Complex Event Recognition
High-Level Event Recognition in Unconstrained Videos
Abstract:The goal of high-level event recognition is to automatically
detect complex high-level events in a given video sequence. This is a
difficult task especially when videos are captured under unconstrained
conditions by non-professionals. Such videos depicting complex events, have
limited quality control and therefore may include severe camera motion, poor
lighting, heavy background clutter and occlusion. However, due to the fast
growing popularity of such videos, especially on the Web, solutions to this
problem are in high demands and have attracted great interest from
researchers. In this paper, we review current technologies for complex event
recognition in unconstrained videos. While the existing solutions vary, we
identify common key modules and provide detailed descriptions along with
some insights for each of them, including extraction and representation of
low-level features across different modalities, classication strategies,
fusion techniques, etc. Publicly available benchmark datasets, performance
metrics, and related research forums are also described. Finally, we discuss
promising directions for future research.
This work is accepted for publication in International Journal on Multimedia
Information Retrieval, 2012, and a preprint is available here.
Covariance of Motion and Appearance Features for Human Action and
Gesture Recognition
Abstract:In this paper, we introduce a novel descriptor for employing
covariance of motion and appearance features for human action and gesture
recognition. In our approach, we compute kinematic features from optical
flow and first and second-order derivatives of intensities to represent
motion and appearance respectively. These features are then used to
construct covariance matrices which capture joint statistics of both
low-level motion and appearance features extracted from a video. Using an
over-complete dictionary of the covariance based descriptors built from
labeled training samples, we formulate low-level event recognition as a
sparse linear approximation problem. Within this, we pose the sparse
decomposition of a covariance matrix, which also conforms to the space of
semi-positive definite matrices, as a determinant maximization problem. Also
since covariance matrices lie on non-linear Riemannian manifolds, we compare
our former approach with a sparse linear approximation alternative that is
suitable for equivalent vector spaces of covariance matrices. This is done
by searching for the best projection of the query data on a dictionary using
an Orthogonal Matching pursuit algorithm. We show the applicability of our
video descriptor in two different application domains - namely human action
recognition and one shot learning of human gestures. Our experiments provide
promising insights in large scale video analysis.
This work is under review in IEEE Transactions on Pattern Analysis and
Machine Intelligence.
Cinematographic Shot Classification and its Application to Complex
Event Recognition
Abstract:In this paper, we propose a discriminative representation
of a video shot based on its camera motion and demonstrate how the
representation can be used for high level multimedia tasks like complex
event recognition. In our technique, we assume that a homography exists
between subsequent pairs of frames in a given video shot. Using purely
image-based methods, we compute homography parameters that serve as coarse
indicators of the camera motion. Next, using Lie algebra, we map the
homography matrices to an intermediate vector space that preserves the
intrinsic geometric structure of the transformation. Multiple time series are
then constructed from these mappings. Features computed on these time series
are used for discriminative classification of video shots. In addition, we
provide an in-depth analysis of different features computed from time-series
and their impact on the classification of different shots. Our empirical
evaluations on eight cinematographic shot classes show that our technique
performs better than approaches that are based on image-based estimation of
camera trajectories. Finally we show an application of our shot
representation for detection of complex events in consumer videos.
This work is under review in IEEE Transactions on Multimedia.
A probabilistic representation for efficient large scale visual
recognition tasks
Abstract: In this paper, we present an efficient alternative to the
traditional vocabulary based on bag-of-visual words (BoV) used for visual
classification tasks. Our representation is both conceptually and
computationally superior to the bag-of-visual words: (1) We iteratively
generate a Maximum Likelihood estimate of an image given a set of
characteristic features in contrast to the BoV methods where an image is
represented as a histogram of visual words,(2) We randomly sample a set of
characteristic features instead of employing computation intensive
clustering algorithms used during the vocabulary generation step of BoV
methods. Our comparable performance to the state-of-the-art, on experiments
over a challenging scene categorization dataset and two equally challenging
human action datasets, demonstrates the universal applicability of our method.
The camera ready version is here.
The code and data used in the experiments discussed in the paper would be
uploaded shortly.
Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining
Multiple Modalities, Contextual Concepts, and Temporal Matching
Abstract:
TRECVID Multimedia Event Detection offers an interesting but very challenging
task in detecting high-level complex events(batting baseball run, making cake,
assembling shelter) in user-generated videos. In this
paper, we will present an overview and comparative analysis of our results,
which achieved top performance among all 45 submissions in TRECVID 2010.
Our aim is to answer the following questions. What kind of feature is more
effective for multimedia event detection? Are features from different feature
modalities (e.g., audio and visual) complementary for event detection? Can we
benefit from generic concept detection of background scenes, human actions,
and audio concepts? Are sequence matching and event-specific object detectors
critical?
Our findings indicate that spatial-temporal feature is very effective for
event detection, and it's also very complementary to other features such as
static SIFT and audio features. As a result, our baseline run combining these
three features already achieves very impressive results, with a mean minimal
normalized cost (MNC) of 0.586. Incorporating the generic concept detectors
using a graph diffusion algorithm provides marginal gains (mean MNC 0.579).
Sequence matching with Earth Mover's Distance (EMD) further improves the
results (mean MNC 0.565). The event-specific detector ("batter"), however,
didn't prove useful from our current re-ranking tests. We conclude that it is
important to combine strong complementary features from multiple modalities
for multimedia event detection, and cross-frame matching is helpful in coping
with temporal order variation. Leveraging contextual concept detectors and
foreground activities remains a very attractive direction requiring further
research.
This is a joint effort between Columbia University and UCF which culminated
into the best performance in the
Multimedia Event Detection
2010 challenge. A notebook paper is available
here.
Photo Aesthetics
A Holistic Approach to Aesthetic Enhancement of Photographs
Abstract:
This article presents an interactive application that enables users to
improve the visual aesthetics of their digital photographs using several
novel spatial recompositing techniques. This work differs from earlier
efforts in two important aspects: (1) it focuses on both photo quality
assessment and improvement in an integrated fashion, (2) it enables the user
to make informed decisions about improving the composition of a photograph.
The tool facilitates interactive selection of one or more than one
foreground objects present in a given composition, and the system presents
recommendations for where it can be relocated in a manner that optimizes a
learned aesthetic metric while obeying semantic constraints. For photographic
compositions that lack a distinct foreground object, the tool provides the
user with crop or expansion recommendations that improve the aesthetic appeal
by equalizing the distribution of visual weights between semantically
different regions. The recomposition techniques presented in the article
emphasize learning support vector regression models that capture visual
aesthetics from user data and seek to optimize this metric iteratively to
increase the image appeal. The tool demonstrates promising aesthetic
assessment and enhancement results on variety of images and provides
insightful directions towards future research.
This journal article
is an extension to the paper published in ACM Multimedia 2010 International
Conference in Florence.
A Coherent Framework for Photo-Quality Assessment and Enhancement based
on Visual Aesthetics
Abstract:
We present an interactive application that enables users to improve the
visual aesthetics of their digital photographs using spatial recomposition.
Unlike earlier work that focuses either on photo quality assessment or
interactive tools for photo editing, we enable the user to make informed
decisions about improving the composition of a photograph and to implement
them in a coherent framework. Specifically, the user can interactively select
a foreground object and the system will present recommendations for where it
can be moved in a manner that optimizes a learned aesthetic metric while
obeying semantic constraints. For photographic compositions that lack a
distinct foreground object, our tool provides the user with cropping or
expanding recommendations that improve its aesthetic quality. We learn a
support vector regression model for capturing image aesthetics from user
data and seek to optimize this metric during recomposition. Rather than
prescribing a fully-automated solution, we allow user-guided object
segmentation and inpainting to ensure that the final photograph matches the
user's criteria. Our approach achieves 86% accuracy in predicting the
attractiveness of unrated images, when compared to their respective human
rankings. Additionally, 73% of the images recomposited using our tool are
ranked more attractive than their original counterparts by human raters.
This work is accepted
in ACM Multimedia International Conference
(ACMMM 2010) as a 10 page paper
(17% acceptance rate), held in Firenze, Italy. Here is an accompanying
talk.
A subset of the images from the dataset mentioned in the paper is available
here. We received some objects from Flickr
users for making there images publicly available for experiments, hence the
full dataset was brought down. The code provided in the archive is unsupported.
Image Registration
Moving Object Detection and Tracking in Forward Looking Infra-Red
Aerial imagery
Abstract:
This chapter discusses the challenges of automating surveillance and
reconnaissance tasks for infra-red visual data obtained from aerial platforms.
These problems have gained significant importance over the years, especially
with the advent of lightweight and reliable imaging devices. Detection and
tracking of objects of interest has traditionally been an area of interest in
the computer vision literature. These tasks are rendered especially challenging
in aerial sequences of infra red modality. The chapter gives an overview of
these problems, and the associated limitations of some of the conventional
techniques typically employed for these applications. We begin with a study
of various image registration techniques that are required to eliminate motion
induced by the motion of the aerial sensor. Next, we present a technique for
detecting moving objects from the ego-motion compensated input sequence.
Finally, we describe a methodology for tracking already detected objects using
their motion history. We substantiate our claims with results on a wide range
of aerial video sequences.
This work is
published as a chapter in Springer book
Machine Vision beyond Visible Spectrum .