<? echo $GLOBALS['title'] ?>


Computer Vision

My research interests includes broad topics in Computer Vision - Complex Event Recognition, Image registration and Photo-aesthetics. Following sections provide an overview of my recent papers in these areas.

Complex Event Recognition

Recognition of Complex Events in Open-Source Web-Scale Videos: A Bottom up approach
Abstract: Recognition of complex events in unconstrained Internet videos is a challenging research problem. In this symposium proposal, we present a systematic decomposition of complex events into hierarchical components and make an in-depth analysis of how existing research are being used to cater to various levels of this hierarchy. We also identify three key stages where we make novel contributions which are necessary to not only improve the over-all recognition performance, but also develop richer understanding of these events. At the lowest level, our contributions include (a) compact covariance descriptors of appearance and motion features used in sparse coding framework to recognize realistic actions and gestures, and (b) a Lie-algebra based representation of dominant camera motion present in video shots which can be used as a complementary feature for video analysis. In the next level, we propose an (c) efficient maximum likelihood estimate based representation from low-level features computed from videos which demonstrates state of the art performance in large scale visual concept detection, and finally, we propose to (d) model temporal interactions between concepts detected in video shots through two new discriminative feature spaces derived from Linear dynamical systems which eventually boosts event recognition performance. In all cases, we conduct thorough experiments to demonstrate promising performance gains over some of the prominent approaches.
This work is accepted for publication in ACM Multimedia 2013 for presentation in Doctoral Symposium, held in Barcelona, Spain. A preprint is available here. This document is a concise version of my PhD dissertation which is available here.

High-Level Event Recognition in Unconstrained Videos
Abstract:The goal of high-level event recognition is to automatically detect complex high-level events in a given video sequence. This is a difficult task especially when videos are captured under unconstrained conditions by non-professionals. Such videos depicting complex events, have limited quality control and therefore may include severe camera motion, poor lighting, heavy background clutter and occlusion. However, due to the fast growing popularity of such videos, especially on the Web, solutions to this problem are in high demands and have attracted great interest from researchers. In this paper, we review current technologies for complex event recognition in unconstrained videos. While the existing solutions vary, we identify common key modules and provide detailed descriptions along with some insights for each of them, including extraction and representation of low-level features across different modalities, classification strategies, fusion techniques, etc. Publicly available benchmark datasets, performance metrics, and related research forums are also described. Finally, we discuss promising directions for future research.
This work is accepted for publication in International Journal on Multimedia Information Retrieval, 2012, and a preprint is available here.

Covariance of Motion and Appearance Features for Human Action and Gesture Recognition
Abstract:In this paper, we introduce a novel descriptor for employing covariance of motion and appearance features for human action and gesture recognition. In our approach, we compute kinematic features from optical flow and first and second-order derivatives of intensities to represent motion and appearance respectively. These features are then used to construct covariance matrices which capture joint statistics of both low-level motion and appearance features extracted from a video. Using an over-complete dictionary of the covariance based descriptors built from labeled training samples, we formulate low-level event recognition as a sparse linear approximation problem. Within this, we pose the sparse decomposition of a covariance matrix, which also conforms to the space of semi-positive definite matrices, as a determinant maximization problem. Also since covariance matrices lie on non-linear Riemannian manifolds, we compare our former approach with a sparse linear approximation alternative that is suitable for equivalent vector spaces of covariance matrices. This is done by searching for the best projection of the query data on a dictionary using an Orthogonal Matching pursuit algorithm. We show the applicability of our video descriptor in two different application domains - namely human action recognition and one shot learning of human gestures. Our experiments provide promising insights in large scale video analysis.
This work is under review in IEEE Transactions on Pattern Analysis and Machine Intelligence.

Cinematographic Shot Classification and its Application to Complex Event Recognition
Abstract:In this paper, we propose a discriminative representation of a video shot based on its camera motion and demonstrate how the representation can be used for high level multimedia tasks like complex event recognition. In our technique, we assume that a homography exists between subsequent pairs of frames in a given video shot. Using purely image-based methods, we compute homography parameters that serve as coarse indicators of the camera motion. Next, using Lie algebra, we map the homography matrices to an intermediate vector space that preserves the intrinsic geometric structure of the transformation. Multiple time series are then constructed from these mappings. Features computed on these time series are used for discriminative classification of video shots. In addition, we provide an in-depth analysis of different features computed from time-series and their impact on the classification of different shots. Our empirical evaluations on eight cinematographic shot classes show that our technique performs better than approaches that are based on image-based estimation of camera trajectories. Finally we show an application of our shot representation for detection of complex events in consumer videos.
This work is under review in IEEE Transactions on Multimedia.

A probabilistic representation for efficient large scale visual recognition tasks
Abstract: In this paper, we present an efficient alternative to the traditional vocabulary based on bag-of-visual words (BoV) used for visual classification tasks. Our representation is both conceptually and computationally superior to the bag-of-visual words: (1) We iteratively generate a Maximum Likelihood estimate of an image given a set of characteristic features in contrast to the BoV methods where an image is represented as a histogram of visual words,(2) We randomly sample a set of characteristic features instead of employing computation intensive clustering algorithms used during the vocabulary generation step of BoV methods. Our comparable performance to the state-of-the-art, on experiments over a challenging scene categorization dataset and two equally challenging human action datasets, demonstrates the universal applicability of our method.
The camera ready version is here. The code and data used in the experiments discussed in the paper would be uploaded shortly.

Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching
Abstract: TRECVID Multimedia Event Detection offers an interesting but very challenging task in detecting high-level complex events(batting baseball run, making cake, assembling shelter) in user-generated videos. In this paper, we will present an overview and comparative analysis of our results, which achieved top performance among all 45 submissions in TRECVID 2010. Our aim is to answer the following questions. What kind of feature is more effective for multimedia event detection? Are features from different feature modalities (e.g., audio and visual) complementary for event detection? Can we benefit from generic concept detection of background scenes, human actions, and audio concepts? Are sequence matching and event-specific object detectors critical? Our findings indicate that spatial-temporal feature is very effective for event detection, and it's also very complementary to other features such as static SIFT and audio features. As a result, our baseline run combining these three features already achieves very impressive results, with a mean minimal normalized cost (MNC) of 0.586. Incorporating the generic concept detectors using a graph diffusion algorithm provides marginal gains (mean MNC 0.579). Sequence matching with Earth Mover's Distance (EMD) further improves the results (mean MNC 0.565). The event-specific detector ("batter"), however, didn't prove useful from our current re-ranking tests. We conclude that it is important to combine strong complementary features from multiple modalities for multimedia event detection, and cross-frame matching is helpful in coping with temporal order variation. Leveraging contextual concept detectors and foreground activities remains a very attractive direction requiring further research. This is a joint effort between Columbia University and UCF which culminated into the best performance in the Multimedia Event Detection 2010 challenge. A notebook paper is available here.

Photo and Video Aesthetics

Towards a Comprehensive Computational Model for Aesthetic Assessment of Videos
Abstract: In this paper we propose a novel aesthetic model emphasizing psycho- visual statistics extracted from multiple levels in contrast to earlier approaches that rely only on descriptors suited for image recognition or based on photographic principles. At the lowest level, we determine dark-channel, sharpness and eye-sensitivity statistics over rectangular cells within a frame. At the next level, we extract Sentibank features (1,200 pre-trained visual classifiers) on a given frame, that invoke specific sentiments such as "colorful clouds", "smiling face" etc. and collect the classifier responses as frame-level statistics. At the topmost level, we extract trajectories from video shots. Using viewer's fixation priors, the trajectories are labeled as foreground, and background/camera on which statistics are computed. Additionally, spatio-temporal local binary patterns are computed that capture texture variations in a given shot. Classifiers are trained on individual feature representations independently. On thorough evaluation of 9 different types of features, we select the best features from each level - dark channel, affect and camera motion statistics. Next, corresponding classifier scores are integrated in a sophisticated low-rank fusion framework to improve the final prediction scores. Our approach demonstrates strong correlation with human prediction on 1,000 broadcast quality videos released by NHK as an aesthetic evaluation dataset.

This paper is published in ACM Multimedia 2013 International Conference as a grand challenge submission in Barcelona.

A Holistic Approach to Aesthetic Enhancement of Photographs
Abstract: This article presents an interactive application that enables users to improve the visual aesthetics of their digital photographs using several novel spatial recompositing techniques. This work differs from earlier efforts in two important aspects: (1) it focuses on both photo quality assessment and improvement in an integrated fashion, (2) it enables the user to make informed decisions about improving the composition of a photograph. The tool facilitates interactive selection of one or more than one foreground objects present in a given composition, and the system presents recommendations for where it can be relocated in a manner that optimizes a learned aesthetic metric while obeying semantic constraints. For photographic compositions that lack a distinct foreground object, the tool provides the user with crop or expansion recommendations that improve the aesthetic appeal by equalizing the distribution of visual weights between semantically different regions. The recomposition techniques presented in the article emphasize learning support vector regression models that capture visual aesthetics from user data and seek to optimize this metric iteratively to increase the image appeal. The tool demonstrates promising aesthetic assessment and enhancement results on variety of images and provides insightful directions towards future research.

This journal article is an extension to the paper published in ACM Multimedia 2010 International Conference in Florence.

A Coherent Framework for Photo-Quality Assessment and Enhancement based on Visual Aesthetics
Abstract: We present an interactive application that enables users to improve the visual aesthetics of their digital photographs using spatial recomposition. Unlike earlier work that focuses either on photo quality assessment or interactive tools for photo editing, we enable the user to make informed decisions about improving the composition of a photograph and to implement them in a coherent framework. Specifically, the user can interactively select a foreground object and the system will present recommendations for where it can be moved in a manner that optimizes a learned aesthetic metric while obeying semantic constraints. For photographic compositions that lack a distinct foreground object, our tool provides the user with cropping or expanding recommendations that improve its aesthetic quality. We learn a support vector regression model for capturing image aesthetics from user data and seek to optimize this metric during recomposition. Rather than prescribing a fully-automated solution, we allow user-guided object segmentation and inpainting to ensure that the final photograph matches the user's criteria. Our approach achieves 86% accuracy in predicting the attractiveness of unrated images, when compared to their respective human rankings. Additionally, 73% of the images recomposited using our tool are ranked more attractive than their original counterparts by human raters.

This work is accepted in ACM Multimedia International Conference (ACMMM 2010) as a 10 page paper (17% acceptance rate), held in Firenze, Italy. Here is an accompanying talk. A subset of the images from the dataset mentioned in the paper is available here. We received some objects from Flickr users for making there images publicly available for experiments, hence the full dataset was brought down. The code provided in the archive is unsupported.

Image Registration

Moving Object Detection and Tracking in Forward Looking Infra-Red Aerial imagery
Abstract: This chapter discusses the challenges of automating surveillance and reconnaissance tasks for infra-red visual data obtained from aerial platforms. These problems have gained significant importance over the years, especially with the advent of lightweight and reliable imaging devices. Detection and tracking of objects of interest has traditionally been an area of interest in the computer vision literature. These tasks are rendered especially challenging in aerial sequences of infra red modality. The chapter gives an overview of these problems, and the associated limitations of some of the conventional techniques typically employed for these applications. We begin with a study of various image registration techniques that are required to eliminate motion induced by the motion of the aerial sensor. Next, we present a technique for detecting moving objects from the ego-motion compensated input sequence. Finally, we describe a methodology for tracking already detected objects using their motion history. We substantiate our claims with results on a wide range of aerial video sequences.

This work is published as a chapter in Springer book Machine Vision beyond Visible Spectrum .

Video on Demand (Before UCF)

A Case for Grid based Video on Demand System

Abstract: The Video on Demand (VoD) services incorporate streaming of video over network and allow its subscribers to select videos and play them in near real time playback quality including interactive functions like Fast Forward, Rewind, Random seek, etc. VoD systems put a huge amount of overhead on the processing capabilities of the video server and need an equally huge amount of storage. These systems also demand a highly optimized network backbone for data transfer. Though lot of research have been carried out in the area of VoD distribution and network optimization, the problems mentioned above have received only cursory attention. Recently, research in the high community has led to the development of Grid Computing technologies for precisely the problems stated above. In this paper, we propose and develop the idea of integrating VoD servers with Grid computing and describe the system as Grid based Video on Demand (GDVoD) system. Prototype of GDVoD system has been developed and experiments have been carried out. The experiments highlight the fact that GDVoD system has low overhead in terms of computation without compromising the quality of the streamed video.

The paper was submitted in HPDC 2006.

Systems Virtualization

Nova: An Approach to On-Demand Virtual Execution Environments for Grids

Abstract: This paper attempts to reduce the overheads of dynamically creating and destroying the virtual environments for secure job execution. It broaches a grid architecture which we call Nova, consisting of extremely minuscule, pre-created virtual machines whose configurations could be altered with respect to the application executed within it. The benefits of the architecture are supported by experimental claims.

This work was accepted as a short paper in CCGrid 2006.

Grid/High Performance Computing

Scalable and Distributed Mechanisms for Integrated Scheduling and Replication in Data Grids

Abstract: Data Grids seek to harness geographically distributed resources for large-scale data-intensive problems. Such problems involve loosely coupled jobs and large data sets distributed remotely. Data Grids have found applications in scientific research fields of high-energy physics, life sciences etc. as well as in the enterprises. The issues that need to be considered in the Data Grid research area include resource management for computation and data. Computation management comprises scheduling of jobs, scalability, and response time; while data management includes replication and movement of data at selected sites. As jobs are data intensive, data management issues often become integral to the problems of scheduling and effective resource management in the Data Grids. Integration of data replication and scheduling strategies is important. Such an integrating solution is either non-existent or work in a centralized manner which is not scalable. The paper deals with the problem of integrating the scheduling and replication strategies in a distributed manner. As part of the solution, we have proposed a Distributed Replication and Scheduling Strategy (DistReSS) which aims at an iterative improvement of the performance based on coupling between scheduling and replication, which is achieved in a distributed and hierarchical fashion. Results suggest that, in the context of our experiments, DistReSS performs comparable to the centralized approach when the parameters are tuned properly.

Work accepted as a poster in CCGrid 2005.