Menu Content/Inhalt
WP5: MI prototypes for video and robotics PDF Print E-mail

The objective of this WP is the prototypes specification, development and evaluation using the proposed framework in different applications. It will be useful for validating and showing the viability of the proposed approach, as well as for dissemination activities of WP6 aimed at the international community, both scientific and non-scientific audience.


Selected publications

Abstract: State-of-the-art systems on cognitive surveillance identify and describe complex events in selected domains, thus providing end-users with tools to easily access the contents of massive video footage. Nevertheless, as the complexity of events increases in semantics and the types of indoor/outdoor scenarios diversify, it becomes difficult to assess which events describe better the scene, and how to model them at a pixel level to fulfill natural language requests. We present an ontology-based methodology that guides the identification, step-by-step modeling, and generalization of the most relevant events to a specific domain. Our approach considers three steps: (1) end-users provide textual evidence from surveilled video sequences; (2) transcriptions are analyzed top-down to build the knowledge bases for event description; and (3) the obtained models are used to generalize event detection to different image sequences from the surveillance domain. This framework produces user-oriented knowledge that improves on existing advanced interfaces for video indexing and retrieval, by determining the best suited events for video understanding according to end-users. We have conducted experiments with outdoor and indoor scenes showing thefts, chases, and vandalism, demonstrating the feasibility and generalization of this proposal.

MIPRCV Relevance: State-of-the-art on surveillance video analysis is heading to the automatic exploitation of semantic context, in order to extract event patterns that permit us a better comprehension of image sequences. In addition, automatically generated semantic events should also be suited for end-user interfacing of video contents, something difficult to achieve by using bottom-up procedures. The methodology implemented within the MIPRCV project, in particular in the Vid-hum prototype, has contributed to cope with the ambiguous and sometimes incorrect interpretations done by experts while building conceptual models. An ontology and the rest of the knowledge bases have been modeled in a top-down manner from users’ textual evidence. This strategy for feedback and interaction has allowed our prototype to choose the most suited event concepts, plus enabling its generalization to different scenarios in the surveillance domain.

Abstract: A realistic appearance-based representation of sideview gait sequences is here introduced. It is based on a prior method where a set of appearance-based features of a gait sample is used for gender recognition. These features are computed from parameter values of ellipses that fit body parts enclosed by regions previously defined while ignoring well-known facts of the human body structure. This work presents an improved regionalization method supported by some adaptive heuristic rules to better adjust regions to body parts. As a result, more realistic ellipses and a more meaningful feature space are obtained. Gender recognition experiments conducted on the CASIA Gait Database show better classification results when using the new features.

MIPRCV Relevance: This work proposes an adaptive regionalization method to better fit and describe body parts in human silhouettes. It provides a source of knowledge for gait analysis, what might lead to gait-based video surveillance functions. Therefore, the results of this paper could contribute to one of the main research line of the "MI prototype for video surveillance and biometry", which consists in the multimodal fusion of multiple views of gait and face.

Abstract: The CASIA Gait Database is one of the most used benchmarks for gait analysis among the few non-small-size datasets available. It is composed of gait sequences of 124 subjects, which are unequally distributed, comprising 31 women and 93 men. This imbalanced situation could correspond to some real contexts where men are in the majority, for example, a sports stadium or a factory. Learning from imbalanced scenarios usually requires suitable methodologies and performance metrics capable of managing and explaining biased results. Nevertheless, most of the reported experiments using the CASIA Gait Database in gender recognition tasks limit their analysis to global results obtained from reduced subsets, thus avoiding having to deal with the original setting. This paper uses a methodology to gain an insight into the discriminative capacity of the whole CASIA Gait Database for gender recognition under its imbalanced condition. The classification results are expected to be more reliable than those reported in previous papers.

MIPRCV Relevance: Usually, video surveillance systems are located in scenarios where one gender is under-represented with respect to the other one. In those cases, the learning and classification methods should be chosen so that they can properly handle class imbalance. This paper studies the problem of gender recognition in a well-known benchmark dataset for gait analysis, where the number of men outnumbers women by three to one. The results show the importance of taking into consideration this data complexity, if it exists, when a recognition system is being designed.
  • [Vid-veh] D. Vázquez, A.M. López, D. Ponsa and J. Marín. Virtual Worlds and Active Learning for Human Detection. In Proc. IEEE Int. Conf. on Multimodal Interaction (ICMI), Alicante, Spain, 2011.

    : Image based human detection is of paramount interest due to its potential applications in fields such as advanced driving assistance, surveillance and media analysis. However, even detecting non-occluded standing humans remains a challenge of intensive research. The most promising human detectors rely on classifiers developed in the discriminative paradigm, i.e., trained with labelled samples. However, labelling is a manual intensive step, especially in cases like human detection where it is necessary to provide at least bounding boxes framing the humans for training. To overcome such problem, some authors have proposed the use of a virtual world where the labels of the different objects are obtained automatically. This means that the human models (classifiers) are learnt using the appearance of rendered images, i.e., using realistic computer graphics. Later, these models are used for human detection in images of the real world. The results of this technique are surprisingly good. However, these are not always as good as the classical approach of training and testing with data coming from the same camera, or similar ones. Accordingly, in this paper we address the challenge of using a virtual world for gathering (while playing a videogame) a large amount of automatically labelled samples (virtual humans and background) and then training a classifier that performs equal, in real-world images, than the one obtained by equally training from manually labelled real-world samples. For doing that, we cast the problem as one of domain adaptation. In doing so, we assume that a small amount of manually labelled samples from real-world images is required. To collect these labelled samples we propose a non-standard active learning technique. Therefore, ultimately our human model is learnt by the combination of virtual and real world labelled samples, which has not been done before. We present quantitative results showing that this approach is valid.
Abstract: Detecting pedestrians in images is a key functionality to avoid vehicle-to-pedestrian collisions. The most promising detectors rely on appearance-based pedestrian classifiers trained with labelled samples. This paper addresses the following question: can a pedestrian appearance model learnt in virtual scenarios work successfully for pedestrian detection in real images? . Our experiments suggest a positive answer, which is a new and relevant conclusion for research in pedestrian detection. More specifically, we record training sequences in virtual scenarios and then appearance-based pedestrian classifiers are learnt using HOG and linear SVM. We test such classifiers in a publicly available dataset provided by Daimler AG for pedestrian detection benchmarking. This dataset contains real world images acquired from a moving car. The obtained result is compared with the one given by a classifier learnt using samples coming from real images. The comparison reveals that, although virtual samples were not specially selected, both virtual and real based training give rise to classifiers of similar performance.

MIPRCV Relevance: The prototype to be developed involves an environment trough which performing active and online learning of a pedestrian classifier. Basically, the system will start by using a pedestrian classifier learnt in batch mode from some manually annotated samples. Using such classifier the system will process a new incoming image. False positives and false negatives will appear afterwards. A user will provide trough a GUI the correction of such errors. Then the classifier will be updated accordingly by some online learning method. Next, the process is repeated for new images until convergence. During this process, for the case of the false negatives (pedestrians classified as background), the user will provide a high level description such as (looking left, frontal view, etc.). Then, not only will the specific false negative be used for updating the classifier, but also positive examples that are in agreement with the high level description. These positive examples would be generated just in time from a virtual world (video game). Thus, the first point was to demonstrate that it is possible to learnt pedestrian classifiers using virtual examples and obtain state of the art results when the classifier is applied to real world images. This is the issue addressed in the two papers mentioned above.
Last Updated ( Wednesday, 30 May 2012 )