Menu Content/Inhalt
Home arrow Project Description arrow Applications
3. Selected application examples of interest to the consortium Print E-mail

A number of applications where the proposed MI framework apply are discussed below.

3.1. Interactive Transcription of 1-D Signals

Only the case of speech signals is examined in some detail. Other similar applications will be briefly discussed below.

State-of-the-art automatic speech transcription yields many errors with large vocabularies, spontaneous speech, etc. For high quality transcriptions, a human operator is required to review system's transcriptions and amend the errors. Typically, the review is performed off-line, after the system has returned a full transcription of the whole audio stream. This is both inefficient and uncomfortable for the user.

The interactive framework allows for a more effective, on-line approach: The system provides its best transcription hypothesis for a short segment of the audio stream and the user amends the first possible error. The validated part of the transcription becomes now a feedback information which the system can use to improve its hypothesis in the next step. Formally, given an audio stream `x` and a transcription prefix `p`, validated by the user in the previous step, we look for a proper completion, or suffix (`s`) of `p` which can produce `x` with maximum likelihood. Formally,


After natural assumptions and simple manipulations [9], the posterior probability of the suffix can be decomposed into the product of two conditional probability distributions: one modeling `x` by means of standard acoustic models and the other modelling `s` by a prefix-conditioned language model. Formally,


where `M_(A)` and `M_(L)` are, respectively, the acoustic-phonetic and the language models included `M`. The resulting maximisation problem can be solved using similar techniques as those commonly used for standard speech recognition [9].

At any given point of this process, a perfectly correct transcription is available for the part of the audio stream already processed. For large streams, this audio chunk and the corresponding validated transcription constitute very valuable data for adaptive learning. This way, `M` can be progressively tuned to the specific language and acoustic environment of the task considered.

In traditional speech recognition, performance is commonly estimated in terms of word error rate (WER). The WER is the minimum number of word errors that should be corrected to edit the system hypotheses into the corresponding transcription references, as they appear in a given labelled test corpus. Using the same corpus, the human interaction effort can be estimated in terms of Word Stroke Ratio (WSR), defined as the number of times a (hypothetical) user should have to interact with the system to correct one word, where user decisions are simulated using the reference transcriptions [9]. The comparison between WSR and WER provides an estimate of the human effort that the interactive approach would save, with respect to classical automatic speech transcription followed by off-line human review. Promising estimates of effort-saving are reported in our preliminary work [9].

Interactive Translation and Transcription of Music and Text Images

Similar ideas can be used for Computer Assisted operation in Transcription of Music [10] and Text Images [11], as well as in Machine Translation [12]. In all these cases, the simplest way for the system to get user feedback information is through keyboard and mouse, but much better interaction synergy can be achieved through other modalities, such as pointer-device gestures, gaze tracking, or Speech Recognition. We have explored the later in [7], with promising results.

3.2. Interactive Retrieval of Multimedia Content

The general problem is examined first, followed by a brief discussions of some specific applications. In classical Content Retrieval, a document collection `C` and a query document `x` are given and the system searches for a selection `hat(X)` of `N` documents from `C` which "most closely resemble" `x`; formally,

`hat(X)=argmax_(X in C^N)Pr(X|C,x)`

where documents in `C` and `x` are assumed to be represented by adequate features [17].

In many applications, a single query document is rarely informative enough and the documents retrieved following the above formulation seldom suit completely the (often rather vague) user expectations. Under the interactive paradigm, the user can provide feedback to validate or otherwise express disappointment about some of the retrieved documents. This idea, sometimes called "relevance based retrieval", is recently gaining increasing popularity for image retrieval [14]. Following our MI framework, we are given a query document `x`, a set `X'` of documents retrieved in a previous step and some "relevance" feedback `F` about the adequacy of the documents in `X'` and we look for `N` documents from `C`, which "are most adequate" according to the query, `x`, and the feedback, `(X',F)`. Formally,

`hat(X)=argmax_(X in C^N)Pr(X|C,x,X',F)

As it is common in the MI framework, user feedback hints are of a modality different from that of `x` or `C`(documents). A pointer device is perhaps the simplest feedback modality here, with `F={F_1,F_2,...}` and each `F_i` being a positive, neutral or negative "vote" for the `i`-th. document in `X'`. Of course, these votes could be better expressed through other modalities.

In our MI framework, once the feedback data have served their basic error-correcting purpose, they become also available for Adaptive Learning of the retrieval model. A trivial adaptation may consist in enlarging the original collection `C` with each user-supplied query document. But more sophisticated tuning techniques are possible by using the feedback data to adapt the features and/or the underlying probabilistic model.

Interactive Face Recognition and other Biometry applications

There are many uses of these technologies. Here we focus in the use of Face Recognition for computer (assisted) surveillance. In this case, a collection of face images is given and the system observes people through an adequately positioned camera. An important step in this case is to accurately detect the faces that appear in the camera field. For each detected face, the system trigers an alarm if it is sufficiently similar to one or more faces in the given collection [15]. The detected face image, along with those it most resembles to, are then shown to a human operator.

Under the MI framework, the operator is allowed to make judgements about the adequacy of each retrieved image, thereby starting an interaction cycle which ends only when she is sure enough of the correction of the results. Clearly, this application is formally very similar to the one discussed above and, consequently, the same MI paradigm can be applied here.

Interactive Retrieval of Images, Text and Music

Once the objects of interest are represented by the adequate features [13,16], these applications formally become particular cases of Multimedia Retrieval, so the same ideas for MI apply in these cases.

For all these applications, the MI framework proposed here provides an adequate and comprehensive formalisation which embodies feedback, adaptive learning and multimodal processing. This is expected to lead to a new generation of content-based retrieval and computer-assisted biometric systems with significantly improved accuracy and usability.

3.3. Interactive image fusion

In image restoration or reconstruction applications, the goal is to obtain an estimate of a real image from one (ore more) observed image(s). In super-resolution imaging, which is challenging research area with many real applications [18], several low-resolution images are combined to obtain a high-resolution one. Information on the original images can be provided in several forms which lead to interesting MI problems.

In these applications, different (types of) observed images are used to improve the quality of a reconstructed image. They have to be jointly modeled through the conditional probability of the images given the hypothesis, `h`, which denotes the target image or region of interest. Modelling this conditional distribution and the corresponding prior, as well as deciding how inference should be performed, lead to interesting multimodal image fusion techniques.

In the MI framework, the user is allowed to provide additional information on, for instance, the image or region of interest (with very interesting applications to mobile devices), the relevance of the used images, and/or the model parameters themselves. This information should guide the current and future reconstruction processes in order to obtain improved target images.

Note that in the above problems, the goal may not only be the estimation of an image `h`, given the observations and feedback information but also to approximate the whole posterior distribution of `h` in order to provide a probability distribution on reconstructed/restored images. In [19] we explore the use of distribution approximation methods in blind deconvolution problems, with positive results.

3.4. Medical Imaging Assisted Diagnosis

Diagnosis by means of medical images has become commonplace nowadays. Automatic diagnosis has also been studied using CV and PR technology. However, as in many other PR/CV applications, full automation is not appropriate or convenient in this case. Therefore, practical developments typically aim at enriching the given images with information (e.g, false colour) that may be helpful for diagnosis purposes [20]. This information is generally computed without any human intervention and, therefore, the system can not benefit from possible medical expert judgements about the results. As in the previously discussed application, the MI framework can significantly improve this situation by letting medical expert feedback guide the system's decision-making process.

3.5. Computer Assisted Document Layout Analysis

Layout analysis is a needed first step in all applications of Computer Vision technology to images containing handwritten or printed text, as well as other elements such as drawings or pictures. The system has to determine which blocks contain the different semantically relevant parts of the image [21].

In difficult practical scenarios, the layout provided by the system has to be amended by a human operator. When this process is carried out off-line, the user needs to correct every wrong detail. Instead, in the MI framework, each corner corrected by the user becomes a feedback information which the system can use to produce an improved layout. Multimodality plays an important role for feedback here. While the input to the Document Analysis system is a page image, user feedback can be advantageously provided by means of pointer gestures which, for improved usability, can be adequately registered over the same input image being processed.

3.6. Multimodal Dialog Systems

This is one of the most straightforward applications of PR and CV in the MI framework. In particular, public kiosks (and other mobile information access devices like PDAs and future phones) allow the user to access information and services without help of any keyboard or mouse. In this context, the multimodal input `x` typically consists of speech, handwriting and touch screen information. The main challenge in this case is to build a multimodal language understanding system that should adequately deal with the underlying modality fusion problem as discussed in section 2.1.

Context and user information, probably provided by a CV system, can also be used to delimit the search space. Multimodal dialogue history plays here the role of the human interaction feedback described in section 2.1. The challenge is to model the complete multimodal dialogue history under the MI framework.

3.7. Cooperative Tracking and Recognition of Human Actions

There are many applications of this technology. Here we focus on tracking human behaviour from videos, such as pedestrians crossing inner-city roads or humans in indoor worlds like an airport. These domains allow us to explore a coherent evaluation of human movements and facial expressions across a wide variation of scale [22]. The challenge here is to create a MI facility which starts with basic knowledge about agent (pedestrian) behaviour in the chosen domain and progressively clusters evaluation results into semantically meaningful subsets of behaviours.

Inference and causal reasoning will be based on finding sequences of agent states that are consistent with the observation data and the agent models. Moreover, the sensing process will be tightly integrated with the reasoning process as part of a perception-action cycle: the goal is to analyze how cooperating pan-tilt-zoom sensors can enhance the process of cognition via controlled responses to uncertain or ambiguous interpretations [23]. MI provides an adequate framework for these cooperating processes.

The main procedure will entail: a) Detection and tracking of agents while they are still some distance away from a particular location; b) When these agents come closer to the camera, or when the active camera zooms in on these agents, their body posture will be evaluated to check for compatibility with the behaviour hypotheses generated so far; c) If they are even closer and their face can be resolved sufficiently well, facial emotions will be checked; and d) In order to close the loop, human interaction will be taken into account to control the system answer.

3.8. Assisted Vehicle Driving

The development of effective Advanced Driver Assistance Systems (ADAS) is recently becoming a major social and commercial objective. Examples of ADAS are Adaptive Cruise Control, Lane Keeping Assistance and Pedestrian Protection Systems. Sensing the vehicle surroundings for ADAS can be done by using a variety of sensors, but CCD/CMOS cameras have proven to be among the most versatile ones. CV technology is used to process the raw camera signals and extract the relevant information and PR algorithms are used to derive adequate decisions from this information [24].

The MI framework proposed in this project will help improving many aspects of ADAS. First, interactivity is intrinsic to this application: Every ADAS output is expected to have an immediate user reaction ny means of the wheel, the brakes or other dashboard actuators. These user actions can be used as feedback signals both to directly improve ADAS decisions and to adaptively train the systems to the user for better performance in the current environment. Furthermore, multimodal technology will allow effective fusion of camera video streams, not only with the interaction feedback signals, but also with signals coming from other sensors which have proven their effectiveness for specific monitoring or control tasks.

3.9. Ubiquitous Robotics

Robotics is perhaps one of the best examples where the MI framework directly applies. Many works in this area have focused on the multimodal nature of the signals involved and/or on the human-robot interfaces. Under the proposed MI framework, both aspects are tightly integrated, along with the on-line adaptation of the machine to the task and user preferences [25].

Recent trends in robotics aim at developing robot teams; that is, groups of homogeneous or heterogeneous specialised machines which cooperate with each-other to achieve common goals [26]. In this case, machine-machine interaction is to be considered, along with the human-machine interaction of the standard MI framework. In these ubiquitous-robotics environments, interaction among machines entails using specific sensor modalities, such as GPS, WiFi, Bluetooth or RFID nets and even modalities based on mobile phone technology. This raises specific challenges that will need specialised approaches to achieve effective cooperation performance. On the other hand, human-robot interaction (which perfectly fits under the MI framework) needs technologies such as computer vision, tracking and recognition of human actions, speech recognition, along with mobile phone or other wireless communications.


<Prev                     Next>