Saturday, June 18, 2016

Multimedia analysis out of the box: new applications and domains

Flickr: Tom Magliery
This blogpost summarizes the panel on the third day of the 14th International Workshop on Content-based Multimedia Indexing CBMI 2016. It includes both statements by the panelists and comments coming from the audience. I was the panel moderator, and was also taking notes as people were speaking (any error in reproducing what people said here is strictly my own).

The panel was structured into three rounds roughly related to the past, present, and future of multimedia analysis research. Each round had an “opener” that the panelists were asked to respond to, and then continued in free form, with the audience also contributing.

First round: The panelists were asked to discuss, “A past vision (that you have had during the last 20 years) for a multimedia analysis application that came to be.”

The early work of GroupLens started a user revolution. It was great to have recommender systems break onto the scene. Their introduction shifted the focus of the community of researchers, also those studying information/multimedia access, from pure computation to involving users. This shift was possible because computers could collect user interactions, providing researchers with large sets of interactions to work with. Recommender systems introduced the key idea that users can benefit from other users, and this idea has come into its own.

Historically, multimedia indexing started with spoken content indexing. (This statement carried the “footnote” that the panelists and the panel moderator all have a speech background.) In the past years, we have seen the maturation of speech and language technology. Now we are on the brink of systems that index all spoken information in multimedia. (But let’s keep breathing in the meantime.)

The panel noticed that it is easier to name past visions that still have not completely come to be. Examples were:

First person video: In the late 1990’s, video life logging started. The goal was to summarize daily life, and to aid memory and remembering. Privacy is a real stumbling block for this vision. However, now we are seeing first person cameras like GoPro: so, perhaps it is video life logging is here, but it is not exactly what we thought it was going to be.

Users: Ten years ago we were developing algorithms for applications, but there was a sense that they would never be put to use. The field of multimedia analysis is now more user centered, although not yet complete so: we are on our way. Sometimes it’s not gaining 5% MAP that makes product usable. Instead, we need to think about different lines.

Education: The panel was in agreement that we have yet to see multimedia reach its potential as a tool for education. This could and should be the century of education!

In the early 1990’s multimedia retrieval and spoken content retrieval were intended to support education. Today, we see that eduction is still mainly about books. MOOCs and online learning resources are growing in popularity, but we are still waiting for multimedia indexing to really contribute to education at large scales.

We used to have the vision that kids should be able to play with information and to communicate with each other as part of studying and learning: These types of applications were fun. What happened to this kind of work? It is a shame that this hasn’t really been put to mainstream use: Is this the responsibility of the multimedia people?

Well, yes. We are all teachers in a way: Why don’t we eat our own dogfood? Looking at this conference, our presentations are all text-heavy sets of PowerPoint slides!

Why is willingness of teachers and journalists to use multimedia tools so low? Do we need to wait until everyone in the world becomes tech friendly to have our research put to use?

Maybe we just don’t have the tools necessary to allow multimedia indexing to come into its own in support of education. We need the tools in order to engage teachers.

We don’t have the time to do education related research. You can’t just do a 10 minute experiment with data from 30 people: people are complex kids are complex! We haven’t been willing to take the time to work with teachers: we haven’t had funding for a 5-10 year sustained effort in this area. But it’s a worthwhile goal.

We need to understand the nature of education. There is a relationship between student and teacher: it is a human relationship. A machine might not be able to motivate the student.

This observation about student behavior stands in contrast to the success of video games in motivating kids. Games appear to motivate kids more so than their parents are able to do. However, today’s games are too simplistic to be an education tool. They don’t reflect real breath.

Final note of the first round: It seems that multimedia analysis researchers don’t talk about “killer applications” anymore. The way we see our success is more diffuse, and maybe that is also OK.

Second round. Panel members were asked to discuss “A current (widely-held) vision for a multimedia analysis application that is doomed.”

Our panelists jumped on the opportunity to be controversial.

Is lifelogging doomed?
Multimedia researchers of course love the huge amounts of data that life logging delivers. But do people really want their lives to be logged? Why would I want all of those picture? Are we just recording without a real application?

When we are healthy and in good shape we have perhaps no reason to record our lives. But when we become older or are in a situation that we need to be managing an illness, things change. In this case, the lifelogging applications are tremendously interesting. For elderly people living alone, it can be a real help: although it does not replace human company.

Why don’t we see this technology being widely used? The problem is not the market. The problem is that we are not marketing or business people: we need someone else to put this technology on the market. This process for doing so is a mess! We develop nice applications, but we need to move on, and the business development never gets done.

Is virtual reality doomed?
We are not in a virtual space having a virtual conference. We are here. Virtual meeting rooms have not come to be and video conferencing fatigue is real. Virtual reality works great in games. Perhaps also in demonstrating things. But in general, augmented reality appears to be the more promising path.

Is multimedia analysis of broadcast television doomed? 
Analysis of news, sports, movies, in fact, any produced content is over. If someone can produce the content, they can also dedicate the effort to annotate it.

A less extreme version of that position is probably, however, more appropriate. When we carry out multimedia research, often produced content is the only content we have. Not every content producer has the resources to create annotations. Finally (as note by the moderator) some types of annotations are against the business interests of people producing multimedia content: Do film producers really want audiences to have a fine-grained breakdown of the violence in film?

The panel agreed that analysis of produced content is very important for knowledge extraction and summaries of large, heterogenous collections. You can extract knowledge and facts: for example, the present needs a 20 minute summary.

Professionals, or specific applications often need detailed summaries: There would be value in summarizing to study for example the soccer moves of a certain player for practice or for strategy purposes.

Personal content often needs summarization: parents like highlights of school games or performances that feature their own children.

Are standards doomed? 
Standards make sense for compression and communication, but standards have been over pushed. Many researchers identify with this situation: You barely know what you’re doing and you make a standard for it. However, the activity that takes place around the production of standards gives rise to new ideas. The fact that descriptors were encoded in MPEG7 gave rise to a lot of further work on descriptors.

Perhaps a more direct way of achieving the same effects is via reference implementations and toolkits. OpenCV is effectively, although not formally, a standard. This kinds of efforts are very important.

Third round: Panel members were asked for “A future vision for a multimedia analysis application that we should strive for.”

The opening comment was interesting and unexpected: As a early-career researcher in multimedia one is drawn to problems that one likes, and that attract and holds one’s attention. However, as a late-career researcher, one looks back and starts to regret not having considered the contribution that one’s career was making to society.

Multimedia for medicine: Young multimedia researchers should consider “joining the doctors”: the field of medecine needs us.

Human rights: Another area with enormous potential social impact is multimedia for human rights. We need algorithms that will allow us to find evidence of violations: examples are the analysis of areal photos to search for hidden destruction and the reconstruction of events using social media.

We need (footnote by moderator) technology that is able to verify the extent to which multimedia reflects the reality that it claims to capture: and, in particular, identify multimedia created with the intent to deceive.

Low quality content is key: Interestingly, some of the most highly socially relevant applications for multimedia involve processing some of the worst images. Multimedia researchers need to be brave enough to venture into areas where content is poor quality, difficult to obtain, and (footnote by moderator) where evaluation of success is highly challenging.

User intent: Multimedia information retrieval has recently experienced the “intent revolution”: the change from focusing on the nature of the items that users are trying to find, to the tasks that users are trying to achieve. Supporting people in their daily lives is not is as obviously socially relevant as education, medical or human rights applications. However, it has an important contribution to make.

Affective computing: We look forward to multimedia systems that support us in the emotional aspects of communicating with multimedia: sharing and mutual remembering. Humans are social creatures (isolations causes us to suffer). Shared experiences allow us to build relationships, share values, and keep the connections needed for social and psychological well-being. Regretfully, current research on affect and sentiment simplifies the emotional aspects of multimedia to the extent that it may be “trivial”. We need to work towards understanding both multimedia and the mind: a key question is: What pieces need to come together in order for someone to experience the reproduction of a memory or an experience?

Hardware and energy consumption: We should not forget that multimedia analysis is possible because of the devices that capture, store and process multimedia. We are ever dependent on hardware. Processing of multimedia costs energy: and future work should also keep energy efficiency in mind.

Closing comments:
When we study multimedia, we study communicating with multimedia. Moving forward it is important to keep the human in human communication.

Is there an end to multimedia? Can we foresee that it might be replaced by something completely different?

We see multimedia as an “everlasting field” encompassing applications that have not yet been invented. However, we should continue to call it “multimedia”, because continuity of what we call it will allow us to build on the past.

Currently, we see more and more other communities doing multimedia: examples are the computer vision community and the speech and language processing community. Having a distinct identity will allow the other fields to avoid reinventing the wheel.

We saw during the first round of the panel that looking back over the past 20 years, we did not do so well in formulating predictions which came true: the technologies that we anticipated have not achieved mainstream uptake (with a few notable exceptions). It’s not dramatic to be wrong in our predictions. However: it is important that we learn from our mistakes.

In general, we do not expect all early-career multimedia researchers to connect to socially relevant applications by “joining the doctors”. But it is good to have a larger vision. When you are writing a paper, embed your ideas within an overall picture of their potential. Embrace the larger meaning of your work and imbue multimedia research with sense of mission.

A big thank you to our panelists and to the members of the audience who contributed to the discussion.

Guillaume Gravier, IRISA, France
Alexander Hauptmann, Carnegie Mellon University, USA
Bernard Merialdo, EURECOM, France

Audience contributors:
Jenny Benois Pineau, University of Bordeaux, France
Bogdan Ionescu, University Politehnica of Bucharest, Romania
Georges Quénot, LIG, France
Stéphane Marchand Maillet, University of Geneva, Switzerland
Mathias Lux, Klagenfurt University, Austria