A post that mixes the magic and delight of the holiday season with multimedia information retrieval? Let's try it and see what happens.
The past couple of weeks holiday cards have been dropping through the mail slot in the front door -- but also emails have been entering my inbox: greetings, photos, and yes, also videos. This morning it was an email with a greeting and a link to a music video "Peaceable Kingdom".
I watched the video for a while and pondered its relationship with Christmas: The music is melodious, soothing and the lyrics take the listener to the manger to make the connection with the adoring state of mind of those who gathered there the first Christmas Eve. Unexpected minor cadences highlight that this is no usual Christmas carol and invite consideration of the multiplicity of the Christmas experience -- how the holiday itself integrates traditions preceding Christianity and how, as each new group and generation reinvents it for their own spirit and needs, it will continue to develop into some future Christmas. From the perspective of the here and now, that future Christmas could seem full of sweetness, hope and light, but also distorted and distinctly pagan.
Of course, the strongest signal I get from the video is that of Margaret Atwood's dystopic visions. I haven't read The Year of the Flood, but what has been written and said about the book has so much fascinated and disturbed me, that the existence of the book as itself as a text seems somehow less important to me -- the setting is already so palpable that what it tells is, in a way, no longer left to be said.
In the end, maybe my personal Christmas feeling associated with the video is that it gives me a chance to spend some time feeling close to the person who sent it to me. The strength of this feeling of connection goes beyond -- indeed exists in a completely different life dimension -- than my reflections on meta-text usurping text or on the length of time that has transpired since I have sat down and read a worthwhile book not related to work.
Where is the multimedia information retrieval tie-in? Well, first, as a result of this video it has occurred to me for the umpteenth time that we need a verb other than "watch" to describe this kind of interaction with this video. It's a music video, so I am mainly listening to it and then looking at the visual stimuli. There could potentially be rather large changes in the visuals -- different pictures, different editing -- and these changes could possibly leave my watching experience largely untouched. I would argue, if I were only "watching", these elements would necessarily have a major defining impact on my experience. They don't. Here, I am rather "watch/listening", which I suppose could give us the new concept of "wistening".
There's a second tie-in as well: There is a little snowflake in the player bar, which I discovered after "wistening" for a while. I usually find snowflake icons ambiguous: especially on climate control units in strange hotel rooms -- do I turn the setting to "snowflake" if it's cold outside or is the "snowflake" setting going to cause the system to start producing cool? I've encountered both. So I've learned just to click on the snowflake and see what happens...
And lo and behold it started snowing. Right into the Peaceable Kingdom -- flakes floating down slowly -- different sorts of flakes at different speeds -- and accumulating at the bottom of the frame. I felt the smile spread on my face -- and grow wider as a realized that I was witnessing one little bit of a sort of world-wide holiday miracle as people in front of screens around the planet discover that you make it snow on YouTube. I thought about people watching this on their laptops and tables, using the mouse to play a bit in the snow and then gathering their friends, colleagues, family around their screens in one big Christmas "You gotta check this out!"
Apparently, you can't do this to every video: and this is where it really starts getting interesting to me. How did YouTube decide which videos to add this feature to? There must have been some multimedia classification algorithm that maybe looked for keywords in the title and description and something like music in the audio channel or colors in the visual channel and combined this with the upload date -- and then enabled "snow" for this video.
I want to make these kinds of algorithms! How do we put everything that we know how to do in terms of multimodal video processing and machine learning and figure out for which videos it needs to be able to snow?
And it's not just snow. There are other ways in which this could go -- and should go -- it has potential to cause so much joy. I am sitting here "wistening" and thinking about friends and family and playing in the snow, but it's clear that we need to go being "wistening" and we need a very for watching+listening+reflecting+playing. It's also clear that we need the technologies that support these activities. Imagine a search engine that can find videos that are appropriate for 'snow': that goes so far beyond user information needs as they are currently conceptualized for multimedia that it sort of takes your breath away.
How to enable the multimedia community to work at these new (from the perspective of this moment, utterly fantastic) frontiers?
The key to doing work in this direction, is to evaluating it. How do we know if we were right in presenting the snow option for a given video? YouTube is probably analyzing its interaction logs at this very moment. But I hate to think that I need to go to work for YouTube in order to ever be able to do the evaluation necessary to write a paper on this topic. Everyone loves the snow, so everyone should be able to work in order to make it better.
Note to self qua New Year's resolution: Keep up commitment to evaluation -- we need it to push ourselves forward into the unknown in a meaningful way. Maybe it's what actually makes the difference between what we call computer science and what we call art. But I'll leave that thought to another day.
In the meantime, the overall conclusion is that holidays and multimedia information retrieval do indeed mix well in a blog post. So happy holidays (ans enjoy the video):
Currently, I'm working to put together the survey for MediaEval 2012. This survey will be used to decide on the tasks that will run in 2012 and also will help to gather information that we will use to further refine the MediaEval model: the set of values and organizational principles that we use to run the benchmark.
At the workshop, someone came up to me and mentioned that he had made use of the model in a different setting, 'I hope you don't mind', he said. Mind? No. Quite to the contrary. We hope that other benchmarking initiatives pick up the MediaEval model and elements of it and put them to use.
I have resolved to be more articulate about exactly what the MediaEval model is. There's no secret sauce or anything -- it's just a set of points that are important to us and that we try to observe as a community.
The MediaEval Model The MediaEval model for benchmarking evaluation is an evolution of the classic model for an information retrieval benchmarking initiative (used by TREC, CLEF, TRECVid). It runs on a yearly cycle and culminates with a workshop where participants gather to discuss their results and plan future work.
The MediaEval attempts to push beyond existing practice, by maximizing the community involvement in the benchmark. Practically, we do this by emphasizing the following points:
Tasks are chosen using a survey, which gathers the opinion of the largest possible number of community members and potential MediaEval participants for the upcoming year.
Tasks follow the same overall schedule, but individual tasks are otherwise very autonomous and are managed by the Task Organizers.
The Tasks Organizers are encouraged to submit runs to their own tasks, but these runs do not count in the official ranking.
The Task Organizers are supported by a group of five core participants who pledge to complete the task "come hell or high water".
Each task has an official quantitative evaluation metric, which is used to rank the algorithms of the participating teams. The task also, however, promotes qualitative measures of algorithm goodness: i.e., the extent to which an algorithm embodies a creative and promising extension of the state of the art. These qualitative measures are recognized informally by awarding a set of prizes.
In interview footage from the MediaEval 2012 workshop, I discuss the challenge of forging consensus within the community.
One of the important parts of consensus building is collecting detailed and high-coverage information at the beginning of the year about what everyone in the community (and also potential members of the community) thinks. And so I am working here, going through not only the tasks proposals, but also other forms of feedback we've gotten from last year (questionnaires, emails) in order to make sure that we get the appropriate questions on the survey.
It always takes so much longer than I predict -- but it's definitely worth the effort.
Tonight I have something like meta-purchase fatigue. My train back from Brussels was canceled and I went to the news stand and bought an International Herald Tribune as a consolation prize: it cost me three Euros. It contained a very interesting article entitled "Disruptions: Privacy Fades in Facebook Era". When I finally came home, I decided to re-read this online and send it to a friend.
But aaargh. The IHT site informs me that I have hit my 20 article limit for the month.
Hey! What's this 20 article limit thing about anyway? I just laid down my 3 Euros -- why can't I see the digital copy of this article.
OK. That's a bad attitude -- that's purchase fatigue, I can overcome that. I care about the content in the IHT -- it's worth something to me so maybe it's time to get an actual subscription. A few fantasies of having a paper delivered to my door in the morning (...and having the time to read it). Really, yes, let's do it. I need to support the news -- creating good news takes money.
Alas, the website is not going to let me do that: My attempts to make an impulse buy of home delivery are met with an error message "Unknown SOA error". There's the meta-purchase fatigue. You try to do the right thing -- spend your money to get something you value -- and somehow that doesn't work either.
The purchase fatigue that faces us in the future will be caused by Facebook. My prediction: In about 10 years, Facebook will start selling us back our historical posts.
Remember those pictures from that college party? Weren't they all gone? Now for a mere $29.99 Facebook will dig them out of its archive and present them to you, labeled with the names of your friends that you have forgotten and festooned with their comments.
Maybe that is the rant of a tired blogger, but otherwise it's also a darn good long term business strategy for Facebook -- if they can somehow fight the "purchase fatigue" that will arise trying to sell people back their own stuff.
At least they should circumvent meta-purchase fatigue and get the subscription service right: when I decide to shell out the cash and sign up for a subscription to get my own past delivered back to me, it would be nice if I didn't an "Unknown SOA error".
And the whole thing distracted me from actually blogging about social multimedia sharing and privacy...
....or about the fact that Google doesn't love me and doesn't return anything useful for the query "purchase fatigue". "Purchase" is a modifier and not part of my search intent in this context, Google.
I know that not interpreting my query as an intent to purchase something is less likely to lead to ad clicks -- but please, really I'm tired of paying for stuff, humor me, really...
A strong system metaphor helps to align the needs and expectations with which a user approaches a multimedia search engine and the functionality and types of results that that search engine provides. My conviction on this point is so firm that I found myself dressed up as Alice from Alice's Adventures in Wonderland and competing as a finalist at the ACM Multimedia 2011 Grand Challenge in the Yahoo! Image Challenge.
Essentially, the story in the book runs that Alice enters, after a long fall, through a door into another world. Here, she encounters the fantastic and the unexpected, but her views are basically determined by two perspectives: one that she has when she grows to be very big and one that she has when she shrinks to be very small. The book plays with language and with logic and for this reason has a strong intellectual appeal to adults as well as holding the fascination of children.
We built a system based on this narrative, which offers users (in response to an arbitrary Flickr query) sets of unexpected yet fascinating images, created either from a "big" perspective or from a "small" perspective. The "Alice" metaphor tells the user to: (1) Expect the "big" and "small" perspectives (2) Expect a system that can be understood at two levels: as both engaging childlike curiosity and also meriting serious intellectual attention due to the way in which it uses language and statistics (3) Expect a system that will need a little bit of patience since the images appear a bit slowly (we're processing a flood of live Flickr results in the background), like the fading in of the Cheshire Cat.
The Grand Challenge requires participants to present their idea in exactly three minutes in a presentation that addresses the following points:
What is your solution? Which challenge does it address? How does it address the challenge?
Does your solution work? Is there evidence that it works?
Can you demo the system?
Is the solution generalizable to other problems? What are the limits of your approach?
Can other people reproduce your results? How?
Did the audience and the jury understand and ENJOY your presentation?
We used the three minutes to cover these points in a dialogue between Alice and Christoph Kofler (CK), first author on the Grand Challenge paper:
Kofler, C., Larson, M., Hanjalic, A. Alice's Worlds of Wonder: Exploiting Tags to Understand Images in terms of Size and Scale. ACM Multimedia 2011, Grand Challenge paper.
During the dialogue we demonstrated the system running live (We knew it was a risk to run a live demo, but luck was with us and the wireless network held up).
Alice's Worlds of Wonder: Three Minute Dialogue
(showing a rather standard opening slide) CK: Alice, look at them out there, their image search experience is dry and boring.
Alice: We should show them our answer to the Yahoo! Image Challenge on Novel Image Understanding.
(showing system interface) CK: The Wonderlands system runs on top of Flickr and sorts search results for the user at search time.
(dialogue during live demo) Alice: Let’s show them how it works. Do we trust the wireless network? CK: Yes. We need a Flickr query. Alice: Let’s do “car” CK: The Wonderlands system presents the user with the choice to enter “Alice’s Small World” or “Alice’s Big World” Alice: Let’s choose Small World.
Alice (to audience): If you know me in "Alice in Wonderland", you know that in the story I shrink to become very small. This is the metaphor underlying the Small World of the Wonderlands system. It shrinks you, too, as a Flickr user, by putting you eye-to-eye with small objects pictured in small environments with limited range. You get the impression you have the perspective of a small being viewing the world from down low.
Still Alice: (to CK) Let’s choose Big World now. In the book, I also grow to be very big. The Big World makes you grow again. Objects are large and the perspective is broad.
You can imagine cases in which you were looking for person-sized cars --- here, the Big World would help you focus your search on the images that you really want.
CK: Should we explain how it works?
CK: (Displays "Implicit Physics of Language" slide) We exploit a combination of user tags and the implicit physics of language.
Alice: Basically, your search engine knows something about the physics of the real world because it indexes large amounts of human language.
Certain queries give you the real-world size of objects: “the flower in her hand” returns a large number of results, so you can infer that a flower is small.
CK: Oh yes! And “the factory in her hand” returns no results so you know a factory is large.
Alice: Basically, the search engine is telling us that a girl holding a flower in her hand is a common situation, but that her holding a factory is not. We get this effect because physics dictates that something commonly held in a human hand must be small.
CK: (Displays with the entry window with the two doors) The sorting algorithm is straightforward. Alice’s Small World contains images whose tags tend to designate smaller objects and Alice’s Big World contains images whose tags tend to designate larger objects.
CK: So Alice, the system takes a fanciful and engaging perspective. But in order to carry out quantitative evaluation we can look at it in terms of scale. We achieve a weighted precision nearly three times random chance. (Flash up under the two doors "Evaluation on 1,633 Flickr images from MIRFLICKR data set. 0.773 weighted precision")
Alice: So the scale numbers point to the conclusion that we are creating a genuine two-worlds experience for users.
CK: Right. But, Alice, do we need to stop at two worlds: big and small? Are there other worlds out there?
Alice: Well, Christoph, effectively the only limit is the speed at which we can query Flickr and Yahoo!. You know that the implicit physics of language works because of general physical principles. So, in theory, there are as many different worlds as there are interesting physical properties.
CK: But being Alice, you like the small and the big worlds, right?
Alice: Yes, I do. Shall we try another query?
CK: (Display final slide) Or we can just tell them where to download the system. You know, the code's online.
Alice: Yes, let them try it out! No more dry and boring image search for this group...(TIME UP!!)
This Halloween I just kept on noticing what I am calling "affect pumpkins". These are jack-o-lantern faces labeled with emotion words. Jack-o-lanterns and decorations (such as the ones in this image) that depict jack-o-lanterns are typical for celebrations of Halloween.
I don't remember having my jack-o-lanterns labeled with adjectives when I was a child, so I am rather curious about this phenomenon and have been observing it a bit. Apparently, the activity of giving jack-o-laterns emotion words is quite fun and is, all and all, a harmonious process, characterized by a lack of disagreement or other inter-personal strife. If you have happy jack-o-lantern, there appears to be a high degree of consensus about the applicability of the label 'happy'.
I contrast this smooth and fun pumpkin labeling procedure with the disagreement in the multimedia community that has apparently developed into full-fledged distate for what are referred to as "subjective user tags", tags that express feelings or personal perspectives. Such tags have been referred to as "imprecise and meaningless" in Liu et al. 2009 published at WWW (page 351) and my impression is that many, many researchers agree with this point of view. In the authors' defense, had they used what I feel as the more appropriate formulation of "imprecise and meaningless with respect to a certain subset of multimedia retrieval tasks", the community would still probably be on a rampage against personal and affective tags.
Sometimes it seems everyone has simply made this spontaneous decision to take up arms against the insight of Rosalind Picard, who in 1995 wrote, "Although affective annotations, like content annotations, will not be universal, they will still help reduce time searching for the 'right scene.' Both types of annotation are potentially powerful; we should be exploring them in digital audio and visual libraries." (from "TR 321" p. 11). Do we have a huge case of sour grapes? Have we decided that we have irreversibly failed over the past 15+ years to exploit affective image labels and are therefore now deciding that we should never have considered them potentially interesting in the first place?
Oh, I hope not. Just look at this wall and think about all the walls like this, all the jack-o-lantern pictures that were created this Halloween and posted to the Internet. There are too many pictures of Halloween pumpkins out there that we can afford to overlook the chance to organize them by affect. Of course, some people might hold that this silly pumpkin should actually also be considered a happy pumpkin: We can anticipate some disagreement. However, it is important to keep two points in mind: (1) Labels that are ostensibly 'objective' and have nothing to do with affect are also subject to lack of consensus on their applicability, e.g., the ambiguity on whether a depicted object is a 'pumpkin' and 'jack-o-lantern' discussed in my previous post. (2) Even if we do not agree on the exact affective label, we do have intuitions that we do not agree and on other possible interpretations. For example, someone who insists on 'silly' will also admit that someone else might consider this pumpkin 'happy', but that it would be less likely to expect anyone to find 'sad' as the most appropriate label.
Interestingly, in my observations, I have seen that the emotion word used to describe a jack-o-lantern seem to be chosen from one of two perspectives: Depicted in the image above are "pumpkin perspective" emotion words ('happy', 'silly', 'sad' and 'mad') which designate the emotion being experienced by the jack-o-lantern that explains the jack-o-lantern's expression. In the picture book page in the image from my previous post there is a mixture of this "pumpkin perspective" with a "people perspective". The book reads, "We'll make our jack-o-lanterns--it might be messy, but it's fun!" and then asks "Will yours be scary?" A jack-o-lantern is scary if it causes fear from the perspective of people looking at it. And then it goes on to ask "Happy? Sad?" which are "pumpkin perspective" words. And finally "A sweet or silly one?". Other perspectives are also possible: the affect label could reflect what the carver of the jack-o-lantern intended to achieve by making the pumpkin.
In my own work, I tend to insist on the importance of distinguishing these different perspectives, with the idea that if the underlying model of affect is complete and sound, it will provide a more stable foundation for building a system of annotation. However, in practical use, the affect labels don't need to distinguish the experiencer or understand the principle of empathic sympathy: we simply know a happy pumpkin when we see one and that of course makes us a little happy ourselves.
Dong Liu, Xian-Sheng Hua, Linjun Yang, Meng Wang, and Hong-Jiang Zhang. 2009. Tag ranking. In Proceedings of the 18th international conference on World wide web (WWW '09). ACM, New York, NY, USA, 351-360.
Rosalind W. Picard, Affective computing, MIT, Media Laboratory Perceptual Computing Section Technical Report 321, November 1995.
Wittgenstein conceives of human language as an activity consisting of language games, that are related, but different. One of these games is the game that we play when we read picture books to kids. We point at images and name them. The kids are then supposed to gradually acquire this pointing and naming behavior. We generally happily consider the children to be acquiring human language during these sessions. However, if we apply our Wittgenstein, what we are doing is teaching kids how to play the "naming game". We notice this because two minutes later the young child is furiously indicating that it doesn't want to do something, whereby the concept "no" is being actively used. The concept of "no" or "no, I don't want" (we recognize while delicately shoving small, flailing hands into sweater arms) is not depictable as a nameable entity in a picture book. We're still using language of some sort, but we've switched to another, possibly more important game.
As multimedia retrieval researchers we generally fall into the same trap when developing multimedia retrieval indexing systems. We get the systems to annotate depictable visual concepts and some how forget that this is only one "language game" in the whole gamut of different games that humans use when they use language. The point is an important one. Visual content based retrieval systems are in their infancy. We, as, well, a species, are currently negotiating a system of conventions, of game moves as it were, that determine how we interact with these systems.
The danger is: if we start out by making very narrow assumptions about what people could possibly be looking for when they look for images and video the conventions of interacting with video search engines will become calcified into a very simplistic game. We'll be stuck in the picture book phase of multimedia retrieval childhood forever.
Actually, this Halloween I encountered a picture book that suggests that even picture books are trying to pop out of the "naming game". This one has a page with a picture of kids making jack-o-lanterns and an orange box asking the questions: "How many organize pumpkins can you count?" and "How many are jack-o-lanterns?"
Well, ahem. When does something stop being a pumpkin and become a jack-o-lantern? When you cut of the top? When you've fully emptied the inside? When you cut the first eye or when you have popped out the final piece around the teeth to complete the grin?
How about those jack-o-lanterns that have been drawn on the chalk board? Are those jack-o-lanterns or are they pictures of jack-o-lanterns? And maybe actually a jack-o-lantern still count as a pumpkin if it was made from a pumpkin in the first place?
In short, it is impossible to give a unique answer to the questions that this book is asking. We can either think that the people at Fischer-Price are corrupting our youth, or we can realize: kids don't need to have books that depict things that are uniquely identifiable. There is simply a huge ambiguity as to what exactly is a pumpkin and what is a jack-o-lantern. We can extend the 'naming-game' with this ambiguity and it is still truly a part of our human language. We don't need to (and generally do not) resolve ambiguity in order to use language effectively. The page of this books is not some sort of obscure philosophical exception: this is a situation that is frequent and highly characteristic of the situations we deal with on a daily basis.
Fischer-Price apparently now thinks that kids' books should not longer protect them against ambiguity in language. We shouldn't "baby" our multimedia systems either: Rather we should let them play as large and complex a language game as they can possibly handle: as large as technically possible and as users find helpful and interesting.
The next post makes another related point about this picture book...
LSCOM stands for "Large Scale Concept Ontology for Multimedia" and it is a list of concepts associated with multimedia, including images and videos. If you are to ask me where I stand with the LSCOM concept list, I am a 2753-Solid_Tangible_Thing kind of a multimedia researcher and not a 125-Airplane_Flying kind of a multimedia researcher.
Basically, what I mean is that I adhere to the perspective that in order to solve the general problem of multimedia information retrieval on the Web, we should make use of basic properties of objects depicted in images and video, rather than their specific identities. I have discussed the issue previously in a post on proto-semantics, dimensions of meaning that arise from human perceptions and interactions with the world. Proto-semantic dimensions are more fundamental than the words that we usually use to describe the world around us, and for that reason, they can be considered to be sub-lexical. For example, I am drinking coffee from a mug, but more fundamentally this is a small, corporeal object, or if we pick something from LSCOM 1425-Concave_Tangible_Object. I return to the issue here, since I've been pondering it again on the occasion of Halloween.
It seems that the way that scientists approach the problem of visual indexing, i.e., automatically describing the visual content of images and videos, is always inextricably related to their backgrounds. I've worked in the area of multimedia retrieval for going on 12 years now, and it my experience two main backgrounds dominate the field: surveillance and cultural heritage. Let me say a few words about both.
Surveillance: The analysis of surveillance footage or images captured by security cameria is aimed at the task of automatically identifying threat levels. For surveillance tasks, one defines a closed set of objects and behaviors that constitute "business as usual" and anything outside of that range can be considered a threat and triggers and alarm calling for the intervention of human intelligence. Surveillance is a high recall task -- meaning that it is more important not to miss any events than to reduce the detection rate of false alarms. This background doesn't quite transfer to the general problem of multimedia retrieval on the Web.
We can't assume that Web multimedia will depict a closed class of objects. The cases that cannot be covered by a closed class are not infrequently occurring "threats", but rather entities drawn from the long tail: which, if we can indeed assume a finite inventory, will contain approximately half of the encountered entities. Further, Web multimedia retrieval is typically a precision oriented problem, which means that reducing false alarms is relatively more important than exhaustive detection.
Cultural heritage: Iconographic classification of visual art involves a classification system such as Iconclass. The stated purpose of Iconclass is the description and retrieval of subjects represented in image. I rather suspect that before the very first paint had dried on the very first canvas, next to the artist was standing an art historian who started to create a classification system to categorize the painting. In other words, using classification systems for visual art is an old idea, that has well-established conventions and has been honed over generations of use. Such a classification system necessarily views works of art as physical objects, and would have as it's goal the task of organizing the storage facility of a museum or of helping to choose which works to hang together in an exhibition. The people who created it assumed that the number of dimensions of similarity between works of art was necessarily finite. Such an assumption makes sense, in light of a relatively small number of art historians working on a relatively small number of questions concerning art history and the iconography of art.
Enter, however, the Web. Images and video are not physical objects and we do not have to be able to list them all in a well ordered list or even every make the decision of "Do we hang this in the East Wing gallery or the West Wing gallery?" There are many more users than art historians, and suddenly it actually be useful to admit the possibility that the number of ways to compare two images might in fact be infinite, rather than finite.
As for myself, I neither fall into the surveillance or the cultural heritage category. I attribute this to what's probably a naive equation of surveillance with totalitarian states and also to having the yearly experience in grade school of being packed on a bus and shipped off for a day at the Art Institute of Chicago.
I guess the Art Institute of Chicago was supposed to have broadened the horizons of our young minds, but instead it sort of warped me in a way that makes it difficult to talk to me, if you are an cultural heritage person or an art historian. I was young enough that everything I drew sort of came out flattish, whether I intended it to look two-dimensional or not, when I was suddenly confronted with the likes of Marc Rothko. I think what happened is that someone in Chicago told me that Marc Rothko described his work as an “elimination of all obstacles between the painter and the idea, between the idea and the observer” (as quoted on this AIC webpage describing the Rothko painting above). At the time, I didn't particularly like Rothko, but the experience permanently hardened my mind to the idea that it made any sense whatsoever to describe visual art in terms of its depicted subject.
I think that Marc Rothko must fit into iconclass categorization "0 Abstract, Non-representational Art: 22C4 colours, pigments, and paint", which is unsatisfactory to me because it makes him seem like an afterthought. In Chicago, they apparently forgot to mention that he was reacting to what came before him. For me, I was already broken. A system that put Rothko on the outside rather than at its core could never been acceptable to me. From then until always: the main point of art is what we do with it: how we talk about it, how we stand before it and mull in the museum, which prints we buy in the shop and go home and hang on our walls and (as little as we like to admit it) how much we pay for it. A priori we don't know what draws us to art, so why should we make little lists of entities corresponding to its subjects?
The perspective I take may not ultimately prove more productive than either the surveillance perspective or the cultural heritage perspective. It is the linguistics perspective. My view is the following: the elements of meaning arising from human perception and interaction with the world that have been encoded into language human language semantics, these are the elements that we should try to dig out of videos and images. They are the lowest common denominator of meaning that we can be sure will give us the ability to cover all human queries: the ones that we can anticipate and the ones that we cannot.
So should the image above be given the LSCOM category 2753-Solid_Tangible_Thing ? Sure. It's an image of a painting. That's a tangible object. But let's also let the image be found by shape and color. And be found how I found it on the Internet: with the query "Rothko". And let it also be found when we search for formative experiences. And for Chicago...
Being educated in the US and being a scientist in Europe is sometimes quite tough. I need to continuously use a sort of filter that tells me that although I am hearing X, I need to pause and carefully consider and realize that the person is really saying Y. One particularly painful example, was unfortunately provided by our rector magnificus, the president of our university, in a recent interview. In promoting a new program to attract female scientists to the TU Delft, he said '...vrouwelijke wetenschappers zijn minstens zo talentvol als mannelijke wetenschappers.' which translates in English as 'female scientists are at least as talented as their male counterparts'. Ouch.
This statement does not work in the US academic context, because it fails gender symmetry. Gender symmetry can be diagnosed with the following test: flip the polarity of gender terms (e.g., 'woman', 'man', 'male', 'female') in a statement, and determine whether the resulting statement retains meaning within the context.
Let's try it. Flipping polarity of gender terms in his sentence yields, '...male scientists are at least as talented as their female counterparts'. This sentence is clearly interpretable, but no longer has a meaning that fits the context.
Contrast that with an alternate sentence such as: 'There is no discrepancy in talent between male and female scientists'. This sentence has the same declarative content, but it passes the gender symmetry test because you can substitute it with 'These is no discrepancy in talent between female and male scientists'.
Of course, in this case, a further problem arises. This sentence has the implicature that there is some reason for which this fact needs to be asserted in the first place. The act of pronouncing this sentence communicates that the speaker does not consider the point to be completely obvious, but rather feels that it needs to be explicitly asserted. One might choose against even this alternative sentence in order to avoid sending the message that one feels that there is someone out there that still needs to be convinced on the point of talent equivalence between male and female scientists. But on the whole, this alternative could be considered the 'best practices' formulation, should one indeed find oneself in a situation where it was necessary to make a statement comparing the relative scientific talent of men and women.
What my filter tells me is that although X was said in this case, what was meant is Y. And concerning Y, I rather suspect that our rector magnificus harbors the personal opinion that women have perhaps even a teensy bit more science talent than men and that in fact he is saying, "at least as (if not more) qualified". Whether or not that is true, it's safe to say that he is of the opinion that our university would, at this point in time, benefit from hiring additional women.
One of the research topics that I am interested in as a multimedia retrieval scientists is developing algorithms for the retrieval of jump in points (JIP) in video. JIPs allow the viewer to click directly to a certain relevant point in a video. On YouTube, they are called deep links. JIPs make it possible to share or to comment about particular points of a video, just as I am currently doing with this post. The deep link to the relevant section of the interview under discussion is the following:
The current status of technology on the Web is that it is possible to comment on JIPs or share them, but search engines don't return them as results. Together with colleagues within the Netherlands and across Europe I am developing and helping to promote the development of JIP retrieval in the MediaEval Rich Speech Retrieval task (see the feature on MediaEval 2011 in MMRecords for a brief description.) Such technology would allow search engines to return pointers to specific time points within video that are relevant to user queries.
At the end of the day, I am more interested in the scientific questions raised by the task of JIP multimedia retrieval than I am in the gender issue. Since grade school, I have frequently been the "only girl" involved in whatever activity fascinated me. You don't know it any other way, so you don't really notice. I contribute what I can to the discourse on promoting gender balance, not so much because of myself, but because I find it wasteful if I feel that women who I am mentoring are somehow holding themselves back.
When I first came to Delft, I contributed the following comment on improving the working climate at the Faculty of Electrical Engineering, Mathematics and Computer Science (EEMCS). This is the point of view that I still stand by so I include it here to complete my comment on the deep link.
Response on the 2009 Challenging Gender survey The way of improving the working climate at EEMCS would be to address the gender imbalance within a larger program of promoting diversity into the Faculty of EEMCS. A faculty that includes international scientists addressing multi- and trans-disciplinary questions is automatically going to be more comfortable for women, since gender differences become just one of many differences of background and perspective that make the faculty richer and more productive.
Any effort invested in promoting inclusion of scientists/researchers that have pursued non-traditional career tracks (e.g., completing their PhD at an older or younger age, taking time off, switching disciplines mid-career) will automatically make women feel more welcome. When women feel welcome, they will also feel confident that the effort that they invest will be rewarded by a long and productive career in the EEMCS, establishing a virtuous cycle.
Everyone benefits from the promotion of diversity. For example, in this kind of climate, a researcher who has worked in the faculty for years will feel more comfortable about taking the risk of investigating a new class of algorithms or applying expertise accumulated in one domain to solving a problem in a radically different domain.
Positive side-effect: If everyone benefits, then women will not be burdened by the (perceived) need to fight the prejudice that they have been hired due to their gender and not due to their competence.
By promoting diversity, both in terms of scientific expertise and also in terms of other characteristics (cultural, religious, linguistic, socio-economic, sexual orientation as well as gender), the faculty will draw on a larger pool of talent and increase its productivity and capacity for creation and invention.
Working at TU-Delft, you see "Challenge the future" written everywhere...sometimes in unexpected places. As a woman this speaks to me in a special way: it says that the future at the TU-Delft is not set up to be carbon copy of the past. Because of the "challenge the future" attitude, I have confidence that the demographics of my department will shift naturally as we the Faculty of EEMCS continues to mature, extend and innovate scientifically.
Today, in Torino, Italy, was the day of the Search Computing and Social Media Workshop organized by Chorus+, Glocal and PetaMedia. Being the PetaMedia organizer, I had the honor of opening the workshop with a few words. I tried to set the tone by making the point that information is inherently social, being created by people, for people. Digital media simply extends the reach of information, letting us exchange with others and with ourselves over the constraints of time and space.
The panel at the end of the day looped back around to this idea to discuss the human factor in search computing. We collected points from the workshop participants on pieces of paper to provide the basis for group discussion. I made some notes about how this discussion unrolled. I'm recording them here while they are still fresh in my head.
We started by tackling a big, unsolved issue: Privacy. The point was made that the very reason why social media even exists is that people seem driven in some way to give up their privacy, share things about themselves that no one would know unless they were revealed. Whether or not users do or should compromise their own privacy by sharing personal media was noted to depend on the situation. For some people it's simply, obviously the right thing to do. Concerns were raised about people not knowing the consequences: maybe effectively I am a totally different person five years from now than I am now. But I am still followed by the consequences of today's sharing habits. In the end, the point was made that if the willingness to among users to share stops, we as social media researchers have not much else left to examine.
Next we moved to the question of events in social media: Human's don't agree about what constitutes and event. Wouldn't it just be easier to just adopt as our idea of an event whatever our automatic methods tell us is an event? Effectively we do this anyway. We have no universal definition of an event. There may be some common understanding or conventions within a community that define what an event is. However, these do not necessarily involve widespread consensus: they may be personal and they may evolve with time. For example, the event of "freedom"? Most people agreed that freedom was not an event.
An event is a context. That's it. At the root of things, there are no events. Instead, we use concepts to build from meaning to situational meaning -- to the interpretation of the meaning of the context. Via this interpretation, the impression of event emerges. In the end, meaning is negotiated.
If we say events are nothing, we wouldn't be able to recognize them. Or, does the computer simply play a role in the negotiation game. The systems we build "teach" us their language and we adapt ourselves to their limitations and to the interpretative opportunities that they offer.
Then the question came up about the problems that we choose to tackle as researcher. "Are we hunting turtles because we can't catch hares?" This bothered me a bit, because assuming you can easily catch a turtle, they are quite difficult to kill because of the shell. The hare would be easier. Do our data sets really allow us to tackle "the problem"? The question presupposes that we know what "the problem" is, which may be the same as solving the problem in the first place. Maybe if we can offer the user in a give context enough results that are good enough, they will be able to pick the one that solves "the problem". Perhaps that's all there is to it. Under such an interpretation, the human factor becomes an integral part of the search problem.
In the end, a clear voice with a succinct take home message: How can we efficiently combine both the human factor and technology approaches? "The machine can propose and the user can decide."
The discussion ended naturally with a Tim Berners Lee quote, reminding us of the original intent of social effect underlying the Web. We adjourned for some more social networking among ourselves, reassuring ourselves that as long as we were still asking the question we shouldn't expect to find ourselves completely off track.
The 2011 season of the MediaEval benchmark culminated with the MediaEval 2011 workshop that was held 1-2 September in Pisa, Italy at Santa Croce in Fossabanda. The workshop was an official satellite event of Interspeech 2011.
For me, it was an amazing experience. So many people worked so hard to organize the tasks, to develop algorithms and also to write their working notes papers and prepare their workshop presentations. I ran around like crazy worrying about logistics details, but every time I stopped for a moment I was immediately caught up in amazement of learning something new. Or of realizing that someone had pushed a step further on an issue where I had been blocked in my own thinking. There's a real sense of traction -- the wheels are connected with the road and we are moving forward.
I make lists of points that are designed to fit on a Power Point slide and to succinctly convey what MediaEval actually is. My most recently version of this slide states that MediaEval is:
...a multimedia benchmarking initiative.
...evaluates new algorithms for multimedia access and retrieval.
...emphasizes the "multi" in multimedia: speech, audio, visual content, tags, users, context.
...innovates new tasks and techniques focusing on the human and social aspects of multimedia content.
...is open for participation from the research community
I make these lists and they capture the external reality of what we do, but actually I have no real understanding of how MediaEval works -- of how exactly the traction arises.
At the workshop I attempted to explain it with a bunch of circles drawn on a flip chart (image above). The circles represent people and/or teams in the community. A year of MediaEval consists of a set of relatively autonomous tasks, each with their own organizers. Starting in 2011, we also required that each task have five core participants who commit to crossing the finishing line on the tasks. Effectively, the core participants started playing the role of "sub-organizers", supporting the organizers by doing things like beta testing evaluation scripts.
This set up served to distribute the work and the responsibility over an even wider base of the MediaEval community. Although I do not know exactly how MediaEval works, I have the impression that this distribution is a key factor. I am interested to see how this configuration develops further next year.
MediaEval has the ambitious aim of quantitatively evaluating algorithms that have been developed at different research sites. We would like to determine the most effective methods for approaching multimedia access and retrieval tasks. At the same time, we would like to retain other information about our experience. It is critical that we do not reduce a year of a MediaEval task to a pair (winner, score). Rather, we would like to know which new approaches show promise. We would like to know this independently of whether they are already far enough along in order to show improvement in a quantitative evaluation score. In this way, we hope that our benchmark will encourage and not repress innovation.
I turned from trying to understand MediaEval as a whole to trying to understand what I do. Among all the circles on this flip chart, I am one of the circles. I am a task organizer, a participant (time permitting) and also play a global glue function: coordinating the logistics.
The MediaEval 2012 season kicks-off with one of the largest logistics tasks: collecting people's proposals for new MediaEval tasks, making sure that they include all the necessary information, a good set of sub-question and getting them packed into the MediaEval survey. It is on the basis of this survey that we decide the tasks that will run in the next year. We use the experience, knowledge and preferences of the community in order to select the most interesting, most viable tasks to run in the next year and also to decide on some of the details of their design.
Five years ago, if someone told me I would be editing surveys for the sake of advancing science, I would have said they were crazy. Oh, I guess I also ordered the "mediaeval multimedia benchmark" T-Shirts. That's just what my little circle in the network does.
Let's keep moving forward and find out where our traction lets us go.
I got an email this morning from someone close to me, S., whose colleague, C., had sent them a message with a Dutch translation of these directions on how to turn the social advertising off. S. declared happily, "The community is really strong". LinkedIn pulled a now-classic social network move and the community moves to push back against it. If there wasn't a name for it already, we can now conveniently refer to it as a SlippedIn.
SlippedIn or slipped up? The fact that this changed behind my back really makes me angry at LinkedIn: Are they going to lose their community?
Well, no. Because actually in sending this mail C. is engaging, probably without her conscious knowledge, in the ultimate form of social advertising. By alerting us to the problem and letting us know how to fix it, C. is mediating between LinkedIn and the community that uses the LinkedIn platform. She is making it possible for all of us to be really p.o.ed at LinkedIn, but still not leave the LinkedIn network because we have the feeling that our community itself has created the solution that keeps us in control of our personal information.
S.'s attitude "The community is really strong" is natural. Because C. caught this feature being slipped in and let us know how to fight it, we now have the impression that we somehow have the power to band together and resist the erosion of the functionality that we signed up for when we joined LinkedIn. C.'s actions give us the impression that although what LinkedIn did is not ok, that LinkedIn is still an tolerable place to social network because we have friends there and that we are in control and can work it out together.
C. has really be used. She is unwitting broadcasting in her social circle a sense of security that everything will be all right. We completely overlook the point that we have no idea of what goes on beyond the scenes that might go on unnoticed by C. or the other C.-like people in the network. We are given the false impression, that whatever LinkedIn does that we find intolerable, that we will be able to notice it and work together to fix it.
We cannot forget that LinkedIn is a monolithic entity: they write the software, they control the servers. What ever feeling that we have that we can influence what is going on is supported only by our own human nature to simply trust that our friends will take care of us. LinkedIn is exploiting that trust to create a force of advocacy for their platform as they pursue a policy aimed at eroding our individual privacy.
Last week I spent a great deal of time last week writing on a proposal called "XNets". Basically, we're looking for a million Euros to help develop robust and productive networking technology that will help ensure that social networking unfolds to meet its full potential. Our vision is distributed social networking: let users build a social network platform where there is no central entity calling the shots.
However, it's not just the distributed system that we need it is the consciousness. I turned the social advertising functionality off and have for the moment the feeling that it is "fixed". But getting this fixed was not C.'s job. C. is not all-seeing nor can she help her friends protect themselves against all possible future SlippedIns. C. should not be doing damage control for LinkedIn. We the community are strong, but we are not omnipotent. The ultimate responsibility for safe-guarding our personal data lies with LinkedIn itself.
What I termed "Human computational relevance" in my previous blog post is probably more appropriately termed "Human computational semantics". The model in the figure in that post can be extended in a straightforward manner to accommodate "Human computational semantics". The model involves comparing multimedia items (again within a specific functional context and a specific demographic) and assigning them a pair-wise similarity value according to the proportion of human subjects that agree that they are similar.
Fig. 1: The similarity between two multimedia items is measured in terms of the the proportion of human subjects within a real-world functional context and drawn from a well-defined demographic that agree that they are similar. I claim that this is the only notion of semantic similarity that we need.
I hit the ceiling when I hear people describe multimedia items as "obviously related" or "clearly semantically similar". The notion of "obvious" is necessarily defined with respect to a perceiver. If you want to say "obvious", you must necessarily specify the assumption you make about "obvious to whom". Likewise, there is no ultimate notion of "similarity" that is floating around out there for everyone to access. If you want to say "similar", you must specify the assumption that you make about "similar in what context."
If you don't make these specifications, then you are sweeping an implicit assumption you are making right under the rug and it's sure to give you trouble later. It's dangerous to let ourselves lose sight of our unconscious assumptions of who our users are and what the functional context actually is in which we expect our algorithms to operate. Even if it is difficult to come up with a formal definition at least we can remind ourselves how slippery these notions are be. It seems that we naturally as humans like to emphasize universality and our own commonality, and that in most situations it's difficult to really convince people that "obvious to everyone" and "always similar" are not sufficiently formalized characterizations to be useful in multimedia research. However, in the case of multimedia content analysis the risks are too great and I feel obliged to at least try.
A common objection to the proposed model runs as follows: "So then you have a semantic system that consists of pairwise comparisons between elements, what about the global system?" My answer is: The model gives you local, example-based semantics. The global properties emerge from local interactions in the system. We do no require the system to be globally consistent, instead we gather pairwise comparisons until a useful level of consistency emerges.
Our insistence on a global semantics, I maintain, is a throwback to the days that we only had conventional books to store knowledge. Paper books are necessarily linear, necessarily of a restricted length and have no random access function. So, we began abstracting and organizing and ordering to back human understanding of the world into an encyclopedic or dictionary form. It's a fun and rewarding activity to construct compendiums of what we know. However, there is no a priori reason why a semantic system based on a global semantic model must necessarily be chosen for use by a search engine.
Language itself is quite naturally defined as a set of conventions that arise and are maintained via highly local acts of communication within a human population. Under this view, we can ask about Fig. 1, why I didn't draw in connections between the human subjects in order to indicate that the basis of their judgements rests in a common understanding -- a language pact as it were. This understanding is negotiated over years of interaction in a world that it exists beyond the immediate moment at which they are asked to answer the question. Our impression that we need an a prior global semantics arises from the fact that there is no practical way to integrate models language evolution or personal language variation into our system. Again, it's sort of comforting to see that when people think about these issues their first response is to emphasize universality and our human commonality.
It's going to hurt us a little inside to work with systems that represent meaning in a distributed, pairwise fashion. It goes against our feeling, perhaps, that everyone should listen to and understand everything we say. We might not want to think too hard about how our web search engines have actually already been using a form of ad hoc distributed semantics for years.
In closing: The model is there. The wider implications of its existence are that we should direct our efforts to solving the engineering and design problems necessary to be able to efficiently and economically generate estimations of human computational relevance and also of the reliability of these estimates. If we accomplish this task, we are in a position to be able to create better algorithms for our systems. Because we are using crowdsourcing -- computation carried out by individual humans -- we also need to address the ethics question: Can we generate such models without tipping the equilibrium of the crowdsroucing-universe so that it disadvantages (or fail to advantages) already fragile human populations?
This post is dedicated to my colleague David Tax: One of the perks of my job is an office on the floor with the guys from the Pattern Recognition Lab -- and one of the downsides is a low-level, but nagging sense of regret that we don't meet at the coffee machine and talk more often. This post articulates the larger story that I'd like to tell you.
In the field of multimedia, we spend so much time in discussions about semantic annotations (such as tags, or concept labels used for automatic concept detection) and whether they are objective or subjective. Usually the discourse runs along the lines of "Objective metadata is worth our effort, subjective metadata is too personal to either predict or be useful." Somehow the underlying assumption in these discussions is that we all have access to an a priori understanding of the distinction between "subjective" and "objective" and that this distinction is of some specific relevance to our field of research.
My position is that, as engineers building multimedia search engines, if we want to distinguish between subjective and objective we should do so using a model. We should avoid listening to our individual gut feelings on the issue (or wasting time talking about them). Instead, we should adopt a the more modern notion of "human computational relevance" which, since the rise of crowdsourcing, has entered into conceivable reach.
The underlying model is simple: Given a definition of a demographic that can be used to select a set of human subjects and a definition of a functional context in the real world inhabited by those subjects, the level of subjectivity or objectivity of an individual label is defined as the percentage of of human subjects who would say "yes, that label belongs with that multimedia item". The model can be visualized as follows:
Fig. 1: The relevance of a tag to an object is defined as the proportion of human subjects (pictured as circles) within a real-world functional context and drawn from a well-defined demographic that agree on a tag. I claim that this is the only notion of the objective/subjective distinction relevant for our work in developing multimedia search engines.
Under this view of the world, the distinction between subjective and objective reduces to the inter-annotator agreement under controlled conditions. I maintain that the level of inter-annotator agreement will also reflect the usefulness that the tag will have deployed within a multimedia search engine designed for use within the domain defined by the functional context by the people in the demographic. If we want to assimilate personalized multimedia search into this picture we can define it within a functional context for a demographic consisting only of one person.
This model reduces the subjective/objective difference to a estimation of the utility of a particular annotation within the system. The discussions we should be spending our time on are the ones about how to tackle the daunting task of implementing this model so as to generate a reliable estimates of human computational relevance.
As mentioned above, the model is intended to be implemented on a crowdsourcing platform that will produce an estimate of the relevance of each label for each multimedia item. I am as deeply involved as I am with crowdsourcing HIT design because am trying to find a principled manner to constrain worker pools with regard to demographic specifications and with regard to the specifications of a real-world function for multimedia objects. At the same time, we need useful estimators of the extent to which the worker pool deviates from the idealized conditions.
These are daunting tasks and will, without doubt, require well-motivated simplifications of the model. It should be clear that I don't claim that the model makes things suddenly 'easy'. However, it is clearly a more principled manner of moving forward than debate on the subjectivity vs. objectivity difference.
I was just amazed at the people involved in this contest: in their ability to develop their own idea and distinguish themselves, but at the same time support each other and collaborate as a community. It's nice to talk about crowdsourced innovation, but it's breathtaking to experience it in action.
The results are reflected in how far LikeLines has come since when I first posted on it at the beginning of June. Raynor looked at me one day and said, "It's an API"...and we realized that this is not just an intelligent video player it is a whole new paradigm for collecting user feedback that can be applied in an entire range of use cases.
From one day to the next we started talking about time-code specific video popularity, which we quickly shorted to "heatmap metadata".
Whatever happens next, whether Raynor proceeds to the next round, I already have an overpowering sense of having "won" at MoJo. It really solidified my belief in the power of collaborative competition as a source of innovation -- and a force for good.
I am an organizer in the MediaEval benchmark and this is the sort of effect that we aspire to: bringing people together to pull towards a common goal simultaneously as individuals and as a community.
There needs to be a multiplicity of such efforts: they should support and learn from each other. I can only encourage the students in our lab to get out there and get involved, both as participants and as organizers.
One day last week we were in the elevator heading down to lunch and Yue Shi turned to me and said. Do you realize that of the people standing in the elevator, there are five PhD students submitting entries in five different competitions?
True to usual style, my first reaction is, "Hey people, what happened to TRECVID?" We are also make an honest effort to submit to TRECVID this year. I watched that happen...and then not happen.
But then I gave myself permission, there in the elevator to turn off the bookkeeping/managing mechanism mechanism in my head -- and just go with my underlying feeling of what we were doing as a lab. It's the feeling of wow. Everybody doing their own thing, but at the same time being part of this amazing collaborative competitive community.
The elevator doors opened and as we passed through I thought, it seems like the normal daily ride that we're taking, but when you look a bit deeper you can see the world changing and how the people in my lab pool efforts to change it.
I just logged into the manuscript management system of a journal that will remained unnamed and was greeted by this information icon plus message.
Is this system really user friendly or is it exactly the opposite? The page goes on to state, "So please do look at the .... hints and warnings that we’ve put in the green box at the top of each page. They will save you time and ensure that your work is processed correctly."
I identify with this sentence. I am always telling people that they are probably not going to understand everything I am saying. It sounds like standard US-English, but I speak fast, with a lot of unexpected vocabulary, large dynamic range and obscure cultural references and I have just enough of a regional accent. People who easily follow fast paced Hollywood movies think they should understand me. But really, I tell them, the skills just might not transfer and it might not be your fault.
I started warning people after years of trying to slow down and choose simple vocabulary. I still try to do this in formal, large group situations with people I don't know very well. However, there are just too many people that insist on speaking English with me: the other languages I speak are drying up and dying and my English will go that way as well if I don't occasionally take it for a walk and deploy it in its full erratic richness. Somehow, I do really identify with this system: "Hey, World! I'm complicated. Deal with it."
And, like this system, I ensnarl myself in some sort of paradox of self-reference. If I don't speak in a way that will allow people to understand me -- how can I expect them to grasp that I want to communicate to them that I might be difficult to grasp? Will I ever succeed in motivating anyone to bring up the extra dose of patience and attention that it takes to follow me? Won't they be suspicious of me if they know I know that I am difficult to understand and also that I am not doing anything about it? Can I ever convince anyone that the effort is worth the payoff?
Likewise: Can we except that the system is indeed being self-explanatory when it claims of itself that it is not self-explanatory? Is it worth the effort and patience it takes to cut through that knot? In the end it's a matter of trust. I smile at the message and the information icon and consider how to move forward.
In the end, I decide to whip off a blog post, sighing when the word "ensnarl" can't be typed without a red misspelling line under it, and return to my attempt to extract the manuscript I'm supposed to be reviewing from the system.
I now sympathize with the interface. The message has worked: I've agreed that it's ok to be complicated.
As an IR researcher, I tend to obsess about why Google can't always deal with my queries. I fall into this bad habit when I have a lot of better things to do with my time and even when I know it is not getting me anywhere.
Today I needed to recall the details of the relationship between Wikipedia and Wikimedia Commons so that I could get it just right for a text I was writing. I typed "relationship between wikipedia and wikimedia commons" into the Google search box and was rewarded as my top hit the link on the image pictured at the right. The rest of the list was of the same ilk.
Oh, my gosh, why is Google reacting this way? This is not how I expected by evening to play out! Actually, "relationship" was sort of information that I wanted and "Wikipedia" and "Wikimedia Commons" were the two entities whose relationship I wanted to understand. Wasn't I making myself clear?
Is Google interpreting my named entities being as the source of the information? Is Google trying to tell me something? Such as, I should be reading Wikipedia instead of writing about Wikipedia?
Is Google trying to gently point out to me that I should be doing more image search? Or maybe giving me subtle support for my opinion that there is a very fine line between navigational and transactional queries?
Is Google relating my query to religion in order to express support for my blogpost last month on Search and Spirituality, which was written when I was in rather a strange mood? Does Google want to evoke in me again that vein of reflection?
But there is an alternative to this vein of inquiry: a simply, very plausible explanation. It runs closely along the lines of the now infamous: "He's just not that into you". Google doesn't do what I would anticipate or what would make be satisfied and happy because Google simply doesn't love me. The evidence is there: my information needs remain unmet and my search goals unreached.
Google and I obviously have a relationship problem. It goes beyond my searches on "relationships". But yet, I keep on returning to that tempting search box again and again. Do I have some sort of a genetic predilection with maintaining a dysfunctional relationship with my search engine? At least until I find an alternate outlet that does happen to be "into" me and my information needs.
In the meantime, I guess I can go and find myself a copy of the Gospel of Mark: I never realized that there was so much overlap.
Your very first glance at worker responses on the very first first task you crowdsource tells you that there are very different kinds of workers out there in the crowdsoucing-sphere, for example, on Mechanical Turk. Some of the responses are impressive in the level of dedication and insight that they reflect, others appear to flatly fail the Turing test.
It is also quite striking that there are different kinds of requesters. Turker Nation gives us insight on to the differences between one requester and the next, some better behaved than others.
What is particularly interesting is the differences among requesters who are working in the area of crowdsourcing for information retrieval and related applications. One would maybe expect there to be some homogenity or consensus here. At the moment, however, I am reviewing some papers involving crowdsourcing, and no one seems to be asking themselves the same questions that I ask myself when I design a crowdsourcing task.
It seems worthwhile to get my list of questions out of my head and into a form where other people can have a look at it. These questions are not going to make HIT (Human Intelligence Task) design any easier, but I do strongly feel that asking them should belong to crowdsourcing best practices. And if you do take time to reflect on these aspects, your HIT will in the end be better designed and more effective.
How much agreement do I expect between workers? Is my HIT "mechanical" or is it possible that even co-operative workers will differ in opinion on the correct answer? Do I reassure my workers that I am setting them up for success by signaling to them that I am aware of the subjective component of my HIT and don't have unrealistic expectations that all workers will agree completely?
Is the agreement between workers going to depend on workers' background experience (familiarity with certain topics, regions of the world, modes of thought)? Have I considered setting up a qualification HIT to do recruitment? or Have I signaled to workers what kind of background they need to be successful on the HIT?
Have other people run similar HITs and I have I read their papers to avoid making the same mistakes again?
Is the layout of my HIT 'polite'? Consider concrete details: Is it obvious that I did my best to minimize non-essential scrolling? But all in all: Does it look like I have ever actually spent time myself as a working on the crowdsourcing platform that I am designing tasks for?
Is the design of my HIT respectful? Experienced workers know that it is necessary for requesters to build in validation mechanisms to filter spurious responses. However, these shoud be well designed so that they are not tedious or insulting for conscientious workers who are highly engaged in the HIT: it is annoying and breaks the flow of work.
Is it obvious to workers why I am running the HIT? Do the answers appear to have a serious, practical application?
Is the title of my HIT interesting, informative and attractive?
Did I consider how fast I need the HIT to run through when making decisions about award levels on also when I will be running the HIT (on the weekend)?
Did I consider what my award level says about my HIT? High award levels can attract treasure seekers. However, award levels that are too low are bad for my reputation as a requester.
Can I make workers feel invested in the larger goal? Have I informed workers that I am a non-profit research institution or otherwise explained (to the extent possible) what I am trying to achieve?
Do I have time to respond to individual worker mails about my HIT? If no, then I should wait until I have time to monitor the HIT before starting it.
Did I consider how the volume of HIT assignments that I am offering will impact the variety of workers that I attract? (low volume HITs attract workers that are less interested in rote tasks)?
Did I give examples that illustrate what kind of answers I expect workers to return for the HIT? Good examples will let workers concerned about their reputations judge in advance if you are likely to reject their work?
Did I inform workers of the conditions under which they could be expected to earn a bonus for the HIT?
Did I make an effort to make the HIT intellectually engaging in order to make it inherently as rewarding as possible to work on?
Did I run a pilot task, especially one that asks workers for their opinions on how well my task is designed?
Did I take a step back and look at my HIT with an eye to how it will enhance my reputation as a requester on the platform? Will it bring back repeat customers (i.e., people who have worked on my HITs before)?
Did I consider the impact of my task on the overall ecosystem of the crowdsourcing platform? If I indiscriminately accept HITs without a responsible validation mechanism, I encourage workers to give spurious responses since they have been reinforced in the strategy of attempting to earn awards with investing a minimum of effort.
Did I consider the implications of my HIT for the overall development of crowdsourcing as an economic activity? Does my HIT support my own ethical position on the role of crowdsourcing (that we as requesters should work towards fair work conditions for workers and that they should ultimately be paid US minimum hourly wage for their work)? It's a complicated issue: http://behind-the-enemy-lines.blogspot.com/2011/05/pay-enough-or-dont-pay-at-all.html
The workers on Mechanical Turk refer to themselves as "turkers". This act of self-naming signals a sense of community, of a common understanding of what they are doing, the commonality of the activity that they are all engaged in.
What do we as requesters call ourselves? Do we have a sense of community, too? Do we enjoy the strength that derives from a shared sense of purpose?
The classical image of Wolfgang von Kempelen's automaton, the original Mechanical Turk, is included above since I think it sheds some light on this issue. Looking at the image we ask ourselves who should be most appropriately designated "turker"? Well, it's not the worker, who is the human in the machine. Rather it is the figure who is dressed as an Ottoman as is operating the machine: If workers consider themselves turkers, then we the requesters must be turkers, too.
The more that we can foster the development of a common understanding of our mission, the more that we can pool our experience to design better HITs, the more effectively we can hope to improve information retrieval by using crowdsourcing.
Today, I discovered an interesting segment of a video clip illustrating someone connecting search and spirituality. Search in a broader sense (beyond "information retrieval") does seem to have a lot to do with our belief systems and our relationship to a sense of higher purpose in life. Coming across a tangible example of the connection between finding information and someone's inner spiritual world stopped to make me reflect. I was struck by the implications for the design of user experience with search engines. What responsibilities do we have as scientists in designing our algorithms and our applications if these then get incorporated into the personal, internal process of individual human beings to find meaning in their own lives by connecting with universal truth?
At the moment I am doing the final spot check on the development set for the MediaEval 2011 Genre Tagging release. I was checking out a video with the genre label personal_or_auto-biographical, one of the 26 categories that we are using this year.
I started playing this video to get an idea of what it exactly was about and I was amazed to listen to this guy and watch him speaking. Perhaps the reaction dates me. There is just a striking immediacy to it that I was not expecting. Apparently, he's alone in his car, and talking only for himself and for the camera.
To really not know who this is, or what happened to him later in 2009 when he stopped publishing episodes is a bit of a science-fiction feeling for me. Watching his video, I am caught up in the present moment of someone who I don't know, over two years after that moment actually occurred. This effect is quite contrary to what he himself is describing. He talks about remaining with himself (someone he nearly by definition must know well) in the present moment.
Or is my witnessing of this nameless present-moment occurring in the past actually simply a new kind of being present?
It certainly seems like it exists on some other plane. Although, I jump immediately to considering what it would take to track the guy down. Gerald Friedland gave a talk at our lab last week about Cybercasing, using geo-tagged information available online to mount real-world attacks. It's fresh in my mind, the array of possibilities for finding someone by following the trail they leave uploading multimedia to the Internet. One video doesn't seem to hurt, but we quickly loose the intuitions for how our uploading behavior might scale -- allowing people to find us on the basis of who we are and in terms of how we are vulnerable.
On the other hand, this yearning to be present in the moment is so universal, so common to so many, that it really doesn't make this guy so special. He's special, perhaps, in that he can operate a camera and get his video online. Also, clearly he has the gift to generate a speech stream that other people then identify as reflecting their own inner processes. But he's specialness ends in a certain way right there. What he is saying in a way so intensely personal that it once again becomes universal -- it's simply what we look like on the inside -- like the pictures that they show us in grade school of the chambers of our hearts and the insides of our large intestines. This video was in that sense made to be lost in the multimedia avalanche of the Internet.
The guy mentions a name in his metadata, Eckhart Tolle, and I followed the trail and very quickly realizing, by clicking into an Eckhart Tolle YouTube video, that Eckhart Tolle is who my mystery guy is talking about rather than who he is himself. That brought a smile, since this distinction is one that we've previously observed as important for speech media .
I listened to Eckhart Tolle for a bit, pondering the metaphor involving the universal similarity of people's large intestines. All of a sudden Eckhart Tolle is saying, "The mind even started to look at ads for flying back to England, fares, and then the impulse came..." He's sort of hesitating, so you wonder if he's also finding this a little strange, but for me it just seemed like a moment that search for information is playing a clearly in central role in what we would otherwise call our own internal states that make up part of our spirituality. It's the kind of search that we would do nowadays with a search engine.
Eckhart Tolle goes on to talk about "obedience to what came out of the present moment"...it guided his decision making process on where to be when. He goes on to say, "...don't do it on an impulse that is a restless impulse or comes out of any kind of negative emotion". If people listen to what he is saying, and a lot do, and if they combine their search for information with interaction with search engines, I land at the following conclusion: our individual spiritual development is not disconnected from our search engines and especially not from our experience of interacting with them.
In the end, the reason I blog about this might just be that I want to use the YouTube link to that Eckhart Tolle video that will take you right to the jump-in point that I am writing abouthttp://youtu.be/K1_R3uKJOB4?t=4m18s Goodness knows how much time I've spent discussing video fragment linking and trying to get research money to work on it as a searching speech problem -- I really get a kick out of being able to link into the stream.
We are late releasing the data for the MediaEval 2011 Genre Tagging task. The initial delay was small, but then other things just got in the way compounding the situation. Today, I am trying to be very present in the moment, in order to ignore the stress that I feel about being so late and be very careful about getting the release right the first time around.
And today's experience reminds me of how careful we need to be in all our research. If our search engines are part of our spiritual worlds, we need to design our algorithms and applications with awareness of their potential impact on the trails that we following in our paths of personal development and on the collective, common digestive system of humanity.
I divide my time between Radboud University Nijmegen and Delft University of Technology in the Netherlands. My research focuses on multimedia retrieval techniques that exploit speech and language and focus on human interpretations of meaning. I am particularly interested in internet video, in networked communities, and crowdsourcing techniques. Lately, I've been noticing how difficult it is to imagine life without search.