The Social Nature of Image Interpretation
Current online image search technologies use keyword queries to fetch results from large corpus of crawled images that are primarily indexed by textual data. Text surrounding images on webpages, tags or annotations, captions are used for indexing the content in the image. This severely limits the scope of what images can be searched; images that do not have sufficient text associated with them or are not accurately described by the associated text are not indexed appropriately. This limitation, in turn, affects the quality and relevance of image search results. Try searching for a ‘striped brooks brothers shirt‘ on Google or Bing image search, or even worse a ‘green squirtle‘ which happens to be a specific Pokémon character. In fact, one can argue that the more the specificity of the text description for querying images, the worse are its results.
The problem that I am describing here is more fundamental in nature—Google, Bing, and Yahoo! search for visual data such as images using a text description or set of keywords. The underlying assumption is that the mapping from text or keywords to images is both known and accurate. This assumption, albeit a statistically significant one (text surrounding an image usually has some relationship to the content of the image), may not always hold true as observed. Tags and annotations, primarily from user-generated or annotated data, can alleviate this problem to a great extent; however, this approach is not practical and often subjective due to the disparity in user-generated tags and annotations.
Talking about subjectivity of content in images–I was recently referred to a few social bookmarking services for images such as Image Spark, Visualize Us, FFFFound!, and Picture For Me. The theme for all these websites (as eloquently stated by ImageSpark) is “Discover, share, tag, and converge images that inspire you and your work.” If you browse through the collections you will find two distinct categories of popular tags — the first category is fairly objective and content specific while the second is subjective and abstract–that depends more on the eyes of the beholder. Tags such as ‘female’, ‘painting’, ‘nature’, ‘black and white’, ‘portrait’, and ‘illustration’ belong to the first category. The second category, which is as significant and comparable in size as the first, has tags like ‘beautiful’, ‘beauty’, ‘love’, ‘funny’, ‘cute’, ‘romantic’, and ‘inspiration’ — all of which are very subjective. If you are anything like me, most of these tags will depend on your mood and to a lesser degree, on the content of an image. While the first category of tags can be ‘learned’ by machine-learning algorithms, the second category has to account for the user’s ‘mood’ or the ‘beholder’s eye’ in addition to the content of the image. This brings in the social or community facet to the interpretation and querying of images. We can build learning algorithms that will drive how we search for images based on how and what we see in images and this of course, can be learned from the likes and dislikes of the ‘birds of the same feather’ as us–the networks we belong to.
How do we make accessing image content simpler and yet more precise? How can we precisely annotate the content of an image precisely for what it will be searched for (or for what it is worth)?
To answer these questions we must first answer this —
What are our peeps looking at?