I am a professor of Computer Engineering at Koç University in Istanbul working at the Artificial Intelligence Laboratory. Previously I was at the MIT AI Lab and later co-founded Inquira, Inc. My research is in natural language processing and machine learning. For prospective students here are some research topics, papers, classes, blog posts and past students.
Koç Üniversitesi Bilgisayar Mühendisliği Bölümü'nde öğretim üyesiyim ve Yapay Zeka Laboratuarı'nda çalışıyorum. Bundan önce MIT Yapay Zeka Laboratuarı'nda çalıştım ve Inquira, Inc. şirketini kurdum. Araştırma konularım doğal dil işleme ve yapay öğrenmedir. İlgilenen öğrenciler için araştırma konuları, makaleler, verdiğim dersler, Türkçe yazılarım, ve mezunlarımız.

June 16, 2019

"Tasarım Ne Bekler" kitabındaki yapay zeka sohbetim

GELECEKTE YAZI OLACAK MI? EMİN DEĞİLİM…
Deniz Yüret
Hazırlayan: Meriç Tuncez
Tasarım Ne Bekler, © 2019 KUAR Yayınları

Konumuzla alakalı araştırma yaparken 2015 yılında Google’ın siyahi bir çiftin fotoğrafını goriller olarak etiketlediğine dair bir haberle karşılaştım. Benzer şekilde Google’ın iş önerilerinde bulunurken erkeklere kadınlara oranla altı kat daha yüksek maaşlı işler önerdiğine dair bir haber var. Bu bilgiden yola çıkarsak yapay zekânın algoritma kaynaklı (algorithmic) önyargıdan yani onu üreten kişinin ön yargılarından uzaklaşması mümkün mü? Ya da nasıl mümkün olabilir?

Öncelikle programın niye bu ön yargılara sahip olduğunu kısaca anlatayım. Bu bahsettiğin teknolojilerin hepsi eski usul “Yazılım 1.0” diyebileceğimiz birilerinin oturup bilgisayara bir şeyler programlaması şeklinde geliştirilmiyor.

Bana 20 sene önce bu soru sorulsaydı derdim ki “Bunu yazan programcı ırkçı ya da cinsiyetçi. Dolayısıyla bu adamı işten atın.” Ama şu anda artık bu yeni teknolojiler bu şekilde geliştirilmiyor. Onun yerine örneklere bakarak istatistikler üzerinden geliştiriliyor.

Yani iş bulma ya da resimden bir şeyler tanıma konusunda bir sürü etiketlenmiş veri hazırlıyorsunuz. Bu etiketlendirilmiş veriyi bilgisayara veriyorsunuz. Bilgisayar milyonlarca örnek üzerinden birtakım şeyleri öğrenip ondan sonra sizin sorularınıza cevap vermeye başlıyor.

Şimdi verdiğiniz veride bir önyargı var ve orada cinsiyetçi ya da ırkçı birtakım şeyler varsa programın önyargıyı da bu algoritmaların içine alması gayet normal. Bu durum algoritmanın suçu değil, verdiğimiz verinin suçu. Dolayısıyla biz eğer bu önyargı konusunda gerçekten duyarlı davranmak istiyorsak veriyi ona göre hazırlamamız lazım.

Yani bilgisayarda onu eğitim verisi olarak kullanmadan önce veriyi dengelememiz lazım. Benzer bir olay geçen sene Microsoft’ta yaşandı. Bir “sohbet robotu” (chatbot) hazırlayıp bunu Twitter’a saldılar. 24 saat sonra kapatmak zorunda kaldılar çünkü insanlardan birçok kötü, ırkçı cinsiyetçi dil elemanlarını öğrenip bunları taklit etmeye başlamıştı.

Yani bu öğrenme algoritmalarını masum birer bebek olarak düşünebiliriz. Ona ne öğretirsek o da aynı şekilde onu tekrarlamayı öğreniyor. Dolayısıyla bu öğretmenin kabahati olabilir.

Örneğin bir AlphaGo (Google DeepMind tarafından geliştirilmiş Go oyununu oynayan bir program) problemini ele aldığımızda bizim kazandığımız nokta belli. Yani nasıl kazanabileceğimiz o oyunda belli ve skorumuz var. Ama mesela bir tasarım probleminde aynı şekilde olmuyor bu, birçok farklı sonuca gitme yolu olduğunu görüyoruz. Örneğin bir iklim değişikliği için tasarım yapılacağı zaman “yapay zekâ”yı nasıl kullanabiliriz? Bizim çıktımız ne olacak burada? Yani sadece “iklimdeki sıcaklığı düşürmek” mi çıktımız? Yoksa başka bir şey mi? Yani böyle karışık bir problemde sonuçlarını ve neyin doğru olduğunu bilemediğimiz durumlarda biz yapay zekâyı ya da “özdevimli öğrenme”yi (machine learning) nasıl tasarımlarımızda kullanabiliriz?

Bu bence şu anda dahi çözümlenememiş bir soru. Çünkü yapay zekâ modellerini eğitirken verdiğimiz verinin yanı sıra bir de “objektif fonksiyon” (objective function) ya da “hata fonksiyonu” (error function) denilen bir değer atamamız gerekli.

Yani genel olarak “Ben sana böyle girdiler verdiğimde böyle çıktılar istiyorum” gibi bir eğitim verisi veriyoruz bu öğrenen programlara. Ama onun yanı sıra “Sen bu istediğim çıktıyı değil ondan biraz daha farklı bir çıktı üretirsen de ben senin hata oranını şu şekilde ölçeceğim, senin objektif fonksiyonun bu olacak.” şeklinde tasarımcının karar vermesi gerekiyor. Dolayısıyla neyi en uygun şekilde kullanacağımıza bizim karar vermemiz lazım. Bu dediğim gibi çok kolay bir problem değil. Özellikle iklim değişikliği gibi karmaşık konularda problem daha da zorlaşıyor.

Elon Musk’ın bu “Robotlar dünyayı fethedecek.” senaryosunu aydınlatabilecek çalışmalar yapılmakta günümüzde ve araştırmacıların en çok kaygılandığı konu bu. Yani biz yapay zekâya bir hedef belirlerken o hedef belirleme konusunda çok dikkatli olmazsak bu sistemlerin bizim o anda hiç beklemediğimiz birtakım yönlere gitmesi mümkün.

Diyelim ki dünyanın ısısını düşürmeyi bir hedef olarak verirsek bize verdiği çözümler yeni bir buz çağına sebep olabilir. Ama diğer yandan bu yeni bir problem değil ve yapay zekâya mahsus bir problem de değil. Geçenlerde yapay zekânın bu objektif fonksiyon problemi ile finansal marketleri ya da politik sistemleri karşılaştıran bir makale okudum. İnsanlar uzun zamandır karmaşık sistemler tasarlamakta güçlük çekiyorlar.

Dolayısıyla gayet iyi niyetlerle tasarlanmış Avrupa Birliği ya da Menkul Kıymetler Borsası gibi karmaşık sosyal sistemler düşünün. Bu sistemlerde de tasarlayanların kötü bir niyeti olmamasına rağmen sistem kendi dinamikleri içerisinde hiç beklemediğimiz birtakım sonuçlara sebep olup bize zarar verecek yönlere gidebiliyor.

Dolayısıyla bu bence üzerinde çalışmamız gereken bir sorun. Bunun bu arada teknik olarak kullanılan adı “hizalama” (alignment). “Sizin değerlerinizle geliştirdiğiniz sistemin ya da programın değerlerinin birbirine paralel hale getirilmesi nasıl mümkün olabilir?” Bu halen üzerinde çalışılan açık bir problem.

Gerisini oku
Full post...

June 06, 2019

Learning from Implicit Information in Natural Language Instructions for Robotic Manipulations

Ozan Arkan Can, Pedro Zuidberg Dos Martires , Andreas Persson , Julian Gaal , Amy Loutfi , Luc De Raedt , Deniz Yuret and Alessandro Saffiotti. 2019. In Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) & Grounded Communication for Robotics (RoboNLP) at NAACL-HLT-2019. (abstract, paper, poster, proceedings)

Abstract: Human-robot interaction often occurs in the form of instructions given from a human to a robot. For a robot to successfully follow instructions, a common representation of the world and objects in it should be shared between humans and the robot so that the instructions can be grounded. Achieving this representation can be done via learning, where both the world representation and the language grounding are learned simultaneously. However, in robotics this can be a difficult task due to the cost and scarcity of data. In this paper, we tackle the problem by separately learning the world representation of the robot and the language grounding. While this approach can address the challenges in getting sufficient data, it may give rise to inconsistencies between both learned components. Therefore, we further propose Bayesian learning to resolve such inconsistencies between the natural language grounding and a robot’s world representation by exploiting spatio-relational information that is implicitly present in instructions given by a human. Moreover, we demonstrate the feasibility of our approach on a scenario involving a robotic arm in the physical world.


Full post...

January 22, 2019

Knet v1.2.0: iterators, iterators, iterators...

The new Knet release is all about iterators: iterators for minibatching, iterators for training, iterators for monitoring, convergence etc. Why am I so excited about iterators all of a sudden? Allow me to explain:

Knet has used iterators for data generation since 2015. That was about it until recently when I was looking for a way to improve the training interface. See, at the core of every deep learning project there is a training loop that looks like this:

function train(model,data)
  for (x,y) in data
    # improve model parameters so model(x) approaches y
  end
end
And these things can run for hours or days. You want the user to have full control of this loop: how many iterations to go, how to detect convergence and quit, how to monitor progress, how to take model snapshots or measure dev accuracy every n iterations etc.

My original (non)solution was to write a new `train` function for every experiment. Why restrict the user with a bad interface when they can write their own 5 line loop? (of course then why write any package at all but that's another discussion).

My next (pseudo)solution was to provide a `train` function with lots of keyword arguments. I soon gave up on that idea when it became clear that I was on my way to implementing a Turing complete programming language using keyword arguments.

Then I thought I had a brilliant flash of insight based on callback functions. See if `train` just accepts a callback function that gets called inside the for loop, the user can implement any behavior:
function train(model,data,callback)
  for (x,y) in data
    callback() || break
    # improve model parameters so model(x) approaches y
  end
end
You want to display a progress bar, do something every n iterations, or quit after N iterations? Just implement some callback function with state and you are all set! Brilliant? Everybody hated it. Including me. It turns out callback functions are awkward to write and do not lead to very readable code.

Then finally I rediscovered iterators and iterators that wrap other iterators (inspired by Tqdm.jl). I knew iterators can be these lazy collections that produce their next element only when asked. (Here is a summary with doc links to refresh your memory). See, once you implement the training loop as an iterator you can pause, restart and terminate it whenever you want:
train(model,data) = ((update model and return loss) for (x,y) in data)
What I realized iterators also do is turn the for loop inside out! Make its guts visible so one has explicit control: You can monitor and display its progress, take snapshots or whatever all with very explicit and readable code. Here are some actual examples from Knet v1.2.0. (`sgd` is a train iterator, f is the model, d is the data):

* To display a progress bar use progress(sgd(f,d)).
* To run until convergence use converge(sgd(f,cycle(d))).
* To run multiple epochs use sgd(f,repeat(d,n)).
* To run a given number of iterations use sgd(f,take(cycle(d),n)).
* To do a task every n iterations use:
(task(x) for x in every(n, sgd(f,cycle(d)))).

Each of the functions like `progress`, `converge`, `sgd` etc. take and return iterators. So they can be composed like crazy. Here is how to (1) train a model on dtrn, (2) measuring loss on dtst every 100 iterations, (3) quitting when dtst performance converges, and (4) displaying a progress bar from the Knet tutorial:
a = adam(model,cycle(dtrn))
b = (model(dtst) for _ in every(100,a))
c = converge(b, alpha=0.1)
progress!(c, alpha=1)
The code reads like the English description! Imagine trying to implement this using keyword arguments or callback functions... and that is why I am excited about iterators.

Notes:
* the more nitpicky reader will probably point out that I should have called these things generators or coroutines or streams or something rather than iterators, but you get the idea.
* every(n,itr) = (x for (i,x) in enumerate(itr) if i%n == 0) should be a Julia primitive! (Thank you @CarloLucibello for pointing out that `IterTools.takenth` does the same thing.)
* @lostella has a wonderful post on iterators.
* Here are the relevant links in Julia docs: Interfaces, Collections, Iteration Utilities and Generator expressions.
* Here is a link to the discussion on Julia discourse.

Full post... Related link

December 14, 2018

Deep Learning in Julia: MIT IAP Seminar

Alan Edelman, Deniz Yuret
Jan 7-11, 2019. 11:00am-12:30pm. Room: 2-135.

Description: The course will consist of five hands-on tutorials giving the students practical experience in programming, training, evaluating and benchmarking deep learning models in Julia. While other machine learning libraries can meet many needs, for innovators who want to go innovate beyond the ordinary models, the expressivity of Julia has no equal. After a brief introduction to the Julia programming language we will cover linear models, multi-layer perceptrons, convolutional and recurrent neural networks. Through these examples the students will be exposed to the concepts of optimization with stochastic gradient descent (backpropagation); data normalization and minibatching; overfitting and regularization; model architectures and sample efficiency.

Prerequisites: Familiarity with programming, probability, calculus and linear algebra.


Full post... Related link

December 07, 2018

Grounded language learning datasets

GQA 201901 (home, Stanford, vqa)

We design a new dataset, GQA, to address these shortcomings, featuring compositional questions over real-world images. We leverage semantic representations of both the scenes and questions to mitigate language priors and conditional biases and enable fine-grained diagnosis for different question types. The dataset consists of 22M questions about various day-to-day images. Each image is associated with a scene graph of the image's object, attributes and relations, a new cleaner version based on Visual Genome. Each question is associated with a structured representation of its semantics, a functional program that specifies the reasoning steps have to be taken to answer it.

Touchdown 201811 (arXiv, github, streetview, Cornell, navi)

An agent must first follow navigation instructions in a real-life visual urban environment to a goal position, and then identify in the observed image a location described in natural language to find a hidden object. The data contains 9,326 examples of English instructions and spatial descriptions paired with demonstrations.

VCR 201811 (home, arXiv, UW/AI2)

Visual commonsense reasoning dataset. Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding. With one glance at an image, we can effortlessly imagine the world beyond the pixels (e.g. that [person1] ordered pancakes). While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a rationale explaining why its answer is true.

 • 290k multiple choice questions
 • 290k correct answers and rationales: one per question
 • 110k images
 • Counterfactual choices obtained with minimal bias, via our new Adversarial Matching approach
 • Answers are 7.5 words on average; rationales are 16 words.
 • High human agreement (>90%)
 • Scaffolded on top of 80 object categories from COCO
 • Is now (as of Dec 3, 2018) available for download!

NLVR2 201811 (home, arXiv, github, Cornell, qa)

The data contains 107,296 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a photograph. The data was collected through crowdsourcings, and solving the task requires reasoning about sets of objects, comparisons, and spatial relations. There are two related corpora: NLVR, with synthetically generated images, and NLVR2, which includes natural photographs.

HOW2 201811 (arXiv (2 cit), github, CMU)

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing. The corpus consists of around 80,000 instructional videos (about 2,000 hours) with associated English sub-titles and summaries. About 300 hours have also been translated into Portuguese using crowd-sourcing, and used during the JSALT 2018 Workshop.

TVQA 201809 (home, arXiv (2 cit), videoqa, UNC)

TVQA: Localized, Compositional Video Question Answering. TVQA is a large-scale video QA dataset based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It consists of 152.5K QA pairs from 21.8K video clips, spanning over 460 hours of video. The questions are designed to be compositional, requiring systems to jointly localize relevant moments within a clip, comprehend subtitles-based dialogue, and recognize relevant visual concepts.

TEMPO 201809 (home, arXiv (1 cit), github, data, video, Berkeley)

TEMPOral reasoning in video and language (TEMPO) dataset. Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in text. Our dataset consists of two parts: a dataset with real videos and template sentences (TEMPO - Template Language) which allows for controlled studies on temporal language, and a human language dataset which consists of temporal sentences annotated by humans (TEMPO - Human Language).

RecipeQA 201809 (home, arXiv, slides, Hacettepe, qa)

RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.

CIFF (LANI and CHAI) 201809 (arXiv (3 cit), github, data and simulators, Cornell, navi+exec)

We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions.

Talk the Walk 201808 (arXiv (3 cit), github, FAIR, navi+dial)

Two agents, a "tourist" and a "guide", interact with each other via natural language in order to have the tourist navigate towards the correct location. The guide has access to a map and knows the target location but not the tourist location, while the tourist does not know the way but can navigate in a 360-degree street view environment. The task involves "perception" for the tourist observing the world, "action" for the tourist to navigate through the environment, and "interactive dialogue" for the tourist and guide to work towards their common goal.

Conceptual Captions 201807 (home, paper (2 cit), github, Google)

We make available Conceptual Captions, a new dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.

DRIF 201806 (github, paper1 (3 cit), paper2 (3 cit), Cornell, navi)

Following natural language navigation instructions on a realistic simulated quadcopter.

ADS 201806 (home, paper (16 cit), data, challenge2018, Pittsburgh)

Automatic Understanding of Visual Advertisements: A large annotated dataset of image and video ads. In this dataset, we provide over 64,000 ad images annotated with the topic of the ad (e.g. the product or topic, in case of public service announcements), the sentiment that the ad provokes, any symbolic references that the ad makes (e.g. an owl symbolizes wakefulness, ice symbolizes freshness, etc.), including bounding boxes containing the physical content that alludes symbolically to concepts outside of the ad, and questions and answers about the meaning of the ad ("What should I do according to the ad? Why should I do it, according to the ad?")

VizWiz 201802 (home, arXiv (17 cit), challenge2018, UTAustin, qa)

VizWiz is proposed to empower a blind person to directly request in a natural manner what (s)he would like to know about the surrounding physical world.

CHALET 201801 (arXiv (10 cit), github, Cornell, navi+exec)

We present CHALET, a 3D house simulator with support for navigation and manipulation. CHALET includes 58 rooms and 10 house configuration, and allows to easily create new house and room layouts. CHALET supports a range of common household activities, including moving objects, toggling appliances, and placing objects inside closeable containers. The environment and actions available are designed to create a challenging domain to train and evaluate autonomous agents, including for tasks that combine language, vision, and planning in a dynamic environment.

House3D 201801 (arXiv (44 cit), github)

Building Generalizable Agents with a Realistic and Rich 3D Environment: Teaching an agent to navigate in an unseen 3D environment is a challenging task, even in the event of simulated environments. To generalize to unseen environments, an agent needs to be robust to low-level variations (e.g. color, texture, object changes), and also high-level variations (e.g. layout changes of the environment). To improve overall generalization, all types of variations in the environment have to be taken under consideration via different level of data augmentation steps. To this end, we propose House3D, a rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of visually realistic houses, ranging from single-room studios to multi-storied houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset (Song et.al.). The diversity in House3D opens the door towards scene-level augmentation, while the label-rich nature of House3D enables us to inject pixel- & task-level augmentations such as domain randomization (Toubin et. al.) and multi-task training. Using a subset of houses in House3D, we show that reinforcement learning agents trained with an enhancement of different levels of augmentations perform much better in unseen environments than our baselines with raw RGB input by over 8% in terms of navigation success rate.

CoDraw 201712 (arXiv (6 cit), github, data, FAIR, draw)

CoDraw: Visual dialog for collaborative drawing. In this work, we propose a goal-driven collaborative task that contains vision, language, and action in a virtual environment as its core components. Specifically, we develop a collaborative `Image Drawing' game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. Two players, Teller and Drawer, are involved. The Teller sees an abstract scene containing multiple clip arts in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip arts. The two players communicate via two-way communication using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of 138K messages exchanged between a Teller and a Drawer from Amazon Mechanical Turk (AMT). We analyze our dataset and present three models to model the players' behaviors, including an attention model to describe and draw multiple clip arts at each round. The attention models are quantitatively compared to the other models to show how the conventional approaches work for this new task. We also present qualitative visualizations.

IQA 201712 (arXiv (26 cit), github, youtube, UW/AI2, qa+navi)

We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question.

EmbodiedQA 201711 (home, arXiv (35 cit), FAIR/Gatech, qa+navi)

We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?"). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ("orange"). This challenging task requires a range of AI skills -- active perception, language understanding, goal-driven navigation, commonsense reasoning, and grounding of language into actions. In this work, we develop the environments, end-to-end-trained reinforcement learning agents, and evaluation protocols for EmbodiedQA.

R2R 201711 (home, arXiv (38 cit), challenge2018, speaker-follower model: paper, code; Matterport3D: home, simulator, arXiv (56 cit), Australia, navi)

R2R is the first benchmark dataset for visually-grounded natural language navigation in real buildings. The dataset requires autonomous agents to follow human-generated navigation instructions in previously unseen buildings, as illustrated in the demo above. For training, each instruction is associated with a Matterport3D Simulator trajectory. 22k instructions are available, with an average length of 29 words. There is a test evaluation server for this dataset available at EvalAI.

FigureQA 201710 (home, arXiv (8 cit), Microsoft, qa)

We introduce FigureQA, a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts. We formulate our reasoning task by generating questions from 15 templates; questions concern various relationships between plot elements and examine characteristics like the maximum, the minimum, area-under-the-curve, smoothness, and intersection. To resolve, such questions often require reference to multiple plot elements and synthesis of information distributed spatially throughout a figure. To facilitate the training of machine learning systems, the corpus also includes side data that can be used to formulate auxiliary objectives. In particular, we provide the numerical data used to generate each figure as well as bounding-box annotations for all plot elements. We study the proposed visual reasoning task by training several models, including the recently proposed Relation Network as a strong baseline. Preliminary results indicate that the task poses a significant machine learning challenge. We envision FigureQA as a first step towards developing models that can intuitively recognize patterns from visual representations of data.

GuessWhich 201708 (arXiv (10 cit), github, Gatech)

In this work, we design a cooperative game - GuessWhich - to measure human-AI team performance in the specific context of the AI being a visual conversational agent. GuessWhich involves live interaction between the human and the AI. The AI, which we call ALICE, is provided an image which is unseen by the human. Following a brief description of the image, the human questions ALICE about this secret image to identify it from a fixed pool of images. We measure performance of the human-ALICE team by the number of guesses it takes the human to correctly identify the secret image after a fixed number of dialog rounds with ALICE.

NLVR 201707 (home, paper (20 cit), github, Cornell, qa)

Visual reasoning language dataset, containing 92,244 pairs of examples of natural statements grounded in synthetic images with 3,962 unique sentences.

GuessWhat 201707 (home, arXiv (38 cit), data, github, Google/Montreal)

GuessWhat?! is a cooperative two-player guessing game. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. The aim of this project is to facilitate research in combining visual understanding, natural language processing and cooperative agent interaction.

 • 155,280 played games
 • 821,889 questions+answers
 • 66,537 images
 • 134,073 objects

TQA 201707 (home, paper (19 cit), ai2)

TQA: Textbook Question Answering. The TQA dataset encourages work on the task of Multi-Modal Machine Comprehension (M3C) task. The M3C task builds on the popular Visual Question Answering (VQA) and Machine Comprehension (MC) paradigms by framing question answering as a machine comprehension task, where the context needed to answer questions is provided and composed of both text and images. The dataset constructed to showcase this task has been built from a middle school science curriculum that pairs a given question to a limited span of knowledge needed to answer it.

Visual Genome 201705 (home, paper (444 cit), Stanford)

Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.

 • 108,077 Images
 • 5.4 Million Region Descriptions
 • 1.7 Million Visual Question Answers
 • 3.8 Million Object Instances
 • 2.8 Million Attributes
 • 2.3 Million Relationships
 • Everything Mapped to Wordnet Synsets

CLEVR 201612 (home, arXiv (184 cit), Stanford/FAIR, qa)

An artificially generated visual question answering dataset:

 • A training set of 70,000 images and 699,989 questions
 • A validation set of 15,000 images and 149,991 questions
 • A test set of 15,000 images and 14,988 questions
 • Answers for all train and val questions
 • Scene graph annotations for train and val images giving ground-truth locations, attributes, and relationships for objects
 • Functional program representations for all training and validation images

MarioQA (home, arXiv (11 cit), github, videoqa, postech/korea)

MarioQA: Answering Questions by Watching Gameplay Videos. From a total of 13 hours of gameplays, we collect 187,757 examples with automatically generated QA pairs. There are 92,874 unique QA pairs and each video clip contains 11.3 events in average. There are 78,297, 64,619 and 44,841 examples in NT, ET and HT, respectively. Note that there are 3.5K examples that can be answered using a single frame of video; the portion of such examples is only less than 2%. The other examples are event-centric; 98K examples require to focus on a single event out of multiple ones while 86K need to recognize multiple events for counting (55K) or identifying their temporal relationships (44K). Note that there are instances that belong to both cases.

VisDial 201611 (home, paper (119 cit), challenge2018, Gatech, qa)

Visual Dialog is a novel task that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the agent has to answer the question.

 • 120k images from COCO
 • 1 dialog / image
 • 10 rounds of question-answers / dialog
 • Total 1.2M dialog question-answers

Comics 201611 (arXiv (19 cit), github, umd)

We construct a dataset, COMICS, that consists of over 1.2 million panels (120 GB) paired with automatic textbox transcriptions. An in-depth analysis of COMICS demonstrates that neither text nor image alone can tell a comic book story, so a computer must understand both modalities to keep up with the plot. We introduce three cloze-style tasks that ask models to predict narrative and character-centric aspects of a panel given n preceding panels as context.

SCONE 201606 (home, paper1 (16 cit), paper2 (37 cit), CodaLab, github/clic-lab, Stanford, exec)

Sequential CONtext-dependent Execution dataset: The task in the SCONE dataset is to execute a sequence of actions according to the instructions. Each scenario contains a world with several objects (e.g., beakers), each with different properties (e.g., chemical colors and amounts). Given 5 sequential instructions in human language (e.g., "Pour from the first beaker into the yellow beaker" or "Mix it"), the system has to predict the final world state.

Blocks 201606 (home, paper1 (28 cit.), paper2 (41 cit.), github, ciff-models, ISI, exec)

Dataset where humans give instructions to robots using unrestricted natural language commands to build complex goal configurations in a blocks world. Example instruction from a sequence: "move the nvidia block to the right of the hp block".

VizDoom 201605 (environment (192 cit), questions (21 cit), multitask)

The recent advances in deep neural networks have led to effective vision-based reinforcement learning methods that have been employed to obtain human-level controllers in Atari 2600 games from pixel data. Atari 2600 games, however, do not resemble real-world tasks since they involve non-realistic 2D environments and the third-person perspective. Here, we propose a novel test-bed platform for reinforcement learning research from raw visual information which employs the first-person perspective in a semi-realistic 3D world. The software, called ViZDoom, is based on the classical first-person shooter video game, Doom. It allows developing bots that play the game using the screen buffer. ViZDoom is lightweight, fast, and highly customizable via a convenient mechanism of user scenarios. (The second paper introduces a language grounding task, the third paper tries multi-task navigation and vqa in this domain)

VIST 201604 (home, arXiv (64 cit), github, challenge2017, challenge2018, Microsoft, desc)

Visual Storytelling Challenge (NAACL 2018). We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The dataset includes 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. VIST is previously known as "SIND", the Sequential Image Narrative Dataset (SIND).

AI2D 201603 (home, arXiv (30 cit), data, ai2, qa)

AI2D is a dataset of illustrative diagrams for research on diagram understanding and associated question answering. Each diagram has been densely annotated with object segmentations, diagrammatic and text elements. Each diagram has a corresponding set of questions and answers.

MovieQA 201512 (home, arXiv (129 cit), examples, videoqa, toronto)

We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The data set consists of almost 15,000 multiple choice question answers obtained from over 400 movies and features high semantic diversity. Each question comes with a set of five highly plausible answers; only one of which is correct. The questions can be answered using multiple sources of information: movie clips, plots, subtitles, and for a subset scripts and DVS.

VQA 201505 (home, challenge2017, challenge2018, paper1 (878 cit), paper2 (67 cit), paper3 (160 cit), Gatech/Vtech, qa)

VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.

 • 265,016 images (COCO and abstract scenes)
 • At least 3 questions (5.4 questions on average) per image
 • 10 ground truth answers per question
 • 3 plausible (but likely incorrect) answers per question
 • Automatic evaluation metric

Refer 201410 (paper1 (176 cit), paper2 (70 cit), github, UNC)

ReferItGame: Referring to Objects in Photographs of Natural Scenes. In this paper we introduce a new game to crowd-source natural language referring expressions. By designing a two player game, we can both collect and verify referring expressions directly within the game. To date, the game has produced a dataset containing 130,525 expressions, referring to 96,654 distinct objects, in 19,894 photographs of natural scenes.

DAQUAR 201410 (home, arXiv (264 cit), qa)

DAQUAR: Towards a Visual Turing Challenge:

 • NYU-Depth V2 dataset with textual question-answer pairs
 • 1449 RGBD indoor images
 • 12,5k question-answer pairs
 • Answers: colors, numbers, objects and sets of these
 • Subjectivity is prominent in the dataset
 • About 9 question-answer pairs per image
 • Object’s category occurs 4 times in training set

COCO 201405 (home, arXiv (3687 cit) )

Common Objects in Context: COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:

 • Object segmentation
 • Recognition in context
 • Superpixel stuff segmentation
 • 330K images (>200K labeled)
 • 1.5 million object instances
 • 80 object categories
 • 91 stuff categories
 • 5 captions per image
 • 250,000 people with keypoints

Flickr30K 201402 (home, paper (490 cit), Illionois, captioning)

We have created an image caption corpus consisting of 158,915 crowd-sourced captions describing 31,783 images. This is an extension of our previous Flickr 8k Dataset. The new images and captions focus on people involved in everyday activities and events

Flickr8K 201308 (home, data, paper (475 cit), Illionois, captioning)

We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.

PASCAL 201006 (home, paper (353 cit), Illinois, captioning)

We describe our experience in creating corpora of images annotated with multiple one-sentence descriptions on MTurk and explore the effectiveness of different quality control strategies for collecting linguistic data using Mechanical MTurk. We find that the use of a qualification test provides the highest improvement of quality, whereas refining the annotations through follow-up tasks works rather poorly. Using our best setup, we construct two image corpora, totaling more than 40,000 descriptive captions for 9000 images.

SAIL 200607 (home, data, data2, data3, paper1 (226 cit), paper2 (280 cit), UTAustin, navi)

A corpus of 786 route instructions gathered from six people in three large-scale virtual indoor environments. Thirty six other people followed these instructions and rated them for quality. These human participants finished at the intended destination on 69% of the trials. State of the art is 35.4%. The instructions were later split into sentences and corresponding segments. State of the art at sentence level is 73%. Sample sentences: "With the wall on your left, walk forward", "Go two intersections down the pink hallway".

Related:

Full post...