İlker Kesen, Ozan Arkan Can, Erkut Erdem, Aykut Erdem, Deniz Yuret. June 20, 2022. Best paper at the 5th Multimodal Learning and Applications Workshop (MULA 2022) in conjunction with CVPR 2022. (PDF, arXiv:2003.12739, presentation video).
Abstract: How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from pixels to high-level features can provide benefits to the overall performance. To support our claim, we propose a model for language-vision problems involving dense prediction, and perform experiments on two different multi-modal tasks: image segmentation from referring expressions and language-guided image colorization. We compare results where either one or both of the top-down and bottom-up visual branches are conditioned on language. Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves state-of-the-art performance. Our analysis of different word types in input expressions suggest that the bottom-up conditioning is especially helpful in the presence of low level visual concepts like color.
Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
Tayfun Ates, M. Ateşoğlu, Çağatay Yiğit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, Deniz Yuret. May 2022. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2602–2627, Dublin, Ireland. Association for Computational Linguistics.
(PDF,
openreview,
arXiv:2012.04293,
poster).
Abstract: Humans are able to perceive, understand and reason about causal events. Developing models with similar physical and causal understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this direction, we introduce CRAFT, a new video question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories in CRAFT include previously studied descriptive and counterfactual questions. Additionally, inspired by the Force Dynamics Theory in cognitive linguistics, we introduce a new causal question category that involves understanding the causal interactions between objects through notions like cause, enable, and prevent. Our results show that even though the questions in CRAFT are easy for humans, the tested baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark.
Ali Safaya, Emirhan Kurtuluş, Arda Göktoğan, Deniz Yuret. May 2022. In Findings of the Association for Computational Linguistics: ACL 2022, pages 846–863, Dublin, Ireland. Association for Computational Linguistics.
(PDF,
openreview,
arXiv:2203.01215,
poster).
Abstract: Having sufficient resources for language X lifts it from the under-resourced languages class, but not necessarily from the under-researched class. In this paper, we address the problem of the absence of organized benchmarks in the Turkish language. We demonstrate that languages such as Turkish are left behind the state-of-the-art in NLP applications. As a solution, we present Mukayese, a set of NLP benchmarks for the Turkish language that contains several NLP tasks. We work on one or more datasets for each benchmark and present two or more baselines. Moreover, we present four new benchmarking datasets in Turkish for language modeling, sentence segmentation, and spell checking. All datasets and baselines are available under: https://github.com/alisafaya/mukayese.
Talk by Ali Safaya and Taner Sezer on the Turkish Data Depository (TDD) project, which aims to collect Turkish NLP resources such as corpora, labeled data, model weights under https://tdd.ai. You can register on the website and download resources, sign up for the mailing list to get updates.
Full post...
Tayfun Ates, Muhammed Samil Atesoglu, Cagatay Yigit, Ilker Kesen, Mert Kobas, Erkut Erdem, Aykut Erdem, Tilbe Goksun, Deniz Yuret. Shared Visual Representations in Human and Machine Intelligence (SVRHM 2020). NeurIPS Workshop. (PDF, Presentation)
Abstract: Recent advances in Artificial Intelligence and deep learning have revived the interest in studying the gap between the reasoning capabilities of humans and machines. In this ongoing work, we introduce CRAFT, a new visual question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 38K video and question pairs that are generated from 3K videos from 10 different virtual environments, containing different number of objects in motion that interact with each other. Two question categories from CRAFT include previously studied descriptive and counterfactual questions. Besides, inspired by the theory of force dynamics from the field of human cognitive psychology, we introduce new question categories that involve understanding the intentions of objects through the notions of cause, enable, and prevent. Our preliminary results demonstrate that even though these tasks are very intuitive for humans, the implemented baselines could not cope with the underlying challenges.
Ekin Akyürek, Erenay Dayanık, Deniz Yuret (2019). Transactions Of The Association For Computational Linguistics, 7, 567-579. (PDF, arXiv)
Abstract: We introduce Morse, a recurrent encoder-decoder model that produces morphological analyses of each word in a sentence. The encoder turns the relevant information about the word and its context into a fixed size vector representation and the decoder generates the sequence of characters for the lemma followed by a sequence of individual morphological features. We show that generating morphological features individually rather than as a combined tag allows the model to handle rare or unseen tags and outperform whole-tag models. In addition, generating morphological features as a sequence rather than e.g. an unordered set allows our model to produce an arbitrary number of features that represent multiple inflectional groups in morphologically complex languages. We obtain state-of-the art results in nine languages of different morphological complexity under low-resource, high-resource and transfer learning settings. We also introduce TrMor2018, a new high accuracy Turkish morphology dataset. Our Morse implementation and the TrMor2018 dataset are available online to support future research.
Hürriyetoğlu, Ali and Yörük, Erdem and Yuret, Deniz and Yoltar, C. and Gürel, B. and Duruşan, F. and Mutlu, O. and Akdemir, A. In CLEF 2019 Working Notes. September, 2019. (PDF, Proceedings).
Abstract: We present an overview of the CLEF-2019 Lab ProtestNews
on Extracting Protests from News in the context of generalizable natural language processing. The lab consists of document, sentence, and
token level information classification and extraction tasks that were referred as task 1, task 2, and task 3 respectively in the scope of this lab.
The tasks required the participants to identify protest relevant information from English local news at one or more aforementioned levels in
a cross-context setting, which is cross-country in the scope of this lab.
The training and development data were collected from India and test
data was collected from India and China. The lab attracted 58 teams
to participate in the lab. 12 and 9 of these teams submitted results and
working notes respectively. We have observed neural networks yield the
best results and the performance drops significantly for majority of the
submissions in the cross-country setting, which is China.
Cengiz, Cemil and Sert, Ulaş and Yuret, Deniz. In Proceedings of the 18th BioNLP Workshop and Shared Task. August, 2019. Florence, Italy. (PDF)
Abstract: In this paper, we describe our system and results submitted for the Natural Language Inference (NLI) track of the MEDIQA 2019 Shared Task. As KU_AI team, we used BERT as our baseline model and pre-processed the MedNLI dataset to mitigate the negative impact of de-identification artifacts. Moreover, we investigated different pre-training and transfer learning approaches to improve the performance. We show that pre-training the language model on rich biomedical corpora has a significant effect in teaching the model domain-specific language. In addition, training the model on large NLI datasets such as MultiNLI and SNLI helps in learning task-specific reasoning. Finally, we ensembled our highest-performing models, and achieved 84.7% accuracy on the unseen test dataset and ranked 10th out of 17 teams in the official results.
Osman Mutlu, Ozan Arkan Can and Erenay Dayanık. 2019. In International Workshop on Semantic Evaluation (SemEval-2019 at NAACL-HLT-2019). (paper, proceedings)
Abstract: This paper describes our system for SemEval-2019 Task 4: Hyperpartisan News Detection (Kiesel et al., 2019). We use pretrained BERT (Devlin et al., 2018) architecture and investigate the effect of different fine tuning regimes on the final classification task. We show that additional pretraining on news domain improves the performance on the Hyperpartisan News Detection task. Our system ranked 8th out of 42 teams with 78.3% accuracy on the held-out test dataset.
Ozan Arkan Can, Pedro Zuidberg Dos Martires , Andreas Persson , Julian Gaal , Amy Loutfi , Luc De Raedt , Deniz Yuret and Alessandro Saffiotti. 2019. In Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) & Grounded Communication for Robotics (RoboNLP) at NAACL-HLT-2019. (abstract, paper, poster, proceedings)
Abstract: Human-robot interaction often occurs in the form of instructions given from a human to a robot. For a robot to successfully follow instructions, a common representation of the world and objects in it should be shared between humans and the robot so that the instructions can be grounded. Achieving this representation can be done via learning, where both the world representation and the language grounding are learned simultaneously. However, in robotics this can be a difficult task due to the cost and scarcity of data. In this paper, we tackle the problem by separately learning the world representation of the robot and the language grounding. While this approach can address the challenges in getting sufficient data, it may give rise to inconsistencies between both learned components. Therefore, we further propose Bayesian learning to resolve such inconsistencies between the natural language grounding and a robot’s world representation by exploiting spatio-relational information that is implicitly present in instructions given by a human. Moreover, we demonstrate the feasibility of our approach on a scenario involving a robotic arm in the physical world.
Ali Hürriyetoğlu, Erdem Yörük, Deniz Yuret, Çağrı Yoltar, Burak Güler, Fırat Duruşan and Osman Mutlu. 2019. In ECIR (CLEF Organizers Lab Track), Germany. (paper, proceedings)
Abstract: We propose a coherent set of tasks for protest information collection in the context of generalizable natural language processing. The tasks are news article classification, event sentence detection, and event extraction. Having tools for collecting event information from data produced in multiple countries enables comparative sociology and politics studies. We have annotated news articles in English from a source and a target country in order to be able to measure the performance of the tools developed using data from one country on data from a different country. Our preliminary experiments have shown that the performance of the tools developed using English texts from India drops to a level that are not usable when they are applied on English texts from China. We think our setting addresses the challenge of building generalizable NLP tools that perform well independent of the source of the text and will accelerate progress in line of developing generalizable NLP systems.
We design a new dataset, GQA, to address these shortcomings, featuring compositional questions over real-world images. We leverage semantic representations of both the scenes and questions to mitigate language priors and conditional biases and enable fine-grained diagnosis for different question types. The dataset consists of 22M questions about various day-to-day images. Each image is associated with a scene graph of the image's object, attributes and relations, a new cleaner version based on Visual Genome. Each question is associated with a structured representation of its semantics, a functional program that specifies the reasoning steps have to be taken to answer it.
An agent must first follow navigation instructions in a real-life visual urban environment to a goal position, and then identify in the observed image a location described in natural language to find a hidden object. The data contains 9,326 examples of English instructions and spatial descriptions paired with demonstrations.
Visual commonsense reasoning dataset. Visual Commonsense Reasoning (VCR) is a new task and large-scale dataset for cognition-level visual understanding. With one glance at an image, we can effortlessly imagine the world beyond the pixels (e.g. that [person1] ordered pancakes). While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. In addition to answering challenging visual questions expressed in natural language, a model must provide a rationale explaining why its answer is true.
290k multiple choice questions
290k correct answers and rationales: one per question
110k images
Counterfactual choices obtained with minimal bias, via our new Adversarial Matching approach
Answers are 7.5 words on average; rationales are 16 words.
High human agreement (>90%)
Scaffolded on top of 80 object categories from COCO
Is now (as of Dec 3, 2018) available for download!
The data contains 107,296 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a photograph. The data was collected through crowdsourcings, and solving the task requires reasoning about sets of objects, comparisons, and spatial relations. There are two related corpora: NLVR, with synthetically generated images, and NLVR2, which includes natural photographs.
In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing. The corpus consists of around 80,000 instructional videos (about 2,000 hours) with associated English sub-titles and summaries. About 300 hours have also been translated into Portuguese using crowd-sourcing, and used during the JSALT 2018 Workshop.
TVQA: Localized, Compositional Video Question Answering. TVQA is a large-scale video QA dataset based on 6 popular TV shows (Friends, The Big Bang Theory, How I Met Your Mother, House M.D., Grey's Anatomy, Castle). It consists of 152.5K QA pairs from 21.8K video clips, spanning over 460 hours of video. The questions are designed to be compositional, requiring systems to jointly localize relevant moments within a clip, comprehend subtitles-based dialogue, and recognize relevant visual concepts.
TEMPOral reasoning in video and language (TEMPO) dataset. Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in text. Our dataset consists of two parts: a dataset with real videos and template sentences (TEMPO - Template Language) which allows for controlled studies on temporal language, and a human language dataset which consists of temporal sentences annotated by humans (TEMPO - Human Language).
RecipeQA is a dataset for multimodal comprehension of cooking recipes. It consists of over 36K question-answer pairs automatically generated from approximately 20K unique recipes with step-by-step instructions and images. Each question in RecipeQA involves multiple modalities such as titles, descriptions or images, and working towards an answer requires (i) joint understanding of images and text, (ii) capturing the temporal flow of events, and (iii) making sense of procedural knowledge.
We propose to decompose instruction execution to goal prediction and action generation. We design a model that maps raw visual observations to goals using LINGUNET, a language-conditioned image generation network, and then generates the actions required to complete them. Our model is trained from demonstration only without external resources. To evaluate our approach, we introduce two benchmarks for instruction following: LANI, a navigation task; and CHAI, where an agent executes household instructions.
Two agents, a "tourist" and a "guide", interact with each other via natural language in order to have the tourist navigate towards the correct location. The guide has access to a map and knows the target location but not the tourist location, while the tourist does not know the way but can navigate in a 360-degree street view environment. The task involves "perception" for the tourist observing the world, "action" for the tourist to navigate through the environment, and "interactive dialogue" for the tourist and guide to work towards their common goal.
We make available Conceptual Captions, a new dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.
Automatic Understanding of Visual Advertisements: A large annotated dataset of image and video ads. In this dataset, we provide over 64,000 ad images annotated with the topic of the ad (e.g. the product or topic, in case of public service announcements), the sentiment that the ad provokes, any symbolic references that the ad makes (e.g. an owl symbolizes wakefulness, ice symbolizes freshness, etc.), including bounding boxes containing the physical content that alludes symbolically to concepts outside of the ad, and questions and answers about the meaning of the ad ("What should I do according to the ad? Why should I do it, according to the ad?")
VizWiz is proposed to empower a blind person to directly request in a natural manner what (s)he would like to know about the surrounding physical world.
We present CHALET, a 3D house simulator with support for navigation and manipulation. CHALET includes 58 rooms and 10 house configuration, and allows to easily create new house and room layouts. CHALET supports a range of common household activities, including moving objects, toggling appliances, and placing objects inside closeable containers. The environment and actions available are designed to create a challenging domain to train and evaluate autonomous agents, including for tasks that combine language, vision, and planning in a dynamic environment.
Building Generalizable Agents with a Realistic and Rich 3D Environment:
Teaching an agent to navigate in an unseen 3D environment is a challenging task, even in the event of simulated environments. To generalize to unseen environments, an agent needs to be robust to low-level variations (e.g. color, texture, object changes), and also high-level variations (e.g. layout changes of the environment). To improve overall generalization, all types of variations in the environment have to be taken under consideration via different level of data augmentation steps. To this end, we propose House3D, a rich, extensible and efficient environment that contains 45,622 human-designed 3D scenes of visually realistic houses, ranging from single-room studios to multi-storied houses, equipped with a diverse set of fully labeled 3D objects, textures and scene layouts, based on the SUNCG dataset (Song et.al.). The diversity in House3D opens the door towards scene-level augmentation, while the label-rich nature of House3D enables us to inject pixel- & task-level augmentations such as domain randomization (Toubin et. al.) and multi-task training. Using a subset of houses in House3D, we show that reinforcement learning agents trained with an enhancement of different levels of augmentations perform much better in unseen environments than our baselines with raw RGB input by over 8% in terms of navigation success rate.
CoDraw: Visual dialog for collaborative drawing. In this work, we propose a goal-driven collaborative task that contains vision, language, and action in a virtual environment as its core components. Specifically, we develop a collaborative `Image Drawing' game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. Two players, Teller and Drawer, are involved. The Teller sees an abstract scene containing multiple clip arts in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip arts. The two players communicate via two-way communication using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of 138K messages exchanged between a Teller and a Drawer from Amazon Mechanical Turk (AMT). We analyze our dataset and present three models to model the players' behaviors, including an attention model to describe and draw multiple clip arts at each round. The attention models are quantitatively compared to the other models to show how the conventional approaches work for this new task. We also present qualitative visualizations.
We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question.
We present a new AI task -- Embodied Question Answering (EmbodiedQA) -- where an agent is spawned at a random location in a 3D environment and asked a question ("What color is the car?"). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question ("orange"). This challenging task requires a range of AI skills -- active perception, language understanding, goal-driven navigation, commonsense reasoning, and grounding of language into actions. In this work, we develop the environments, end-to-end-trained reinforcement learning agents, and evaluation protocols for EmbodiedQA.
R2R is the first benchmark dataset for visually-grounded natural language navigation in real buildings. The dataset requires autonomous agents to follow human-generated navigation instructions in previously unseen buildings, as illustrated in the demo above. For training, each instruction is associated with a Matterport3D Simulator trajectory. 22k instructions are available, with an average length of 29 words. There is a test evaluation server for this dataset available at EvalAI.
We introduce FigureQA, a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts. We formulate our reasoning task by generating questions from 15 templates; questions concern various relationships between plot elements and examine characteristics like the maximum, the minimum, area-under-the-curve, smoothness, and intersection. To resolve, such questions often require reference to multiple plot elements and synthesis of information distributed spatially throughout a figure. To facilitate the training of machine learning systems, the corpus also includes side data that can be used to formulate auxiliary objectives. In particular, we provide the numerical data used to generate each figure as well as bounding-box annotations for all plot elements. We study the proposed visual reasoning task by training several models, including the recently proposed Relation Network as a strong baseline. Preliminary results indicate that the task poses a significant machine learning challenge. We envision FigureQA as a first step towards developing models that can intuitively recognize patterns from visual representations of data.
In this work, we design a cooperative game - GuessWhich - to measure human-AI team performance in the specific context of the AI being a visual conversational agent. GuessWhich involves live interaction between the human and the AI. The AI, which we call ALICE, is provided an image which is unseen by the human. Following a brief description of the image, the human questions ALICE about this secret image to identify it from a fixed pool of images. We measure performance of the human-ALICE team by the number of guesses it takes the human to correctly identify the secret image after a fixed number of dialog rounds with ALICE.
GuessWhat?! is a cooperative two-player guessing game. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. The aim of this project is to facilitate research in combining visual understanding, natural language processing and cooperative agent interaction.
TQA: Textbook Question Answering. The TQA dataset encourages work on the task of Multi-Modal Machine Comprehension (M3C) task. The M3C task builds on the popular Visual Question Answering (VQA) and Machine Comprehension (MC) paradigms by framing question answering as a machine comprehension task, where the context needed to answer questions is provided and composed of both text and images. The dataset constructed to showcase this task has been built from a middle school science curriculum that pairs a given question to a limited span of knowledge needed to answer it.
MarioQA: Answering Questions by Watching Gameplay Videos. From a total of 13 hours of gameplays, we collect 187,757 examples with automatically generated QA pairs. There are 92,874 unique QA pairs and each video clip contains 11.3 events in average. There are 78,297, 64,619 and 44,841 examples in NT, ET and HT, respectively. Note that there are 3.5K examples that can be answered using a single frame of video; the portion of such examples is only less than 2%. The other examples are event-centric; 98K examples require to focus on a single event out of multiple ones while 86K need to recognize multiple events for counting (55K) or identifying their temporal relationships (44K). Note that there are instances that belong to both cases.
Visual Dialog is a novel task that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the agent has to answer the question.
We construct a dataset, COMICS, that consists of over 1.2 million panels (120 GB) paired with automatic textbox transcriptions. An in-depth analysis of COMICS demonstrates that neither text nor image alone can tell a comic book story, so a computer must understand both modalities to keep up with the plot. We introduce three cloze-style tasks that ask models to predict narrative and character-centric aspects of a panel given n preceding panels as context.
Sequential CONtext-dependent Execution dataset: The task in the SCONE dataset is to execute a sequence of actions according to the instructions. Each scenario contains a world with several objects (e.g., beakers), each with different properties (e.g., chemical colors and amounts). Given 5 sequential instructions in human language (e.g., "Pour from the first beaker into the yellow beaker" or "Mix it"), the system has to predict the final world state.
Dataset where humans give instructions to robots using unrestricted natural language commands to build complex goal configurations in a blocks world. Example instruction from a sequence: "move the nvidia block to the right of the hp block".
The recent advances in deep neural networks have led to effective vision-based reinforcement learning methods that have been employed to obtain human-level controllers in Atari 2600 games from pixel data. Atari 2600 games, however, do not resemble real-world tasks since they involve non-realistic 2D environments and the third-person perspective. Here, we propose a novel test-bed platform for reinforcement learning research from raw visual information which employs the first-person perspective in a semi-realistic 3D world. The software, called ViZDoom, is based on the classical first-person shooter video game, Doom. It allows developing bots that play the game using the screen buffer. ViZDoom is lightweight, fast, and highly customizable via a convenient mechanism of user scenarios. (The second paper introduces a language grounding task, the third paper tries multi-task navigation and vqa in this domain)
Visual Storytelling Challenge (NAACL 2018). We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The dataset includes 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. VIST is previously known as "SIND", the Sequential Image Narrative Dataset (SIND).
AI2D is a dataset of illustrative diagrams for research on diagram understanding and associated question answering. Each diagram has been densely annotated with object segmentations, diagrammatic and text elements. Each diagram has a corresponding set of questions and answers.
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text. The data set consists of almost 15,000 multiple choice question answers obtained from over 400 movies and features high semantic diversity. Each question comes with a set of five highly plausible answers; only one of which is correct. The questions can be answered using multiple sources of information: movie clips, plots, subtitles, and for a subset scripts and DVS.
VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.
265,016 images (COCO and abstract scenes)
At least 3 questions (5.4 questions on average) per image
10 ground truth answers per question
3 plausible (but likely incorrect) answers per question
ReferItGame: Referring to Objects in Photographs of Natural Scenes. In this paper we introduce a new game to crowd-source natural language referring expressions. By designing a two player game, we can both collect and verify referring expressions directly within the game. To date, the game has produced a dataset containing 130,525 expressions, referring to 96,654 distinct objects, in 19,894 photographs of natural scenes.
We have created an image caption corpus consisting of 158,915 crowd-sourced captions describing 31,783 images. This is an extension of our previous Flickr 8k Dataset. The new images and captions focus on people involved in everyday activities and events
We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
We describe our experience in creating corpora of images annotated with multiple one-sentence descriptions on MTurk and explore the effectiveness of different quality control strategies for collecting linguistic data using Mechanical MTurk. We find that the use of a qualification test provides the highest improvement of quality, whereas refining the annotations through follow-up tasks works rather poorly. Using our best setup, we construct two image corpora, totaling more than 40,000 descriptive captions for 9000 images.
A corpus of 786 route instructions gathered from six people in three large-scale virtual indoor environments. Thirty six other people followed these instructions and rated them for quality. These human participants finished at the intended destination on 69% of the trials. State of the art is 35.4%. The instructions were later split into sentences and corresponding segments. State of the art at sentence level is 73%. Sample sentences: "With the wall on your left, walk forward", "Go two intersections down the pink hallway".
Arda Akdemir, Ali Hürriyetoglu, Erdem Yörük, Burak Gürel, Çagri Yoltar and Deniz Yuret. 2018. In Proceedings of the 12th Workshop on Geographic Information Retrieval (GIR'18). ACM. (pdf, url, proceedings)
Abstract: Place name recognition is one of the key tasks in Information Extraction. In this paper, we tackle this task in English News from India. We first analyze the results obtained by using available tools and corpora and then train our own models to obtain better results. Most of the previous work done on entity recognition for English makes use of similar corpora for both training and testing. Yet we observe that the performance drops significantly when we test the models on different datasets. For this reason, we have trained various models using combinations of several corpora. Our results show that training models using combinations of several corpora improves the relative performance of these models but still more research on this area is necessary to obtain place name recognizers that generalize to any given dataset.
Ömer Kırnap, Erenay Dayanık and Deniz Yuret. 2018. In Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. (paper, code, proceedings, 2017 version, Ömer's MS thesis)
Abstract:
We introduce tree-stack LSTM to model
state of a transition based parser with
recurrent neural networks. Tree-stack
LSTM does not use any parse tree based
or hand-crafted features, yet performs
better than models with these features.
We also develop new set of embeddings
from raw features to enhance the performance.
There are 4 main components of
this model: stack’s σ-LSTM, buffer’s βLSTM,
actions’ LSTM and tree-RNN. All
LSTMs use continuous dense feature vectors
(embeddings) as an input. Tree-RNN
updates these embeddings based on transitions.
We show that our model improves
performance with low resource languages
compared with its predecessors. We participate
in CoNLL 2018 UD Shared Task
as the ”KParse” team and ranked 16th in
LAS, 15th in BLAS and BLEX metrics, of
27 participants parsing 82 test sets from 57
languages.
Berkay Furkan Önder, Can Gümeli and Deniz Yuret. 2018. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. (paper, code, proceedings)
Abstract: We present SParse, our Graph-Based Parsing
model submitted for the CoNLL 2018
Shared Task: Multilingual Parsing from
Raw Text to Universal Dependencies (Zeman
et al., 2018). Our model extends
the state-of-the-art biaffine parser
(Dozat and Manning, 2016) with a structural
meta-learning module, SMeta, that
combines local and global label predictions.
Our parser has been trained and
run on Universal Dependencies datasets
(Nivre et al., 2016, 2018) and has 87.48%
LAS, 78.63% MLAS, 78.69% BLEX and
81.76% CLAS (Nivre and Fang, 2017)
score on the Italian-ISDT dataset and has
72.78% LAS, 59.10% MLAS, 61.38%
BLEX and 61.72% CLAS score on the
Japanese-GSD dataset in our official submission.
All other corpora are evaluated
after the submission deadline, for whom
we present our unofficial test results.
Dilek Zeynep Hakkani-Tür, Murat Saraçlar, Gökhan Tür, Kemal Oflazer and Deniz Yuret. 2018. In Turkish Natural Language Processing, Kemal Oflazer and Murat Saraçlar (Eds.), Ch.3, pp.53-68. Springer. (URL)
Abstract: Morphological disambiguation is the task of determining the contextually
correct morphological parses of tokens in a sentence. A morphological disambiguator
takes in sets of morphological parses for each token, generated by a morphological
analyzer, and then selects a morphological parse for each, considering statistical
and/or linguistic contextual information. This task can be seen as a generalization
of the part-of-speech (POS) tagging problem for morphologically rich languages.
The disambiguated morphological analysis is usually crucial for further processing
steps such as dependency parsing. In this chapter, we review morphological disambiguation
problem for Turkish and discuss approaches for solving this problem as
they have evolved from manually crafted constraint-based rule systems to systems
employing machine learning.