April 29, 2021

Ozan Arkan Can, Ph.D. 2021


Current position: Applied Scientist - Amazon Search - Berlin (Homepage, LinkedIn, Email)
PhD Thesis: Cognitively-Inspired Deep Learning Approaches for Grounded Language Learning. April 2021. (PDF, Presentation, Publications, Code).

Thesis Abstract:

Designing machines that can perceive the surrounding world and interacting with us using human language is one of the long-standing goals of artificial intelligence. Although tremendous progress has been made to model the linguistic meanings computationally, how to best integrate linguistic and perceptual processing in multi-modal tasks is a significant open problem. This thesis explores several cognitively-inspired neural architectures that consider the different aspects of the language’s role in cognition, visual perception, and task execution. Proposed models incorporate design choices motivated by cognitive science studies and are based on the common patterns in vision-language tasks.

We begin by presenting an encoder-decoder network with a novel channel-based perceptual attention mechanism and its application to the navigational instruction following task. The perceptual processing component of this architecture is designed to focus on individual objects and properties within the environment using the language priors while preserving the spatial relations. To benefit from the designed component, we also propose an improved agent-centric world representation to allow the model to reason over the perception spatially.

Next, we explore the usage of the Neural Module Networks approach in a real robotic system for the first time. Since collecting large-scale real world data is a labor-intensive and expensive work, the system learns the language grounding on simulated data and the perceptual representation separately to overcome the scarce data problem. However, because of the separate learning processes, inconsistencies arise between the user’s and robot’s world models. To overcome this, we propose a Bayesian learning approach that uses the implicit information in the instruction to update the perceptual belief to align what the user sees and what the robot perceives.

In both parts, we demonstrate systems that use the high-level effect of language on visual processing, which operates on high-level representations. In addition to this, in the last part, we investigate the effect of language on low-level visual processing. To this end, we condition one or both low-level and high-level visual processing branches of a backbone architecture on language using language filters and apply these models to the image segmentation from referring expression task. Experiments show that modulating both low-level and high-level visual processing with language significantly improves the language grounding performance.


Full post...