June 20, 2022

Modulating Bottom-Up and Top-Down Visual Processing via Language-Conditional Filters

İlker Kesen, Ozan Arkan Can, Erkut Erdem, Aykut Erdem, Deniz Yuret. June 20, 2022. Best paper at the 5th Multimodal Learning and Applications Workshop (MULA 2022) in conjunction with CVPR 2022. (PDF, arXiv:2003.12739, presentation video).

Abstract: How to best integrate linguistic and perceptual processing in multi-modal tasks that involve language and vision is an important open problem. In this work, we argue that the common practice of using language in a top-down manner, to direct visual attention over high-level visual features, may not be optimal. We hypothesize that the use of language to also condition the bottom-up processing from pixels to high-level features can provide benefits to the overall performance. To support our claim, we propose a model for language-vision problems involving dense prediction, and perform experiments on two different multi-modal tasks: image segmentation from referring expressions and language-guided image colorization. We compare results where either one or both of the top-down and bottom-up visual branches are conditioned on language. Our experiments reveal that using language to control the filters for bottom-up visual processing in addition to top-down attention leads to better results on both tasks and achieves state-of-the-art performance. Our analysis of different word types in input expressions suggest that the bottom-up conditioning is especially helpful in the presence of low level visual concepts like color.

No comments: