October 26, 2023

Batuhan Özyurt, M.S. 2023


Current position: AI Research Engineer, Codeway Studios (LinkedIn)
MS Thesis: Localizing Knowledge in Large Language Model Representations. October 2023. (PDF)
Thesis Abstract:

Large language models (LLMs) are very proficient in NLP tasks. In the first part of this work, we evaluate the performance of LLMs on the task of finding the locations of characters inside a long narrative. The objective of the task is to generate the correct answer when the input is a piece of a narrative followed by a question asking the location of a character. For the evaluation of the task, we generate two new datasets by annotating the characters and their locations in the narratives: Andersen and Persuasion. We show that the LLM performance is not satisfactory on these datasets when compared to the simple baseline we designed that does not use machine learning. We also experiment with in-context learning to improve the performance and report results. Moreover, we address the problem that the LLMs are limited by the bounded context length. We hypothesize that if we localize the character-location relation information among the activations inside an LLM, we can store those activations and inject them into other models that are run with a different prompt so that the LLM can answer the questions about the information that was carried from another prompt, even though the character and location relation is not mentioned explicitly in the current prompt. We develop five different techniques to localize the character-location relation information occurring in the LLMs: Moving and adding LLM activations to other prompts, adding noise to LLM activations, checking cosine similarity between LLM activations, editing LLM activations, and visualizing attention scores during answer generation. We report the observations we made using these techniques.


Full post...

September 15, 2023

İlker Kesen, Ph.D. 2023


Current position: Postdoctoral Scientist at Department of Computer Science, University of Copenhagen - DIKU (LinkedIn, Website, Scholar, Github, Twitter)
PhD Thesis: Advancing Toward Temporal and Commonsense Reasoning in Vision-Language Learning. September 2023. (PDF, Presentation)
Thesis Abstract:

Humans learn to ground language to the world through experience, primarily visual observations. Devising natural language processing (NLP) approaches that can reason in a similar sense to humans is a long-standing objective of the artificial intelligence community. Recently, transformer models exhibited remarkable performance on numerous NLP tasks. This is followed by breakthroughs in vision-language (V&L) tasks, like image captioning and visual question answering, which require connecting language to the visual world. These successes of transformer models encouraged the V&L community to pursue more challenging directions, most notably temporal and commonsense reasoning. This thesis focuses on V&L problems that require either temporal reasoning, commonsense reasoning, or both simultaneously. Temporal reasoning is the ability to reason over time. In the context of V&L, this means going beyond static images, i.e., processing videos. Commonsense reasoning requires capturing the implicit general knowledge about the world surrounding us and making an accurate judgment using this knowledge within a particular context. This thesis comprises four distinct studies that connect language and vision by exploring various aspects of temporal and commonsense reasoning. Before advancing to these challenging directions, (i) we first focus on the localization stage: We experiment with a model that enables systematic evaluation of how language-conditioning should affect the bottom-up and the top-down visual processing branches. We show that conditioning the bottom-up branch on language is crucial to ground visual concepts like colors and object categories. (ii) Next, we investigate whether the existing video-language models thrive in answering questions about complex dynamic scenes. We choose the CRAFT benchmark as our test bed and show that the state-of-the-art video-language models fall behind human performance by a large margin, failing to process dynamic scenes proficiently. (iii) In the third study, we develop a zero-shot video-language evaluation benchmark to evaluate the language understanding abilities of pretrained video-language models. Our experiments reveal that the current video-language models are no better than the vision-language models, processing static images as input in processing daily dynamic actions. (iv) In the last study, we work on a figurative language understanding problem called euphemism detection. Euphemisms tone down expressions about sensitive or unpleasant issues. The ambiguous nature of euphemistic terms makes it challenging to detect their actual meaning within a context where commonsense knowledge and reasoning are necessities. We show that incorporating additional textual and visual knowledge in low-resource settings is beneficial to detect euphemistic terms. Nonetheless, our findings on these four studies still demonstrate a substantial gap between current V&L models' abilities and human cognition.


Full post...

September 07, 2023

Gürkan Soykan, M.S. 2023


Current position: PhD Student, Wageningen University and Research (LinkedIn , Email, Github)
MS Thesis: ComicVerse: Expanding the Frontiers of AI in Comic Books with Holistic Understanding. September 2023. (PDF, Presentation)
Thesis Abstract:

Comics are a unique and multimodal medium that conveys stories and ideas through sequential imagery often accompanied by text for dialogue and narration. Comics' elaborate visual language exhibits variations from different authors, cultures, periods, technologies, and artistic styles. Consequently, the computational analysis of comic books requires addressing fundamental challenges in computer vision and natural language processing. In this thesis, I aim to enhance neural comic book understanding by making use of comics' unique multimodal nature and processing comics in a character-centric approach. I chose to work on the massive collection of Golden Age of American comics, which is publicly accessible. However, the availability of annotated data is limited. Thus, to achieve my goal, I have adopted a holistic approach composed of four main steps ranging from curating datasets to proposing novel tasks and architectures for comics. The first step involves extracting high-quality text data from speech bubbles and narrative box images using OCR models. I decompose comic pages into their constituent components in the second step through detection, segmentation, and association tasks with a refined Multi-Task Learning (MTL) model. Detection involves identifying panels, speech bubbles, narrative boxes, character faces, and bodies. Segmentation focuses on isolating speech bubbles and panels, while the association task involves linking speech bubbles with character faces and bodies. In the third step, I utilize the paired character faces and bodies obtained from the previous stage to create character instances and, subsequently, reidentify and track these instances across sequential panels. These three steps made locating comic book panels, identifying their components, and transforming character identities into a dialogue-like structure possible. In the final step of my thesis, I propose a multimodal framework by introducing the ComicBERT model, which exploits the abovementioned structure. Cloze-style tasks were used to evaluate ComicBERT's contextual understanding capabilities. Furthermore, I propose a new task called Scene-Cloze. As a result, my approach achieves a new state-of-the-art performance in Text-Cloze and Visual-Cloze tasks with accuracies of 69.5% and 77.1%, respectively, thus getting closer to the human baseline. Overall, the highlights of my contributions are as follows:
1. I curated and shared COMICS Text+ Dataset with over two million transcriptions of textboxes from the golden age of comics. In addition, I open-sourced the text detection and recognition models that are fine-tuned for the task and datasets used in their training.
2. I refined a MTL framework for detection, segmentation, and association tasks and achieved SOTA results in comic character face and body-to-speech bubble association tasks.
3. I proposed a novel Identity-Aware Semi-Supervised Learning for Comic Character Re-Identification framework to generate unified and identity-aligned comic character embeddings and identity representations. Furthermore, I generated two new datasets: the Comic Character Instances Dataset, encompassing over a million character instances used in the self-supervision phase, and the Comic Sequence Identity Dataset, containing annotations of identities within sets of four consecutive comic panels used in semi-supervision phase.
4. I introduced the multimodal Comicsformer, a transformer-encoder architecture capable of processing sequential panels and their constituents. It serves as the backbone for the Masked Comic Modeling (MCM) task, a novel self-supervised pre-training strategy for comics, resulting in ComicBERT, a potential foundation model for golden age comics. ComicBERT achieves SOTA performance in cloze-style tasks, particularly in text-cloze and visual-cloze tasks, approaching human-level comprehension.


Full post...

August 29, 2023

CLIP-guided StyleGAN Inversion for Text-driven Real Image Editing

Ahmet Canberk Baykal, Abdul Basit Anees, Duygu Ceylan, Erkut Erdem, Aykut Erdem and Deniz Yuret. Aug 29, 2023. ACM Transactions On Graphics (TOG), vol 42, issue 5, article no:172, pp 1--18. Presented at ACM SIGGRAPH Asia 2023 in Sydney, Australia, Dec 12-15, 2023. (PDF, arXiv:2307.08397, Demo video).

Abstract: Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.


Full post...

August 15, 2023

Domain-Adaptive Self-Supervised Face & Body Detection in Drawings

Barış Batuhan Topal, Deniz Yuret, Tevfik Metin Sezgin. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI 2023) Main Track. Pages 1432-1439. August 2023. (PDF, arXiv:2211.10641).

Abstract: Drawings are powerful means of pictorial abstraction and communication. Understanding diverse forms of drawings, including digital arts, cartoons, and comics, has been a major problem of interest for the computer vision and computer graphics communities. Although there are large amounts of digitized drawings from comic books and cartoons, they contain vast stylistic variations, which necessitate expensive manual labeling for training domain-specific recognizers. In this work, we show how self-supervised learning, based on a teacher-student network with a modified student network update design, can be used to build face and body detectors. Our setup allows exploiting large amounts of unlabeled data from the target domain when labels are provided for only a small subset of it. We further demonstrate that style transfer can be incorporated into our learning pipeline to bootstrap detectors using a vast amount of out-of-domain labeled images from natural images (i.e., images from the real world). Our combined architecture yields detectors with state-of-the-art (SOTA) and near-SOTA performance using minimal annotation effort. Our code can be accessed from https://github.com/barisbatuhan/DASS_Detector.


Full post...

July 18, 2023

Abdul Basit Anees, M.S. 2023


Current position: Research Engineer at aiXplain (San Jose, California) (LinkedIn, Email)
MS Thesis: HyperGAN-CLIP: A Versatile Framework for CLIP-Guided Image Synthesis and Editing using Hypernetworks. July 2023. (PDF, Presentation)
Thesis Abstract:

Generative Adversarial Networks, particularly StyleGAN and its variants, have shown exceptional capability in generating highly realistic images. However, training these models remains challenging in domains where data is scarce, as it typically requires large datasets. In this thesis work, we introduce a versatile framework that enhances the capabilities of a pre-trained StyleGAN for various tasks, including domain adaptation, reference-guided image synthesis, and text-guided image manipulation even when only a small number of training sample are available. We achieve this by integrating the CLIP space into the generator of StyleGAN using hypernetworks. These hypernetworks introduce dynamic adaptability, enabling the pre-trained StyleGAN to be effectively applied to specific domains described by either a reference image or a textual description. To further improve the alignment between the synthesized images and the target domain, we introduce a CLIP-guided discriminator, ensuring the generation of high-quality images. Notably, our approach shows remarkable flexibility and scalability, enabling text-guided image manipulation with text-free training and seamless style transfer between two images. Through extensive qualitative and quantitative experiments, we validate the robustness and effectiveness of our approach, surpassing existing methods in terms of performance.


Full post...