Thesis Abstract:
Comics are a unique and multimodal medium that conveys stories and ideas through sequential imagery often accompanied by text for dialogue and narration. Comics' elaborate visual language exhibits variations from different authors, cultures, periods, technologies, and artistic styles. Consequently, the computational analysis of comic books requires addressing fundamental challenges in computer vision and natural language processing. In this thesis, I aim to enhance neural comic book understanding by making use of comics' unique multimodal nature and processing comics in a character-centric approach. I chose to work on the massive collection of Golden Age of American comics, which is publicly accessible. However, the availability of annotated data is limited. Thus, to achieve my goal, I have adopted a holistic approach composed of four main steps ranging from curating datasets to proposing novel tasks and architectures for comics. The first step involves extracting high-quality text data from speech bubbles and narrative box images using OCR models. I decompose comic pages into their constituent components in the second step through detection, segmentation, and association tasks with a refined Multi-Task Learning (MTL) model. Detection involves identifying panels, speech bubbles, narrative boxes, character faces, and bodies. Segmentation focuses on isolating speech bubbles and panels, while the association task involves linking speech bubbles with character faces and bodies. In the third step, I utilize the paired character faces and bodies obtained from the previous stage to create character instances and, subsequently, reidentify and track these instances across sequential panels. These three steps made locating comic book panels, identifying their components, and transforming character identities into a dialogue-like structure possible. In the final step of my thesis, I propose a multimodal framework by introducing the ComicBERT model, which exploits the abovementioned structure. Cloze-style tasks were used to evaluate ComicBERT's contextual understanding capabilities. Furthermore, I propose a new task called Scene-Cloze. As a result, my approach achieves a new state-of-the-art performance in Text-Cloze and Visual-Cloze tasks with accuracies of 69.5% and 77.1%, respectively, thus getting closer to the human baseline. Overall, the highlights of my contributions are as follows:
1. I curated and shared COMICS Text+ Dataset with over two million transcriptions of textboxes from the golden age of comics. In addition, I open-sourced the text detection and recognition models that are fine-tuned for the task and datasets used in their training.
2. I refined a MTL framework for detection, segmentation, and association tasks and achieved SOTA results in comic character face and body-to-speech bubble association tasks.
3. I proposed a novel Identity-Aware Semi-Supervised Learning for Comic Character Re-Identification framework to generate unified and identity-aligned comic character embeddings and identity representations. Furthermore, I generated two new datasets: the Comic Character Instances Dataset, encompassing over a million character instances used in the self-supervision phase, and the Comic Sequence Identity Dataset, containing annotations of identities within sets of four consecutive comic panels used in semi-supervision phase.
4. I introduced the multimodal Comicsformer, a transformer-encoder architecture capable of processing sequential panels and their constituents. It serves as the backbone for the Masked Comic Modeling (MCM) task, a novel self-supervised pre-training strategy for comics, resulting in ComicBERT, a potential foundation model for golden age comics. ComicBERT achieves SOTA performance in cloze-style tasks, particularly in text-cloze and visual-cloze tasks, approaching human-level comprehension.