Multimodal DocVQA
Vision-Language Model for Document Question Answering
Document Question Answering (DocVQA) requires a system to understand both the textual content and the visual layout of a document simultaneously. In this AI seminar project, I built a multimodal pipeline designed to reason over complex document structures using both text and vision.
The architecture fuses visual embeddings extracted via ResNet and CLIP with text data obtained from OCR outputs. These combined inputs are fed into transformer-based Large Language Models (LLMs) to achieve multimodal alignment and generate accurate answers based on the document’s layout and content.
<!--
See https://www.debugbear.com/blog/responsive-images#w-descriptors-and-the-sizes-attribute and
https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Responsive_images for info on defining 'sizes' for responsive images
-->
<source
class="responsive-img-srcset"
srcset="/assets/img/docvqa_architecture-480.webp 480w,/assets/img/docvqa_architecture-800.webp 800w,/assets/img/docvqa_architecture-1400.webp 1400w,"
type="image/webp"
sizes="95vw"
>
<img
src="/assets/img/docvqa_architecture.jpg"
class="img-fluid rounded z-depth-1"
width="100%"
height="auto"
title="Model Architecture"
loading="eager"
onerror="this.onerror=null; $('.responsive-img-srcset').remove();"
>
</picture>
</figure>
</div>
</div>
–>
The training workflows were implemented in PyTorch, focusing on optimizing the alignment between the visual and textual representations to improve the model’s spatial reasoning capabilities.