Multimodal DocVQA

Vision-Language Model for Document Question Answering

Document Question Answering (DocVQA) requires a system to understand both the textual content and the visual layout of a document simultaneously. In this AI seminar project, I built a multimodal pipeline designed to reason over complex document structures using both text and vision.

The architecture fuses visual embeddings extracted via ResNet and CLIP with text data obtained from OCR outputs. These combined inputs are fed into transformer-based Large Language Models (LLMs) to achieve multimodal alignment and generate accurate answers based on the document’s layout and content.

<!--
  See https://www.debugbear.com/blog/responsive-images#w-descriptors-and-the-sizes-attribute and
  https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Responsive_images for info on defining 'sizes' for responsive images
-->

  
    <source
      class="responsive-img-srcset"
      
        srcset="/assets/img/docvqa_architecture-480.webp 480w,/assets/img/docvqa_architecture-800.webp 800w,/assets/img/docvqa_architecture-1400.webp 1400w,"
        type="image/webp"
      
      
        sizes="95vw"
      
    >
  

<img
  src="/assets/img/docvqa_architecture.jpg"
  
    class="img-fluid rounded z-depth-1"
  
  
    width="100%"
  
  
    height="auto"
  
  
  
  
    title="Model Architecture"
  
  
  
    loading="eager"
  
  onerror="this.onerror=null; $('.responsive-img-srcset').remove();"
>

</picture>

</figure>

</div>

</div>

An overview of the multimodal pipeline, showcasing the integration of OCR text extraction with CLIP/ResNet visual embeddings.

–>

The training workflows were implemented in PyTorch, focusing on optimizing the alignment between the visual and textual representations to improve the model’s spatial reasoning capabilities.