👁️

VisionRAG

GPT-4o Vision pgvector PostgreSQL FastAPI Streamlit Docker pdf2image PyMuPDF

GitHub ↗

Overview

A multimodal document intelligence system that processes scanned PDFs, invoices, and screenshots using GPT-4o Vision — extract structured data from visual documents and query them in natural language with source citations. What makes it non-trivial: most RAG systems only handle text. VisionRAG uses GPT-4o Vision to first understand the visual layout of each page, then stores those embeddings in PostgreSQL with pgvector for similarity search. Every answer includes document-level and page-level citations — you can trace exactly which page of which document the answer came from. Fully containerised via Docker Compose: one command spins up both the app and the pgvector database.

Have questions about this project? Ask my AI assistant for details.

Ask AI about this →