A multimodal document intelligence system that processes scanned PDFs, invoices, and screenshots using GPT-4o Vision โ extract structured data from visual documents and query them in natural language with source citations. What makes it non-trivial: most RAG systems only handle text. VisionRAG uses GPT-4o Vision to first understand the visual layout of each page, then stores those embeddings in PostgreSQL with pgvector for similarity search. Every answer includes document-level and page-level citations โ you can trace exactly which page of which document the answer came from. Fully containerised via Docker Compose: one command spins up both the app and the pgvector database.
Have questions about this project? Ask my AI assistant for details.
Ask AI about this โ