A technical deep-dive into the BM25 RAG pipeline, real-time SSE token streaming, and LangGraph / CrewAI agent orchestration powering this portfolio's AI features.
The chatbot uses a custom BM25-based Retrieval-Augmented Generation pipeline. Instead of
heavy vector embeddings, it uses lightweight keyword scoring (rank-bm25) over
structured JSON knowledge items, giving fast and accurate retrieval with zero GPU overhead.
Uses Okapi BM25 over tokenized knowledge items. No model download, no GPU — pure term-frequency scoring. Expanded query improves recall for synonyms.
from rank_bm25 import BM25Okapi bm25 = BM25Okapi(tokenized_corpus) scores = bm25.get_scores(_tokenize(query)) top_idx = scores.argsort()[::-1][:8]
Before retrieval the query is expanded with domain synonyms. Contact queries expand with ['email','linkedin','github']; project queries expand with project titles and tag aliases.
# Tech alias expansion tech_aliases = { 'langchain': ['rag', 'chain', 'retrieval'], 'crewai': ['multi-agent', 'crew'], } for key, aliases in tech_aliases.items(): if key in q_lower: expansion += aliases
A whitelist approach checks against PORTFOLIO_WORDS and CODING_WORDS lists. Short follow-ups (≤10 words) in ongoing conversations always pass through.
def is_off_topic(query, has_history): q = query.lower() if any(w in q for w in PORTFOLIO_WORDS): return False if has_history and len(q.split()) <= 10: return False return True
Top-8 BM25 results are assembled as bullet points. The system prompt is built dynamically: base persona + projects section + timeline section + retrieved context + links.
system_prompt = (
SYSTEM_BASE
+ _build_projects_section(projects)
+ _build_timeline_section(timeline)
+ f"\n\nRetrieved:\n{context}"
+ links_note
)
Chat responses stream token-by-token using HTTP Server-Sent Events. The browser sends one
POST request; Flask keeps the connection alive and pushes each LLM token as a
text/event-stream chunk via stream_with_context. Messages are
saved to SQLite only after the full response is complete.
Uses Response(stream_with_context(generate()), content_type='text/event-stream'). The inner generator yields JSON-encoded token chunks as SSE events.
def generate(): stream = groq.create(..., stream=True) for chunk in stream: content = chunk.choices[0].delta.content yield f"data: {json.dumps({'token':content})}\n\n" yield f"data: {json.dumps({'done':True})}\n\n" return Response( stream_with_context(generate()), content_type='text/event-stream', headers={'Cache-Control': 'no-cache', 'X-Accel-Buffering': 'no'})
The browser uses the Fetch API with a ReadableStream reader rather than EventSource, allowing POST requests with JSON bodies. Tokens are parsed and appended in real-time.
const resp = await fetch('/api/chat/stream', { method: 'POST', body: JSON.stringify({message, model, session_id}) }); const reader = resp.body.getReader(); while (true) { const {done, value} = await reader.read(); if (done) break; // parse SSE lines, extract .token, append }
Chunks are accumulated during streaming. After done, the full assembled text is saved to SQLite, token usage is logged (with estimated fallback if the API doesn't return usage), and the session ID is emitted.
# After streaming loop ends: full_text = ''.join(chunks) _add(session_id, 'user', user_msg) _add(session_id, 'assistant', full_text) log_usage(model, provider, tokens_in, tokens_out, latency_ms)
Even blocked responses stream word-by-word for consistent UX. Off-topic detection runs before the LLM call; the canned reply is split on spaces and yielded as token events.
if is_off_topic(message, has_history): def _off(): for word in OFF_TOPIC_REPLY.split(' '): yield ( f"data: " + json.dumps({'token': word+' '}) + "\n\n" ) yield f"data: {json.dumps({'done':True})}\n\n" return Response(stream_with_context(_off()), ...)
A real multi-step Research Agent runs live on this portfolio at /agent ↗. It implements the ReAct pattern with a custom StateGraph — three typed nodes connected by a conditional loop edge — mirroring the LangGraph architecture exactly, with no library overhead for lightweight deployment.
Sequential execution · Each agent sees prior agent output · Role + Goal + Backstory defined per agent
A typed TypedDict state flows through nodes. Conditional edges route execution based on state values. Checkpointers enable pause-and-resume and human-in-the-loop.
from langgraph.graph import StateGraph graph = StateGraph(AgentState) graph.add_node("research", research_node) graph.add_node("tools", ToolNode(tools)) graph.add_conditional_edges( "router", route_fn, {"research": "research", "analysis": "analysis"})
Each node binds tools to the LLM with .bind_tools(). The model returns tool_calls in its message; the graph routes to a ToolNode, executes, and feeds the result back into state.
llm_with_tools = llm.bind_tools([ web_search, code_executor, retriever, ]) # ToolNode handles execution automatically tool_node = ToolNode(tools=[ web_search, code_executor, retriever ])
Agents are defined with role, goal, and backstory. Tasks are assigned per-agent and executed sequentially. The crew orchestrator manages handoffs and shared context.
researcher = Agent( role="Research Analyst", goal="Find accurate information", backstory="Expert at gathering data", tools=[search_tool]) crew = Crew( agents=[researcher, analyst, writer], tasks=[task1, task2, task3], process=Process.sequential)
All agents follow Reason → Act → Observe. The LLM reasons about the goal, picks a tool action, receives the observation, and loops until it decides the task is complete.
# ReAct cycle (conceptual) while not done: thought = llm.think(state, tools) if thought.is_final_answer: break obs = thought.tool.invoke(thought.args) state.update(observation=obs) # loop: reason again with observation
Architecture diagrams auto-generated from each project's GitHub README. Add a GitHub URL to any project in the Admin Panel, then click the 🏗 Arch button to generate its diagram.
The Advance_Rag_Chatbot system is a production-ready, full-stack RAG chatbot built with Python, Flask, and ChromaDB. It uses a bi-encoder for retrieval, a cross-encoder for re-ranking, and a large language model for generation. The system has a sliding-window session memory for multi-turn coherence and supports evaluation metrics like faithfulness, relevancy, precision, and recall.
🤖 Auto-generated from GitHub README · 2026-03-19 · Regenerate ↗