← Back to Portfolio

How It All Works

A technical deep-dive into the BM25 RAG pipeline, real-time SSE token streaming, and LangGraph / CrewAI agent orchestration powering this portfolio's AI features.

RAG Pipeline

The chatbot uses a custom BM25-based Retrieval-Augmented Generation pipeline. Instead of heavy vector embeddings, it uses lightweight keyword scoring (rank-bm25) over structured JSON knowledge items, giving fast and accurate retrieval with zero GPU overhead.

💬 User Query
🔍 Off-Topic Detection
if off-topic
🚫 Blocked Reply (streamed)
✓ allowed
⚙️ Query Expansion
synonyms · tech aliases · tag names
Knowledge Sources
📁 projects.json 🕐 timeline.json 📝 notes.json 📚 KNOWLEDGE[ ]
📊 BM25 Retrieval rank-bm25 · top_k = 8
📎 Context Assembly + Relevant Links
📋 System Prompt + Conversation History + Retrieved Context
🚀 Groq LPU Llama 3.3-70B / 3.1-8B
or
🧠 OpenAI GPT-4o / GPT-4o-mini
✨ Streaming Response → Browser
📊 BM25 Retrieval

Uses Okapi BM25 over tokenized knowledge items. No model download, no GPU — pure term-frequency scoring. Expanded query improves recall for synonyms.

from rank_bm25 import BM25Okapi

bm25 = BM25Okapi(tokenized_corpus)
scores = bm25.get_scores(_tokenize(query))
top_idx = scores.argsort()[::-1][:8]
⚙️ Query Expansion

Before retrieval the query is expanded with domain synonyms. Contact queries expand with ['email','linkedin','github']; project queries expand with project titles and tag aliases.

# Tech alias expansion
tech_aliases = {
  'langchain': ['rag', 'chain', 'retrieval'],
  'crewai':    ['multi-agent', 'crew'],
}
for key, aliases in tech_aliases.items():
    if key in q_lower: expansion += aliases
🔍 Off-Topic Detection

A whitelist approach checks against PORTFOLIO_WORDS and CODING_WORDS lists. Short follow-ups (≤10 words) in ongoing conversations always pass through.

def is_off_topic(query, has_history):
    q = query.lower()
    if any(w in q for w in PORTFOLIO_WORDS):
        return False
    if has_history and len(q.split()) <= 10:
        return False
    return True
📋 Context Assembly

Top-8 BM25 results are assembled as bullet points. The system prompt is built dynamically: base persona + projects section + timeline section + retrieved context + links.

system_prompt = (
    SYSTEM_BASE
    + _build_projects_section(projects)
    + _build_timeline_section(timeline)
    + f"\n\nRetrieved:\n{context}"
    + links_note
)

Streaming SSE

Chat responses stream token-by-token using HTTP Server-Sent Events. The browser sends one POST request; Flask keeps the connection alive and pushes each LLM token as a text/event-stream chunk via stream_with_context. Messages are saved to SQLite only after the full response is complete.

🖥 Browser
POST /api/chat/stream
EventSource connected
Append token → DOM
Append token → DOM
Append token → DOM
done → stop animation
Store session_id
POST {message, model, session_id}
200 text/event-stream
data: {token: "Hello"}
data: {token: " world"}
data: {token: "!"}
data: {done: true, session_id}
⚙️ Flask Server
Rate limit check
Load history from SQLite
build_rag_messages()
LLM stream=True open
yield {token: chunk}
yield {token: chunk}
yield {token: chunk}
Save messages to SQLite
log_usage() · yield done
🌊 Flask SSE Route

Uses Response(stream_with_context(generate()), content_type='text/event-stream'). The inner generator yields JSON-encoded token chunks as SSE events.

def generate():
    stream = groq.create(..., stream=True)
    for chunk in stream:
        content = chunk.choices[0].delta.content
        yield f"data: {json.dumps({'token':content})}\n\n"
    yield f"data: {json.dumps({'done':True})}\n\n"

return Response(
    stream_with_context(generate()),
    content_type='text/event-stream',
    headers={'Cache-Control': 'no-cache',
             'X-Accel-Buffering': 'no'})
📡 Client-Side Fetch Stream

The browser uses the Fetch API with a ReadableStream reader rather than EventSource, allowing POST requests with JSON bodies. Tokens are parsed and appended in real-time.

const resp = await fetch('/api/chat/stream', {
  method: 'POST',
  body: JSON.stringify({message, model, session_id})
});
const reader = resp.body.getReader();
while (true) {
  const {done, value} = await reader.read();
  if (done) break;
  // parse SSE lines, extract .token, append
}
💾 Post-Stream Persistence

Chunks are accumulated during streaming. After done, the full assembled text is saved to SQLite, token usage is logged (with estimated fallback if the API doesn't return usage), and the session ID is emitted.

# After streaming loop ends:
full_text = ''.join(chunks)
_add(session_id, 'user',      user_msg)
_add(session_id, 'assistant', full_text)
log_usage(model, provider,
           tokens_in, tokens_out, latency_ms)
⚠️ Off-Topic Streaming

Even blocked responses stream word-by-word for consistent UX. Off-topic detection runs before the LLM call; the canned reply is split on spaces and yielded as token events.

if is_off_topic(message, has_history):
    def _off():
        for word in OFF_TOPIC_REPLY.split(' '):
            yield (
                f"data: "
                + json.dumps({'token': word+' '})
                + "\n\n"
            )
        yield f"data: {json.dumps({'done':True})}\n\n"
    return Response(stream_with_context(_off()), ...)

Agent Orchestration

A real multi-step Research Agent runs live on this portfolio at /agent ↗. It implements the ReAct pattern with a custom StateGraph — three typed nodes connected by a conditional loop edge — mirroring the LangGraph architecture exactly, with no library overhead for lightweight deployment.

LangGraph · StateGraph Pattern

📝 Task Input / User Prompt
🔷 StateGraph — AgentState (TypedDict)
⚡ Conditional Router / Edge Function
🔍 Research Node
Web Search / RAG
state.research ✓
🧮 Analysis Node
Code Executor
state.analysis ✓
✍️ Writer Node
Formatter / Template
state.draft ✓
🔗 State Aggregator + Checkpointer
✅ Final Output / END Node

CrewAI · Sequential Multi-Agent Pattern

👔 Researcher
gather information
📊 Analyst
process & synthesise
✍️ Writer
produce output
📄 Structured Report
final deliverable

Sequential execution · Each agent sees prior agent output · Role + Goal + Backstory defined per agent

🔷 LangGraph StateGraph

A typed TypedDict state flows through nodes. Conditional edges route execution based on state values. Checkpointers enable pause-and-resume and human-in-the-loop.

from langgraph.graph import StateGraph

graph = StateGraph(AgentState)
graph.add_node("research", research_node)
graph.add_node("tools",    ToolNode(tools))
graph.add_conditional_edges(
    "router", route_fn,
    {"research": "research",
     "analysis": "analysis"})
🔧 Tool Calling

Each node binds tools to the LLM with .bind_tools(). The model returns tool_calls in its message; the graph routes to a ToolNode, executes, and feeds the result back into state.

llm_with_tools = llm.bind_tools([
    web_search,
    code_executor,
    retriever,
])

# ToolNode handles execution automatically
tool_node = ToolNode(tools=[
    web_search, code_executor, retriever
])
🤝 CrewAI Agents

Agents are defined with role, goal, and backstory. Tasks are assigned per-agent and executed sequentially. The crew orchestrator manages handoffs and shared context.

researcher = Agent(
  role="Research Analyst",
  goal="Find accurate information",
  backstory="Expert at gathering data",
  tools=[search_tool])

crew = Crew(
  agents=[researcher, analyst, writer],
  tasks=[task1, task2, task3],
  process=Process.sequential)
🔄 ReAct Loop

All agents follow Reason → Act → Observe. The LLM reasons about the goal, picks a tool action, receives the observation, and loops until it decides the task is complete.

# ReAct cycle (conceptual)
while not done:
    thought = llm.think(state, tools)
    if thought.is_final_answer:
        break
    obs = thought.tool.invoke(thought.args)
    state.update(observation=obs)
    # loop: reason again with observation
This agent runs live on this portfolio
Try it — ask about projects, skills, or any AI/ML concept. Watch the Plan → Research → Synthesize steps execute in real-time.
▶ Try the Agent →

My Projects

Architecture diagrams auto-generated from each project's GitHub README. Add a GitHub URL to any project in the Admin Panel, then click the 🏗 Arch button to generate its diagram.

Project Architecture

🔗 Advanced RAG Chatbot

The Advance_Rag_Chatbot system is a production-ready, full-stack RAG chatbot built with Python, Flask, and ChromaDB. It uses a bi-encoder for retrieval, a cross-encoder for re-ranking, and a large language model for generation. The system has a sliding-window session memory for multi-turn coherence and supports evaluation metrics like faithfulness, relevancy, precision, and recall.

GitHub ↗
RAG Cross-encoder re-ranking Sentence-transformers ChromaDB Vector search Flask Session management RAGAS evaluation Faithfulness scoring OpenAI · Ollama
OpenAI GPT or Ollama
LLM
ChromaDB with cosine similarity
Retrieval
Streaming SSE
Response
ChromaDB
Storage
RAGAS-inspired metrics
Evaluation

Data / Request Flow

1
💬 User Query input
The user sends a query to the chatbot
2
🔍 Bi-Encoder Retrieval process
The bi-encoder retrieves relevant documents from the ChromaDB
3
⚙️ Cross-Encoder Re-Ranking process
The cross-encoder re-ranks the retrieved documents for precision
4
🧠 LLM Generation process
The large language model generates an answer based on the re-ranked context
5
Response Output output
The chatbot returns the generated answer to the user
6
💾 Session Memory storage
The chatbot stores the conversation history in a sliding-window session memory
7
📊 Evaluation Metrics process
The chatbot calculates evaluation metrics like faithfulness, relevancy, precision, and recall

Tech Stack Layers

Tech Stack by Layer
Interface
Flask API HTML/CSS/JS
Orchestration
Bi-Encoder Cross-Encoder LLM
Storage & Cache
ChromaDB Session Memory
External APIs
OpenAI Ollama

🤖 Auto-generated from GitHub README · 2026-03-19 · Regenerate ↗