Most multi-agent tutorials show agents cooperating: one researches, one writes, one reviews. The outputs are additive. Everyone agrees by the end.
That works fine for content generation. It fails badly for decisions.
When I built the finance-ai-agent — an autonomous investment committee that analyses stocks — I needed agents that would genuinely challenge each other, not just summarise the same data from different angles. The result was an adversarial architecture: sequential conflict with forced cross-examination, orchestrated by a LangGraph state machine with a hard human-in-the-loop gate before any output is produced.
Here is how it works, and why it surfaces insights that cooperative agents routinely miss.
The naive multi-agent approach for investment analysis looks like this:
This is parallel summarisation with a merge step. It sounds adversarial but it is not. Each agent reasons from the same raw data in isolation. Agent A never has to defend its bull case against Agent B's strongest objection. Agent B never has to confront the specific evidence that undermines the bear thesis.
The result: both sides present their best case on paper, the judge picks the more convincing one, and the gap in the losing argument is never examined. In production, those unexamined gaps are exactly where the bad calls live.
The architecture runs sequentially, not in parallel. The order matters:
The key constraint is the Bear Analyst instruction: attack the Bull strongest claim, not the weakest one. Cherry-picking weak arguments is easy. Dismantling the core thesis is where the model has to actually think.
In the parallel version, the Bear writes independently. It attacks whatever the data suggests is most vulnerable — which is often a side point, not the load-bearing assumption of the bull case.
In the sequential version, the Bear has read the Bull actual argument. It knows which claim is central. It cannot ignore it. The forced cross-examination means the debate converges on the thing that actually matters for the investment decision.
Backtesting the verdicts against real 30-day price outcomes, the sequential adversarial model consistently surfaced valuation and liquidity risks that the parallel model summarised away. The difference was not in the raw data — both agents saw the same numbers. It was in the obligation to respond.
LangGraph was the right tool here because the workflow is a conditional directed graph, not a chain. Each node is an agent. The edges carry the accumulated state — fact base, thesis, objection, rebuttal, risk scorecard — forward to the next node.
graph = StateGraph(AnalysisState)
graph.add_node("planner", planner_agent)
graph.add_node("bull", bull_analyst)
graph.add_node("bear", bear_analyst)
graph.add_node("rebuttal", bull_rebuttal)
graph.add_node("risk", risk_auditor)
graph.add_node("cio", cio_judge)
graph.add_node("human_gate", human_approval)
graph.set_entry_point("planner")
graph.add_edge("planner", "bull")
graph.add_edge("bull", "bear")
graph.add_edge("bear", "rebuttal")
graph.add_edge("rebuttal", "risk")
graph.add_edge("risk", "cio")
graph.add_edge("cio", "human_gate")
graph.add_conditional_edges("human_gate", route_approval,
{"approved": "pdf_report", "rejected": END})
The AnalysisState TypedDict accumulates every agent output. Each agent receives the full state and appends its contribution — no agent operates from a blank slate.
After the CIO Judge delivers the verdict, the graph pauses at human_gate. No PDF report is generated without explicit user approval.
This is not just UX polish. It is an architectural commitment: autonomous agents should inform decisions, not make them unilaterally. The PDF report is the artefact that gets shared — potentially acted upon. Requiring human sign-off before it exists creates a clear accountability boundary between the AI recommendation and the human decision.
In the Streamlit interface, the gate surfaces as a dialogue: approve the verdict, reject it, or request a new analysis. The graph branches accordingly.
The system runs on Groq Llama 3.3 by default — fast and cheap for the debate steps. If Groq rate-limits or errors under load, the orchestrator automatically retries the same prompt against GPT-4o.
The fallback is transparent to the user. The CIO verdict tags which model produced it, so the output is always traceable — important when scoring verdicts against real market outcomes.
The adversarial sequential pattern is overkill for most tasks. Use it when:
For content generation, data extraction, or summarisation — use cooperative agents. Adversarial debate is slow, token-expensive, and unnecessary when there is no genuine tension in the problem.
If I rebuilt this today, I would add a fourth debate turn: the Bear gets a final response to the Bull rebuttal. One rebuttal round surfaces the core tension; a second round reveals whether the Bull defence holds under pressure or just restates the original thesis with more confidence.
I would also score the debate quality separately from the verdict — a judge that rates how well each side engaged with the opposition argument, not just which thesis was more convincing on its own terms.
Stack: LangGraph, Groq Llama 3.3, OpenAI GPT-4o (fallback), yfinance, SEC EDGAR, Tavily, Streamlit, ReportLab. Source: github.com/iampriyabrat14/finance-ai-agent