AI powered developer assistant

AUGUST 13, 2025

Breaking the Code Review Bottleneck: How We Built DeputyDev, Our AI-Powered Code Review Assistant

Posted By DeputyDev Team

7 Minutes read.

We're excited to share our research on DeputyDev, an AI-powered code review assistant that's transforming how we approach code reviews at TATA 1mg. Our paper, now published on arXiv, details how we reduced code review times by up to 47% while maintaining code quality.

Read the full paper: Breaking the Code Review Bottleneck: How We Built DeputyDev, Our AI-Powered Code Review Assistant

Read on arXiv

The Problem: Code Reviews Were Killing Our Velocity

Like many engineering organizations, we faced a serious code review bottleneck. Our telemetry data revealed some eye-opening statistics:

73 hours average time before a PR was picked up for review
82 hours average review duration
6.2 days total closure cycle
Developers spending 41 minutes daily just on code reviews

But the real cost wasn't just time—it was the constant context switching. Research from UC Irvine shows that interruptions cause an average of 23 minutes of lost focus. For our developers, waiting days for feedback meant repeatedly losing and rebuilding context, impacting both productivity and wellbeing.

Our Hypothesis: AI as the First Line of Defense

We proposed a two-stage code review process where DeputyDev acts as an AI first-reviewer before human reviewers step in. The idea was simple but powerful:

Reduce cognitive load on human reviewers by handling routine checks
Minimize context switching for code authors through immediate feedback
Improve code quality through comprehensive AI + human review

The Technical Challenge: Context is Everything

Here's where it gets interesting. You can't just throw code at an LLM and expect meaningful reviews. The key insight was that context is everything.

Think about it: when a human reviews code, they don't just look at the changed lines. They understand:

What the business requirement is
How the change fits into the broader architecture
What other parts of the codebase might be affected
The team's coding standards and patterns

We needed to give our AI the same contextual awareness.

Building the Optimized Context

DeputyDev creates what we call an "optimized context" by pulling together:

PR Basics
- Title, description, and diff
Business Context
- Jira stories (what needs to be done)
- Confluence pages (how it should be done)
The Secret Sauce: Contextually Relevant Code Chunks

This is the most crucial piece. When you change a function process_order, DeputyDev automatically identifies:

Functions that call process_order
Functions that process_order calls
Related data models
Error handling patterns that might be affected

We use Abstract Syntax Trees (AST) and a combination of lexical and semantic search to find these relevant code chunks. The formula is elegantly simple:

Relevant_Chunks = Lexical_Search_Results ∪ Semantic_Search_Results

Why Not Just Send the Entire Codebase?

Great question! Three reasons:

Context window limits: Even modern LLMs with 100k+ token windows can't handle massive codebases
The "lost in the middle" problem: LLMs, like humans, focus on the beginning and end of long texts, losing track of middle content
Cost: Processing millions of tokens for every PR adds up quickly

The Architecture: Multi-Agent Workflow

Inspired by Andrew Ng's work on agentic AI, we built DeputyDev using a multi-agent architecture with reflection. Instead of one monolithic AI trying to review everything, we created specialized agents:

Our Six Specialized Agents

Security Agent: Identifies injection attacks, hardcoded credentials, vulnerable dependencies
Code Communication Agent: Reviews documentation, docstrings, and logging
Performance Agent: Finds algorithmic optimizations and database query improvements
Maintainability Agent: Ensures readability, reusability, and quality
Error Agent: Catches logical, syntactical, and runtime errors
Business Validation Agent: Verifies changes match stated requirements

The Reflection Pattern

Here's where it gets really clever. After each agent generates its initial review, we send the response back to the LLM asking it to reflect on its own output. This iterative refinement dramatically improves quality.

Andrew Ng's research showed that GPT-3.5 with an agent loop achieved 95.1% accuracy compared to just 48.1% in zero-shot mode. That's the power of reflection.

The Blending Engine

All agent responses flow through our "blending engine" which:

Filters comments by confidence score
Summarizes overlapping feedback from multiple agents
Reduces noise for reviewers

The Experiment: Rigorous A/B Testing

We weren't going to deploy this without solid data. We ran a 30-day double-controlled A/B experiment with:

200+ engineers
Three groups: Test Set (33%), Control Set 1 (33%), Control Set 2 (33%)
Outlier filtering to ensure statistical validity

The Results Speak for Themselves

Metric	Control Set 1	Control Set 2	Test Set (DeputyDev)	Improvement
Avg Review Time	239.57 hrs	278.14 hrs	197.97 hrs	-17% to -29%
Avg Time per LOC	12.97 hrs	12.29 hrs	7.50 hrs	-38% to -42%
Median Review Time	0.76 hrs	0.78 hrs	0.41 hrs	-46% to -48%

The statistically significant reductions across all metrics validated our hypothesis.

Where DeputyDev Shines Brightest

Interestingly, we found DeputyDev's impact varies by PR size:

Small PRs (0-50 LOC): 41-44% reduction in review time per LOC
Medium PRs (51-100 LOC): 15-34% reduction
Large PRs (101-200 LOC): 28-52% reduction
Extra Large PRs (201-500 LOC): Variable results

The tool excels at smaller PRs because it eliminates the fixed overhead of context switching—the biggest time sink for small changes.

Beyond Code Review: Additional Features

Context-Aware Chat

Developers can ask DeputyDev questions by starting with #dd:

#dd Why are parameterized queries safer?
#dd Generate a docstring for this function

It's like having a senior developer available 24/7, with full context of your PR.

PR Summaries

DeputyDev automatically generates:

Comprehensive PR summaries
Size estimation (LOC changed)
Estimated review time

This helps reviewers quickly understand changes without diving into the diff immediately.

Technical Choices

LLM Selection:

Claude 3.5 Sonnet (Anthropic) for code reviews - superior performance on HumanEval benchmark
GPT-4o (OpenAI) for PR summarization - excellent summarization capabilities

Integration Points:

Version control: GitHub, GitLab, Bitbucket
Project management: Jira
Documentation: Confluence

Key Learnings

1. Structured vs. Unstructured Output

We discovered that enforcing JSON schema restrictions during the initial LLM reasoning phase significantly reduces quality. Our solution: let the LLM reason freely, then structure the output in a separate step.

2. Agentic Design Trade-offs

Benefits:

Focused attention on specific aspects
Parallel processing capability
Modular updates without system-wide changes

Challenges:

Increased system complexity
Higher inference costs (multiple agents = more tokens)
Potential for redundant or conflicting feedback

The quality improvements justified the added complexity.

3. Weak Correlation Between LOC and Review Time

Surprisingly, we found very weak correlation (0.004-0.095) between lines of code changed and review time. This aligns with what experienced developers intuitively know: a 5-line change can be more complex than a 500-line one.

Real-World Impact

Since deployment:

✅Rolled out across the entire TATA 1mg organization
✅Available as a SaaS solution to external companies
✅Supporting numerous engineering professionals daily
✅Significantly reduced developer frustration with review delays

The Future of Code Review

DeputyDev represents a fundamental shift in how we think about code reviews. Rather than replacing human reviewers, it augments them—handling the mechanical, time-consuming aspects while freeing humans to focus on architectural decisions, business logic, and nuanced judgments that require deep expertise.

The immediate feedback loop also fundamentally changes the developer experience. Instead of context-switching nightmares, developers get instant, actionable feedback, make corrections, and move forward—all while staying in flow.

Open Questions and Future Work

We're continuing to explore:

Fine-tuning models specifically for our codebase
Expanding agent capabilities (security testing, performance profiling)
Improving handling of extra-large PRs
Multi-repository context awareness
Predictive suggestions before code is even written

Try It Yourself

DeputyDev is available as a SaaS solution. If you're facing similar code review bottlenecks, we'd love to help your team achieve similar productivity gains.