Most AI code review tools show you a demo, claim they "understand your codebase," and ask you to trust them. We decided to do something different: test ourselves against a public benchmark and publish the results, whether they're flattering or not.
Greptile's 2025 AI Code Review Benchmark is the closest thing the industry has to a standardized test. 50 real-world pull requests, each containing an intentionally introduced bug, across 5 major open-source repositories. Every tool is scored the same way: did it leave an inline comment that identified the specific bug and explained its impact?
The Setup
We forked all 5 benchmark repos under our GitHub account and installed the Grapple PR GitHub App:
Sentry
Python · 10 PRs
Cal.com
TypeScript · 10 PRs
Grafana
Go · 10 PRs
Keycloak
Java · 10 PRs
Discourse
Ruby · 10 PRs
Each PR was opened against our fork, triggering Grapple PR's full review pipeline: context assembly, 6-agent parallel execution, verification pass, confidence scoring, and inline comment posting. Default settings. No custom rules. No .grapple.yml configuration.
We then scored each PR using an LLM judge that applied Greptile's exact criteria: (1) inline comment on the specific buggy code, (2) identifies the actual planted bug (not a different issue), (3) explains the impact.
The Results
28 out of 50 bugs caught. 56% overall.
Not the top score. But look at the breakdown: we caught 86% of low-severity bugs and 73% of medium-severity bugs. Our weakness is critical bugs, where we caught only 20%.
What We Caught Well
Function returns the original config object instead of the modified copy. Our Logic Agent traced the data flow and flagged it.
Blacklist emails are lowercased but guest input isn't normalized. Security Agent caught the bypass.
Missing method filter in a deleteMany query. Logic Agent identified the missing constraint.
Null check on grantType instead of rawTokenId. Multiple agents flagged this independently and it clustered into one high-confidence finding.
Non-atomic read-modify-write on @loaded_locales. Logic Agent identified the TOCTOU race condition.
What We Missed, and Why
The 22 misses cluster into clear patterns:
The security agent saw the open() call but didn't trace that the URL was user-controlled. Requires taint tracking from source to sink.
The webhook secret was compared with !== instead of crypto.timingSafeEqual(). Our static scanner didn't have a pattern for this at the time.
The method calls this.getForLogin() instead of delegate.getForLogin(). Finding this requires seeing the method body of the import target, not just the diff.
The agent flagged a related concern (codes being nulled out) but didn't identify the specific bug: codes should be marked as used, not just checked.
What We Built in Response
Every miss pointed to a concrete architectural gap. So we built three things:
Taint-tracking security prompts
Rewrote the Security Agent from a category checklist to source-to-sink data flow analysis. The prompt now walks the model through: identify sources (user input), identify sinks (queries, HTTP calls, crypto comparisons), trace the path. Added 60+ OpenGrep AST-aware rules for 16 languages.
Cross-file context injection
Agents now see code from files outside the diff. When the diff imports a function, we resolve the import to the code graph and include the function body. If the function doesn't exist, the agent catches it. If it recurses into itself instead of delegating, the agent catches that too.
Multi-pass review for large PRs
PRs with more than 500 lines changed now get a second pass. The Security and Logic agents run again with a prompt: "here's what the first pass found. What did it miss?" This catches subtle bugs that get overlooked in large diffs.
Transparency Is the Point
We could have waited until our scores were higher. We could have run the benchmark privately, tweaked our prompts until we hit 80%, and then published.
But that defeats the purpose. If you're evaluating AI code review tools, you deserve real numbers — not marketing claims. Our 56% is real. Our misses are documented. Our improvements are in production. And we'll re-run this benchmark periodically and publish the delta.
Every PR from this benchmark is publicly visible on our GitHub. You can check every review comment yourself:
Next up: we're re-running the benchmark with our taint-tracking, OpenGrep, and cross-file improvements. We'll publish the comparison — before vs after — so you can see exactly what changed.