All posts
2026-04-07 8 min readFrancis Watson

We ran the Greptile 50-PR benchmark. Here's what happened.

We tested Grapple PR against 50 real-world bug-introducing PRs across Sentry, Cal.com, Grafana, Keycloak, and Discourse. Full results, no spin.

benchmarktransparencyresults

Most AI code review tools show you a demo, claim they "understand your codebase," and ask you to trust them. We decided to do something different: test ourselves against a public benchmark and publish the results, whether they're flattering or not.

Greptile's 2025 AI Code Review Benchmark is the closest thing the industry has to a standardized test. 50 real-world pull requests, each containing an intentionally introduced bug, across 5 major open-source repositories. Every tool is scored the same way: did it leave an inline comment that identified the specific bug and explained its impact?

The Setup

We forked all 5 benchmark repos under our GitHub account and installed the Grapple PR GitHub App:

Sentry

Python · 10 PRs

Cal.com

TypeScript · 10 PRs

Grafana

Go · 10 PRs

Keycloak

Java · 10 PRs

Discourse

Ruby · 10 PRs

Each PR was opened against our fork, triggering Grapple PR's full review pipeline: context assembly, 6-agent parallel execution, verification pass, confidence scoring, and inline comment posting. Default settings. No custom rules. No .grapple.yml configuration.

We then scored each PR using an LLM judge that applied Greptile's exact criteria: (1) inline comment on the specific buggy code, (2) identifies the actual planted bug (not a different issue), (3) explains the impact.

The Results

ToolOverallCriticalHighMediumLow
Greptile82%58%100%89%87%
Cursor58%58%64%56%53%
Grapple PR56%20%65%73%86%
Copilot54%50%57%78%87%
CodeRabbit44%33%36%56%53%
Graphite6%17%0%11%0%

28 out of 50 bugs caught. 56% overall.

Not the top score. But look at the breakdown: we caught 86% of low-severity bugs and 73% of medium-severity bugs. Our weakness is critical bugs, where we caught only 20%.

What We Caught Well

Stale config variableSentry

Function returns the original config object instead of the modified copy. Our Logic Agent traced the data flow and flagged it.

Case sensitivity bypass in email blacklistCal.com

Blacklist emails are lowercased but guest input isn't normalized. Security Agent caught the bypass.

OR condition deleting all remindersCal.com

Missing method filter in a deleteMany query. Logic Agent identified the missing constraint.

Wrong parameter in null checkKeycloak

Null check on grantType instead of rawTokenId. Multiple agents flagged this independently and it clustered into one high-confidence finding.

Thread-safety issue with lazy initializationDiscourse

Non-atomic read-modify-write on @loaded_locales. Logic Agent identified the TOCTOU race condition.

What We Missed, and Why

The 22 misses cluster into clear patterns:

SSRF via open(url) without validationDiscourse

The security agent saw the open() call but didn't trace that the URL was user-controlled. Requires taint tracking from source to sink.

Timing attack via direct string comparisonCal.com

The webhook secret was compared with !== instead of crypto.timingSafeEqual(). Our static scanner didn't have a pattern for this at the time.

Recursive caching call causing infinite loopKeycloak

The method calls this.getForLogin() instead of delegate.getForLogin(). Finding this requires seeing the method body of the import target, not just the diff.

Backup codes not invalidated after useCal.com

The agent flagged a related concern (codes being nulled out) but didn't identify the specific bug: codes should be marked as used, not just checked.

What We Built in Response

Every miss pointed to a concrete architectural gap. So we built three things:

Taint-tracking security prompts

Rewrote the Security Agent from a category checklist to source-to-sink data flow analysis. The prompt now walks the model through: identify sources (user input), identify sinks (queries, HTTP calls, crypto comparisons), trace the path. Added 60+ OpenGrep AST-aware rules for 16 languages.

Cross-file context injection

Agents now see code from files outside the diff. When the diff imports a function, we resolve the import to the code graph and include the function body. If the function doesn't exist, the agent catches it. If it recurses into itself instead of delegating, the agent catches that too.

Multi-pass review for large PRs

PRs with more than 500 lines changed now get a second pass. The Security and Logic agents run again with a prompt: "here's what the first pass found. What did it miss?" This catches subtle bugs that get overlooked in large diffs.

Transparency Is the Point

We could have waited until our scores were higher. We could have run the benchmark privately, tweaked our prompts until we hit 80%, and then published.

But that defeats the purpose. If you're evaluating AI code review tools, you deserve real numbers — not marketing claims. Our 56% is real. Our misses are documented. Our improvements are in production. And we'll re-run this benchmark periodically and publish the delta.

Every PR from this benchmark is publicly visible on our GitHub. You can check every review comment yourself:

Next up: we're re-running the benchmark with our taint-tracking, OpenGrep, and cross-file improvements. We'll publish the comparison — before vs after — so you can see exactly what changed.

Try Grapple PR on your next pull request

Free during beta. One-click GitHub App install. No credit card.

Blog — Grapple PR