Most AI code reviewers run a single LLM prompt: "here's the diff, find problems." That's like asking one person to simultaneously be a security auditor, a performance engineer, a logic debugger, an architecture reviewer, and a style checker. No one is good at all of those at once — not humans, not LLMs.
Grapple PR uses 6 specialized agents, each with a different model, different system prompt, and different expertise. Here's exactly how the pipeline works, from webhook to inline comment.
The Pipeline
Context Assembly
~2sParallel fetch: code graph nodes + edges, blast radius, intent spec (from PR + linked issues + commits), team knowledge, codebase intelligence. Also parses .grapple.yml config and detects hotfix branches.
Parallel Agent Execution
~30-90sSecurity, Logic, and Style agents always run. Architecture skips for tiny single-file changes. Performance skips when no loops/queries/data structures detected. Each agent gets the full ReviewContext.
Deep Review Pass (large PRs)
~30-60sIf PR has >500 lines changed, Security and Logic run a second pass with: 'here's what the first pass found — what did it miss?' Findings deduped by file + line range overlap.
Verification
~30-60sDeterministic line check: are the cited lines actually in the diff? -50 penalty for hallucinated line numbers. Then LLM verification: does the code at that line match the description?
Clustering
<1sWhen 2+ distinct agents flag the same file + line range + category family, merge into one finding with combined evidence. Suppressed duplicates never reach output. Multi-agent agreement gets a confidence boost.
Confidence Scoring
<1sEach finding scored 0-100 across 5 dimensions: evidence strength, agent agreement, severity alignment, verification pass, historical accuracy. Phase 1 (rules-only) → Phase 2 (60/40 rules/feedback) → Phase 3 (80% feedback-driven).
Output Filtering
<1sFindings below the confidence threshold (default 70) are suppressed. Remaining findings sorted by severity, then confidence. Converted to PR review comments with suggested fixes.
The 6 Agents
Each agent is a class that extends BaseAgent. They share the same context-building methods (intent, intelligence, enhanced data, config, feedback, cross-file) but have different system prompts and different models.
Security Agent
Claude Sonnet 4.6Taint-tracking methodology: identify sources (user input), identify sinks (queries, HTTP calls, crypto comparisons), trace the path. 60+ OpenGrep AST-aware rules run BEFORE the LLM as evidence. Language-specific sink tables for Python, TypeScript, Ruby, Java, Go, C#, PHP. Catches: injection, SSRF, timing attacks, ReDoS, hardcoded secrets, unsafe deserialization.
Logic Agent
Claude Opus 4.6The most critical agent. Evaluates code against developer intent (from PR title, description, linked issues, commit messages). Focus areas: intent alignment, edge cases (null/empty/zero), off-by-one errors, race conditions, error handling, state management. If the intent says 'cap discounts at 50%' and the code has no cap, that's a Critical finding.
Architecture Agent
Claude Opus 4.6Traces blast radius through the dependency graph. Cross-module impact, API contract violations, circular dependencies, pattern violations. Skips for single-file changes under 20 lines with no cross-module imports. Uses the code graph edges to trace import chains.
Performance Agent
Claude Sonnet 4.6N+1 query detection by tracing call chains through the code graph. Memory leaks, O(n^2) complexity, unbounded pagination, missing connection pool limits. Skips when no DB queries, loops, or data structure operations are detected in the patch.
Style Agent
Claude Haiku 4.5Ultra-conservative. Matches existing naming patterns from the code graph. Respects linter configuration. Only flags egregious style issues — never personal preferences. All findings are minor or info severity. If it can't find a clear team convention violation, it returns empty.
Verification Agent
Claude Sonnet 4.6Not a review agent — a fact-checker. Cross-checks every finding from the other 5 agents against codebase evidence. Deterministic line-presence check runs first (free, no API call): are the cited lines actually in the diff? Then LLM verification: does the code at that location match what the finding describes? Adjustments range from +20 (strong confirmation) to -50 (hallucination or demonstrably wrong).
Why specialization matters
The benchmark proved this. When the Security Agent and Logic Agent both independently flag the same issue — say, a wrong variable in a null check — the clustering system merges them into a single finding with combined evidence and a multi-agent confidence boost. Two agents agreeing is a stronger signal than one agent being very sure.
Conversely, when only the Style Agent flags a naming issue and no other agent sees a problem, it gets a -10 low-signal penalty. Solo findings in style/naming/documentation categories are more likely to be noise.
The feedback loop
Every time a developer marks a finding as "Helpful" or "Not Useful," that signal flows back into the system. A feedback learner job runs every 30 minutes, aggregating accept/dismiss rates per agent per category per repository.
When the Security Agent on your repo has a 90% accept rate for "sql_injection" findings but a 20% accept rate for "naming" findings, it sees that in its prompt:
## Your Historical Accuracy on This Repo - **sql_injection**: 90% accept rate (45 reviews) HIGH TRUST: continue flagging these issues. - **naming**: 20% accept rate (30 reviews) WARNING: mostly dismissed. Be extra conservative.
The confidence scorer also uses this data. Phase 1 (cold start): pure rules. Phase 2 (after ~200 reviews): 60% rules, 40% feedback. Phase 3 (after ~1000 reviews): 80% feedback, 20% rules as guardrails. The system gets more accurate the more your team uses it.
OpenGrep: AST-aware scanning before the LLM
Before any LLM agent runs, we reconstruct the post-change file from the diff patch and scan it with OpenGrep (the open-source Semgrep engine). 60+ curated rules across 16 languages check for:
OpenGrep findings are fed to the Security Agent as evidence. The agent sees: "OpenGrep flagged a timing attack at line 25 — a variable named webhook_secret is compared with !==." The LLM then decides if this is a real vulnerability or a false positive based on the surrounding code context.
This two-layer approach — deterministic pattern matching + LLM reasoning — catches more bugs than either alone. The scanner doesn't miss patterns. The LLM doesn't raise false alarms on safe patterns. Together, they're better than the sum.
What's next
The pipeline is live and processing PRs for our early adopters. We're focused on three things:
Improving critical bug detection — our benchmark weakness. More cross-file context, deeper taint tracking, better security patterns.
Multi-platform support — GitLab, Azure DevOps, Bitbucket. The pipeline is platform-agnostic; the webhook integration isn't yet.
Agentic commits — instead of suggesting a fix, push the fix commit directly. Green-light fixes (tests pass, <10 lines, single file) can be auto-committed with approval.
Want to see the pipeline in action? Install the GitHub App — it's free during beta. Or check our benchmark results to see what it catches on real-world bugs.