March 1, 2026

GitLab MR Review Agent

Active

LLM-powered code review bot using FastAPI, Qdrant, Claude API, and RAG over Confluence. Reduced review turnaround by 40%.

Problem context

At AlfaStrakhovanie, our data engineering team processes 50+ merge requests per week across Spark jobs, dbt models, and Airflow DAGs. Senior engineers were spending 4–6 hours/week on reviews that were repetitive: checking naming conventions, validating SQL query patterns, verifying that new models referenced the correct data contracts.

The goal: automate the repetitive 60% of review feedback so senior engineers can focus on architecture and logic.

Architecture

GitLab Webhook (MR event)


FastAPI service (webhook handler)

        ├── fetch MR diff via GitLab API
        ├── chunk diff by file/hunk


Qdrant retrieval (RAG)

        ├── query: Confluence data contracts
        ├── query: internal coding standards docs
        ├── query: similar past MR reviews


Claude API (claude-3-5-sonnet)

        ├── system prompt: role + company context
        ├── retrieved context (top-k chunks)
        ├── MR diff


Structured JSON output (per-file comments)


GitLab API → post inline comments on MR

Trade-offs explored

RAG vs. fine-tuning: Fine-tuning was ruled out — our codebase evolves faster than any fine-tuning cycle. RAG over live Confluence gives current context without retraining.

Claude vs. GPT-4o: Tested both. Claude produced more actionable feedback with fewer false positives on Spark job patterns. Lower hallucination rate on domain-specific SQL.

Chunk size for diffs: Hunks of ~150 lines performed better than full files. Large diffs (500+ lines) caused context dilution and worse recall.

Implementation highlights

# Simplified retrieval step
async def retrieve_context(diff_chunk: str, k: int = 8) -> list[str]:
    embedding = await embed(diff_chunk)
 
    results = await qdrant_client.search(
        collection_name="internal_docs",
        query_vector=embedding,
        limit=k,
        with_payload=True,
    )
 
    return [hit.payload["text"] for hit in results]
# Review generation
async def generate_review(diff: str, context: list[str]) -> ReviewOutput:
    response = await anthropic.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=2048,
        system=REVIEW_SYSTEM_PROMPT,
        messages=[
            {
                "role": "user",
                "content": f"<context>\n{chr(10).join(context)}\n</context>\n\n<diff>\n{diff}\n</diff>",
            }
        ],
    )
    return parse_review(response.content[0].text)

Results

  • Review turnaround: reduced from avg. 18h → 11h (−40%)
  • False positive rate: ~8% (manually validated over 200 reviews)
  • Adoption: 3 teams actively using it; 2 more onboarding
  • Cost: ~$0.04 per MR review (Claude API). Pays back in minutes of senior engineer time saved.

What I'd do differently

  1. Chunk by semantic boundary, not line count. Splitting mid-function degraded context quality. A tree-sitter parser for Python/SQL would be better.
  2. Add confidence scores. Some comments are high-confidence (naming violations), others speculative (performance concerns). Surface that to reviewers.
  3. Human-in-the-loop for critical paths. Production pipeline MRs should require explicit human sign-off even if the agent approves.

Discussion

Was this post useful?

Sign in to like and comment.

Your name and avatar from the chosen provider are stored in this site's own database to show your activity.