LLM Judge - Abhin Rustagi

Overview

LLM Judge is a web-based evaluation platform for scoring question-answer pairs using LLMs as judges. It targets regulated domains (legal, medical, finance) where QA quality matters and automated evaluation needs to be both rigorous and transparent.

Users upload a file of Q&A pairs, configure weighted evaluation criteria with optional hard minimums, pick a judge model, and get back per-pair scores with reasoning, a pass/fail verdict, and any safety warnings.

The goal was to build a flexible evaluation tool that supports multiple LLM providers, domain-aware criteria, and a transparent scoring breakdown — so users can understand why something passed or failed, not just whether it did.

Solution

A two-part system: a FastAPI backend that handles file parsing, safety checks, and LLM-based evaluation, and a Next.js frontend that guides users through a step-by-step wizard.

The evaluation engine runs all criteria concurrently for each Q&A pair using asyncio, with retry logic and exponential backoff per LLM call. Safety checks (jailbreak detection, bias detection) run independently using pattern matching and are reported separately from quality scores.

The verdict logic combines weighted criterion scores with hard minimums: an answer can fail even if its overall weighted score is above the threshold, if any single criterion falls below its hard minimum. This is important for domains like medical, where safety can't be traded off against other criteria.

Tech Stack

Backend: FastAPI, Pydantic, Uvicorn
Frontend: Next.js 15 (App Router), React 19, TypeScript, Tailwind CSS, DaisyUI
LLM Providers: OpenAI SDK, Anthropic SDK, Google GenAI SDK, DeepSeek (via OpenAI-compatible client)
File Processing: PyMuPDF4LLM (PDF to Markdown), regex-based Q&A extraction
Resilience: Tenacity (exponential backoff retries)
Deployment: Docker, Google Cloud Run, Amazon Fargate

Architecture

The frontend is a step-by-step wizard (domain → upload → criteria → model → evaluate → results) built with React 19 and DaisyUI components. It communicates with the FastAPI backend exclusively through a Next.js BFF proxy layer.

On the backend, the evaluation flow is:

Parse the uploaded file into Q&A pairs
Run safety checks (jailbreak + bias) concurrently on each pair
For each pair, evaluate all criteria concurrently — each criterion triggers an LLM call that returns a score (0-100), reasoning, and issues
Aggregate weighted scores, apply hard minimum logic, determine pass/fail verdict
Return the full evaluation response with per-pair results, summary statistics, and safety warnings

Provider abstraction maps model IDs to the correct SDK client, with a unified response format across all providers. Errors in individual evaluations are caught and reported as REJECT verdicts without failing the entire batch.