Max Petrusenko portrait

Max Petrusenko

Blog

Tag

#llm evals

10 articles tagged with "llm evals"

← Back to all articles

LLM Evaluation Metrics That Actually Matter

LLM Evaluation Metrics That Actually MatterA concise framework for selecting evaluation metrics that map to business outcomes and reliability targets.

3/27/2026TechFull Article

Build an LLM Eval Dataset from Production Traces

Build an LLM Eval Dataset from Production TracesHow to convert real user interactions into reusable test sets for regression and model comparison.

3/26/2026TechFull Article

Offline vs Online LLM Evals

Offline vs Online LLM EvalsWhen to run synthetic benchmarks, when to measure in production, and how to combine both.

3/25/2026TechFull Article

LLM-as-Judge Rubric Design

LLM-as-Judge Rubric DesignRubric patterns that improve consistency and reduce evaluator drift in LLM-as-judge pipelines.

3/24/2026TechFull Article

Pairwise vs Absolute LLM Scoring

Pairwise vs Absolute LLM ScoringTradeoffs between pairwise comparisons and absolute scorecards for prompt and model selection.

3/23/2026TechFull Article

Tool-Calling Evals: Schema and Retries

Tool-Calling Evals: Schema and RetriesEvaluate function-calling reliability with schema compliance, retries, and side-effect safety checks.

3/22/2026TechFull Article

Agent Evals: Trajectory Quality

Agent Evals: Trajectory QualityHow to score multi-step agent behavior, tool choice, and completion efficiency.

3/21/2026TechFull Article

CI/CD Eval Gates for LLM Apps

CI/CD Eval Gates for LLM AppsA release pipeline pattern that blocks regressions with automated eval checks.

3/20/2026TechFull Article

Hallucination Testing: Reference-Based and Reference-Free

Hallucination Testing: Reference-Based and Reference-FreeTesting methods to catch factual drift, unsupported claims, and citation mismatches.

3/19/2026TechFull Article

LLM Regression Dashboard: Alerts and Thresholds

LLM Regression Dashboard: Alerts and ThresholdsDashboard design for evaluation regressions, quality alerts, and deployment control loops.

3/18/2026TechFull Article