Tag

#llm evals

10 articles tagged with "llm evals"

← Back to all articles
LLM Evaluation Metrics That Actually Matter
LLM Evaluation Metrics That Actually MatterA concise framework for selecting evaluation metrics that map to business outcomes and reliability targets.
Build an LLM Eval Dataset from Production Traces
Build an LLM Eval Dataset from Production TracesHow to convert real user interactions into reusable test sets for regression and model comparison.
Offline vs Online LLM Evals
Offline vs Online LLM EvalsWhen to run synthetic benchmarks, when to measure in production, and how to combine both.
LLM-as-Judge Rubric Design
LLM-as-Judge Rubric DesignRubric patterns that improve consistency and reduce evaluator drift in LLM-as-judge pipelines.
Pairwise vs Absolute LLM Scoring
Pairwise vs Absolute LLM ScoringTradeoffs between pairwise comparisons and absolute scorecards for prompt and model selection.
Tool-Calling Evals: Schema and Retries
Tool-Calling Evals: Schema and RetriesEvaluate function-calling reliability with schema compliance, retries, and side-effect safety checks.
Agent Evals: Trajectory Quality
Agent Evals: Trajectory QualityHow to score multi-step agent behavior, tool choice, and completion efficiency.
CI/CD Eval Gates for LLM Apps
CI/CD Eval Gates for LLM AppsA release pipeline pattern that blocks regressions with automated eval checks.
Hallucination Testing: Reference-Based and Reference-Free
Hallucination Testing: Reference-Based and Reference-FreeTesting methods to catch factual drift, unsupported claims, and citation mismatches.
LLM Regression Dashboard: Alerts and Thresholds
LLM Regression Dashboard: Alerts and ThresholdsDashboard design for evaluation regressions, quality alerts, and deployment control loops.