llm-eval-harness

Agent SkillsFreeOpen Source

Benchmark any OpenAI/Anthropic-compatible LLM on speed, concurrency, protocol compliance and quality.

🛡️ AgentReady threat assessment

MAESTRO 7-layer threat model + OWASP AIVSS risk score for llm-eval-harness, derived from its capabilities.

AIVSS 7.6 · High

Overview

Community Agent Skill that evaluates an LLM endpoint across speed (TTFT, tokens/sec), concurrency/stability (success rate, p50/p90, breaking point), Anthropic protocol compliance (thinking-block trigger rate), and quality regression via blind-judge precision. Issues concurrent API calls and writes benchmark reports.

Key features

Speed and TTFT measurement
Concurrency/stability load testing
Blind-judge quality regression

Use cases

Verifying a vendor's tokens/sec claim
Head-to-head model benchmarking