llm-eval-harness
Benchmark any OpenAI/Anthropic-compatible LLM on speed, concurrency, protocol compliance and quality.
๐ก๏ธ AgentReady threat assessment
MAESTRO 7-layer threat model + OWASP AIVSS risk score for llm-eval-harness, derived from its capabilities.
AIVSS 7.6 ยท High
View MAESTRO 7-layer threat model โOverview
Community Agent Skill that evaluates an LLM endpoint across speed (TTFT, tokens/sec), concurrency/stability (success rate, p50/p90, breaking point), Anthropic protocol compliance (thinking-block trigger rate), and quality regression via blind-judge precision. Issues concurrent API calls and writes benchmark reports.
Key features
- Speed and TTFT measurement
- Concurrency/stability load testing
- Blind-judge quality regression
Use cases
- Verifying a vendor's tokens/sec claim
- Head-to-head model benchmarking