Methodology
How DX Audit runs agent-based audits of developer-facing products, and how to read the resulting reports.
This is the framework behind our published audits and client baseline audits. We use observed agent behavior as evidence — when an agent struggles to discover documentation, obtain credentials, or recover from an error, that friction is real and can affect human developers too. The published audits are an early evidence-backed set of case studies, not a benchmark leaderboard.
Launch Posture
- Real audits with real agents, not hypothetical scoring rubrics
- Visible disclosure in every published report
- Still refining the methodology through real audits
- Qualitative findings first; numerical scoring is deferred
Audit process
Standard Task Suite
What we test. The baseline audit for a new service follows a six-stage workflow. Each stage tests a distinct aspect of the developer experience. The suite is standardized enough to support pattern-finding across services, while leaving room for service-specific task design within each stage. In client engagements, we align on key outcomes and design task suites around the workflows that matter most — and can complement agent audits with moderated usability tests with human developers.
Discover
Can the agent find the relevant docs, references, and entry points?
Onboard
Can it obtain access, credentials, and the minimum setup needed to begin?
Core task
Can it complete the representative workflow the service is supposed to support?
Error handling
When the first attempt fails, can it recover using the available signals?
Cleanup / offboard
Can it identify and clean up the resources it created?
Reflection
What friction did the agent identify after completing the workflow?
Note
The task suite provides a common structure across audits, not benchmark-grade comparability. Each service audit adapts the core task to the service's primary workflow. Differences in task scope are disclosed in the report's Run Conditions.
Agent and evidence
Model Runs and Evidence Policy
How evidence is gathered. Audits are run with real AI agents. Each launch report includes runs from two models to capture behavioral variation. Divergences between runs are recorded rather than flattened away.
Agent + Models
- Harness
Claude Code- Launch models
-
Opus 4.6andSonnet 4.6 - Run strategy
Each service is audited with both models. One published report per service combines both runs, with divergences noted inline.
Evidence Model
- Reader-first layer
- Report findings and recommendations are the primary reader-facing artifact.
- Supporting evidence
- Evidence remains visible through excerpts, session timelines, and transcript links.
- Model divergences
- Differences between model runs are preserved when they materially affect interpretation.
Launch status
Launch reports use Claude models only. Codex and potentially other agent harnesses will be added in future audit cycles. Running the same service audit across multiple agent families will help distinguish service-attributable friction from model-specific behavior.
Disclosure contract
Run Conditions
What we disclose. Every published launch report includes the same six-field disclosure block. These fields document the exact conditions under which the audit was run, so readers can assess what is service-attributable and what is an artifact of the test environment.
Run Conditions
Present in every published report. Labels match exactly.
- Starting state
- What existed before the run that materially affects interpretation.
- Fixture policy
- What was prepared in advance, if anything, and why.
- Credential timing
- When credentials became available and how they were provided.
- Allowed surfaces
- Which interfaces the agent was permitted to use, and what the harness constrained.
- Operator intervention policy
- What the operator could do and how interventions are recorded.
- Declared deviations from baseline
- Any service-, harness-, environment-, or operator-specific departures from the default procedure.
Finding attribution
Findings distinguish between root causes so that not every observed problem is attributed to the service under test:
Analytic framework
Analysis Dimensions
How we organize findings. Seven dimensions structure how we analyze developer-facing products. Each finding in a published audit maps to one or more of these dimensions, making it easier to see where friction concentrates and where a service performs well.
01
Discoverability
How easily an agent can find the right starting points, references, and workflow hints.
02
Documentation quality
Whether the docs are accurate, current, machine-accessible, and sufficient to proceed.
03
Onboarding friction
How hard it is to obtain credentials, configure access, and reach a usable starting state.
04
API coherence
Whether the service's mental model, naming, and workflow structure are internally consistent.
05
Failure transparency
Whether errors are clear enough to support recovery.
06
Output reliability
Whether the service returns stable, interpretable outputs the agent can act on confidently.
07
Agent-native interface availability
Whether the service offers machine-oriented surfaces such as llms.txt, MCP, or other agent-specific affordances.
Reading the results
Interpretation and Validity
How to read the results. Agent behavior changes as models and services evolve. These reports are useful as evidence-backed case studies within their stated scope, not as permanent verdicts.
Temporal validity
Every report is published with the test date and model version. Results become less representative as services update their APIs, docs, and onboarding — and as models themselves evolve.
Re-audits are expected over time. Updated reports will be published alongside the originals, not as silent replacements.
What launch results do and do not claim
They are: evidence-backed case studies of real agent interactions with real developer-facing products, structured enough to reveal patterns across services.
They are not: a benchmark leaderboard, a permanent verdict, or a claim of perfectly identical test conditions across all services.