Methodology

How DX Audit runs agent-based audits of developer-facing products, and how to read the resulting reports.

This is the framework behind our published audits and client baseline audits. We use observed agent behavior as evidence — when an agent struggles to discover documentation, obtain credentials, or recover from an error, that friction is real and can affect human developers too. The published audits are an early evidence-backed set of case studies, not a benchmark leaderboard.

Launch Posture

Real audits with real agents, not hypothetical scoring rubrics
Visible disclosure in every published report
Still refining the methodology through real audits
Qualitative findings first; numerical scoring is deferred

Audit process

Standard Task Suite

What we test. The baseline audit for a new service follows a six-stage workflow. Each stage tests a distinct aspect of the developer experience. The suite is standardized enough to support pattern-finding across services, while leaving room for service-specific task design within each stage. In client engagements, we align on key outcomes and design task suites around the workflows that matter most — and can complement agent audits with moderated usability tests with human developers.

Discover

Can the agent find the relevant docs, references, and entry points?

Onboard

Can it obtain access, credentials, and the minimum setup needed to begin?

Core task

Can it complete the representative workflow the service is supposed to support?

Error handling

When the first attempt fails, can it recover using the available signals?

Cleanup / offboard

Can it identify and clean up the resources it created?

Reflection

What friction did the agent identify after completing the workflow?

Note

The task suite provides a common structure across audits, not benchmark-grade comparability. Each service audit adapts the core task to the service's primary workflow. Differences in task scope are disclosed in the report's Run Conditions.

Agent and evidence

Model Runs and Evidence Policy

How evidence is gathered. Audits are run with real AI agents. Each launch report includes runs from two models to capture behavioral variation. Divergences between runs are recorded rather than flattened away.

Agent + Models

Harness: Claude Code
Launch models: Opus 4.6 and Sonnet 4.6
Run strategy: Each service is audited with both models. One published report per service combines both runs, with divergences noted inline.

Evidence Model

Reader-first layer: Report findings and recommendations are the primary reader-facing artifact.
Supporting evidence: Evidence remains visible through excerpts, session timelines, and transcript links.
Model divergences: Differences between model runs are preserved when they materially affect interpretation.

Launch status

Launch reports use Claude models only. Codex and potentially other agent harnesses will be added in future audit cycles. Running the same service audit across multiple agent families will help distinguish service-attributable friction from model-specific behavior.

Disclosure contract

Run Conditions

What we disclose. Every published launch report includes the same six-field disclosure block. These fields document the exact conditions under which the audit was run, so readers can assess what is service-attributable and what is an artifact of the test environment.

Run Conditions

Present in every published report. Labels match exactly.

Starting state: What existed before the run that materially affects interpretation.
Fixture policy: What was prepared in advance, if anything, and why.
Credential timing: When credentials became available and how they were provided.
Allowed surfaces: Which interfaces the agent was permitted to use, and what the harness constrained.
Operator intervention policy: What the operator could do and how interventions are recorded.
Declared deviations from baseline: Any service-, harness-, environment-, or operator-specific departures from the default procedure.

Finding attribution

Findings distinguish between root causes so that not every observed problem is attributed to the service under test:

ServiceDocumentationHarnessEnvironmentOperator

Analytic framework

Analysis Dimensions

How we organize findings. Seven dimensions structure how we analyze developer-facing products. Each finding in a published audit maps to one or more of these dimensions, making it easier to see where friction concentrates and where a service performs well.

Discoverability

How easily an agent can find the right starting points, references, and workflow hints.

Documentation quality

Whether the docs are accurate, current, machine-accessible, and sufficient to proceed.

Onboarding friction

How hard it is to obtain credentials, configure access, and reach a usable starting state.

API coherence

Whether the service's mental model, naming, and workflow structure are internally consistent.

Failure transparency

Whether errors are clear enough to support recovery.

Output reliability

Whether the service returns stable, interpretable outputs the agent can act on confidently.

Agent-native interface availability

Whether the service offers machine-oriented surfaces such as llms.txt, MCP, or other agent-specific affordances.

Reading the results

Interpretation and Validity

How to read the results. Agent behavior changes as models and services evolve. These reports are useful as evidence-backed case studies within their stated scope, not as permanent verdicts.

Temporal validity

Every report is published with the test date and model version. Results become less representative as services update their APIs, docs, and onboarding — and as models themselves evolve.

Re-audits are expected over time. Updated reports will be published alongside the originals, not as silent replacements.

What launch results do and do not claim

They are: evidence-backed case studies of real agent interactions with real developer-facing products, structured enough to reveal patterns across services.

They are not: a benchmark leaderboard, a permanent verdict, or a claim of perfectly identical test conditions across all services.