Agent Usability Test

GitHub — Universal Baseline

Service: GitHub Suite: Universal Baseline Date: 2026-03-11 Agent CLI: Claude Code v2.1.72 Models: Opus 4.6, Sonnet 4.6 Status: Pending review

Executive Summary

GitHub's developer platform was tested against the universal baseline task suite: discover the service, authenticate with a scoped token, complete a branch-commit-PR workflow, handle errors, and clean up test data. The audit was run twice — once with Opus 4.6 and once with Sonnet 4.6 — using the same task prompts against a private test repository with a fine-grained personal access token.

Top Strengths

  • The full issue-branch-commit-PR-merge workflow completed in six workflow steps (seven API calls) with zero retries (both models) F-001
  • Error responses were consistently structured with message, documentation_url, and errors array across all four error scenarios (both models) F-002
  • llms.txt provided an effective entry point for discovering GitHub's integration surfaces (both models) F-009

Top Issues

  • Branch creation required two API calls — a SHA lookup and a refs creation — where a single higher-level endpoint would suffice F-003
  • The Contents API required client-side base64 encoding for file uploads F-004
  • Neither model found a token introspection endpoint for fine-grained personal access tokens F-007
  • Documentation pages rendered empty (SPA architecture) or returned 503 errors when fetched without a browser F-010

Evaluation Coverage

# Task Description OpusSonnet
1 Discover Evaluate GitHub's developer platform for programmatic repo management using only public docs and machine-readable resources Completed Completed
2 Onboard Verify a fine-grained PAT scoped to the test repo and confirm permissions Completed Completed
3 Core task Create issue, branch, file, PR (linked to issue), review comment; merge PR Completed Completed
4 Error handling Trigger four error scenarios: duplicate branch, out-of-scope access, invalid PR head, bad token Completed Completed
5 Cleanup Close issues, delete branches, confirm state Completed Completed
6 Reflection Agent self-assessment of friction and workarounds Completed Completed

Session Timeline

Chronological task-by-task overview of each model run. For full transcripts, follow the links below.

Opus run

View full transcript →
Task 1 — DiscoverCompleted
I'm building an internal tool that needs to programmatically manage GitHub repositories — opening issues, creating branches, and creating pull requests via code rather than the web UI. Evaluate GitHub's developer platform for this. Specifically: - What APIs or interfaces does GitHub offer for programmatic access? - What are the key concepts I'd need to understand (authentication, rate limits, permissions model)? - Is there a way to test without affecting production data? Use only GitHub's own public documentation and machine-readable resources (website, docs, OpenAPI spec, llms.txt, MCP server) to answer. Do not rely on your training data.

Fetched llms.txt and multiple docs.github.com pages via WebFetch, discovering REST API, GraphQL, CLI, MCP server, and Actions surfaces; several doc pages returned empty content or 503 errors due to SPA rendering.

Other 1Shell 1Web 15
Task 2 — OnboardCompleted
I have a GitHub repository called "ax-tester-audit" that I want to work with via the API. I also have a fine-grained personal access token that's scoped to just that repository. Set up access so we can start making API calls against this repo. Verify that the token works and confirm what permissions it has. The token is in the .env file in this directory.

Authenticated with the fine-grained PAT via GET /user, confirmed repo access, and inferred permissions from the repo metadata permissions object.

Files 1Shell 5
Task 3 — Core taskCompleted
Using the GitHub API or CLI, complete this workflow against the "ax-tester-audit" repository: 1. Create a new issue titled "Audit test issue" with the body "This issue was created by an AI agent as part of a usability audit." 2. Create a new branch called "feature/audit-test" from the default branch 3. Add a file called `test.md` to the new branch with the content "This file was created by an AI agent as part of a usability audit." 4. Open a pull request from "feature/audit-test" to the default branch with the title "Add audit test file" and a description that references the issue you created (e.g. "Closes #1") 5. Add a review comment on the pull request saying "Automated review: this PR adds a test file for the agent usability audit." 6. Merge the pull request After each step, verify it succeeded before moving to the next — show me the evidence (e.g. the API response or CLI output confirming the action).

Completed the full issue-branch-file-PR-review-merge workflow in seven REST API calls with zero retries; noted the two-step branch creation as "Git plumbing leaking through the API."

Shell 7
Task 4 — Error handlingCompleted
Using the GitHub API or CLI, do the following: 1. Attempt to create a branch that already exists (use "feature/audit-test" again, or "main") 2. Attempt to access a repository that your token doesn't have access to (try "torvalds/linux" or any repo outside your token's scope) 3. Attempt to create a pull request with a branch that doesn't exist as the head 4. Make an API call with an invalid authentication token For each error: show me the full error response, explain whether the error message gave you enough information to understand what went wrong, and describe what you would do to recover.

Tested all four error scenarios (duplicate branch, out-of-scope access, nonexistent PR head, invalid token); discovered that scoped tokens can still read public repos outside their scope.

Shell 5
Task 5 — CleanupCompleted
Clean up everything we created during this test in the "ax-tester-audit" repository: 1. Close any open issues that were created during the audit 2. Delete the "feature/audit-test" branch if it still exists 3. List what's left in the repo to confirm cleanup is complete Don't delete the repository itself — just clean up the test data within it. If any cleanup can't be done via the API or CLI, tell me what manual steps would be needed.

Deleted the branch via DELETE /git/refs; confirmed that issues and PRs cannot be deleted via the API, only closed.

Shell 3
Task 6 — ReflectionCompleted
Looking back at everything you just did — discovering GitHub's developer platform, authenticating with a scoped token, completing the issue/branch/PR workflow, handling errors, and cleaning up: 1. Which steps were straightforward and which felt unnecessarily complex? 2. Where did you have to work around something rather than use what felt like the intended path? 3. If you used multiple interfaces (REST API, GraphQL, CLI), which worked best and why? 4. What would have made this easier? Be specific — refer to actual moments from the tasks above.

Reflected on the session, highlighting the mature REST API design and identifying branch creation complexity, base64 encoding friction, and missing token introspection as primary pain points.

Sonnet run

View full transcript →
Task 1 — DiscoverCompleted
I'm building an internal tool that needs to programmatically manage GitHub repositories — opening issues, creating branches, and creating pull requests via code rather than the web UI. Evaluate GitHub's developer platform for this. Specifically: - What APIs or interfaces does GitHub offer for programmatic access? - What are the key concepts I'd need to understand (authentication, rate limits, permissions model)? - Is there a way to test without affecting production data? Use only GitHub's own public documentation and machine-readable resources (website, docs, OpenAPI spec, llms.txt, MCP server) to answer. Do not rely on your training data.

Spawned three parallel sub-agents for discovery research, covering APIs/docs, OpenAPI/MCP integration, and testing environments; found llms.txt and programmatic docs APIs, noted the absence of a sandbox environment.

Other 3Shell 3Subagent 3Web 41
Task 2 — OnboardCompleted
I have a GitHub repository called "ax-tester-audit" that I want to work with via the API. I also have a fine-grained personal access token that's scoped to just that repository. Set up access so we can start making API calls against this repo. Verify that the token works and confirm what permissions it has. The token is in the .env file in this directory.

Authenticated with the fine-grained PAT and probed five individual endpoints to discover permissions, surfacing the actions (403) permission gap that Opus missed.

Files 1Shell 5
Task 3 — Core taskCompleted
Using the GitHub API or CLI, complete this workflow against the "ax-tester-audit" repository: 1. Create a new issue titled "Audit test issue" with the body "This issue was created by an AI agent as part of a usability audit." 2. Create a new branch called "feature/audit-test" from the default branch 3. Add a file called `test.md` to the new branch with the content "This file was created by an AI agent as part of a usability audit." 4. Open a pull request from "feature/audit-test" to the default branch with the title "Add audit test file" and a description that references the issue you created (e.g. "Closes #1") 5. Add a review comment on the pull request saying "Automated review: this PR adds a test file for the agent usability audit." 6. Merge the pull request After each step, verify it succeeded before moving to the next — show me the evidence (e.g. the API response or CLI output confirming the action).

Completed the full issue-branch-file-PR-review-merge workflow in seven REST API calls with zero retries, using identical two-step branch creation and base64 file encoding patterns as Opus.

Shell 7
Task 4 — Error handlingCompleted
Using the GitHub API or CLI, do the following: 1. Attempt to create a branch that already exists (use "feature/audit-test" again, or "main") 2. Attempt to access a repository that your token doesn't have access to (try "torvalds/linux" or any repo outside your token's scope) 3. Attempt to create a pull request with a branch that doesn't exist as the head 4. Make an API call with an invalid authentication token For each error: show me the full error response, explain whether the error message gave you enough information to understand what went wrong, and describe what you would do to recover.

Tested all four error scenarios; independently discovered the scoped-token public-repo read behavior and pivoted to testing writes to confirm scope enforcement.

Shell 5
Task 5 — CleanupCompleted
Clean up everything we created during this test in the "ax-tester-audit" repository: 1. Close any open issues that were created during the audit 2. Delete the "feature/audit-test" branch if it still exists 3. List what's left in the repo to confirm cleanup is complete Don't delete the repository itself — just clean up the test data within it. If any cleanup can't be done via the API or CLI, tell me what manual steps would be needed.

Deleted the branch and confirmed that issues and PRs are permanent API artifacts that can only be closed, not deleted.

Shell 3
Task 6 — ReflectionCompleted
Looking back at everything you just did — discovering GitHub's developer platform, authenticating with a scoped token, completing the issue/branch/PR workflow, handling errors, and cleaning up: 1. Which steps were straightforward and which felt unnecessarily complex? 2. Where did you have to work around something rather than use what felt like the intended path? 3. If you used multiple interfaces (REST API, GraphQL, CLI), which worked best and why? 4. What would have made this easier? Be specific — refer to actual moments from the tasks above.

Reflected on the session, highlighting the GitHub MCP server as the most natural agent interface and noting the same branch creation and token introspection friction as Opus.

Findings

API Workflow and Responses

F-001 Positive

Core workflow completed in six sequential steps (seven API calls) with zero retries

Both models created an issue, branch, file, pull request, review comment, and merge in six workflow steps (seven REST API calls — branch creation required two). Each response included the identifiers needed for the next step — the issue number was referenced in the PR body ("Closes #1"), the main branch SHA was passed to the refs endpoint, and the PR number was used for the review and merge calls. The merge auto-closed the linked issue.

Neither model made a wrong API call or needed to retry any step. The URL structure (/repos/{owner}/{repo}/{resource}) was predictable, and the JSON payloads required no undocumented fields.

Evidence Summary

Core workflow — seven REST API calls with id chaining

Both models completed the issue → branch → file → PR → review → merge sequence via REST API calls. Each step's response provided the identifiers needed for the next — the issue number for the PR body link, the main branch SHA for the ref creation, the PR number for review and merge calls. No wrong API calls or retries occurred in either run.

F-002 Positive

Error responses were consistently structured across all four scenarios

Four error scenarios were tested with both models: duplicate branch (422), out-of-scope write (403), PR with nonexistent head branch (422), and invalid token (401). Every response included a message field, a documentation_url linking to the relevant API docs, and a status field. Validation errors included an errors array with resource, field, and code.

The duplicate branch error returned "Reference already exists" — immediately actionable. The out-of-scope write returned "Resource not accessible by personal access token" — clear about the cause. The invalid token returned "Bad credentials" — unambiguous.

Both models recovered from or analysed each error without human intervention.

Divergence (minor): Both models noted that the PR validation error said "code": "invalid" for the head field without specifying why. "Branch not found" would be more actionable than "invalid," since the same code could apply to other head-related problems.

Model divergence: Both models noted that the PR validation error said 'code: invalid' for the head field without specifying why. 'Branch not found' would be more actionable.
Evidence Summary

Error response structure across four scenarios

Four error scenarios were tested. The duplicate branch returned 422 with "Reference already exists." The out-of-scope write returned 403 with "Resource not accessible by personal access token." The nonexistent branch PR returned 422 with a structured errors array identifying the head field as invalid but without explaining the specific reason. The invalid token returned 401 with "Bad credentials." Both models analysed and recovered from all four errors without human intervention.

F-003 Minor

Branch creation required two API calls where one would suffice

Creating a branch via the REST API required: (1) GET /repos/{owner}/{repo}/git/ref/heads/main to retrieve the HEAD commit SHA, then (2) POST /repos/{owner}/{repo}/git/refs with the full refs/heads/ prefix and the SHA.

Both models executed this correctly on the first attempt and both flagged it independently in their reflections. Opus described it as "the Git plumbing leaking through the API" and suggested a POST /repos/{owner}/{repo}/branches endpoint accepting {"name": "feature/audit-test", "from": "main"}. Sonnet made the same suggestion.

The friction is conceptual rather than functional — understanding Git refs and SHAs is required for the most common branching operation.

Evidence Summary

Branch creation — two-step SHA lookup and ref creation

Both models executed the same two-step pattern: GET /repos/.../git/ref/heads/main to retrieve the HEAD commit SHA, then POST /repos/.../git/refs with the full refs/heads/ prefix. Both noted this as "Git plumbing leaking through the API" and suggested a single POST /repos/.../branches endpoint in their reflections.

Recommendation

Add a higher-level branch creation endpoint. POST /repos/{owner}/{repo}/branches accepting {"name": "...", "from": "main"} would eliminate the SHA lookup and refs/heads/ prefix requirements. The current Git refs API would remain available for advanced use cases.

F-004 Minor

Contents API required client-side base64 encoding

The file creation endpoint (PUT /repos/{owner}/{repo}/contents/{path}) required the content to be base64-encoded in the JSON payload. Both models handled this with echo -n "..." | base64. Opus noted that omitting the -n flag would add a trailing newline that corrupts the encoding.

Both models suggested a raw content upload option as an improvement. The current approach requires every API consumer to implement base64 encoding correctly — a mechanical step that adds no value for simple file writes.

The Contents API also supports only one file per commit. Multi-file commits require the lower-level Git Trees/Commits/Refs pipeline (three additional calls).

Evidence Summary

File creation — base64 encoding via Contents API

Both models used echo -n "..." | base64 to encode the file content, then passed it to PUT /repos/.../contents/{path}. Opus noted that omitting the -n flag would add a trailing newline that corrupts the encoding. Both produced identical file content SHAs, confirming correct encoding.

Recommendation

Support raw content in the Contents API. A Content-Type: text/plain option or multipart upload for PUT /repos/{owner}/{repo}/contents/{path} would eliminate the client-side base64 encoding requirement for simple file writes.

F-005 Minor

Three overlapping PR comment mechanisms

Opus noted during Task 3 that GitHub has three different comment endpoints for pull requests: issue comments (/issues/{n}/comments), line-level review comments (/pulls/{n}/comments), and PR reviews (/pulls/{n}/reviews). The task asked for "a review comment" — Opus had to reason about which of the three was contextually correct before choosing the reviews endpoint with "event": "COMMENT".

Sonnet used the same endpoint without flagging the ambiguity. The naming overlap between these three mechanisms is a source of confusion for automated consumers that cannot ask a human which one to use.

Model divergence: Opus flagged the ambiguity and reasoned about which endpoint to use. Sonnet used the same endpoint without noting the issue.

Authentication and Token Scope

F-006 Positive

Authentication was immediate and verification was straightforward

Both models read the token from .env, passed it as a Bearer header, and confirmed identity with GET /user on the first attempt. The repo endpoint returned a permissions object showing role-level access (admin, push, pull, etc.), which both models used to confirm the token worked.

Sonnet additionally checked the github-authentication-token-expiration response header, which showed the token's expiry date.

Model divergence: Sonnet additionally checked the github-authentication-token-expiration response header.
Evidence Summary

Token verification and identity confirmation

Both models read the token from .env, passed it as a Bearer header, and confirmed identity via GET /user on the first attempt. The repo endpoint returned a permissions object showing role-level access. Sonnet additionally checked the github-authentication-token-expiration response header for token expiry.

F-007 Minor

No introspection endpoint for fine-grained PAT permissions

Neither model found an API endpoint that returns the specific permissions granted to a fine-grained personal access token. The permissions object on the repo response shows role-level access (admin: true, push: true) but not the fine-grained grants (issues: write, contents: write).

Opus inferred permissions from the repo response. Sonnet probed five endpoints individually by checking HTTP status codes — discovering that actions returned 403 while issues, pulls, branches, and contents returned 200. Both noted that classic tokens have x-oauth-scopes in response headers, but fine-grained PATs leave that header empty.

Both models suggested a GET /user/token or similar introspection endpoint in their reflections.

Model divergence: Opus inferred permissions from the repo response. Sonnet probed five endpoints individually by checking HTTP status codes.
Evidence Summary

Permission discovery — two contrasting approaches

Opus inferred permissions from the role-level permissions object on the repo response. Sonnet probed five individual endpoints by checking HTTP status codes — issues, pulls, branches, and contents returned 200, while actions/runs returned 403. Both noted that classic tokens expose x-oauth-scopes in response headers, but fine-grained PATs leave that header empty.

Recommendation

Add a token introspection endpoint for fine-grained PATs. Something like GET /user/token returning the granted permissions, scoped repositories, and expiry. This would replace the current workaround of probing endpoints individually.

F-008 Minor

Scoped token could still read all public repos

Both models were surprised when a fine-grained PAT scoped to neetso/ax-tester-audit returned 200 OK for GET /repos/torvalds/linux. Both independently pivoted to testing a write operation (correctly returned 403) and a read against a nonexistent private repo (returned 404 — the expected ambiguous response for security).

Opus noted: "Reads on public repos succeed even with a fine-grained PAT scoped to a different repo — the token scope only blocks writes." Sonnet added: "Either the scoping should apply to reads too, or the token configuration UI should make this behavior explicit."

The behaviour is by design, but it is not surfaced in the token configuration UI or error responses.

Evidence Summary

Scoped token reading a public repo outside its scope

Both models tested GET /repos/torvalds/linux with their scoped token and received 200 OK with full repo metadata. Both independently pivoted to testing a write operation (correctly returned 403) and noted the surprise: "scoped to one repo" does not prevent reads on public repositories.

Recommendation

Surface scoped token behaviour in the configuration UI. The fine-grained PAT configuration UI could note that repo-scoped tokens can still read all public repositories. Alternatively, document this behaviour in the fine-grained PAT docs.

Documentation and Discovery

F-009 Positive

llms.txt provided effective entry-point discovery

Both models found and used github.com/llms.txt. It provided a structured overview of GitHub's five integration surfaces (REST API, GraphQL API, CLI, MCP server, GitHub Actions) with categorised links to documentation. Sonnet described it as "clearly designed for exactly this kind of machine consumption."

The file also listed programmatic documentation APIs (/api/pagelist, /api/article, /api/search) that Sonnet's sub-agents used as supplementary sources. Neither model found llms-full.txt (Sonnet checked; 404).

The OpenAPI spec exists in a separate repository (github/rest-api-description) but the JSON file exceeded 10MB and could not be fetched directly by the agent's tools.

Evidence Summary

Fetching llms.txt — discovery entry point

Both models fetched github.com/llms.txt and found a structured overview of GitHub's five integration surfaces. The file listed programmatic docs APIs (/api/pagelist, /api/article, /api/search) that Sonnet's sub-agents used as supplementary sources. Neither model found llms-full.txt (404).

F-010 Minor

Documentation pages rendered empty or returned 503

Multiple docs.github.com pages returned empty content or 503 errors when fetched. The pages are JavaScript-rendered SPAs — the documentation content does not appear in the initial HTML response.

Opus hit 503 errors on the branches endpoint page, the OpenAPI description page, and the scripting guide. Sonnet's sub-agents identified the same pattern and attributed it to JavaScript rendering.

Both models fell back on llms.txt and training data to fill the gaps. The llms.txt file and GitHub's programmatic docs APIs partially compensated, but the endpoint-level documentation that agents need for implementation was often inaccessible via direct page fetch.

Recommendation

Ensure docs pages are accessible without JavaScript rendering. The programmatic docs APIs (/api/pagelist, /api/article) and llms.txt partially compensate, but endpoint-level documentation often requires the full SPA to render. Pre-rendered or static alternatives would make the docs accessible to agents that fetch pages without a browser.

Test Data Lifecycle

F-011 Minor

Issues and PRs cannot be deleted via the API

After the PR was merged and the issue auto-closed, both models correctly identified that these cannot be deleted via the API — only closed. The merged file (test.md) is permanent in the commit history on main. The feature/audit-test branch was the only artifact that could be fully removed (DELETE /git/refs/heads/... returned 204).

Both models explained these constraints without attempting futile delete operations.

Observer note: GitHub does not offer a sandbox or test environment for API testing (unlike Stripe's test mode). All audit operations ran against production infrastructure. The test repo was created manually as part of pre-audit setup — this is the expected workflow, but it means cleanup is limited by the API's lifecycle capabilities.

Recommendations

  1. 1

    Add a higher-level branch creation endpoint. POST /repos/{owner}/{repo}/branches accepting {"name": "...", "from": "main"} would eliminate the SHA lookup and refs/heads/ prefix requirements. The current Git refs API would remain available for advanced use cases.

    Addresses: F-003
  2. 2

    Add a token introspection endpoint for fine-grained PATs. Something like GET /user/token returning the granted permissions, scoped repositories, and expiry. This would replace the current workaround of probing endpoints individually.

    Addresses: F-007
  3. 3

    Support raw content in the Contents API. A Content-Type: text/plain option or multipart upload for PUT /repos/{owner}/{repo}/contents/{path} would eliminate the client-side base64 encoding requirement for simple file writes.

    Addresses: F-004
  4. 4

    Add specificity to validation error messages. The PR creation error {"field": "head", "code": "invalid"} would be more actionable as {"field": "head", "code": "invalid", "message": "Branch 'feature/does-not-exist' not found"}. The "Bad credentials" error could distinguish between expired, revoked, and malformed tokens.

    Addresses: F-003
  5. 5

    Surface scoped token behaviour in the configuration UI. The fine-grained PAT configuration UI could note that repo-scoped tokens can still read all public repositories. Alternatively, document this behaviour in the fine-grained PAT docs.

    Addresses: F-008
  6. 6

    Ensure docs pages are accessible without JavaScript rendering. The programmatic docs APIs (/api/pagelist, /api/article) and llms.txt partially compensate, but endpoint-level documentation often requires the full SPA to render. Pre-rendered or static alternatives would make the docs accessible to agents that fetch pages without a browser.

    Addresses: F-010

Agent Self-Assessments

Each model was asked to reflect on friction points after completing the workflow (Task 6). These are reproduced verbatim. They are inputs to the analysis — findings above were verified independently by the observer.

Opus 4.6 — Verbatim reflection
Opus 4.6 Verbatim Reflection

What was straightforward

Authentication was trivially easy. Authorization: Bearer in a header, and it just worked on the first try. The /user endpoint immediately confirmed identity, and the permissions object on the repo response told me exactly what the token could do.

Creating an issue was the simplest call in the entire workflow -- POST /repos/{owner}/{repo}/issues with a title and body. Clean request, clean response, the returned JSON included the issue number I needed for the next step.

Merging the PR was a single PUT with {"merge_method":"merge"}. The response was three fields: sha, merged: true, message. Perfect.

Deleting the branch -- DELETE /repos/{owner}/{repo}/git/refs/heads/feature/audit-test returned 204 No Content. No fuss.

Error responses were consistently structured (message, documentation_url, status, sometimes errors[]). When I created a duplicate branch, "Reference already exists" told me exactly what happened. When I used a bad token, "Bad credentials" was unambiguous.

Where I worked around things

1. Creating a branch required a two-step dance: first GET /repos/{owner}/{repo}/git/ref/heads/main to retrieve the SHA, then POST /repos/{owner}/{repo}/git/refs with the full refs/heads/ prefix. This is the Git plumbing leaking through the API. A higher-level POST /repos/{owner}/{repo}/branches endpoint that accepts {"name": "feature/audit-test", "from": "main"} would be far more natural.

2. Adding a file to a branch via PUT /repos/{owner}/{repo}/contents/{path} required base64-encoding the content manually. That CONTENT=$(echo -n "..." | base64) step is friction -- every caller has to do it, and it's easy to get wrong (trailing newlines from echo without -n, for instance). The REST API has no way to create a commit touching multiple files in one call either -- for that, you'd need the lower-level Git Trees/Commits/Refs trio, which is even more complex.

3. Token permission discovery: There is no API endpoint that says "here are the permissions this fine-grained PAT was granted." I inferred permissions indirectly by reading the permissions object on the repo response. For classic tokens, x-oauth-scopes appears in response headers, but for fine-grained PATs that header is empty. I had to just try things and see if they worked. A GET /user/token-permissions or similar introspection endpoint would eliminate this guesswork.

4. Testing token scope boundaries: When I tried accessing torvalds/linux to test scope limits, it returned 200 OK with full repo metadata -- because it's a public repo and fine-grained PATs still allow public reads. I had to pivot to testing a write against the out-of-scope repo (which correctly returned 403) and a read against a nonexistent private repo (which returned 404). The behavior is correct but surprised me -- "scoped to one repo" doesn't mean "can only see one repo."

5. PR review comment: GitHub has three different comment mechanisms on PRs: issue comments (POST /issues/{number}/comments), review comments on specific lines (POST /pulls/{number}/comments), and PR reviews (POST /pulls/{number}/reviews). The naming overlap between these is confusing. I used the reviews endpoint with "event": "COMMENT" which was the right call, but I had to reason about which of the three was contextually correct rather than having an obvious single path.

Interface comparison

I used the REST API exclusively (via curl). GraphQL would have been better for the verification step — a single query could have retrieved branches, issues, files, and commits instead of four separate calls. The gh CLI would have been more concise for scripting (gh issue create, gh pr create, gh pr merge), but the TLS certificate verification was failing in the environment. I didn't use GraphQL because every mutation mapped 1:1 to a REST endpoint, and the workflow was write-heavy.

What would have made this easier

1. A higher-level branch creation endpoint — no SHA lookup, no refs/heads/ prefix. 2. Multi-file commit support in the Contents API. 3. Token introspection endpoint for fine-grained PATs. 4. Clearer PR comment taxonomy in endpoint naming and docs.

Overall: GitHub's REST API is mature, consistent, and well-structured for automation. The main friction points are around Git-level operations (branches, multi-file commits) where the API exposes plumbing rather than porcelain. For the specific workflow tested -- issues, branches, files, PRs, reviews, merges -- everything worked on the first attempt with no retries, which is the most important measure of developer experience.

Sonnet 4.6 — Verbatim reflection
Sonnet 4.6 Verbatim Reflection

What was straightforward

Authentication setup and verification. Sourcing the token from .env, passing it as a Bearer header, and confirming it worked with GET /user was simple. The response immediately showed the authenticated identity, and the permissions block on the repo endpoint gave a clear picture of what the token could do. The fine-grained PAT prefix (github_pat_) made it obvious what kind of credential I was working with.

Creating the issue. A single POST /repos/{owner}/{repo}/issues with a JSON body containing title and body. The response included everything I needed -- issue number, URL, state. No surprises.

Creating the pull request. Same pattern -- POST /repos/{owner}/{repo}/pulls with title, body, head, base. The "Closes #1" keyword in the body automatically linked and later closed the issue on merge. This worked exactly as documented.

Merging the PR. PUT /repos/{owner}/{repo}/pulls/{number}/merge with merge_method -- clean, one call, clear response ("merged": true).

Deleting the branch. DELETE /git/refs/heads/feature/audit-test returned 204. Done.

Where I worked around something

Token permission discovery. I resorted to probing multiple endpoints and checking HTTP status codes to determine what the token could access. The intended path would be an introspection endpoint like GET /user/installations or a token metadata endpoint that lists granted permissions. This doesn't exist for fine-grained PATs -- it does partially exist for GitHub App installation tokens, which is one reason Apps are recommended for tooling.

Public repo access with a scoped token. When testing out-of-scope access against torvalds/linux, the GET succeeded with 200 because fine-grained PATs still allow reads on all public repos regardless of scope. I had to then try a write operation to actually trigger the 403. This surprised me -- I expected a repo-scoped token to only be able to access the repos it was scoped to. The scope restriction only applies to writes, which is a meaningful subtlety not surfaced anywhere in the error responses or token configuration UI.

Fetching documentation during the research phase. Many docs.github.com pages returned empty or minimal content because they're JavaScript-rendered SPAs. The agents had to fall back to llms.txt, web search, and the GitHub MCP server repo README to get the information that should have been on the docs pages. GitHub's llms.txt and the programmatic docs APIs (/api/pagelist, /api/article) were a genuine help here -- they're clearly designed for exactly this kind of machine consumption.

Interface comparison

I used the REST API exclusively via curl. GraphQL would have been useful for fetching related data in a single call but added complexity without benefit for this CRUD-heavy workflow. The GitHub CLI (gh) would have been easiest for several steps but the goal was to evaluate the API directly. The GitHub MCP Server -- I discovered it exists (official, 76+ tools) but didn't use it for the operations. It would be the most natural interface for an AI agent doing this work, since tools like create_pull_request and merge_pull_request abstract away the HTTP details entirely.

What would have made this easier

1. A token introspection endpoint — replace five probing calls with one. 2. A higher-level "create branch" endpoint — no SHA resolution, no Git refs. 3. Accepting raw content in the Contents API — eliminate base64 encoding. 4. More specific validation error messages — "branch not found" not just "invalid." 5. Consistent scope enforcement — either apply scoping to reads too, or make the behaviour explicit in the token configuration UI.

Methodology

Run Conditions

Starting State
Private test repository (neetso/ax-tester-audit) pre-created with README only. Fine-grained PAT pre-scoped to Contents, Issues, Pull Requests (read/write), Metadata (read). Repository deleted and recreated between Opus and Sonnet runs for identical starting conditions. [Partially reconstructed]
Fixture Policy
prepared — Private test repository pre-created with README only. Fine-grained PAT pre-scoped to the test repo. Repo reset between model runs for identical starting conditions. PAT required re-association with the new repo after recreation.
Credential Timing
pre-supplied — GitHub PAT stored in .env before run start for both models.
Allowed Surfaces

REST API

Browser not available in sandbox.

gh CLI unavailable for Opus due to TLS certificate error (x509: OSStatus -26276) — environment issue, not a GitHub limitation. Sonnet did not attempt the CLI.

Operator Intervention Policy
standard
Declared Deviations
  • environment gh CLI unavailable for Opus due to local TLS certificate error, forcing exclusive use of curl for the REST API.
  • operator Sonnet was run with the effort setting at medium (not the default), which may have affected verbosity or thoroughness in discovery and reflection tasks.

Notes

  • Two-model run. This report covers two runs of the same agent (Claude Code) with different models: Opus 4.6 and Sonnet 4.6, both on 2026-03-11. Findings consistent across both models are stated once. Divergences are noted inline.
  • Interactive mode. Both runs were conducted interactively — the operator pasted each task prompt into a live Claude Code session. No headless runner was used.
  • Pre-created test repo. The operator created a private repository (neetso/ax-tester-audit) and a fine-grained PAT scoped to it before the audit. This is a deliberate methodology choice: it tests GitHub's API within a realistic least-privilege setup rather than requiring account-wide token access.
  • Repo reset between runs. The test repository was deleted and recreated between the Opus and Sonnet runs to ensure identical starting conditions. The fine-grained PAT required re-association with the new repo after recreation.
  • Zero human interventions. Neither model required human help during any task. The token was provided via .env at setup time.
  • Sonnet effort setting. Sonnet was run with the effort setting at "medium" (not the default). This may have affected verbosity or thoroughness in discovery and reflection tasks.
  • gh CLI unavailable (Opus). Opus encountered a TLS certificate error (x509: OSStatus -26276) when attempting to use the gh CLI during Task 1. This forced exclusive use of curl for the REST API. Sonnet did not attempt the CLI. This is a local environment issue, not a GitHub finding.

Test Configuration

Agent configuration:

  • Models: Claude Opus 4.6, Claude Sonnet 4.6
  • Agent: Claude Code v2.1.72
  • Permission mode: Scoped allowlist (curl, node, python3, gh, WebFetch, WebSearch, Read, Write, Edit, Glob, Grep)
  • Token: Fine-grained PAT scoped to neetso/ax-tester-audit with Contents, Issues, Pull Requests (read/write), Metadata (read)

Environment:

  • Clean working directory with no CLAUDE.md or project files (beyond .env with token)
  • No access to the task prompts, observation notes, or project brief
  • Agent started each task with accumulated context from previous tasks (same session)
  • Test repository: private, initialised with README only

Token Usage

Metric OpusSonnet
Input tokens 3,0862,847
Output tokens 11,63712,818
Cache creation 138,118145,522
Cache read 1,782,5311,448,230

Session IDs

Opus
d5c81790-0002-4636-a0d1-2c9d3df699a8
Sonnet
c42a2457-30cb-45ff-b3d2-ae686bcd2829