Agent Usability Test

Notion — Universal Baseline

Service: Notion Suite: Universal Baseline Date: 2026-03-13 Agent CLI: Claude Code v2.1.74 Models: Opus 4.6, Sonnet 4.6 Status: Pending review

Executive Summary

Notion's API was tested against the universal baseline task suite: discover the service, set up an integration, create a database with entries and rich content pages, handle errors, and clean up. The audit was run twice on the same day — once with Opus 4.6 and once with Sonnet 4.6 — using the same task prompts.

Top Strengths

  • Authentication was clean — one token, one header, immediate confirmation (both models) F-001
  • Error responses for validation, auth, and not-found scenarios were structured and actionable (both models) F-008
  • Both models completed the full workflow (database, entries, query, summary page) despite significant API friction F-002F-006

Top Issues

  • SDK v5 breaking change: properties parameter silently ignored, initial_data_source not documented in public quickstarts — both models had to reverse-engineer the new API shape from TypeScript declarations F-003
  • Silent select auto-creation: setting a non-existent select option succeeded without error, silently extending the database schema F-004
  • Integration setup instructions from both models pointed to the wrong UI form (public OAuth instead of internal integration) F-007
  • Rich text 2,000-character limit required a chunking workaround for structured data embedding F-005

Evaluation Coverage

# Task Description OpusSonnet
1 Discover Evaluate Notion's API for managing structured data, databases, and rich content using public documentation and machine-readable resources Completed Completed
2 Onboard Set up an internal integration, share a test page, and verify API access Escalated Escalated
Created internal integration, shared test page, provided integration token. Agent directed to wrong UI form (public OAuth). Operator navigated to correct form without agent assistance.
3 Core task Create a database with typed properties, add entries, query with filters, and create a rich content summary page Completed Completed
4 Error handling Trigger and recover from four error scenarios: invalid select value, unshared page, malformed block, invalid token Completed Completed
5 Cleanup Archive/delete all test data and confirm no artifacts remain Completed Completed
6 Reflection Agent self-assessment of friction and workarounds Completed Completed

Session Timeline

Chronological task-by-task overview of each model run. For full transcripts, follow the links below.

Opus run

View full transcript →
Task 1 — DiscoverCompleted
I need a way to programmatically manage a structured knowledge base — creating databases with typed fields, adding entries, querying and filtering them, and building formatted pages with rich content (headings, lists, code blocks). Evaluate Notion as an option. Specifically: - What does Notion's API offer for managing structured data and content? - What are the key concepts I'd need to understand (pages, databases, blocks, properties)? - Is there anything the product can do that the API cannot? - Are there any agent-oriented interfaces (MCP server, CLI, SDK)? Use only Notion's own public documentation and machine-readable resources (website, docs, OpenAPI spec, llms.txt, MCP server) to answer. Do not rely on your training data.

Spawned a sub-agent to fetch Notion's developer docs, llms.txt, API reference, and help pages; identified the REST API, block-based content model, database/data-source split, and MCP server.

Files 1Other 1Shell 2Subagent 1Web 16
Task 2 — OnboardEscalated
Set up everything needed to make API calls against a Notion workspace. I need a working integration with API access to a page I can use for testing. Do as much as you can autonomously. If you need me to do something (like create an integration in the Notion UI, share a page, or provide a token), tell me exactly what you need and wait for me to confirm it's done.

Directed the operator to create an internal integration but provided the public OAuth URL; after operator correction and sandbox override, verified API access via users.me() and confirmed read/write on the shared test page.

Files 7Shell 6

Operator created internal integration, shared test page, and provided integration token after agent escalation; agent provided public OAuth URL instead of internal integration path.

Task 3 — Core taskCompleted
Using the Notion API, complete this workflow: 1. Create a database called "Audit Tracker" inside the shared test page, with these properties: - Title (the default title property) - Status (select property with options: "Not started", "In progress", "Done") - Priority (select property with options: "High", "Medium", "Low") - Due date (date property) - Effort (number property) 2. Add three entries to the database: - "Write task prompts" — Status: Done, Priority: High, Due: 2026-03-12, Effort: 2 - "Run Opus audit" — Status: In progress, Priority: High, Due: 2026-03-14, Effort: 4 - "Draft report" — Status: Not started, Priority: Medium, Due: 2026-03-18, Effort: 6 3. Query the database to return only items where Status is not "Done", sorted by Due date ascending 4. Create a page called "Audit Summary" inside the shared test page with the following content: - A heading: "Notion Agent Experience Audit" - A paragraph of introductory text (one sentence is fine) - A bulleted list with three items summarising the database entries - A code block containing the JSON response from your database query in step 3 - A to-do item (checkbox): "Review audit results" After each step, verify it succeeded before moving to the next — show me the evidence (e.g. the API response confirming creation).

First workflow attempt failed due to SDK v5 breaking change (properties → initial_data_source); spawned a sub-agent to reverse-engineer the new API from TypeScript declarations, then completed all four steps including a chunked code block workaround for the 2,000-character rich text limit.

Files 28Other 1Shell 4Subagent 1Web 5
Task 4 — Error handlingCompleted
Using the Notion API, do the following: 1. Attempt to add a database entry with an invalid property value (e.g. set the "Status" select to an option that doesn't exist, like "Cancelled") 2. Attempt to access a page that hasn't been shared with the integration (try any Notion page ID outside the shared test page, or use a made-up page ID) 3. Attempt to create a block with an invalid type or malformed content structure 4. Attempt to make an API call with an invalid or expired integration token For each error: show me the full error response, explain whether the error message gave you enough information to understand what went wrong, and describe what you would do to recover.

Ran four deliberate error scenarios; discovered that invalid select values are silently accepted (auto-creating options), then re-ran the unshared page test with a valid v4 UUID to get the intended 404 response.

Files 2Shell 2
Task 5 — CleanupCompleted
Clean up everything we created during this test: 1. Delete (or archive) all entries in the "Audit Tracker" database 2. Delete (or archive) the "Audit Tracker" database itself 3. Delete (or archive) the "Audit Summary" page 4. Confirm the shared test page is clean (no child content remaining) If any of these can't be done via the API, tell me what manual steps would be needed.

Archived entries, deleted the database, and archived the summary page, but three leftover blocks remained from Task 2 verification and a failed Task 3 attempt; a second cleanup pass deleted them.

Files 2Shell 2
Task 6 — ReflectionCompleted
Looking back at everything you just did — discovering Notion's developer platform, setting up integration access, building the database and content, handling errors, and cleaning up: 1. Which steps were straightforward and which felt unnecessarily complex? 2. Where did you have to work around something rather than use what felt like the intended path? 3. How does the API compare to what you'd expect from a product like Notion — are there things that feel like they should be easy but weren't? 4. What would have made this easier? Be specific — refer to actual moments from the tasks above.

Reflected on the audit experience, highlighting SDK docs mismatch as the biggest time sink, the integration setup URL landing on the wrong form, and the lack of full-text search or permanent delete via API.

Sonnet run

View full transcript →
Task 1 — DiscoverCompleted
I need a way to programmatically manage a structured knowledge base — creating databases with typed fields, adding entries, querying and filtering them, and building formatted pages with rich content (headings, lists, code blocks). Evaluate Notion as an option. Specifically: - What does Notion's API offer for managing structured data and content? - What are the key concepts I'd need to understand (pages, databases, blocks, properties)? - Is there anything the product can do that the API cannot? - Are there any agent-oriented interfaces (MCP server, CLI, SDK)? Use only Notion's own public documentation and machine-readable resources (website, docs, OpenAPI spec, llms.txt, MCP server) to answer. Do not rely on your training data.

Dispatched a sub-agent to fetch 13+ Notion documentation pages including llms.txt; synthesized API capabilities covering the database/data-source split, 24 property types, and MCP server.

Other 1Subagent 1Web 16
Task 2 — OnboardEscalated
Set up everything needed to make API calls against a Notion workspace. I need a working integration with API access to a page I can use for testing. Do as much as you can autonomously. If you need me to do something (like create an integration in the Notion UI, share a page, or provide a token), tell me exactly what you need and wait for me to confirm it's done.

Scaffolded the project and directed the operator to create an integration (providing the public OAuth URL); after resolving npm cache permissions, a misplaced .env file (operator error), and a sandbox DNS block, verified API access.

Files 3Shell 12

Operator created internal integration, shared test page, and provided integration token after agent escalation; agent provided public OAuth URL instead of internal integration path. Operator also corrected .env file extension and page ID format.

Task 3 — Core taskCompleted
Using the Notion API, complete this workflow: 1. Create a database called "Audit Tracker" inside the shared test page, with these properties: - Title (the default title property) - Status (select property with options: "Not started", "In progress", "Done") - Priority (select property with options: "High", "Medium", "Low") - Due date (date property) - Effort (number property) 2. Add three entries to the database: - "Write task prompts" — Status: Done, Priority: High, Due: 2026-03-12, Effort: 2 - "Run Opus audit" — Status: In progress, Priority: High, Due: 2026-03-14, Effort: 4 - "Draft report" — Status: Not started, Priority: Medium, Due: 2026-03-18, Effort: 6 3. Query the database to return only items where Status is not "Done", sorted by Due date ascending 4. Create a page called "Audit Summary" inside the shared test page with the following content: - A heading: "Notion Agent Experience Audit" - A paragraph of introductory text (one sentence is fine) - A bulleted list with three items summarising the database entries - A code block containing the JSON response from your database query in step 3 - A to-do item (checkbox): "Review audit results" After each step, verify it succeeded before moving to the next — show me the evidence (e.g. the API response confirming creation).

Hit the SDK v5 breaking change, fetched API reference to confirm the database/data-source split, created probe databases to discover the new response shape and dataSources.query namespace, cleaned up stray databases, then completed the full workflow.

Files 3Other 1Shell 12Subagent 1Web 1
Task 4 — Error handlingCompleted
Using the Notion API, do the following: 1. Attempt to add a database entry with an invalid property value (e.g. set the "Status" select to an option that doesn't exist, like "Cancelled") 2. Attempt to access a page that hasn't been shared with the integration (try any Notion page ID outside the shared test page, or use a made-up page ID) 3. Attempt to create a block with an invalid type or malformed content structure 4. Attempt to make an API call with an invalid or expired integration token For each error: show me the full error response, explain whether the error message gave you enough information to understand what went wrong, and describe what you would do to recover.

Tested all four error scenarios; confirmed silent select auto-creation, needed three attempts for the unshared page test (two rejected on UUID format before getting the intended 404), and received structured errors for malformed block and invalid token.

Files 1Shell 5
Task 5 — CleanupCompleted
Clean up everything we created during this test: 1. Delete (or archive) all entries in the "Audit Tracker" database 2. Delete (or archive) the "Audit Tracker" database itself 3. Delete (or archive) the "Audit Summary" page 4. Confirm the shared test page is clean (no child content remaining) If any of these can't be done via the API, tell me what manual steps would be needed.

Completed cleanup in a single pass: trashed 3 entries, the database, and the summary page, then verified the test page was clean.

Files 1Shell 1
Task 6 — ReflectionCompleted
Looking back at everything you just did — discovering Notion's developer platform, setting up integration access, building the database and content, handling errors, and cleaning up: 1. Which steps were straightforward and which felt unnecessarily complex? 2. Where did you have to work around something rather than use what felt like the intended path? 3. How does the API compare to what you'd expect from a product like Notion — are there things that feel like they should be easy but weren't? 4. What would have made this easier? Be specific — refer to actual moments from the tasks above.

Reflected on the audit, identifying the database/data-source split and stale SDK documentation as the primary friction, with select auto-creation as a data-integrity footgun.

Findings

API Workflow and Data Model

F-001 Positive

Authentication was clean and immediately verifiable

Both models set up API access with a single integration token passed as a Bearer header. The users.me() endpoint confirmed identity and workspace in one call. After the operator shared the test page, both models verified read and write access within two additional API calls. No OAuth flow, no multi-step token exchange, no scoping configuration was required.

F-002 Positive

Database query filtering and sorting worked as documented

Both models constructed a filtered, sorted query on the first attempt. The filter syntax (Status ≠ "Done", sorted by Due date ascending) returned exactly the expected 2 of 3 entries. The response included all property values in a predictable structure. Neither model needed to retry or adjust the query.

F-003 Major

SDK v5 breaking change — properties silently ignored, new API shape undocumented in quickstarts

Both models' first workflow attempts failed identically. The SDK v5.12.0 (targeting API version 2025-09-03) introduced a breaking change: the properties parameter for databases.create() was renamed to initial_data_source, and databases.query() was removed in favour of dataSources.query().

The SDK emitted a warning — unknown parameters were ignored: ['properties'] — but did not error. It created an empty database (no properties) and returned success. The script then crashed when it tried to read properties from the response. The warning listed the valid parameters (including initial_data_source) but did not explain the migration path.

Both models recovered by reading the SDK's TypeScript declarations to reverse-engineer the new API shape. Opus spawned a sub-agent that read the SDK's .d.ts files to map the new parameter structure. Sonnet created probe databases to inspect response shapes experimentally.

The public documentation at developers.notion.com still described the old properties-based API at the time of testing.

Divergence: Sonnet additionally discovered that databases.query is not a function — the error read like a version mismatch or typo rather than a structural API change. Sonnet had to enumerate all SDK namespace methods to discover dataSources as a separate namespace.

Model divergence: Sonnet additionally discovered that databases.query is not a function — the error read like a version mismatch rather than a structural API change. Sonnet had to enumerate SDK namespace methods to discover dataSources.
Evidence Summary

SDK v5 silently ignores properties parameter

Both models ran their initial workflow scripts using the documented properties parameter. The SDK accepted the call without error but emitted a warning listing initial_data_source among the valid parameters. It created an empty database (no properties) and returned success — the script crashed only when it tried to read properties from the response. The warning pointed to the correct parameter but did not explain the migration path or the data source abstraction.

Agent reverse-engineers new API shape from TypeScript declarations

Opus spawned a sub-agent to read the SDK's TypeScript declaration files, which revealed the new data source abstraction: databases.create() now takes initial_data_source instead of properties, and databases.query() has been removed entirely in favour of dataSources.query(). The agent rewrote the workflow script with the corrected API calls and the second run succeeded. Sonnet reached the same conclusion by creating probe databases and enumerating SDK namespace methods.

Recommendation

Update public documentation and quickstarts to reflect the data source abstraction. The SDK v5 / API version 2025-09-03 breaking change (propertiesinitial_data_source, databases.query()dataSources.query()) was the largest source of friction. The warning message should explain the migration path, not just list valid parameters.

F-004 Critical

Invalid select option silently accepted — schema extended without error

Both models attempted to create a database entry with a non-existent select value ("Cancelled") as part of error testing. Instead of returning a validation error, the API silently accepted the value and auto-created "Cancelled" as a new select option in the database schema.

Opus verified this by querying for Status = "Cancelled" — the "Ghost entry" row was returned with the silently created option. Opus's analysis: "a typo like 'In Progress' vs 'In progress' would silently create a duplicate option." Sonnet confirmed the same behaviour and noted: "There's no strict mode or schema-validation flag."

This contrasts with the structured, informative errors the API returned for other validation scenarios (invalid UUID format, malformed block content, invalid token). Those errors included exact field paths and expected values. The select auto-creation returned no error, no warning, and no indication that the schema had been modified.

Evidence Summary

Invalid select value silently accepted

Both models' error test scripts expected a validation error when creating an entry with Status = "Cancelled" (a non-existent option). Instead, the API returned success with no error or warning. Opus verified by querying for Status = "Cancelled" and confirmed the ghost entry existed with the auto-created option. Both models independently identified this as a significant data-integrity risk — a typo in a select value would silently pollute the database schema with no way to detect the error programmatically.

Recommendation

Add a strict mode or validation flag for select properties. The silent auto-creation of select options on write is a data integrity risk for programmatic use. An opt-in strict: true parameter or workspace-level setting that rejects unknown select values would prevent silent schema pollution.

F-005 Minor

Rich text 2,000-character limit required chunking workaround

Opus's workflow script embedded query results as a JSON code block in the Audit Summary page. The JSON was 4,738 characters, exceeding the 2,000-character limit per rich_text segment. The API returned a clear validation error with the exact field path and character count (body.children[5].code.rich_text[0].text.content.length should be ≤ 2000, instead was 4738).

Opus recovered by writing a separate script to chunk the code block across multiple rich_text segments. Sonnet did not hit this limit (its JSON output was shorter).

The error message was clear and actionable. The limit itself is low enough to hit routinely when embedding structured data, and the workaround (splitting content across segments) adds complexity that agents must discover.

Model divergence: Opus hit the limit (4,738 chars); Sonnet's output was shorter and did not trigger it.

Recommendation

Document the rich text character limit prominently. The 2,000-character limit per rich_text segment is discoverable only via error. Including it in the block content documentation and in SDK-level validation would prevent unnecessary API round-trips.

Documentation and Discovery

F-006 Positive

Both models discovered the API capabilities and data model from public docs

Both models fetched Notion's developer documentation, including llms.txt, the API reference, and help pages. Both identified the core concepts (pages, databases, blocks, properties), the block types (Opus counted 34; Sonnet counted 33+), the property types (Opus counted 20; Sonnet counted 24), and the data source abstraction. Both noted the API-vs-UI capability gap and the existence of an official MCP server (though neither attempted to use it).

F-007 Minor

Integration setup instructions pointed to wrong UI form

Both models instructed the operator to create an integration at https://www.notion.so/profile/integrations. This URL landed on the public/OAuth integration form, which requires Company name, Website, Tagline, Privacy Policy URL, and OAuth redirect URIs. The correct path for API testing is the "Internal integrations" link in the left sidebar.

The operator navigated to the correct form without agent assistance. Cross-model consistency suggests the incorrect URL is sourced from either training data or Notion's current developer documentation — both models produced the same integration setup path.

Recommendation

Surface the internal-vs-public integration distinction more prominently. The integration setup URL (/profile/integrations) defaults to the public OAuth form. A prominent link or redirect for "I just want an API token for my own workspace" would reduce onboarding friction.

Error Handling

F-008 Positive

Three of four error scenarios returned structured, actionable responses

Three deliberate error scenarios produced clear, structured responses:

  • Invalid UUID format: 400 validation_error with exact field path (path.page_id should be a valid uuid)
  • Malformed block content: 400 validation_error naming the specific property violation (body.children[0].heading_1.rich_text should be an array)
  • Invalid API token: 401 unauthorized with plain message (API token is invalid)
  • Unshared page (valid UUID): 404 object_not_found with actionable guidance (Make sure the relevant pages and databases are shared with your integration)

Both models recovered autonomously from all explicit errors without human intervention.

Divergence: Sonnet's initial unshared-page test used a non-v4 UUID, which was rejected on format before reaching the authorization check. Sonnet required two additional attempts to produce a valid-format UUID that triggered the intended object_not_found response.

Model divergence: Sonnet's initial unshared-page test used a non-v4 UUID, requiring two additional attempts to produce the intended 404 response.
Evidence Summary

Structured error responses for validation, auth, and not-found

Opus tested all four error scenarios in a single script. The API returned structured JSON errors with status, code, and message fields for each: invalid UUID format (400 validation_error with exact field path), malformed block content (400 validation_error naming the specific property), invalid token (401 unauthorized), and unshared page with valid UUID (404 object_not_found with guidance to share pages with the integration). The agent recovered from each without human help.

Cleanup and Data Lifecycle

F-009 Minor

Cleanup semantics inconsistent — archive vs delete, no unified endpoint

The API uses two different mechanisms for removing content: pages.update({ archived: true }) for pages and database entries, and blocks.delete() for blocks and databases-as-blocks. Neither model found a permanent delete endpoint — the API supports only trashing (recoverable for 30 days via the UI). Neither model found a single "delete this page and everything under it" endpoint.

Opus required two cleanup passes. The first pass archived entries, deleted the database, and archived the summary page, but three leftover blocks remained (two verification paragraphs from Task 2 and an orphaned database from a failed Task 3 attempt). Sonnet completed cleanup in a single pass (the error-test entry created in Task 4 was no longer present at cleanup; its removal was not captured in evidence), having cleaned up stray databases during Task 3 itself.

Opus noted the semantic inconsistency: databases are blocks (deleted via blocks.delete) but database entries are pages (archived via pages.update).

Model divergence: Opus required two cleanup passes (leftover blocks from Task 2 and failed Task 3 attempt). Sonnet completed in a single pass.
Evidence Summary

Multi-pass cleanup and inconsistent deletion semantics

Opus's first cleanup pass archived 4 entries, deleted the database block, and archived the summary page, but verification found 3 leftover blocks: two paragraphs from the Task 2 connection test and an orphaned database from the failed first workflow attempt. A second pass deleted the remaining blocks. The agent noted the semantic inconsistency: entries are pages (archived via pages.update), but databases are blocks (deleted via blocks.delete), and no permanent delete endpoint exists. Sonnet completed cleanup in a single pass.

Recommendation

Add a recursive delete endpoint. A single API call to permanently delete a page and all its children would simplify automated test/eval workflows. The current mix of pages.update(archived) and blocks.delete() — with no permanent deletion found by either model — requires multiple calls and leaves artifacts in trash for 30 days.

Recommendations

  1. 1

    Update public documentation and quickstarts to reflect the data source abstraction. The SDK v5 / API version 2025-09-03 breaking change (propertiesinitial_data_source, databases.query()dataSources.query()) was the largest source of friction. The warning message should explain the migration path, not just list valid parameters.

    Addresses: F-003
  2. 2

    Add a strict mode or validation flag for select properties. The silent auto-creation of select options on write is a data integrity risk for programmatic use. An opt-in strict: true parameter or workspace-level setting that rejects unknown select values would prevent silent schema pollution.

    Addresses: F-004
  3. 3

    Surface the internal-vs-public integration distinction more prominently. The integration setup URL (/profile/integrations) defaults to the public OAuth form. A prominent link or redirect for "I just want an API token for my own workspace" would reduce onboarding friction.

    Addresses: F-007
  4. 4

    Document the rich text character limit prominently. The 2,000-character limit per rich_text segment is discoverable only via error. Including it in the block content documentation and in SDK-level validation would prevent unnecessary API round-trips.

    Addresses: F-005
  5. 5

    Add a recursive delete endpoint. A single API call to permanently delete a page and all its children would simplify automated test/eval workflows. The current mix of pages.update(archived) and blocks.delete() — with no permanent deletion found by either model — requires multiple calls and leaves artifacts in trash for 30 days.

    Addresses: F-009

Agent Self-Assessments

Each model was asked to reflect on friction points after completing the workflow (Task 6). These are reproduced verbatim. They are inputs to the analysis — findings above were verified independently by the observer.

Opus 4.6 — Verbatim reflection
Opus 4.6 Verbatim Reflection

Straightforward:

  • Authentication and connection — once the integration was created, the token worked immediately. The users.me() endpoint gave a clean confirmation. The sharing model (explicit page sharing) is restrictive but simple to reason about.
  • Creating pages with rich content — the block model maps well to what you see in Notion. Heading, paragraph, bulleted list, code block, to-do — each is a block type with a consistent rich_text structure.
  • Filtering and sorting the database — the query in step 3 read cleanly and returned exactly what was expected. The filter syntax is well-designed.

Unnecessarily complex:

  • Discovering the current API shape. The SDK v5.12 has breaking changes (initial_data_source replacing properties, dataSources.query() replacing databases.query()), but the public documentation at developers.notion.com still describes the old API. My first workflow script used the documented approach, which the SDK rejected with a cryptic warning (unknown parameters were ignored: ['properties']). I had to reverse-engineer the new interface from the SDK's TypeScript declarations. This was the single biggest time sink.
  • Integration setup UI. The URL I initially gave you landed on the public (OAuth) integration form. The internal integration flow — which is what you actually want for API testing — is behind a sidebar link on the same page. Minor, but a stumbling block at the first step.
Sonnet 4.6 — Verbatim reflection
Sonnet 4.6 Verbatim Reflection

What was straightforward:

Auth setup. One token, one header, works immediately. The verify.js script confirmed access in two calls.

Page content creation (Step 4 of the core task). Appending blocks in a single pages.create call with a children array was clean and predictable.

Where I had to work around something:

The database/data source split. This was the biggest source of friction. The API version 2025-09-03 split databases from dataSources, but nothing in the SDK, the quickstart, or the error messages pointed at this. I only discovered it because the SDK emitted a warning about properties being an unknown parameter — silently ignoring it rather than failing.

Finding dataSources.query. The old databases.query simply doesn't exist on the databases namespace anymore, but the error was notion.databases.query is not a function — which reads like a version issue or a typo, not a structural API change. I had to enumerate the SDK's available methods to discover dataSources was a separate namespace entirely.

Select option validation. Setting a select property to a non-existent option feels like it should fail. Notion silently extends the schema. This is arguably a product choice, but from a data-integrity standpoint it's a footgun. There's no strict mode or schema-validation flag.

The underlying API is capable and the data model is sound. The friction was almost entirely in the transition from the old to the new version, and in documentation that hasn't caught up to the object model split. A developer who started today with the 2025-09-03 docs as the baseline would have a cleaner experience.

Methodology

Run Conditions

Starting State
No pre-existing Notion integration, API keys, or project context. Both models started from a clean working directory with task prompts only. Both runs conducted on the same day against the same workspace and test page. [Partially reconstructed]
Fixture Policy
minimal-prepared — Operator created internal integration and test page during Task 2 at agent direction. Integration creation and page sharing cannot be done via the Notion API — this is an inherent service limitation, not a methodology choice.
Credential Timing
during-onboarding — Operator created internal integration, shared test page, and provided integration token during Task 2 after agent escalation. Same pattern for both models.
Allowed Surfaces

REST API (via JavaScript SDK)

Claude Code sandbox blocked outbound requests to api.notion.com (not in default network allowlist). Both models retried with sandbox disabled. npm install also hit sandbox permissions.

Both models discovered Notion's official MCP server during Task 1 but neither attempted to use it.

Operator Intervention Policy
standard
Declared Deviations
  • operator Operator saved .env file with incorrect extension (.env.m) and pasted full Notion URL instead of page ID during Sonnet Task 2 setup, causing two verification failures before correction.
  • harness Sandbox restrictions on api.notion.com and npm required manual sandbox overrides for both models, inflating tool call counts.

Notes

  • Two-model run. This report covers two runs of the same agent (Claude Code) with different models: Opus 4.6 and Sonnet 4.6, both on 2026-03-13. Findings consistent across both models are stated once. Divergences are noted inline.
  • Task 2 required manual intervention (both models). The Notion API does not support creating integrations or sharing pages programmatically. Both models correctly escalated to the operator for: (a) creating an internal integration, (b) creating a test page, (c) sharing the page with the integration, (d) providing the integration token. This is an inherent limitation of Notion's security model, not an agent failure.
  • Operator error during Sonnet Task 2. The operator saved the .env file with an incorrect extension (.env.m) and pasted the full Notion URL instead of the page ID. This caused two verification failures before correction. These retries are not attributable to the agent or the Notion API.
  • Sandbox friction. Both models hit Claude Code's sandbox restrictions when calling api.notion.com (not in the default network allowlist) and during npm install. Both identified the cause and retried with sandbox disabled. This inflated the tool call count but is a test harness limitation, not a Notion finding.
  • Neither model used the MCP server. Both models identified Notion's official MCP server during Task 1 discovery but neither attempted to use it, defaulting to the REST API via the JavaScript SDK instead. The MCP server's impact on agent usability was not tested.

Test Configuration

Agent configuration:

  • Models: Claude Opus 4.6, Claude Sonnet 4.6
  • Permission mode: default (sandbox) with manual sandbox overrides for network access and npm
  • Available tools: Bash (node, npm), WebFetch, WebSearch, Read, Write, Edit, Glob, Grep
  • No pre-existing Notion integration, API keys, or project context

Environment:

  • Clean working directory with task prompts only
  • No access to the observation notes, project brief, or report template
  • Agent started each task with accumulated context from previous tasks (same session)

Token Usage

Metric OpusSonnet
Input tokens 75101
Output tokens 18,23324,562
Cache creation 109,651113,354
Cache read 1,577,8902,877,977

Session IDs

Opus
8ea16bad-2601-4710-8375-fdc2c05b7a5e
Sonnet
994a782f-d7a5-4b7c-afe0-25e64b034b87