This workstream exists to make the project's claims provable, not just plausible.
Primary inspiration:
TheFellow-fkyeahfor black-box conformance disciplinesmartcomputer-ai-forgefor conformance decomposition and gap-management rigoraliciapaz-attractor-rbfor readable divergence tracking
Goal
Build and maintain a benchmark-grade black-box conformance suite with a visible scoreboard tied to executable evidence.
Implemented Surface
The repository now exposes a dedicated conformance harness in test/attractor_ex/conformance/ with one suite per domain:
parsing_conformance_test.exsruntime_conformance_test.exsstate_conformance_test.exstransport_conformance_test.exsagent_loop_conformance_test.exsunified_llm_conformance_test.exs
Shared fixture data for those suites lives in test/support/attractor_ex_conformance_fixtures.ex.
The published scoreboard is maintained in AttractorPhoenix.Conformance and rendered on the LiveView benchmark page at /benchmark.
Benchmark Matrix
Current black-box scorecard:
| Domain | Score | Evidence |
|---|---|---|
| Parsing | 4.5 | mix test test/attractor_ex/conformance/parsing_conformance_test.exs |
| Runtime | 4.0 | mix test test/attractor_ex/conformance/runtime_conformance_test.exs |
| State | 3.5 | mix test test/attractor_ex/conformance/state_conformance_test.exs |
| Transport | 4.0 | mix test test/attractor_ex/conformance/transport_conformance_test.exs |
| Agent loop | 4.0 | mix test test/attractor_ex/conformance/agent_loop_conformance_test.exs |
| Unified LLM | 4.0 | mix test test/attractor_ex/conformance/unified_llm_conformance_test.exs |
Composite conformance score: 4.0
Verification Commands
Run the public proof surface with:
mix test test/attractor_ex/conformance
mix test test/attractor_ex/http_test.exs
mix test test/attractor_ex/agent/session_test.exs
mix test test/attractor_ex/llm_client_test.exs
The first command is the benchmark-facing harness. The remaining focused suites provide deeper supporting evidence for areas still marked partial.
Gap Ledger
Known proof gaps remain public and explicit:
CONF-STATE-001The file-backed HTTP manager proves restart persistence, but a wider cold-boot durability benchmark is still a runtime-foundation item. Evidence:test/attractor_ex/conformance/state_conformance_test.exsRoadmap:docs/plan/02-runtime-foundation.mdCONF-TRANSPORT-001The benchmark harness covers create/status/questions/answers, but SSE replay remains covered by focused transport tests rather than this compact scoreboard suite. Evidence:test/attractor_ex/http_test.exsRoadmap:docs/plan/03-operator-surface-and-debugger.mdCONF-AGENT-001The benchmark harness proves the default provider preset and event surface, but deeper multi-provider/subagent matrices still live in focused session tests. Evidence:test/attractor_ex/agent/session_test.exsRoadmap:docs/plan/06-unified-llm-and-agent-platform.mdCONF-LLM-001The scoreboard proves provider-agnostic JSON and stream normalization, while provider-native parity gaps remain tracked in the unified LLM compliance matrix. Evidence:test/attractor_ex/conformance/unified_llm_conformance_test.exsRoadmap:docs/plan/06-unified-llm-and-agent-platform.md
Success Criteria
This workstream is considered implemented when:
- implementation claims can be traced to executable tests
- the repo exposes a public benchmark or conformance scorecard
- partial or missing areas are documented explicitly, not implied away
- the proof surface is maintained alongside the benchmark page and published docs