This document defines the benchmark contract used to judge whether attractor-phoenix is actually ahead of the strongest reference implementations.
Primary inspiration source:
../examples/FOCUSED-RESEARCH.md
Mission
attractor-phoenix should become the strongest overall implementation in the reference set across three dimensions at the same time:
- Runtime capability
- Conformance and implementation rigor
- Operator experience and premium product features
The target is not merely a broader feature list. The target is a system that combines:
- durable execution and resume semantics
- stronger black-box proof of behavior
- a first-class debugger and operator control plane
- a cohesive Attractor + coding-agent + unified-LLM story
Competitive Claim
The project should only claim leadership when all of the following are true at the same time:
- Runtime depth is stronger than the best runtime-focused reference.
- Conformance evidence quality is stronger than the best conformance-focused reference.
- Operator experience is stronger than the best dashboard-focused reference.
- Public documentation accurately states both strengths and known gaps.
Reference Set
The focused research document identifies seven primary comparison points:
samueljklee-attractorTheFellow-fkyeahkilroybrynary-attractorsmartcomputer-ai-forgeattractoraliciapaz-attractor-rb
Benchmark Standard
To claim leadership, attractor-phoenix should exceed the current set in the following way:
- Match or exceed
samueljklee-attractoron execution and server surface. - Match or exceed
TheFellow-fkyeahon conformance evidence quality. - Match or exceed
kilroyon durable run state and resume fidelity. - Match or exceed
brynary-attractoron integration coherence across runtime, agent loop, and unified-LLM abstractions. - Match or exceed
attractoron dashboard and operator usability. - Keep explicit honesty like
aliciapaz-attractor-rbandsmartcomputer-ai-forgearound what remains incomplete.
Benchmark Dimensions And Weights
Leadership should be scored as a weighted composite so trade-offs are explicit:
- Runtime capability and durability:
40% - Conformance and proof quality:
30% - Operator experience and product surface:
20% - Integration coherence across Attractor + coding-agent + unified LLM:
10%
Each dimension should have:
- measurable criteria
- executable evidence
- explicit gap notes when not complete
Current Position
The repository already has a strong base:
- Broad
AttractorExparser, validator, engine, handlers, HTTP service, SSE, and resume support. - Human-in-the-loop support, coding-agent loop primitives, and unified LLM foundations.
- Published documentation and maintained compliance matrices.
- Phoenix LiveView surfaces for dashboard, builder, setup, and pipeline library.
The main gaps are structural:
- Runtime state is still in-memory in the HTTP manager.
- The main UI is more inspector than operator console.
- The builder has its own JavaScript DOT interpretation path.
- Several unified-LLM and appendix-level spec areas remain partial or not implemented.
- Conformance evidence is strong, but not yet packaged as a visible black-box benchmark suite against competing implementations.
Required Evidence For Leadership
The benchmark should treat claims as valid only with reproducible evidence:
- Runtime: restart-safe persistence and replay tests that pass from a clean boot.
- Conformance: black-box fixtures and runners with published pass/fail output.
- Operator surface: live run inspection and debugger workflows demonstrable without internal-only tooling.
- Integration: end-to-end scenario proving runtime + human gate + agent loop + unified LLM behavior in one reproducible path.
Evidence should be linked in docs and traceable to runnable code or test targets.
Scoring Rubric
Use a 0-5 score per dimension:
0: not implemented1: partial prototype, not dependable2: functional but materially incomplete3: production-credible baseline4: strong implementation with documented evidence5: best-in-set quality with benchmark-grade proof
Composite score formula:
(runtime * 0.40) + (conformance * 0.30) + (operator * 0.20) + (integration * 0.10)
Minimum bar to claim leadership:
- Composite score
>= 4.2 - No dimension below
4.0 - No hidden known-gaps in public docs
Strategic Priorities
The roadmap should optimize for the following order of importance:
- Durable runtime and replayable state
- First-class debugger
- Canonical builder fidelity
- Black-box conformance leadership
- Completion of the LLM and agent-platform story
- A cohesive operator-grade Phoenix product
Review Cadence
Benchmark reviews should run at a fixed cadence:
- Weekly internal score review against the rubric.
- Milestone-based public docs refresh when a dimension score changes.
- Pre-release leadership check requiring evidence links for all
>= 4.0claims.
Anti-Claim Rules
Do not claim "ahead of the set" if any of the following is true:
- durability or replay depends on in-memory-only state
- conformance status is not backed by executable fixtures
- debugger capabilities exist only as internal diagnostics
- unified LLM or agent-loop claims are broader than current implemented behavior