01. Mission And Competitive Benchmark

This document defines the benchmark contract used to judge whether attractor-phoenix is actually ahead of the strongest reference implementations.

Primary inspiration source:

../examples/FOCUSED-RESEARCH.md

Mission

attractor-phoenix should become the strongest overall implementation in the reference set across three dimensions at the same time:

Runtime capability
Conformance and implementation rigor
Operator experience and premium product features

The target is not merely a broader feature list. The target is a system that combines:

durable execution and resume semantics
stronger black-box proof of behavior
a first-class debugger and operator control plane
a cohesive Attractor + coding-agent + unified-LLM story

Competitive Claim

The project should only claim leadership when all of the following are true at the same time:

Runtime depth is stronger than the best runtime-focused reference.
Conformance evidence quality is stronger than the best conformance-focused reference.
Operator experience is stronger than the best dashboard-focused reference.
Public documentation accurately states both strengths and known gaps.

Reference Set

The focused research document identifies seven primary comparison points:

samueljklee-attractor
TheFellow-fkyeah
kilroy
brynary-attractor
smartcomputer-ai-forge
attractor
aliciapaz-attractor-rb

Benchmark Standard

To claim leadership, attractor-phoenix should exceed the current set in the following way:

Match or exceed samueljklee-attractor on execution and server surface.
Match or exceed TheFellow-fkyeah on conformance evidence quality.
Match or exceed kilroy on durable run state and resume fidelity.
Match or exceed brynary-attractor on integration coherence across runtime, agent loop, and unified-LLM abstractions.
Match or exceed attractor on dashboard and operator usability.
Keep explicit honesty like aliciapaz-attractor-rb and smartcomputer-ai-forge around what remains incomplete.

Benchmark Dimensions And Weights

Leadership should be scored as a weighted composite so trade-offs are explicit:

Runtime capability and durability: 40%
Conformance and proof quality: 30%
Operator experience and product surface: 20%
Integration coherence across Attractor + coding-agent + unified LLM: 10%

Each dimension should have:

measurable criteria
executable evidence
explicit gap notes when not complete

Current Position

The repository already has a strong base:

Broad AttractorEx parser, validator, engine, handlers, HTTP service, SSE, and resume support.
Human-in-the-loop support, coding-agent loop primitives, and unified LLM foundations.
Published documentation and maintained compliance matrices.
Phoenix LiveView surfaces for dashboard, builder, setup, and pipeline library.

The main gaps are structural:

Runtime state is still in-memory in the HTTP manager.
The main UI is more inspector than operator console.
The builder has its own JavaScript DOT interpretation path.
Several unified-LLM and appendix-level spec areas remain partial or not implemented.
Conformance evidence is strong, but not yet packaged as a visible black-box benchmark suite against competing implementations.

Required Evidence For Leadership

The benchmark should treat claims as valid only with reproducible evidence:

Runtime: restart-safe persistence and replay tests that pass from a clean boot.
Conformance: black-box fixtures and runners with published pass/fail output.
Operator surface: live run inspection and debugger workflows demonstrable without internal-only tooling.
Integration: end-to-end scenario proving runtime + human gate + agent loop + unified LLM behavior in one reproducible path.

Evidence should be linked in docs and traceable to runnable code or test targets.

Scoring Rubric

Use a 0-5 score per dimension:

0: not implemented
1: partial prototype, not dependable
2: functional but materially incomplete
3: production-credible baseline
4: strong implementation with documented evidence
5: best-in-set quality with benchmark-grade proof

Composite score formula:

(runtime * 0.40) + (conformance * 0.30) + (operator * 0.20) + (integration * 0.10)

Minimum bar to claim leadership:

Composite score >= 4.2
No dimension below 4.0
No hidden known-gaps in public docs

Strategic Priorities

The roadmap should optimize for the following order of importance:

Durable runtime and replayable state
First-class debugger
Canonical builder fidelity
Black-box conformance leadership
Completion of the LLM and agent-platform story
A cohesive operator-grade Phoenix product

Review Cadence

Benchmark reviews should run at a fixed cadence:

Weekly internal score review against the rubric.
Milestone-based public docs refresh when a dimension score changes.
Pre-release leadership check requiring evidence links for all >= 4.0 claims.

Anti-Claim Rules

Do not claim "ahead of the set" if any of the following is true:

durability or replay depends on in-memory-only state
conformance status is not backed by executable fixtures
debugger capabilities exist only as internal diagnostics
unified LLM or agent-loop claims are broader than current implemented behavior

← Previous Page Implementation Plan

Next Page → 02. Runtime Foundation