ROUND R18 · AI STACK VELOCITY

From six extensions to one control plane

2026-05-11 · 6 streams · 4/5 R1 + 3/5 R2 captured · Codex max-budget · ~$30-40 spend

R17 closed the positioning. R18 found the product architecture. Per user direction (drop federation; focus AI-stack research with API-call adoption), the panel converged on a single Codex R2 framing: Modulum's R18 product is a developer platform, not six separate SDKs. pip install hypernym-stack orchestrates everything.

streams converged

verdicts sound

partial → R19 fixes

R2 outliers

control plane

00 · What this is

A spec round, not a status report

Same plain-language framing as R16-R17. Outputs are design decisions, not bugs found in code.

"Modulum's R18 product should be a composability and reliability control plane, not six extension products."

Codex R2 · most novel R2 contribution

01 · The product architecture

hypernym-stack — one platform, six capabilities

Cross-panel synthesis. The six R18 streams aren't six products; they're capabilities of one developer SDK.

hypernym-stack (single platform · SDK + API) ├── compile/ ← Hypernym M5 Compiler (Grok Stream 5) │ └── Drop in any base model → optimized M5 patterns in <10 GPU-min ├── context/ ← Hypernym Context Compiler (Codex Stream 2) │ └── pip install; LangChain / LlamaIndex / DSPy plugins; standalone CLI ├── infer/ ← Modulum API (existing + new models) │ ├── Speculative-decoding fast path (Stream 4; substrate-aware draft) │ ├── Quantized variants (Stream 3; 4-bit → Pocket / Edge) │ └── Multi-modal endpoints (Stream 1; audio first → vision second) ├── retrieve/ ← Modulum-aware retrieval (folded into context/) ├── receipt/ ← Unified Retention Receipt API (3-panel convergent) │ └── Context Reliability Label + Counterfactual Context Audit └── reasoning/ ← R19 placeholder (proof/claim/obligation graph)

Distribution model

OSS tier: pip install hypernym-stack with free credits / N-calls-per-day · Pro tier: paid endpoints at *.hypernym.ai · Enterprise: deploy in customer VPC + SOC2 / HIPAA certifications (R17 carry-forward; Year-1 critical-path).

02 · Per-stream verdicts

Six streams · 4 sound · 2 partial-with-R19-fix

Cross-pollinated R2 verdicts. Codex's reframes upgraded Streams 2 and 5; named fixes for Streams 4 and 6.

Stream	R2 Verdict	Key resolution
1 — Multi-modal Modulum	sound · audio-first	Long-meeting transcript wedge; LibriSpeech long-form / AMI; vision deferred 90d
2 — Modulum-aware RAG	sound · Context Compiler reframe	Codex: prompt-assembly as compiler optimization, not retrieval ranking. Better abstraction.
3 — Quantization stability	sound · empirically pending	Grok INT4 per-head-scale test = canonical first experiment; ≥+6pp at 128k = Pocket gate
4 — Modulum + speculative	partial · substrate-aware drafting needed	Naive composition underperforms (interference); Codex Draft Distillation = architectural fix
5 — Cross-model transfer	sound · M5 Compiler reframe	Grok: 50M differentiable attention compiler distills any base model in <10 GPU-min
6 — Long-context reasoning	partial · R19 reasoning-state arch	Codex: proof/claim/obligation graph + dependency-trace receipt = R19 push

03 · Seven unanimous panel commits

What 3+ R2 panels agreed on

Cross-model convergence. These anchor R18 closeout and R19 scope.

Modulum is a control plane, not 6 products. Codex R2 origin; Grok and Claude agreed implicitly via R1+R2 product framings. hypernym-stack = the canonical developer SDK.
Stream 2 is "Context Compiling" (Codex reframe). Treat prompt-assembly as compiler optimization, not retrieval ranking. Competitor surface = DSPy / TextGrad, not LangChain.
Stream 4 naive composition underperforms. Grok + Claude flagged speculative-draft-Modulum interference in R1; Codex's "Modulum Draft Distillation" is the architectural fix. ~2-3× naive → ~4.5× substrate-aware compound.
Stream 5 question collapses via Grok's M5 Compiler. 50M differentiable attention compiler — drop in any new base model, get Modulum running in <10 GPU-minutes. Codex R2 upgraded Stream 5 to priority #1.
Stream 6 is the R19 push. All panelists agreed retention is necessary but not sufficient for reasoning. Codex's reasoning-state architecture (proof/claim/obligation graph + dependency trace) is the structural form. Hypernym's second category-defining result if it lands.
Receipts are R18 product-ready. Codex Retention SLA + Counterfactual Context Audit + Grok forward-pass retention receipt + Claude Substrate Receipt — three independent panelists landed on the same R17 Codex outlier (Context Reliability Label). Productize as unified Receipt API.
Quantization stability empirically pending. Grok's INT4 per-head-scale-factor experiment is the canonical first test. If +6pp at 128k holds, Pocket / Edge / Browser product lines unlock immediately.

04 · R19 push — locked next round

Modulum-Retained Evidence + Verifiable Reasoning-State

Unified architecture from R18 panel cross-pollination.

Codex framing

stream 6

Explicit proof / claim / obligation graph that the model updates and verifies during generation. Final answer includes verifiable dependency trace.

Grok framing

stream 6

Lightweight reasoning scratchpad — re-inject only depth-stable bands into a second forward pass. Inference-time only; no fine-tuning constraint preserved.

Claude framing

stream 6 tie-in

Refusal-at-depth. Modulum's depth-stable retention enables refusal calibration at 128k+. Combines with R17 refusal-correctness benchmarks.

Codex outlier convergence

stream 11

Counterfactual Context Audit. Ablation-driven dependency scoring. Makes Receipts hard to fake; regulated-workflow ready.

R19 frame

R19 productizes long-context reasoning the same way R7-R17 productized long-context retention. Hypernym's second category-defining result if it lands. Frontier labs are stuck at ~30-50% multi-hop accuracy at 128k+; if Modulum-conditioned reasoning hits 70-80%, the moat is permanent. "Hypernym becomes the inference platform that makes reasoning at 128k+ commercially defensible."

05 · Three R2 outliers

R18 standouts (not in any R1)

Modulum Pocket

claude r2

Apple Silicon native flagship. 4-bit Gemma + Whisper + Context Compiler. "Ask any 200-page document on your iPhone. Doesn't hallucinate. Doesn't forget the middle." Direct B2C $19.99/mo. Ships before frontier labs ship long-context retention.

Modulum Compatibility Score

codex r2

Public per-base-model score combining zero-shot gain, calibrated gain, calibration cost, mask overlap, speedup, quantized survival. Buyer-facing metric for model selection. Hypernym = the standards body.

Counterfactual Context Audit

codex r2

Rerun lightweight ablations that remove/move cited evidence blocks to test whether the answer actually depended on them. Returns "answer dependence" score in the receipt. Makes Receipts hard to fake; regulated-workflow ready.

Honorable mentions: Modulum Browser WebGPU shader (Grok — top-of-funnel free demo) · Retention-Aware KV Tiering (Codex — precision tiers by evidence importance, not just eviction) · Modulum-on-Embeddings (Claude — Cohere / Voyage competitor at long-doc embedding retention) · Modulum Draft Distillation (Codex — Stream 4 architectural fix) · Modulum-Induced Hallucination Control (Qwen R1 — depth-band confidence as probabilistic meter).

06 · R19 carry-forward

What R18 didn't close

R19's seed scope. Critical path items first.

Modulum replication beyond BABILong qa1 — multi-task BABILong qa2-qa20, MRCR v2, legal/biomed long-doc retrieval, multiple base models. R17 + R18 panels both echoed this; R19 must close.
Reasoning-state architecture v1 spec — Codex's proof/claim/obligation graph + dependency-trace receipt as concrete VTC algebra extension.
Substrate-aware draft distillation — Stream 4 fix; train Gemma 7B draft on Modulum-conditioned target outputs; target ≥4.5× compound speedup.
M5 Compiler distillation — Grok's 50M attention compiler; first attempt: Llama 4 transfer from Gemma 4; falsifier ≥+5pp BABILong qa1 at 128k.
INT4 per-head-scale quantization experiment — Stream 3 canonical first test for Pocket gate; ≥+6pp at 128k = pass.
Audio-first multi-modal benchmark publication — LibriSpeech long-form / AMI meeting / CallHome retention curves; +5pp WER retention target at >60min audio.
Modulum Pocket prototype — 4-bit Gemma + Whisper iOS proof-of-concept; consumer surface that ships before frontier-lab long-context retention.
Unified Retention Receipt API spec — combine Claude/Codex/Grok proposals into single standard; integrate Counterfactual Context Audit.
R19 dispatch fixes: Gemini auth (single-key path); Gemma CLI invocation (mlx_vlm not on PATH).

07 · Closing

R7-R15 fixed retention. R18 productized it. R19 fixes reasoning.

R16 closed the algebra. R17 closed the positioning. R18 closed the architecture — Modulum is a developer platform, not six SDKs. The Context Compiler reframes RAG; the M5 Compiler reframes cross-model transfer; the Retention Receipt API standardizes how customers compare AI systems. Quantization survival unlocks the consumer surface (Modulum Pocket).

R19 fixes reasoning the same way R7-R17 fixed retention. Modulum-conditioned multi-hop reasoning at 128k+ would be Hypernym's second category-defining result. Frontier labs are stuck at 30-50%; Hypernym's structural advantage on retention should translate to reasoning if the panel-converged architecture (proof/claim/obligation graph + dependency trace + refusal-at-depth) holds.

From benchmark result to inference platform in two rounds. R19 ships the reasoning-state architecture. R20+ ships the integrated production stack.