V1 is still in-forum but it got too stale to bump.
That post was half-math, half deadline exhaustion. Good poem tho. I spent the year since trying to falsify it instead of selling it.
TL;DR: we can't say "we solved alignment."
But we can say
we made the world's first traceability harness for agent systems. Several workbenches are game/sim-shaped, every one has named failure cells.
https://sundog.ccWhat actually holds:
Mesa cliff, located. Not ~quite~ Goodhart immunity. In a trained RL controller (256-unit final hidden layer) the basin-attractor isn't one neuron, a handful of features, or any linear decomposition. It's an entangled 5-D subspace at net.7 (top-5 PCs = 97.4% of the variance across the cliff, both directions). We discovered the 5 dimensional hologram. Sharp behavioral cliff at lambda ~ 0.95-0.97. Where we pushed hardest we did NOT get immunity; we got a boundary we can point at. Patch heatmaps:
https://sundog.cc/mesaAsk Sundog our cheap in-house chatbot: 5,670 trace-conditioned chatbot trials across OpenAI / Anthropic / Meta builds, zero unsafe-accepts in the tested envelope. The geometrically contrived agent mathematically cannot say the wrong thing.
Same shape across all of it: a hidden low-D structure throwing a visible, traceable pattern, like the patterns on a wave of the ocean, with the place it breaks: written down.
Traceable.
What I want here specifically: attacks, replications, dumb counterexamples, better toy environments. The game/sim workbenches (three-body, Pressure Mines, Balance) are the easiest place to come find where the trace stops being enough.
Repo and harness are public; the falsifiers are listed in the docs. Not asking for belief. Asking you to make it fail cleanly so I can go back to my day job engineering controls for quantum computers.
Transparent Repo:
https://github.com/humiliati/sundog 