A short note on resource-conditioned execution verification.
Cooper Veit. May 1, 2026. Patent pending.
When you send a request to a hosted AI system, the public interface tells you a model name, maybe a priority class, maybe a cache policy, maybe a region, maybe a price.
It does not tell you the computer you actually used.
Behind the API, the same apparent request can pass through different queues, cache states, memory tiers, backend kernels, precision policies, regions, model snapshots, batching regimes, fallback paths, and billing regimes. Some of these transitions are benign. Some are explicitly permitted. Some are exactly what the customer paid to avoid.
The old mental model is:
input -> model -> output
The better mental model is:
input + execution contract
-> resource-conditioned execution path
-> observable traces
-> attributed output
That is the patent-pending frame: resource-conditioned execution-contract verification.
The short version:
Ashiba verifies whether an AI workload executed under the class of execution that was declared, purchased, or relied on.
Not by asking the provider to reveal its whole cluster. Not by pretending exact internals are always knowable. By probing the boundaries where hidden execution state changes.
Most AI execution promises are not scalar facts.
"Cached" is not one fact. A prompt prefix can be resident in one tier, evicted to another, rematerialized from tokens, or billed differently from how it was served.
"Priority" is not one fact. A request can be admitted to a priority lane, miss a batch window, fall back during congestion, or receive priority only at one scheduling layer.
"FP32" is not one fact. The user-visible dtype, the runtime precision policy, the selected backend, the fused kernel, the strict reference path, and the hardware instruction path can disagree.
"Deterministic" is not one fact. A deterministic front-end setting may or may not control the specific recomputation, dropout, mutable-buffer, collective, or hardware path that matters.
So the verification target is usually not a component. It is a handoff:
The interesting failures live at those transitions.
A verifier starts with a declared execution contract. For example:
primary regime: priority decode
boundary: priority capacity exhaustion
permitted fallback: standard decode after exhaustion
prohibited fallback: reduced-precision decode or cross-region decode
required traces: time-to-first-token, token cadence, billed usage
Then it compiles a redundant probe suite:
The verifier collects observable traces. These might be token timing, latency, cache counters, billed usage, error state, output residue, metadata, quota movement, or client-side measurements.
It then evaluates expected relationships among those traces. The pattern of satisfied and violated relationships is a transition syndrome.
That syndrome maps to a contract-relevant state:
remained in primary regime
transitioned to permitted fallback
transitioned to impermissible fallback
unresolved among an observational equivalence class
The goal is not omniscience. The goal is the coarsest effective execution regime: the least-specific hidden state sufficient to decide whether the contract was satisfied.
If two internal states are indistinguishable and both are permitted, there is no need to pretend to know which one occurred. If two internal states are indistinguishable but one is permitted and one is not, the report should say that too: unresolved because telemetry is insufficient.
That is a better audit artifact than a fake exact answer.
Ordinary monitoring asks: did latency spike, did errors rise, did spend change?
Benchmarking asks: how fast or accurate is this system under a test?
Observability asks: what traces can I collect?
Execution-contract verification asks a different question:
Given the class of execution that was declared or bought, what hidden execution-regime transition best explains the traces?
That difference matters because a provider can be "up" and still fail the relevant execution contract. A request can finish quickly and still run under the wrong precision policy. A cache can reduce cost and still violate a retention promise. A billing line can look plausible and still hide an impermissible fallback.
This is why the unit of value is not a dashboard. It is a signed attribution report:
contract manifest
probe manifest
trace bundle
policy records
tolerance records
transition syndrome
candidate-state set
coarsest effective execution regime
verdict
evidence hash
The dashboard helps humans navigate. The report is the thing that can be reviewed, disputed, archived, and used in procurement or compliance.
One lesson from the experiments is that a reference is not automatically pure.
Suppose you test a fused attention path against an "unfused FP32 reference." If the reference uses ordinary matrix multiplication under the same global precision policy as the tested path, then the reference may inherit the very policy you are trying to test.
Now a mismatch can mean at least four different things:
So the verifier needs strict-reference isolation. The reference path has its own policy record. The tested path has a tested policy record. The discrepancy is classified, not merely observed.
This is the kind of small implementation detail that decides whether an audit is real.
The frame is abstract, but it came out of concrete probe work.
In checkpoint/rematerialization probes, deterministic controls matched, non-preserved-RNG falsifiers diverged, mutable forward state diverged, and train-mode running statistics created state drift even when gradients could remain equivalent. That forced the verifier to separate gradient equivalence from state equivalence.
In H100 scaled-dot-product-attention probes under a strict precision policy, hundreds of measurements across backend and compile schedules produced compliance results. That is not a non-result. It is a positive audit finding: substrate-enforced under the measured policy.
In TF32 precision-policy probes, a broad comparison first looked like an operator divergence. Strict-reference isolation showed the more precise story: some apparent violations were reference-policy artifacts, while the forced math backend exposed a real policy-sensitive path. "FP32 attention" was not one operational fact.
In matrix-validation work, some Good/Bad calibration cells fired cleanly, while other attempted probes stayed unresolved or failed to reproduce a hypothesized bite. Those negative results are part of the method. A verifier that cannot say "calibration investigation" or "instrumentation needed" is too eager to be trusted.
The point is not that these particular cases are the product. The point is the pattern:
declared semantic
-> boundary-targeted probe
-> observable trace
-> falsifier-gated attribution
-> coarsest effective state
That pattern is portable.
AI infrastructure is becoming less like one computer and more like a market of hidden execution regimes.
The customer sees a model name. The provider manages memory pressure, batching economics, cache tiers, precision tradeoffs, fallback paths, quotas, queueing, regions, and hardware availability.
That gap is not going away. It is the product surface.
The mature version of AI infrastructure will need execution receipts. Not just "the model answered." Not just "the API returned 200." A receipt for what class of execution was actually delivered, at what boundary, under what evidence.
That is the job of resource-conditioned execution verification.
It does not require full provider introspection. It does not require pretending all hidden state is knowable. It does require treating AI execution as something with contracts, transitions, traces, and evidence.
The AI API is not the computer. The computer is the API plus the hidden resource-conditioned execution system behind it.
If we are going to buy, regulate, insure, or rely on that system, we need a way to verify which computer we actually got.