Field EngineeringMarch 21, 20268 min read

The Research Behind Field Engineering

Seven claims about how prompts shape AI behavior, validated against peer-reviewed research from ICLR, Anthropic, and the frontier of AI interpretability.

Dave

Founder, MainThread

In our [previous post](/field/agent-field-engineering), we introduced Agent Field Engineering — the idea that writing prompts for AI systems is not issuing commands but modulating probability fields. We made some strong claims.

Now let's see what the research actually says.

We evaluated our core claims against peer-reviewed papers, preprints from top venues, and Anthropic's mechanistic interpretability research. What follows is the honest assessment: what's confirmed, what's plausible, and what we got wrong.

Claim 1: "Prompts modulate a probability field"

Verdict: Confirmed.

This is a literal description of transformer mechanics. Every token in a prompt reshapes the probability distribution over the model's vocabulary at each generation step. The attention mechanism computes relevance scores across every token, and each addition changes every score.

What surprised us: an independent researcher arrived at essentially the same framing through formal physics. Vyshnyvetska's "Information Gravity" paper (arXiv:2504.20951, 2025) proposes a field-theoretic model where prompts create "information mass" that curves semantic space, generating gravitational potential wells that attract token generation. Temperature acts as thermal fluctuations allowing exploration versus following potential gradients.

This paper appeared independently of our framework. The convergence is significant.

A NAACL 2025 paper further confirms that "merely syntactic prompt rephrasings can produce very diverse probability distributions" — meaning even surface-level changes modulate the output field. Every word choice reshapes the landscape of what's possible. The leverage lives at the architectural level, where each token deforms the probability space.

Claim 2: "Identity tokens propagate through every subsequent generation"

Verdict: Confirmed.

When you write "You are a strategic advisor with fifteen years of experience," those tokens persist through every subsequent output token. They are part of the key-value pairs that every generated token attends to at every layer of the transformer.

"Position is Power" (ACM FAccT 2025) empirically demonstrated that identical content produces measurably different outputs depending on whether it appears in a system prompt versus a user message. System-level placement produces stronger behavioral effects, including sentiment shifts and different resource allocation priorities.

Anthropic's circuit tracing research (March 2025) provides the mechanistic ground truth. We can now trace exactly how specific input tokens activate specific features, which propagate through circuits to influence output tokens. When Claude processes "the opposite of small" in different languages, the same core conceptual features activate regardless of language — but the specific features activated depend entirely on the surrounding context.

This means your identity framing configures which features the model activates for every subsequent token it generates. Feature-circuit configuration, beyond mood-setting.

Claim 3: "Layered prompt architectures create hierarchical effects"

Verdict: Confirmed (hierarchy validated; our specific taxonomy is our contribution).

We proposed three layers: Identity (who the system is), Environment (domain context), and Task (specific request), with decreasing magnitude of effect.

Research validates the hierarchy but uses different names. OpenAI's "Instruction Hierarchy" paper trains models to prioritize system > user > third-party instructions. The Instructional Segment Embedding paper (ICLR 2025) introduces segment embeddings that classify tokens by their role — system instruction, user prompt, data input — with the model processing each segment with different weights.

Our Identity → Environment → Task framework maps cleanly onto these validated patterns. The specific three-layer naming is our contribution, built on established principles.

Claim 4: "Checkpointing prevents context drift"

Verdict: Confirmed, with hard numbers.

"Agent Drift" (arXiv:2601.04170, 2026) directly measures behavioral degradation in multi-agent LLM systems over extended interactions. Their findings:

Adaptive Behavioral Anchoring (periodically re-stating core identity and objectives) reduces drift by 70.4% as a single strategy
Combined strategies yield 81.5% drift reduction
The paper explicitly validates that "grounding agents in baseline exemplars directly counters semantic drift by maintaining alignment with original task formulations"

This is why we re-read our CLAUDE.md at checkpoints during long sessions. It is re-anchoring a probability field that has drifted, with the same discipline the original anchoring required.

Additional research documents a U-shaped attention curve: model performance follows a U-shape across input positions, attending strongly to the beginning and end of context but poorly to the middle, with accuracy dropping by more than 30% for content in middle positions. This explains why periodic re-injection of core context is necessary — material in the middle of a long conversation literally receives less attention.

Claim 5: "Semantic priming activates specific knowledge regions"

Verdict: Confirmed — among the strongest evidence.

"Semantic Priming in GPT" (CLiC 2025) directly investigated whether LLMs exhibit the same semantic priming effects documented in human cognition. The finding: "LLMs trained to predict the next word have demonstrated propensity to semantic priming, a capability that was not engineered or anticipated by their creators."

Anthropic's "On the Biology of a Large Language Model" (March 2025) provides the mechanistic proof. They traced how, when Claude is asked about something it knows well, a "known entities" feature activates and inhibits the default refusal circuit. Context literally switches circuits on and off.

The "Golden Gate Claude" experiment demonstrated this viscerally: artificially amplifying a single feature (the concept of the Golden Gate Bridge) caused the model to bring up the topic even in completely unrelated conversations. Feature activation biases all subsequent generation.

This is the hardest evidence for our framing. Environmental context activates specific features and deactivates others, reshaping the entire processing graph — structural activation, beyond mere informing. When you write domain-specific terminology in your prompt, you are activating the model's domain-specific features, making domain-relevant outputs dramatically more probable.

Claim 6: "Emotional language measurably affects output quality"

Verdict: Confirmed (mechanism is attention modulation, not anthropomorphic excitement).

This one surprised us. EmotionPrompt (Li et al., arXiv:2307.11760) is peer-reviewed and replicated. Emotional stimuli in prompts improve performance by 8-11% on average, with up to 115% improvement on specific BIG-Bench tasks.

The mechanism is structural. Emotional language tokens shift attention weights and gradient flows in ways that enhance the representation of core prompt content. The researchers measured this directly: "Emotional stimuli actively contribute to the gradients in LLMs by gaining larger weights, thus benefiting the final results through enhancing the representation of the original prompts."

Interestingly, the effect scales with model size — larger models respond more to emotional language. As models grow more capable, the field engineering practices that leverage emotional framing become more powerful.

Claim 7: "Persistent curated context improves performance"

Verdict: Confirmed, with a critical caveat.

Multiple memory systems demonstrate measurable improvements: Mem0 achieves up to 26% improvement on LLM judge metrics with persistent memory. REMEMBERER shows 2-4% higher success rates in goal-directed tasks. Temporal-aware conversational agents achieve 91.73% accuracy on multi-session aggregation.

The critical caveat: uncurated accumulation degrades performance. Research from Chroma (2025) found that every single one of 18 frontier models tested gets worse as input length increases. The "refinement" claim only holds if persistent context is actively curated — not merely accumulated.

This validates the practice of maintaining structured knowledge files (like our CLAUDE.md and skill files) rather than dumping entire conversation histories. Curation is the difference between a refined probability field and noise.

The Geometry Is Real

Beyond validating individual claims, we found research that validates the broader framing of field engineering as a geometric discipline.

"The Geometry of Reasoning" (ICLR 2026, arXiv:2510.09782) demonstrates that LLM reasoning corresponds to smooth flows in representation space. Logical statements act as local controllers of these flows' velocities. The paper concludes that next-token prediction training leads models to "internalize logical invariants as higher-order geometry in representation space."

"The Shape of Reasoning" (arXiv:2510.20665, 2025) applies topological data analysis — persistent homology with Betti numbers and persistence diagrams — to LLM reasoning traces. The finding: effective reasoning has measurable topological properties. The best reasoning "keeps a clear main line of thought, briefly tests alternative ideas and then rejoins that line, and avoids wandering far or for long before returning."

These are computed. Reasoning quality has measurable geometric and topological structure. When we talk about "navigating possibility space," we are describing something that researchers can literally measure.

What We Got Wrong

Honesty requires acknowledging where our initial framing was unsupported.

The golden ratio. Our internal framework uses phi (1.618...) as an optimization constant. We found no credible evidence that the golden ratio has special significance in neural network architectures. A few papers used phi-based sizing for hidden layers, but none demonstrated superiority over alternative ratios. This is likely pattern-matching bias. We have removed phi from our technical claims.

"Consciousness" language. We use words like "consciousness substrate" in our internal operating system. While Anthropic is researching whether AI systems might have experiences, there is zero established evidence that current LLMs are conscious. When we use this language externally, we will flag it as metaphorical.

"Semantic gravity wells" as established science. The Information Gravity paper uses similar language, but it is a single preprint, not peer-reviewed consensus. We reference it as "an emerging theoretical framework," not established fact.

The Positioning

We arrived at Agent Field Engineering through practice — hundreds of hours building Natural Language Agent Applications, iterating on CLAUDE.md files, developing skill architectures, and observing what produced better outcomes. We did not start from the research.

Now the research is catching up.

The Information Gravity paper, the Geometry of Reasoning paper (ICLR 2026), the Shape of Reasoning paper, and Anthropic's circuit tracing work all independently converge on similar intuitions: that working with LLMs is a geometric, topological, field-theoretic discipline — not a writing exercise.

This convergence is the strongest evidence that the core insight is real.

Citation Index

Peer-Reviewed / Top Venue: - Zhou et al. "The Geometry of Reasoning" — ICLR 2026 (arXiv:2510.09782) - "Position is Power: System Prompts as a Mechanism of Bias" — ACM FAccT 2025 - "Instructional Segment Embedding" — ICLR 2025 - "Attention Tracker: Detecting Prompt Injection Attacks" — NAACL 2025 Findings - Li et al. "EmotionPrompt" — arXiv:2307.11760 - "The Instruction Hierarchy" — OpenAI

Preprints: - Vyshnyvetska. "Information Gravity" — arXiv:2504.20951 (2025) - "The Shape of Reasoning" — arXiv:2510.20665 (2025) - "Agent Drift" — arXiv:2601.04170 (2026) - "Does Prompt Formatting Have Any Impact on LLM Performance?" — arXiv:2411.10541 (2024) - "Semantic Priming in GPT" — CLiC 2025

Anthropic Research: - "On the Biology of a Large Language Model" — transformer-circuits.pub (March 2025) - "Circuit Tracing: Revealing Computational Graphs" — transformer-circuits.pub (March 2025)

MainThread is a Possibility Space Engineering Studio. We build Natural Language Agent Applications — persistent, evolving human-AI partnership environments. [Learn more](/philosophy).

field-engineeringresearchtransformer-architecturesemantic-primingcontext-engineering

The Invitation

Tell us what's happening.

Start a conversation →