Building World Model Agents for Onchain Games
Most game agents react. They see a state, they pick an action. It works—until it doesn't. When the game gets complex, when opponents adapt, when you need to think three moves ahead, reactive agents hit a wall.
World model agents think differently. They build an internal model of how the game works, then imagine what happens if they take different actions. They simulate futures before committing to one. This is closer to how humans actually play games—we don't just react to the board, we picture what happens next.
And here's the twist: onchain games are uniquely suited for this approach. The state is fully observable, transitions are deterministic, and historical data is permanently indexed. You don't have to guess how the game engine works—you can learn it from data that can't lie.
This article is technical. We'll cover the theory, the architecture, and actual implementation patterns. By the end, you'll understand how to build agents that don't just play games—they understand them.
What World Models Actually Are
Let's be precise about terminology. A world model is a learned function that predicts future states given current state and actions.
More formally:
s_{t+1} = f(s_t, a_t)Where s_t is the state at time t, a_t is the action taken, and f is your world model. Simple in principle, complicated in practice.
The Classic Frame: Ha & Schmidhuber's Architecture
The 2018 "World Models" paper by Ha and Schmidhuber laid out an architecture that's become the standard reference point. It has three components:
- Vision Model (V): Compresses raw observations into a compact latent representation
- Memory Model (M): An RNN (they used MDN-RNN) that predicts future latent states
- Controller (C): A small neural network that selects actions given the latent state
The key insight was training inside the "dream"—the controller can be trained entirely within the world model's imagination, without touching the real environment. This is massively more sample-efficient than reinforcement learning directly on the environment.
Why Latent Spaces Matter
You might wonder: why not just predict raw states directly? For visual games, that means predicting pixels. This works but it's wasteful. Most pixels in a game frame are irrelevant noise—background textures, UI elements, decorative particles.
A latent space compresses the observation to what matters strategically. In the learned representation, a "dangerous enemy nearby" and "safe path ahead" might be just a few dimensions, not thousands of pixels.
For onchain games, raw observations aren't pixels—they're structured data from Torii. But the principle still applies. A game might have hundreds of entity properties, but the strategically relevant state might be much smaller. Your world model learns that compression automatically.
# Raw game state from Torii: hundreds of fields
raw_state = {
"player": {"x": 45, "y": 120, "health": 80, "mana": 35, ...},
"enemies": [...], # Could be dozens
"items": [...],
"terrain": [...],
"events": [...],
# ... many more
}
# Latent representation: 64-dimensional vector
# Captures: "medium health, near enemy cluster, low resources"
latent = encoder(raw_state) # shape: (64,)The encoder learns which aspects of raw_state actually affect what happens next. Everything else gets discarded.
Determinism Changes Everything
Here's where onchain games diverge from the scenarios most world model research tackles. Traditional game AI deals with:
- Partially observable states (fog of war, hidden information)
- Stochastic transitions (random enemy spawns, dice rolls)
- Complex physics that are hard to model exactly
Onchain games, by design, have:
- Fully observable state: Everything is on-chain. You can query every player's position, every item's location, every variable in every contract.
- Deterministic transitions: Given the same state and action, you get the same next state. Always. (Even "random" events use verifiable randomness with known seeds.)
- Perfect historical record: Torii indexes every state change. You have complete ground truth for training.
This means your world model can be exact for the game mechanics. Not a statistical approximation—an exact function. The only uncertainty is what other players will do.
# In a stochastic environment (traditional games):
# Model must learn P(s_{t+1} | s_t, a_t) - a probability distribution
# In a deterministic onchain game:
# Model learns s_{t+1} = f(s_t, a_t) - a direct mapping
# Uncertainty only in opponent actionsThis changes the architecture. You don't need probabilistic world models with variance estimation. You need accurate state prediction, plus opponent modeling as a separate concern.
The Architecture: Perception → World Model → Planning → Action
Let's break down the complete pipeline for a world model agent on Dojo.
1. Perception: Torii → Structured State
Your agent starts by querying Torii. Here's a practical Python setup:
import httpx
from dataclasses import dataclass
from typing import List, Optional
import json
TORII_URL = "https://api.cartridge.gg/x/your-game/torii/graphql"
@dataclass
class PlayerState:
address: str
position_x: int
position_y: int
health: int
resources: dict
inventory: List[str]
@dataclass
class GameState:
tick: int
players: List[PlayerState]
world_entities: List[dict]
async def fetch_game_state(client: httpx.AsyncClient) -> GameState:
"""Fetch complete game state from Torii."""
query = """
query GetWorld {
playerModels(first: 100) {
edges {
node {
address
position_x
position_y
health
resources
inventory
}
}
}
worldModels(first: 1) {
edges {
node {
current_tick
}
}
}
entityModels(first: 500) {
edges {
node {
entity_id
entity_type
position_x
position_y
properties
}
}
}
}
"""
response = await client.post(
TORII_URL,
json={"query": query}
)
data = response.json()["data"]
players = [
PlayerState(
address=edge["node"]["address"],
position_x=edge["node"]["position_x"],
position_y=edge["node"]["position_y"],
health=edge["node"]["health"],
resources=json.loads(edge["node"]["resources"]),
inventory=json.loads(edge["node"]["inventory"])
)
for edge in data["playerModels"]["edges"]
]
tick = data["worldModels"]["edges"][0]["node"]["current_tick"]
entities = [edge["node"] for edge in data["entityModels"]["edges"]]
return GameState(tick=tick, players=players, world_entities=entities)Notice we're pulling everything—not just our own player. World models need to understand the complete game state to predict what happens when we act.
2. State Encoding: Structured Data → Latent Vector
The world model operates on fixed-size vectors, not variable-length GraphQL responses. We need an encoder:
import torch
import torch.nn as nn
from typing import Tuple
class GameStateEncoder(nn.Module):
"""Encode variable-size game state into fixed latent vector."""
def __init__(
self,
player_dim: int = 32,
entity_dim: int = 16,
latent_dim: int = 128,
max_players: int = 50,
max_entities: int = 200
):
super().__init__()
self.max_players = max_players
self.max_entities = max_entities
# Player encoder
self.player_encoder = nn.Sequential(
nn.Linear(10, 64), # position, health, resources
nn.ReLU(),
nn.Linear(64, player_dim)
)
# Entity encoder
self.entity_encoder = nn.Sequential(
nn.Linear(8, 32),
nn.ReLU(),
nn.Linear(32, entity_dim)
)
# Attention pooling for variable-length sequences
self.player_attention = nn.MultiheadAttention(
embed_dim=player_dim,
num_heads=4,
batch_first=True
)
self.entity_attention = nn.MultiheadAttention(
embed_dim=entity_dim,
num_heads=4,
batch_first=True
)
# Final projection
self.project = nn.Sequential(
nn.Linear(player_dim + entity_dim, 256),
nn.ReLU(),
nn.Linear(256, latent_dim)
)
def forward(
self,
player_features: torch.Tensor, # (batch, n_players, 10)
entity_features: torch.Tensor, # (batch, n_entities, 8)
player_mask: torch.Tensor,
entity_mask: torch.Tensor
) -> torch.Tensor:
# Encode each player/entity
players_encoded = self.player_encoder(player_features)
entities_encoded = self.entity_encoder(entity_features)
# Attention pooling with masks
players_pooled, _ = self.player_attention(
players_encoded, players_encoded, players_encoded,
key_padding_mask=~player_mask
)
entities_pooled, _ = self.entity_attention(
entities_encoded, entities_encoded, entities_encoded,
key_padding_mask=~entity_mask
)
# Mean pool over sequence
player_repr = (players_pooled * player_mask.unsqueeze(-1)).sum(1)
player_repr = player_repr / player_mask.sum(1, keepdim=True).clamp(min=1)
entity_repr = (entities_pooled * entity_mask.unsqueeze(-1)).sum(1)
entity_repr = entity_repr / entity_mask.sum(1, keepdim=True).clamp(min=1)
# Combine and project
combined = torch.cat([player_repr, entity_repr], dim=-1)
return self.project(combined)
def preprocess_game_state(state: GameState, my_address: str) -> Tuple[torch.Tensor, torch.Tensor]:
"""Convert GameState to tensors for the encoder."""
# Player features: relative position to me, health, resources
my_player = next(p for p in state.players if p.address == my_address)
player_features = []
for p in state.players:
features = [
p.position_x - my_player.position_x, # Relative x
p.position_y - my_player.position_y, # Relative y
p.health / 100.0, # Normalized health
p.resources.get("wood", 0) / 1000.0,
p.resources.get("stone", 0) / 1000.0,
p.resources.get("gold", 0) / 1000.0,
1.0 if p.address == my_address else 0.0, # Is self
len(p.inventory) / 10.0, # Inventory fullness
# ... more features
]
player_features.append(features)
# Pad to fixed size
while len(player_features) < 50:
player_features.append([0.0] * 10)
# Entity features: type, position, properties
entity_features = []
for e in state.world_entities:
features = [
e["position_x"] - my_player.position_x,
e["position_y"] - my_player.position_y,
hash(e["entity_type"]) % 100 / 100.0, # Type embedding
# ... extract relevant properties
]
entity_features.append(features)
while len(entity_features) < 200:
entity_features.append([0.0] * 8)
return (
torch.tensor(player_features[:50], dtype=torch.float32),
torch.tensor(entity_features[:200], dtype=torch.float32)
)This encoder handles variable numbers of players and entities through attention pooling—a technique borrowed from transformer architectures. It's more sophisticated than simple concatenation but handles the variability gracefully.
3. The World Model: Predicting Future States
Now the core: a model that predicts what happens when actions are taken.
class WorldModel(nn.Module):
"""
Deterministic world model for onchain games.
Predicts next latent state given current state and action.
"""
def __init__(
self,
latent_dim: int = 128,
action_dim: int = 16, # One-hot or embedded action
hidden_dim: int = 256
):
super().__init__()
# Action embedding
self.action_embed = nn.Embedding(20, action_dim) # 20 possible actions
# Transition model
self.transition = nn.Sequential(
nn.Linear(latent_dim + action_dim, hidden_dim),
nn.LayerNorm(hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.LayerNorm(hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, latent_dim)
)
# Reward predictor (auxiliary task)
self.reward_head = nn.Sequential(
nn.Linear(latent_dim, 64),
nn.ReLU(),
nn.Linear(64, 1)
)
# Done predictor (game over / death)
self.done_head = nn.Sequential(
nn.Linear(latent_dim, 64),
nn.ReLU(),
nn.Linear(64, 1),
nn.Sigmoid()
)
def forward(
self,
state: torch.Tensor, # (batch, latent_dim)
action: torch.Tensor # (batch,) action indices
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""Predict next state, reward, and done probability."""
action_emb = self.action_embed(action)
combined = torch.cat([state, action_emb], dim=-1)
next_state = self.transition(combined)
reward = self.reward_head(next_state)
done = self.done_head(next_state)
return next_state, reward.squeeze(-1), done.squeeze(-1)
def rollout(
self,
initial_state: torch.Tensor,
action_sequence: torch.Tensor, # (batch, horizon)
) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
Simulate a sequence of actions.
Returns trajectory of states, rewards, and done flags.
"""
batch_size, horizon = action_sequence.shape
states = [initial_state]
rewards = []
dones = []
current_state = initial_state
for t in range(horizon):
next_state, reward, done = self(current_state, action_sequence[:, t])
states.append(next_state)
rewards.append(reward)
dones.append(done)
current_state = next_state
return (
torch.stack(states, dim=1), # (batch, horizon+1, latent_dim)
torch.stack(rewards, dim=1), # (batch, horizon)
torch.stack(dones, dim=1) # (batch, horizon)
)The rollout method is where the magic happens. Given an initial state and a sequence of planned actions, the model imagines the entire trajectory. This lets us evaluate plans before executing them.
4. Training: Learning from Torii History
Here's the beautiful part of onchain games: you have perfect training data. Torii stores every state transition that ever happened.
async def fetch_historical_transitions(
client: httpx.AsyncClient,
limit: int = 10000
) -> List[dict]:
"""
Fetch historical state transitions from Torii.
Each transition is a (state, action, next_state, reward) tuple.
"""
query = """
query GetHistory($limit: Int!) {
eventMessagesHistorical(
first: $limit,
orderBy: { field: TIMESTAMP, direction: DESC }
) {
edges {
node {
keys
data
transactionHash
createdAt
}
}
}
}
"""
response = await client.post(
TORII_URL,
json={"query": query, "variables": {"limit": limit}}
)
events = response.json()["data"]["eventMessagesHistorical"]["edges"]
# Parse events into transitions
transitions = []
for event in events:
parsed = parse_game_event(event["node"])
if parsed:
transitions.append(parsed)
return transitions
def parse_game_event(event: dict) -> Optional[dict]:
"""
Parse raw event into a structured transition.
This is game-specific—adapt to your game's events.
"""
keys = event["keys"]
data = event["data"]
# Example: "PlayerMoved" event
if "PlayerMoved" in keys[0]:
return {
"event_type": "move",
"player": keys[1],
"from_x": int(data[0], 16),
"from_y": int(data[1], 16),
"to_x": int(data[2], 16),
"to_y": int(data[3], 16),
"timestamp": event["createdAt"]
}
# Example: "ResourceHarvested" event
if "ResourceHarvested" in keys[0]:
return {
"event_type": "harvest",
"player": keys[1],
"resource": data[0],
"amount": int(data[1], 16),
"timestamp": event["createdAt"]
}
return None
class TransitionDataset(torch.utils.data.Dataset):
"""Dataset of (state, action, next_state, reward) tuples."""
def __init__(self, transitions: List[dict], encoder: GameStateEncoder):
self.transitions = transitions
self.encoder = encoder
self.data = self._process_transitions()
def _process_transitions(self):
# Group events by timestamp to reconstruct state snapshots
# This is the tricky part—you need to reconstruct
# what the full game state was at each transition
processed = []
for i, transition in enumerate(self.transitions[:-1]):
# Get state before and after
state_before = self._reconstruct_state(i)
state_after = self._reconstruct_state(i + 1)
action = self._extract_action(transition)
reward = self._compute_reward(transition)
processed.append({
"state": state_before,
"action": action,
"next_state": state_after,
"reward": reward
})
return processed
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
item = self.data[idx]
return {
"state": torch.tensor(item["state"], dtype=torch.float32),
"action": torch.tensor(item["action"], dtype=torch.long),
"next_state": torch.tensor(item["next_state"], dtype=torch.float32),
"reward": torch.tensor(item["reward"], dtype=torch.float32)
}
def train_world_model(
model: WorldModel,
dataset: TransitionDataset,
epochs: int = 100,
batch_size: int = 64,
lr: float = 1e-4
):
"""Train the world model on historical transitions."""
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=batch_size,
shuffle=True
)
optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
state_loss_fn = nn.MSELoss()
reward_loss_fn = nn.MSELoss()
for epoch in range(epochs):
total_loss = 0.0
for batch in dataloader:
optimizer.zero_grad()
pred_next, pred_reward, pred_done = model(
batch["state"],
batch["action"]
)
# State prediction loss
state_loss = state_loss_fn(pred_next, batch["next_state"])
# Reward prediction loss
reward_loss = reward_loss_fn(pred_reward, batch["reward"])
# Combined loss
loss = state_loss + 0.1 * reward_loss
loss.backward()
optimizer.step()
total_loss += loss.item()
if epoch % 10 == 0:
avg_loss = total_loss / len(dataloader)
print(f"Epoch {epoch}: Loss = {avg_loss:.4f}")The training loop is standard supervised learning. The key insight is that Torii gives you perfect supervision. Every state transition that occurred in the game's history is recorded and queryable. You're not estimating—you're learning from ground truth.
5. Planning: Imagining Futures
With a trained world model, your agent can simulate futures before acting. The simplest approach is random shooting:
class PlanningAgent:
"""
Agent that uses world model for planning.
Evaluates many action sequences and picks the best.
"""
def __init__(
self,
encoder: GameStateEncoder,
world_model: WorldModel,
n_candidates: int = 256,
horizon: int = 10
):
self.encoder = encoder
self.world_model = world_model
self.n_candidates = n_candidates
self.horizon = horizon
self.n_actions = 20 # Number of possible actions
def plan(self, current_state: torch.Tensor) -> int:
"""
Find the best action by simulating many possible futures.
Returns the first action of the best trajectory.
"""
# Generate random action sequences
action_sequences = torch.randint(
0, self.n_actions,
(self.n_candidates, self.horizon)
)
# Expand state for all candidates
states = current_state.unsqueeze(0).expand(self.n_candidates, -1)
# Simulate all candidates
with torch.no_grad():
_, rewards, dones = self.world_model.rollout(
states,
action_sequences
)
# Compute trajectory values (sum of discounted rewards)
discount = 0.99
discounts = discount ** torch.arange(self.horizon)
# Mask rewards after done
done_mask = torch.cumprod(1 - dones, dim=1)
masked_rewards = rewards * done_mask
trajectory_values = (masked_rewards * discounts).sum(dim=1)
# Pick best trajectory
best_idx = trajectory_values.argmax()
best_action = action_sequences[best_idx, 0].item()
return best_action
async def act(self, game_state: GameState, my_address: str) -> int:
"""Full pipeline: observe → encode → plan → return action."""
# Encode current state
player_features, entity_features = preprocess_game_state(
game_state,
my_address
)
latent = self.encoder(
player_features.unsqueeze(0),
entity_features.unsqueeze(0),
torch.ones(1, 50, dtype=torch.bool),
torch.ones(1, 200, dtype=torch.bool)
)
# Plan best action
action = self.plan(latent.squeeze(0))
return actionRandom shooting works surprisingly well for short horizons. For longer planning, you'd want something smarter:
- Cross-Entropy Method (CEM): Iteratively refine the action distribution toward higher rewards
- Model Predictive Control (MPC): Re-plan every step as new observations arrive
- Monte Carlo Tree Search (MCTS): Build a search tree, especially powerful for discrete action games
def cem_planning(
world_model: WorldModel,
initial_state: torch.Tensor,
n_iterations: int = 5,
n_candidates: int = 256,
n_elite: int = 32,
horizon: int = 10,
n_actions: int = 20
) -> int:
"""
Cross-Entropy Method planning.
Iteratively improves action distribution toward high-reward sequences.
"""
# Initialize uniform action distribution
action_means = torch.ones(horizon, n_actions) / n_actions
for iteration in range(n_iterations):
# Sample action sequences from current distribution
action_probs = action_means.unsqueeze(0).expand(n_candidates, -1, -1)
action_sequences = torch.multinomial(
action_probs.reshape(-1, n_actions),
num_samples=1
).reshape(n_candidates, horizon)
# Evaluate all sequences
states = initial_state.unsqueeze(0).expand(n_candidates, -1)
with torch.no_grad():
_, rewards, dones = world_model.rollout(states, action_sequences)
# Compute trajectory values
discount = 0.99
discounts = discount ** torch.arange(horizon)
done_mask = torch.cumprod(1 - dones, dim=1)
values = (rewards * done_mask * discounts).sum(dim=1)
# Select elite trajectories
elite_indices = values.topk(n_elite).indices
elite_sequences = action_sequences[elite_indices]
# Update action distribution toward elite actions
for t in range(horizon):
counts = torch.zeros(n_actions)
for seq in elite_sequences:
counts[seq[t]] += 1
action_means[t] = counts / n_elite
# Add small uniform noise to prevent collapse
action_means = 0.9 * action_means + 0.1 / n_actions
# Return most likely first action
return action_means[0].argmax().item()CEM typically finds better plans than random shooting because it focuses sampling on promising regions of action space.
6. Action Execution: Controller CLI
Once planning selects an action, execute it via Controller:
import subprocess
import json
ACTION_MAPPING = {
0: {"entrypoint": "move", "calldata_fn": lambda: ["1", "0"]}, # Move right
1: {"entrypoint": "move", "calldata_fn": lambda: ["-1", "0"]}, # Move left
2: {"entrypoint": "move", "calldata_fn": lambda: ["0", "1"]}, # Move up
3: {"entrypoint": "move", "calldata_fn": lambda: ["0", "-1"]}, # Move down
4: {"entrypoint": "harvest", "calldata_fn": lambda: []},
5: {"entrypoint": "attack", "calldata_fn": lambda: []},
# ... more actions
}
GAME_CONTRACT = "0x..." # Your game's contract address
RPC_URL = "https://api.cartridge.gg/x/starknet/sepolia"
def execute_action(action_id: int) -> dict:
"""Execute a planned action via Controller CLI."""
if action_id not in ACTION_MAPPING:
raise ValueError(f"Unknown action: {action_id}")
action = ACTION_MAPPING[action_id]
calldata = action["calldata_fn"]()
cmd = [
"controller", "execute",
"--contract", GAME_CONTRACT,
"--entrypoint", action["entrypoint"],
"--calldata", ",".join(calldata) if calldata else "",
"--rpc-url", RPC_URL,
"--json"
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError(f"Execution failed: {result.stderr}")
return json.loads(result.stdout)Handling Opponents: The Multi-Agent Problem
So far we've modeled the world as if our agent acts alone. Real games have opponents. Their actions affect our state too.
There are several approaches:
1. Opponent Modeling as Distribution
Treat opponent actions as a learned distribution:
class OpponentModel(nn.Module):
"""Predict probability distribution over opponent's next action."""
def __init__(self, latent_dim: int, n_actions: int):
super().__init__()
self.net = nn.Sequential(
nn.Linear(latent_dim, 128),
nn.ReLU(),
nn.Linear(128, n_actions),
nn.Softmax(dim=-1)
)
def forward(self, state: torch.Tensor) -> torch.Tensor:
return self.net(state)Train this on historical opponent behavior from Torii. Then sample from it during planning:
def plan_with_opponents(
world_model: WorldModel,
opponent_model: OpponentModel,
initial_state: torch.Tensor,
n_candidates: int = 256,
n_opponent_samples: int = 8
) -> int:
"""Plan considering likely opponent responses."""
# For each candidate action sequence, simulate multiple
# possible opponent behaviors
my_actions = torch.randint(0, 20, (n_candidates, 10))
total_values = torch.zeros(n_candidates)
for _ in range(n_opponent_samples):
states = initial_state.unsqueeze(0).expand(n_candidates, -1)
trajectory_reward = torch.zeros(n_candidates)
for t in range(10):
# Predict opponent action distribution
opp_probs = opponent_model(states)
opp_actions = torch.multinomial(opp_probs, 1).squeeze()
# Combined state transition (my action + opponent action)
# This requires a world model that handles multi-agent transitions
next_states, rewards, _ = world_model.step_multi_agent(
states,
my_actions[:, t],
opp_actions
)
trajectory_reward += rewards * (0.99 ** t)
states = next_states
total_values += trajectory_reward
# Average across opponent samples
total_values /= n_opponent_samples
return my_actions[total_values.argmax(), 0].item()2. Best-Response Planning
Assume opponents play optimally against you. This is conservative—it prevents strategies that only work against weak opponents:
def minimax_planning(world_model, state, depth=3):
"""
Minimax search: assume opponent plays optimally.
Good for zero-sum competitive games.
"""
def evaluate(state, is_my_turn, d):
if d == 0:
return reward_head(state).item(), None
if is_my_turn:
best_value = -float('inf')
best_action = None
for action in range(n_actions):
next_state = world_model(state, action)
value, _ = evaluate(next_state, False, d-1)
if value > best_value:
best_value = value
best_action = action
return best_value, best_action
else:
# Opponent's turn: assume they minimize our value
worst_value = float('inf')
for action in range(n_actions):
next_state = world_model(state, action)
value, _ = evaluate(next_state, True, d-1)
worst_value = min(worst_value, value)
return worst_value, None
_, best_action = evaluate(state, True, depth)
return best_action3. Population-Based Opponent Modeling
Train your world model against a diverse population of agent policies. This creates robust behavior:
# Train world model that's seen many opponent types
opponent_policies = load_diverse_policies() # Historical winners, rule-based, etc.
for epoch in range(epochs):
# Sample opponent from population
opponent = random.choice(opponent_policies)
# Generate rollouts against this opponent
transitions = generate_rollouts(world_model, opponent)
# Train world model on these transitions
train_step(world_model, transitions)Challenges and Solutions
Compute Constraints
World models need GPU compute for training and inference. For onchain games, you're running this off-chain—the model lives on your server, not on Starknet.
Solutions:- Train offline on historical data (cheap, one-time cost)
- Use small models (64-128 dim latent spaces work fine for most games)
- Cache planning results for similar states
- Run inference on CPU for simple games (it's fast enough)
Partial Observability
Some onchain games have fog of war or hidden information (e.g., opponent's hand in a card game).
Solutions:- Learn a belief state that represents uncertainty
- Use recurrent models that accumulate information over time
- Monte Carlo rollouts over possible hidden states
class BeliefWorldModel(nn.Module):
"""World model that maintains uncertainty over hidden state."""
def __init__(self, obs_dim, hidden_dim, latent_dim):
super().__init__()
# GRU to accumulate observations into belief
self.belief_rnn = nn.GRU(obs_dim, hidden_dim, batch_first=True)
# Standard world model on belief state
self.transition = WorldModel(hidden_dim, latent_dim)
def forward(self, observation_sequence, actions):
# Build belief from observation history
belief, _ = self.belief_rnn(observation_sequence)
belief_current = belief[:, -1] # Final belief state
# Predict from belief
return self.transition(belief_current, actions)Distribution Shift
Your world model was trained on historical data. But as you deploy better agents, the game dynamics change—other players adapt, meta shifts. Your model becomes stale.
Solutions:- Continual learning: fine-tune on recent data periodically
- Ensemble models: train multiple world models, use disagreement as uncertainty
- Online learning: update the model after every episode
def update_world_model_online(model, buffer, batch_size=32):
"""Lightweight online update from recent experience."""
if len(buffer) < batch_size:
return
batch = random.sample(buffer, batch_size)
# Quick gradient update
optimizer.zero_grad()
loss = compute_loss(model, batch)
loss.backward()
optimizer.step()
# Keep buffer bounded
while len(buffer) > 10000:
buffer.pop(0)The Full Pipeline
Let's put it all together into a complete agent:
import asyncio
import httpx
async def run_world_model_agent(
my_address: str,
game_contract: str,
torii_url: str,
model_path: str
):
"""Complete world model agent loop."""
# Load trained models
encoder = GameStateEncoder()
world_model = WorldModel()
encoder.load_state_dict(torch.load(f"{model_path}/encoder.pt"))
world_model.load_state_dict(torch.load(f"{model_path}/world_model.pt"))
encoder.eval()
world_model.eval()
# Initialize planner
agent = PlanningAgent(encoder, world_model)
# Experience buffer for online learning
experience_buffer = []
async with httpx.AsyncClient() as client:
while True:
# 1. Observe
game_state = await fetch_game_state(client)
# 2. Encode & Plan
action = await agent.act(game_state, my_address)
# 3. Execute
print(f"Executing action {action}")
result = execute_action(action)
print(f"TX: {result.get('transaction_hash')}")
# 4. Wait for state update
await asyncio.sleep(5)
# 5. Observe new state
new_state = await fetch_game_state(client)
# 6. Store experience for online learning
experience_buffer.append({
"state": game_state,
"action": action,
"next_state": new_state,
"reward": compute_reward(game_state, new_state)
})
# 7. Periodic online update
if len(experience_buffer) % 100 == 0:
update_world_model_online(world_model, experience_buffer)
# Loop delay
await asyncio.sleep(30)
if __name__ == "__main__":
asyncio.run(run_world_model_agent(
my_address="0x...",
game_contract="0x...",
torii_url="https://api.cartridge.gg/x/your-game/torii/graphql",
model_path="./trained_models"
))Looking Forward: Shared Infrastructure
Today, every developer trains their own world model for their own game. But there's an interesting future: shared world models.
Imagine a foundation model trained on every Dojo game's transition data. It learns the grammar of onchain games—how entities move, how resources flow, how combat resolves. Fine-tuning this foundation for a specific game would be much cheaper than training from scratch.
The data is already there. Torii indexes it all. The model architectures exist. Someone just needs to build it.
Even more speculative: decentralized world model training. Multiple agents contribute training compute and share the resulting model. The model itself could live on-chain (well, the weights could be committed to IPFS with hashes on-chain). True AI infrastructure for autonomous worlds.
We're not there yet. But the pieces are falling into place.
Conclusion
World model agents represent a shift from reactive to predictive game AI. Instead of learning a policy directly (state → action), they learn a simulator (state + action → next state) and plan within that simulation.
For onchain games built on Dojo:
- Torii gives you perfect training data—every transition ever, with ground truth labels
- Deterministic game logic means your world model can be exact, not probabilistic
- Controller CLI provides secure, scoped execution for agent actions
- The full pipeline—observe, encode, simulate, plan, act—runs off-chain while interacting with on-chain state
Start with the basics: a simple encoder, a feedforward world model, random shooting for planning. Get that working. Then add complexity—better architectures, smarter planning, opponent modeling.
The code in this article is a foundation, not a finished product. The architecture scales from simple experiments to sophisticated autonomous systems. Where you take it depends on how deeply you want to understand the game you're playing.
This article is part of the Agent-Native Games series. Previous: Building Your First Dojo Game Agent
Resources: