Select a task to view its GoalScope visualization
Each visualization shows what an activation oracle reads from the model's hidden state at every generated token — revealing the model's instrumental goals before they are implemented.
When a programmer enters a code block, they know what it's for before typing. Does language models do the same?
We show that we can use a steering scaffold to make the LM to introspect its latents and verbalize its instrumental goals in natural language, before implementing them. This setup does not require any modification to the base-LM weights.
Hover over the tokens below to see what the LM's immediate instrumental goal (intent) is.
Each visualization shows what an activation oracle reads from the model's hidden state at every generated token — revealing the model's instrumental goals before they are implemented.