Reading out LM's intent during code generation

When a programmer enters a code block, they know what it's for before typing. Does language models do the same? We show that we can use a steering scaffold to make the LM to introspect its latents and verbalize its instrumental goals in natural language, before implementing them. This setup does not require any modification to the base-LM weights.
Hover over the tokens below to see what the LM's immediate instrumental goal (intent) is.

Tooltip = oracle's verbalization of the
hidden state at that token.

Select a task to view its GoalScope visualization

Each visualization shows what an activation oracle reads from the model's hidden state at every generated token — revealing the model's instrumental goals before they are implemented.