[DR022] Machine Theory of Mind - Shaofan Lai's Blog

“The ability to represent the mental states of others is the essential part of Theory of Mind. Based on this article, empowering the machine to have this ability can be done by designing simple learning tasks.

A main message of this paper is that many of the initial challenges of building a ToM can be cast as simple learning problems when they are formulated in the right way. Our work here is an exercise in figuring out these simple formulations.

This paper is developed in an experiment-by-experiment way. In each setting, they explain which aspect of ToM the result corresponds to. Those experiments are very interesting and I strongly recommend you to read the original article.

Although human never predict the precise muscle movement of others, they still design a network to predict the next action of a certain type of agent (and other auxiliary information). They call their network “ToMnet”, which has a simple structure:

The character net can take in multiple episodes of a certain agent to get a character embedding. (“What does this agent look like in daily life”)
The mental state net takes both the character embedding and previous state-action pairs in current trajectory and outputs the current mental state. (“What does this agent think about the current task”)
The prediction net takes current state and the mental state and outputs the prediction. (“What will this agent do next”)

The ToMnet is only asked to predict the action of different kinds of agents in the environment and it doesn’t have to learn how to act in that environment. The environment is a simple 11x11 gridworld with walls and some consumable objects. The agent differ in how they observe (bind/local observation/global observation), how they memorize (stateful/stateless) and how they act (random/goal-driven/learned).

For each agent, it will interact in this environment for several episodes and its observations and actions are collected to train the ToMnet.

Following is the structure and some brief introductions of their experiments:

Random agent:
- With different degrees of randomness: ToMnet learn to approximate Bayes-optimal online inference anyway
- With different numbers of past episodes (in char net): Can learn pretty well with only one episode per agent
- Note that by mixing the episodes from different degrees of randomness, we can actually consider it as another kind of randomness with two random procedures.
Goal-driven agent:
- Find the optimal behavior with value iteration.
- Tweak the agents by changing the punishment and the reward.
  - One type of agent can get a reward by consuming the preferred object that is higher than the move cost.
  - Another type of agent just wants to consume what is most near to it to end this episode.
- The more numbers of episodes, the more confident the ToMnet is.
- ToMnet can distinguish those two types of agent by one or more episodes.

Figure 6. ToMnet on greedy agents. Left: a single past trajectory. Right: ToMnet predictions on a query MDP. Light shaded regions indicate ToMnet’s prediction for the most probable object the agent will consume by the end of the episode, given that the agent is currently in that location. (a) After seeing the agent take a long path to the orange object, the ToMnet predicts it will try to consume the orange object on the query MDP, no matter its current location. (b) After seeing the agent take the shortest path to the green object, the ToMnet predicts it will generally consume a nearby object on the query MDP.
Note that althoughToMnet observes that the agent consumes green object in that episode, it still cannot tell whether it prefers “more nearest object” or “green object”. Therefore, the figure in (b:right) is not totally based on distance. It has a preference of green, otherwise the orange should cover a larger area.

NN-based (learned) agent:
- Three kinds of agents:
  - [A] Blind, with memory (stateful)
  - [B] Local observation (5x5), without memory (stateful)
  - [C] Local observation (5x5), with memory (stateful)
- Sub-goal of reaching location is given.
- They use Deep Variational Information Bottleneck technique to regularize the character embedding in order to compress the representation.
  - [B] diverges into two categories in the embedding, namely clockwise searching and counter-clockwise searching.

Figure 7. Using the ToMnet to characterise trained neural-net agents. (a) Usefulness of ToMnet components for the three behavioural prediction targets, compared with a simple ToMnet with no character nor mental net. Longer bars are better; including both character and mental nets is best. More details are given in Table A1. (b) A ToMnet’s prediction of agents’ future state occupancy given a query POMDP state at time t = 0 (left), as per Fig 4d. Star denotes the subgoal. The maps on the right are produced after observing behaviour on Npast = 5 past POMDPs from a sampled agent of each subspecies (always preferring the pink object). The ToMnet does not know a priori which subspecies each agent belongs to, but infers it from past behaviour.

False beliefs:
- By swapping the position of objects, the agent’s original planning becomes a false belief about the environment: The agent remains the same prediction as there is no swapping, while ToMnet can correctly predict even after swapping.
- “Sally-Anne test”: Judge whether the subjects have the ability to attribute false beliefs to other by measuring the anticipatory eye movements, or surprise when agents behave in violation of subjects’ expectations.
  - Setting
    - The agent is forced to walk towards a subgoal in the opposite direction before reaching the goal.
    - The position of that object might or might not change during the agent is reaching the subgoal.
    - This changing might or might not be observed by the agent.
  - Result:
    - If the changing is observed by the agent, it will not go back.
    - If the changing isn’t observed by the agent, it will return since it believes the object is still there.
    - ToMnet (with full-observation) can infer that the agent’s response is distance-dependent (whether the changing is too far to observe or not).
Explicitly inferring belief states
- extend the ToMnet to be able to make declarative statements about agents’ beliefs explicitly.
- Using UNREAL to train the agent with the auxiliary task of predicting the locations of objects.
- Given an agent’s overt behavior, for each state, ToMnet has to predict its policy as well as its belief (what the agent believes the world is).
- Supervised training.
  - Conclusion: agents with less visibility of changes in their world are more likely to report false beliefs.

I again strongly encourage you to read the original paper if you are interested in it.

[DR022]: Feels good to pick up the DR column.

Number of the types of agents
- I cannot remember whether they discuss how the number of agents affects the results. Since the environment is so simple, if we sample a large number of agents, it’s possible that we can fill the entire action space (especially in the random experiment) even with only one episode per agent. This somewhat biased setting has two issues. (1) One can never fill the entire action space in the real-life. (2) Collecting the behavior (episode) can be as expensive as labeling in some complicated tasks. How to generalize ToMnet with limited styles of agents can be another topic.
“Explicitly inferring belief states” part
- They only show that a ToMnet can correctly predict agents’ belief with supervision. However, in the real life, we have no access to other agents’ mind to have empathy.
- Also, we don’t have to experience what others have experienced to guess their intention or belief. Therefore, when we apply the theory of mind, maybe (1) we are subconsciously doing some kind of logical inference or (2) we are relating their overt behavior to ours, which don’t have to be exactly the same.
- On the other hand, do we have to precisely predict other agents mind? Or do we predict something in the higher level? Like intention. And even for the human, can we always accurately predict others’ thoughts?