«Erik Talvitie Department of Mathematics and Computer Science Franklin & Marshall College Lancaster, PA 17604 erik.talvitie Abstract This ...»
Learning Partially Observable Models Using
Department of Mathematics and Computer Science
Franklin & Marshall College
Lancaster, PA 17604
This paper introduces timeline trees, which are partial models of partially observable environments. Timeline trees are given some speciﬁc predictions to make and
learn a decision tree over history. The main idea of timeline trees is to use temporally abstract features to identify and split on features of key events, spread arbitrarily far apart in the past (whereas previous decision-tree-based methods have been limited to a ﬁnite sufﬁx of history). Experiments demonstrate that timeline trees can learn to make high quality predictions in complex, partially observable environments with high-dimensional observations (e.g. an arcade game).
1 Introduction Learning a model of a high-dimensional environment can pose a signiﬁcant challenge, but the ability to make predictions about future events is key to good decision making. One common approach is to avoid learning a complete, monolithic model of the environment, and to instead focus on learning partial models that are only capable of making a restricted set of predictions (for instance, predictions about some particular aspect of the environment, or predictions about future rewards). Partial models can often be simpler to learn than a complete model. In some cases they can be combined to form complete, structured models, which can then be used for planning purposes (e.g. factored MDPs , collections of partial models ). In other cases, partial models can be directly useful for control (e.g. U-Tree , prediction proﬁle models ).
This paper introduces timeline trees which are partial models for partially observable environments.
Timeline trees are focused on capturing a particular kind of partial observability; they assume that their predictions can be made by recalling a (ﬁnite) sequence of events in the past that may have occurred far apart from each other in time. While not all partially observable phenomena take this form, a good deal of everyday partial observability has this ﬂavor. For instance, you may know that your keys are in the next room because you remember putting them there. Most of the experiences since that event are probably irrelevant to making predictions about the location of your keys.
The main idea of timeline trees is to build a decision tree over history. As with similar approaches, the decision tree can split on features of observations in recent history. However, a timelinetree may also establish new timestamps in the past and is able split on features of observations surrounding those events as well. For instance, there could be a timestamp representing the last time the agent saw its keys, and then features of the neighboring observations could identify the keys’ location. In this way, timeline trees can make use of information arbitrarily spread out in history.
2 Partial Models This paper will focus on discrete dynamical systems. Speciﬁcally, time procedes in discrete steps t = 1, 2, 3,.... At every step t, the agent selects an action at from a ﬁnite set A and the environment (stochastically) emits an observation ot, taken from a ﬁnite set O. The history at time
t is the sequence of actions and observations from the beginning of time, up through time t:
def ht = a1 o1 a2 o2... at ot. In the general partially observable case, the observation emitted at each step may depend upon the entire history (and the agent’s action). So, an agent wishing to predict the next observation must model the conditional probability distribution Pr(Ot+1 | Ht, At+1 ).
If one is able to predict the next observation at any history and for any action (that is, if one has access to this conditional distribution), one can compute the probability of any future sequence of observations given any future sequence of actions and the history . Such a model is called a complete model because in any situtation, it is capable of making any prediction about the future.
Examples in the partially observable setting include POMDPs [6, 7] and PSRs [5, 8]. A partial model is any model that does not represent this full conditional distribution.
This paper will focus on partial models that make conditional predictions about abstract features of the next observation, though many of the ideas can be straightforwardly adapted to work with predictions of other forms. Formally, let ω and κ be many-to-one mappings over the set of observations O. The task of the partial model at time t will be to predict the value of ω(ot+1 ), conditioned on the value of κ(ot+1 ). So it represents the distribution Pr(ω(Ot+1 ) | Ht, At+1, κ(Ot+1 )). For example, in the experiment in Section 5.3, observations are images and multiple partial models are learned, each predicting the color of a single pixel, conditioned on the colors of pixels above and to the left.
2.1 Related Work: Partial Models in Partially Observable Environments
McCallum’s U-Tree  learns a decision tree over history, where the leaves of the tree map to the expected discounted sum of rewards at the associated history (though the method could be adapted to make other predictions, as in ). McCallum used binary features of the form “Feature X takes value Y at time-step t − k,” where t is the current time-step. Thus the decision tree learns an abstraction both over observations (which could be high-dimensional in their own right) and over the sequence of observations (by using features from multiple time-steps). However, because it can only consider a ﬁnite number of such features, U-Tree has a ﬁnite memory horizon; events that occur before some arbitrary cutoff in the past cannot be taken into account when making predictions.
Timeline trees are an extension of UTree that allow it to use features of observations arbitrarily far back in the past, though they are not the ﬁrst attempt to address this issue. Looping predictive sufﬁx trees (LPSTs)  are prediction sufﬁx trees  that allow nodes to loop back to their ancestors.
Local agent state representations (LASR)  map histories to a real number, and then learn a direct mapping from that number to the target predictions. McCallum  and Mahmud  both developed incremental hill-climbing algorithms to learn ﬁnite state machines (FSMs), where each state is associated with predictions about future rewards, and the transitions depend on both the action taken and the observation received. Prediction proﬁle models  are similar FSMs, but rather than hill-climbing, they are learned by pre-processing the data and then applying standard complete model learning methods (they were demonstrated using POMDPs and LPSTs).
All of these approaches can, in principle, represent arbitrarily long-range dependencies in time.
However, unlike U-Tree, they all treat observations as atomic, which limits their applicability to truly high-dimensional systems. Furthermore, despite their theoretical representational capacity, their learning algorithms have difﬁculty discovering long-range dependencies in practice.
The learning algorithm for LPSTs ﬁrst learns a full sufﬁx tree, and then adds loops as appropriate. Thus, to capture very long-range dependencies, one must ﬁrst build a very deep sufﬁx tree.
McCallum reported that his FSM-learning method was often unable to detect long-range temporal dependencies (since this typically involves multiple elaborations of the FSM, none of which would individually seem valuable to the hill-climbing algorithm). Mahmud’s similar approach would likely suffer a similar limitation. The learning algorithms for LASR and prediction proﬁle models both rely on estimating predictions at particular histories. Because estimates will only be accurate for histories that appear many times, these algorithms can only be effectively applied to data consisting of many short trajectories, which limits their ability to discover long-range dependencies in practice.
It should be noted that prediction proﬁle models have been combined with an additional preprocessing step that learns an abstraction before the prediction proﬁle model is learned . Because of this, and because their formulation most closely ﬁts the setting of this paper, the experiments in Section 5 will directly compare against the performance of prediction proﬁle models.
3 Timeline Trees The goal of timeline trees is to combine the strengths of U-Tree with the ability to attend to events arbitrarily far apart in history (rather than limited to a ﬁnite sufﬁx). Unlike several of the above approaches, timeline trees are not arbitrarily recurrent (they do not contain loops except in a limited, implicit sense), which does restrict their representational capacity. However, in exchange they retain the straightfoward decision tree training of U-Tree, which allows them to simultaneously learn an abstraction over both the history sequence and high-dimensional observations and which furthermore allows them to discover long-range temporal depencies in practice (and not just in principle).
The decision tree built by U-Tree splits on features of observations at some temporal offset from the current timestep. The key idea of timeline trees is to allow multiple timestamps in history and to allow splits on features of observations at temporal offsets relative to any of these timestamps.
Timeline trees take a set F of binary features, where each feature f (ht, k) takes the history at time t, ht, and a timestep 0 k ≤ t + 11 and returns 1 or 0. For example, if the observations are images, f could return 1 if a black pixel existed anywhere at step k − 1 but did not exist at step k. It is assumed that f (ht, k) makes use of no information after step k (though it may access timestep k or before).
For a ﬁxed vector τ of timestamps, the model is a standard decision tree, and only a small extension of U-Tree (which ﬁxed the number of timestamps to 1: the current timestep). Each internal node in the tree is associated with a feature f, a timestamp index i, and a temporal offset δ (which may be negative) and has two children representing histories where the value of f (ht, τ [i] + δ) is 0 or 1, respectively. The leaves of the tree are associated with estimates of Pr(ω(Ot+1 ) | ht, at+1, κ(Ot+1 )).
To use the timeline tree to make a prediction, one simply follows a path from the root to a leaf in the tree, choosing the appropriate child at each node according to the feature value f (ht, τ [i] + δ).
Timeline trees’ real strength lies in their ability to add new timestamps. They do this via a special type of feature. For every feature f ∈ F, there is an additional timestamp feature ξ f. The feature ξ f (ht, j, k) = 1 if there is some timestep m such that j m k where f (ht, m) = 1. More importantly, the greatest such m (that is, the time of the most recent occurence of f ), call it mf, is added as a timestamp to all nodes in the subtree where ξ f = 1.
When making a prediction for ht, one maintains a growing vector τ of timestamps (in order of least to most recent). Beginning at the root there is only one timestamp: τroot = t + 1, where t is the current timestep. As one travels from the root to a leaf, one may encounter a node associated with timestamp feature ξ f. Such a node is also associated with an index i into the current timestamp vector. If ξ f (ht, τ [i − 1], τ [i]) = 1, the path moves to the corresponding child and adds mf to τ def (let τ  = −1). Nodes further down in the tree may refer to this new timestamp. As such, the tree is able to establish timestamps based on the occurence of key events (the presence of some feature).
Timestamp features are a form of temporal abstraction; they refer to an event in the past, but abstract away how long ago it was. They are limited, however. There are systems that would require an inﬁnite timeline tree that approaches in Section 2.1 can capture easily (see Section 5.2). Nevertheless, they do capture a natural and intuitive form of partial observability, as can be seen by example.
3.2 An Example
As a simple illustrative example, consider an agent that must keep track of its key. The agent’s key can be in room A or B, or in the agent’s pocket (where it starts). The agent has three actions: move For simplicity’s sake, the notation f (ht, k) hides the fact that the features may also depend on at+1 and κ(ot+1 ). For this discussion, assume that k may equal t + 1 if the feature makes use of only these aspects of time t + 1. If k = t + 1 and the feature refers to other information, assume f (ht, k) = 0.
Figure 1: The Key World example.
(which switches its location), stay (which does nothing), and pocket. The last action transfers the key between the agent’s pocket and the current room (in either direction) unless the key is in neither (in which case it does nothing). The agent can observe its location and whether the key is in the room. A diagram is shown in the left of Figure 1 (missing arrows are self-loops).
On the right of Figure 1 an example timeline tree is shown that can predict whether the agent will see the key in the next timestep. At the root, there is only one timestamp: t+1, where t is the current step. The root checks if the agent can currently see the key. If so, the agent will only see the key in the next step if it stays. Otherwise, the agent must remember where the key was last seen. The square-shaped node is meant to indicate a timestamp feature, which checks if the agent has ever seen the key before the only timestamp. If not, the key is in the agent’s pocket. If so, a new timestamp m is added that marks the last time the key was seen. If the agent put the key in its pocket after m, it must take the key out to see it. Otherwise, it must be in the other room.
4 Learning Timeline Trees