Return to Past and Current Projects

Lifetime-limited memory networks

Many natural data sources can be characterized as event sequences, consisting of discrete observations stamped with a real-valued time of occurrence, e.g., outgoing smartphone calls, online product purchases, music selections. Some specialized deep architectures have been developed to incorporate temporal dynamics (e.g., CT-GRU, Hawkes process memories), but none have shown a benefit over the state-of-the-art in the field--Long Short-Term Memory (Hochreiter & Schmidhuber, 1995)--when LSTM is provided with the time stamp as additional input.

LSTM was originally conceived of as a latch that observes an interesting pattern and remembers its occurrence forever. If LSTM really operated in this manner, it would be poor at discarding information that was no longer task relevant, and for this reason, LSTM was augmented with a forgetting gate. However, the forgetting gate is itself problematic because memory leakage and gradient squashing may prevent robust training.

We are developing a Lifetime-Limited Memory (LLM) architecture premised on the assumption that information in a sequence can have a finite time-dependent lifetime, and therefore it is essential to store information about the age of the memory as well as to retrieve and erase memories based on age. We obtain age dependence by endowing each LLM cell with a bank of leaky integrators that have log-linear spaced time scales to ensure a wide dynamic range. Temporal localization can be performed via linear mixtures of these integrators, and their fixed decay incorporates forgetting yet ensures that gradients can propagate back in time regardless of the model parameters.

The goal of this research is to better handle issues of time scale. To illustrate the type of problems we hope to be able to tackle, consider the domain of network intrusion detection. Event patterns of relevance can occur on a time scale of microseconds to weeks (e.g., Mukherjee et al., 1994; Palanivel et al., 2014). Any architecture that quantizes time into the finest granularity necessary will be poor at detecting patterns at the coarse granularity (we're skeptical about the ability of the recent Phased LSTM model to learn via gradient descent). And consider the domain of online shopping. An individual's purchases may reflect long-term preferences or short-term needs, and thus recency itself isn't necessarily the best predictor of an individual's interests; predictions should be based both on long- and short-term histories.


Denis Kazakov (Computer Science, Boulder) Matt Meirhofer (Applied Math, Boulder)