Continual learning

tags: Machine learning

Continual learning is a type of supervised learning where there is no “testing phase” associated to a decision process. Instead, training samples keep being processed by the algorithm which has to simultaneously make predictions and keep learning.

This is challenging for a fixed neural network architecture since it has a fixed capacity and is bound to either forget things or be unable to learn anything new.

A definition from the survey (De Lange et al. 2020):

The General Continual Learning setting considers an infinite stream of training data where at each time step, the system receives a (number of) new sample(s) drawn non i.i.d from a current distribution that could itself experience sudden or gradual changes.

Theoretical foundations

Concept shift

(Bartlett et al. 1996) explores how to learn under the assumption of concept shift:

The learner sees a sequence of random examples, labelled according to a sequence of functions, and must provide an accurate estimate of the target function sequence.

Formally, a learner sees at time \(t\) a random example \(x_t\) from some domain \(X\). He also sees the value of \(f_t(x_t) \in \{0, 1\}\) where \(f_t\) is an unknown function from some known class \(F\).

The paper addresses two problems of learning with changing concepts:

Estimation: When can we estimate a sequence \((f_1, \cdots, f_n)\) from observations \(((x_1, f_1(x_1)), \cdots, (x_n, f_n(x_n)))\)?
Prediction: When can one predict the next concept \(f_{n+1}\) from a sequence of concepts \((f_1, \cdots, f_n)\)?

Formal definitions of different aspects of continual learning

Learning to learn

The paper (Baxter 1998) defines the problem of learning to learn as follows (notations are chosen to contrast with regular supervised learning):

an input space \(X\) and an output space \(Y\),
a loss function \(l: Y \times Y \rightarrow \mathbb{R}\),
an environment \((P, Q)\) where \(P\) is the set of all probability distributions on \(X \times Y\) and \(Q\) is a distribution on \(P\),
a hypothesis space family \(H = {\mathcal{H}}\) where each \(\mathcal{H} \in H\) is a set of functions \(h: X \rightarrow Y\).

(Pentina, Ben-David 2015)

(Pentina, Urner 2016)

(Pentina, Lampert 2015)

(Balcan et al. 2020)

(Kifer et al. 2004)

(Ben-David, Borbely 2008)

(Ben-David, Schuller 2003)

(Geisa et al. 2022)

Examples of continual learning systems

Never Ending Language Learner (NELL) (Carlson et al. 2010)

Benchmarks

Computer vision based benchmarks

Split MNIST: the MNIST dataset is split into 5 2-classes tasks (Nguyen et al. 2017; Zenke et al. 2017; Shin et al. 2017).
Split CIFAR10: the CIFAR10 dataset is split into 5 2-classes tasks (Krizhevsky, Hinton 2009).
Split mini-ImageNet: a mini ImageNet (100 classes) task split into 20 5-classes tasks.
Continual Transfer Learning Benchmark: A benchmark from Facebook AI, built from 7 computer vision datasets: MNIST, CIFAR10, CIFAR100, DTD, SVHN, Rainbow-MNIST, Fashion MNIST. The tasks are all 5-classes or 10-classes classification tasks. Some example task sequence constructions from (Veniat et al. 2021):

The last task of \(S_{out}\) consists of a shuffling of the output labels of the first task. The last task of \(S_{in}\) is the same as its first task except that MNIST images have a different background color. \(S_{long}\) has 100 tasks, and it is constructed by first sampling a dataset, then 5 classes at random, and finally the amount of training data from a distribution that favors small tasks by the end of the learning experience.
Permuted MNIST: here for each different task the pixels of the MNIST digits are permuted, generating a new task of equal difficulty as the original one but different solution. This task is not suitable if the model has some spatial prior (like a CNN). Used first in (Goodfellow et al. 2014; Srivastava et al. 2013). Also in (Kirkpatrick et al. 2017)
Rotated MNIST: each task contains digits rotated by a fixed angle between 0 and 180 degrees.

Bibliography

Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Gregory Slabaugh, Tinne Tuytelaars. May 26, 2020. http://arxiv.org/abs/1909.08383.
Peter L. Bartlett, Shai Ben-David, Sanjeev R. Kulkarni. 1996. "Learning Changing Concepts by Exploiting the Structure of Change". In , 131–39. ACM Press. DOI.
Jonathan Baxter. 1998. In , edited by Sebastian Thrun and Lorien Pratt, 71–94. Springer US. DOI.
Anastasia Pentina, Shai Ben-David. 2015. In , edited by Kamalika Chaudhuri, CLAUDIO GENTILE, and Sandra Zilles, 194–208. Springer International Publishing. DOI.
Anastasia Pentina, Ruth Urner. 2016. In . Vol. 29. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2016/hash/f39ae9ff3a81f499230c4126e01f421b-Abstract.html.
Anastasia Pentina, Christoph H Lampert. 2015. In . Vol. 28. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2015/hash/9232fe81225bcaef853ae32870a2b0fe-Abstract.html.
Maria-Florina Balcan, Avrim Blum, Vaishnavh Nagarajan. February 12, 2020. "Lifelong Learning in Costly Feature Spaces". Theoretical Computer Science 808 (February):14–37. DOI.
Daniel Kifer, Shai Ben-David, Johannes Gehrke. 2004. "Detecting Change in Data Streams". In , 4:180–91. Toronto, Canada.
Shai Ben-David, Reba Schuller Borbely. December 1, 2008. "A Notion of Task Relatedness Yielding Provable Multiple-Task Learning Guarantees". Machine Learning 73 (3):273–87. DOI.
Shai Ben-David, Reba Schuller. 2003. In , edited by Bernhard Schölkopf and Manfred K. Warmuth, 567–80. Springer. DOI.
Ali Geisa, Ronak Mehta, Hayden S. Helm, Jayanta Dey, Eric Eaton, Jeffery Dick, Carey E. Priebe, Joshua T. Vogelstein. January 6, 2022. "Towards a Theory of Out-of-Distribution Learning". January 6, 2022DOI.
Andrew Carlson, Justin Betteridge, Bryan Kisiel. 2010. In Proceedings of the Conference on Artificial Intelligence (AAAI) (2010), 1306–13. DOI.
Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, Richard E. Turner. 2017. CoRR abs/1710.10628. http://arxiv.org/abs/1710.10628.
Friedemann Zenke, Ben Poole, Surya Ganguli. 2017. In , edited by Doina Precup and Yee Whye Teh, 70:3987–95. PMLR. http://proceedings.mlr.press/v70/zenke17a.html.
Hanul Shin, Jung Kwon Lee, Jaehong Kim, Jiwon Kim. 2017. In , edited by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, 2990–99. https://proceedings.neurips.cc/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html.
Alex Krizhevsky, Geoffrey E. Hinton. 2009. "Learning Multiple Layers of Features from Tiny Images".
Tom Veniat, Ludovic Denoyer, Marc'Aurelio Ranzato. February 12, 2021. http://arxiv.org/abs/2012.12631.
Ian J. Goodfellow, Mehdi Mirza, Xia Da, Aaron C. Courville, Yoshua Bengio. 2014. In , edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1312.6211.
Rupesh Kumar Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino J. Gomez, Jürgen Schmidhuber. 2013. In , edited by Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, 2310–18. https://proceedings.neurips.cc/paper/2013/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html.
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, et al.. January 25, 2017. "Overcoming Catastrophic Forgetting in Neural Networks". http://arxiv.org/abs/1612.00796.