Continual learning

tags
Machine learning

Continual learning is a type of supervised learning where there is no “testing phase” associated to a decision process. Instead, training samples keep being processed by the algorithm which has to simultaneously make predictions and keep learning.

This is challenging for a fixed neural network architecture since it has a fixed capacity and is bound to either forget things or be unable to learn anything new.

A definition from the survey (De Lange et al. 2020):

The General Continual Learning setting considers an infinite stream of training data where at each time step, the system receives a (number of) new sample(s) drawn non i.i.d from a current distribution that could itself experience sudden or gradual changes.

Theoretical foundations

Concept shift

(Bartlett et al. 1996) explores how to learn under the assumption of concept shift:

The learner sees a sequence of random examples, labelled according to a sequence of functions, and must provide an accurate estimate of the target function sequence.

Formally, a learner sees at time \(t\) a random example \(x_t\) from some domain \(X\). He also sees the value of \(f_t(x_t) \in \{0, 1\}\) where \(f_t\) is an unknown function from some known class \(F\).

The paper addresses two problems of learning with changing concepts:

  • Estimation: When can we estimate a sequence \((f_1, \cdots, f_n)\) from observations \(((x_1, f_1(x_1)), \cdots, (x_n, f_n(x_n)))\)?
  • Prediction: When can one predict the next concept \(f_{n+1}\) from a sequence of concepts \((f_1, \cdots, f_n)\)?

Formal definitions of different aspects of continual learning

Learning to learn

The paper (Baxter 1998) defines the problem of learning to learn as follows (notations are chosen to contrast with regular supervised learning):

  • an input space \(X\) and an output space \(Y\),
  • a loss function \(l: Y \times Y \rightarrow \mathbb{R}\),
  • an environment \((P, Q)\) where \(P\) is the set of all probability distributions on \(X \times Y\) and \(Q\) is a distribution on \(P\),
  • a hypothesis space family \(H = {\mathcal{H}}\) where each \(\mathcal{H} \in H\) is a set of functions \(h: X \rightarrow Y\).

(Pentina, Ben-David 2015)

(Pentina, Urner 2016)

(Pentina, Lampert 2015)

(Balcan et al. 2020)

(Kifer et al. 2004)

(Ben-David, Borbely 2008)

(Ben-David, Schuller 2003)

(Geisa et al. 2022)

Examples of continual learning systems

Benchmarks

Computer vision based benchmarks

  • Split MNIST: the MNIST dataset is split into 5 2-classes tasks (Nguyen et al. 2017; Zenke et al. 2017; Shin et al. 2017).

  • Split CIFAR10: the CIFAR10 dataset is split into 5 2-classes tasks (Krizhevsky, Hinton 2009).

  • Split mini-ImageNet: a mini ImageNet (100 classes) task split into 20 5-classes tasks.

  • Continual Transfer Learning Benchmark: A benchmark from Facebook AI, built from 7 computer vision datasets: MNIST, CIFAR10, CIFAR100, DTD, SVHN, Rainbow-MNIST, Fashion MNIST. The tasks are all 5-classes or 10-classes classification tasks. Some example task sequence constructions from (Veniat et al. 2021):

    The last task of \(S_{out}\) consists of a shuffling of the output labels of the first task. The last task of \(S_{in}\) is the same as its first task except that MNIST images have a different background color. \(S_{long}\) has 100 tasks, and it is constructed by first sampling a dataset, then 5 classes at random, and finally the amount of training data from a distribution that favors small tasks by the end of the learning experience.

  • Permuted MNIST: here for each different task the pixels of the MNIST digits are permuted, generating a new task of equal difficulty as the original one but different solution. This task is not suitable if the model has some spatial prior (like a CNN). Used first in (Goodfellow et al. 2014; Srivastava et al. 2013). Also in (Kirkpatrick et al. 2017)

  • Rotated MNIST: each task contains digits rotated by a fixed angle between 0 and 180 degrees.

Bibliography

  1. . . http://arxiv.org/abs/1909.08383.
  2. . . "Learning Changing Concepts by Exploiting the Structure of Change". In , 131–39. ACM Press. DOI.
  3. . . In , edited by Sebastian Thrun and Lorien Pratt, 71–94. Springer US. DOI.
  4. . . In , edited by Kamalika Chaudhuri, CLAUDIO GENTILE, and Sandra Zilles, 194–208. Springer International Publishing. DOI.
  5. . . In . Vol. 29. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2016/hash/f39ae9ff3a81f499230c4126e01f421b-Abstract.html.
  6. . . In . Vol. 28. Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2015/hash/9232fe81225bcaef853ae32870a2b0fe-Abstract.html.
  7. . . "Lifelong Learning in Costly Feature Spaces". Theoretical Computer Science 808 (February):14–37. DOI.
  8. . . "Detecting Change in Data Streams". In , 4:180–91. Toronto, Canada.
  9. . . "A Notion of Task Relatedness Yielding Provable Multiple-Task Learning Guarantees". Machine Learning 73 (3):273–87. DOI.
  10. . . In , edited by Bernhard Schölkopf and Manfred K. Warmuth, 567–80. Springer. DOI.
  11. . . "Towards a Theory of Out-of-Distribution Learning". January 6, 2022DOI.
  12. . . In Proceedings of the Conference on Artificial Intelligence (AAAI) (2010), 1306–13. DOI.
  13. . . CoRR abs/1710.10628. http://arxiv.org/abs/1710.10628.
  14. . . In , edited by Doina Precup and Yee Whye Teh, 70:3987–95. PMLR. http://proceedings.mlr.press/v70/zenke17a.html.
  15. . . In , edited by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, 2990–99. https://proceedings.neurips.cc/paper/2017/hash/0efbe98067c6c73dba1250d2beaa81f9-Abstract.html.
  16. . . "Learning Multiple Layers of Features from Tiny Images".
  17. . . http://arxiv.org/abs/2012.12631.
  18. . . In , edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1312.6211.
  19. . . In , edited by Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, 2310–18. https://proceedings.neurips.cc/paper/2013/hash/8f1d43620bc6bb580df6e80b0dc05c48-Abstract.html.
  20. . . "Overcoming Catastrophic Forgetting in Neural Networks". http://arxiv.org/abs/1612.00796.
Last changed | authored by

Comments

Loading comments...

Leave a comment

Back to Notes