Back to recurrent processing at the crossroad of transformers and state-space models

0 9 minutes read

Back to recurrent processing at the crossroad of transformers and.png

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).

Google Scholar

Touvron, H. et al. Llama 2: open foundation and fine-tuned chat models. Preprint at https://doi.org/10.48550/arXiv.2307.09288 (2023).

Gemini team Google. Gemini: a family of highly capable multimodal models. Preprint at https://doi.org/10.48550/arXiv.2312.11805 (2023).

Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems Vol. 30 (NuerIPS, 2017).

Yu, Y., Si, X., Hu, C. & Zhang, J. A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput. 31, 1235–1270 (2019).

Article
MathSciNet
MATH

Google Scholar

Bengio, Y., Mori, R. & Gori, M. Learning the dynamic nature of speech with back-propagation for sequences. Pattern Recognit. Lett. 13, 375–385 (1992).

Article

Google Scholar

Gori, M., Hammer, B., Hitzler, P. & Palm, G. Perspectives and challenges for recurrent neural network training. Log. J. IGPL 18, 617–619 (2010).

Article
MATH

Google Scholar

Orvieto, A. et al. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning 26670–26698 (ACM, 2023).

Peng, B. et al. RWKV: reinventing RNNs for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H. et al.) 14048–14077 (ACL, 2023).

Voelker, A., Kajić, I. & Eliasmith, C. Legendre memory units: continuous-time representation in recurrent neural networks. In Advances in Neural Information Processing Systems 32 (NeurIPS, 2019).

Gu, A., Dao, T., Ermon, S., Rudra, A. & Ré, C. HiPPO: recurrent memory with optimal polynomial projections. Adv. Neural Inf. Process. Syst. 33, 1474–1487 (2020).

Google Scholar

Gu, A., Goel, K. & Re, C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations (Curran Associates, 2021).

Gu, A. et al. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Adv. Neural Inf. Process. Syst. 34, 572–585 (2021).

Google Scholar

Smith, J. T., Warrington, A. & Linderman, S. Simplified state space layers for sequence modeling. In 11th International Conference on Learning Representations (Curran Associates, 2023).

Gu, A. & Dao, T. Mamba: linear-time sequence modeling with selective state spaces. Preprint at https://doi.org/10.48550/arXiv.2312.00752 (2023).

De, S. et al. Griffin: mixing gated linear recurrences with local attention for efficient language models. Preprint at https://doi.org/10.48550/arXiv.2402.19427 (2024).

Marschall, O., Cho, K. & Savin, C. A unified framework of online learning algorithms for training recurrent neural networks. J. Mach. Learn. Res. 21, 5320–5353 (2020).

MathSciNet
MATH

Google Scholar

Elman, J. L. Finding structure in time. Cogn. Sci. 14, 179–211 (1990).

Article

Google Scholar

Siegelmann, H. T. Neural Networks and Analog Computation: Beyond the Turing Limit (Springer, 2012).

Li, Z., Han, J., E, W. & Li, Q. Approximation and optimization theory for linear continuous-time recurrent neural networks. J. Mach. Learn. Res. 23, 1997–2081 (2022).

MathSciNet

Google Scholar

Tallec, C. & Ollivier, Y. Can recurrent neural networks warp time? In International Conference on Learning Representation (Curran Associates, 2018).

Kitagawa, G. A self-organizing state-space model. J. Am. Stat. Assoc. 93, 1203–1215 (1998).

Google Scholar

Lipton, Z. C., Berkowitz, J. & Elkan, C. A critical review of recurrent neural networks for sequence learning. Preprint at https://doi.org/10.48550/arXiv.1506.00019 (2015).

Salehinejad, H., Sankar, S., Barfett, J., Colak, E. & Valaee, S. Recent advances in recurrent neural networks. Preprint at https://doi.org/10.48550/arXiv.1801.01078 (2017).

Kidger, P. On neural differential equations. Preprint at https://doi.org/10.48550/arXiv.2202.02435 (2022).

Rumelhart, D. E. et al. Learning Internal Representations by Error Propagation (Institute for Cognitive Science, Univ. California, San Diego 1985).

Werbos, P. J. Backpropagation through time: what it does and how to do it. Proc. IEEE 78, 1550–1560 (1990).

Article

Google Scholar

Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning 1310–1318 (ACM, 2013).

Hochreiter, S. Untersuchungen zu dynamischen neuronalen netzen. Diploma, Technische Univ. München (1991).

Bengio, Y. & Frasconi, P. Credit assignment through time: alternatives to backpropagation. In Advances in Neural Information Processing Systems Vol. 6 (NeurIPS, 1993).

Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780 (1997).

Article

Google Scholar

Cho, K. et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. 2014 Conference on Empirical Methods in Natural Language Processing 1724–1734 (EMNLP, 2014).

Mehta, H., Gupta, A., Cutkosky, A. & Neyshabur, B. Long range language modeling via gated state spaces. In 11th International Conference on Learning Representations (Curran Associates, 2023).

Hua, W., Dai, Z., Liu, H., Le, Q. Transformer quality in linear time. In International Conference on Machine Learning 9099–9117 (ACM, 2022).

Sun, Y. et al. Retentive network: a successor to transformer for large language models. Preprint at https://doi.org/10.48550/arXiv.2307.08621 (2023).

Arjovsky, M., Shah, A. & Bengio, Y. Unitary evolution recurrent neural networks. In International Conference on Machine Learning 1120–1128 (ACM, 2016).

Mhammedi, Z., Hellicar, A., Rahman, A. & Bailey, J. Efficient orthogonal parametrisation of recurrent neural networks using householder reflections. In International Conference on Machine Learning (ACM, 2017).

Kag, A. & Saligrama, V. Training recurrent neural networks via forward propagation through time. In International Conference on Machine Learning 5189–5200 (ACM, 2021).

Sutskever, I., Vinyals, O. & Le, Q. V. Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27, 3104–3112 (2014).

Google Scholar

Karuvally, A., Sejnowski, T. & Siegelmann, H. T. Hidden traveling waves bind working memory variables in recurrent neural networks. In 41st International Conference on Machine Learning (ACM, 2024).

Sieber, J., Alonso, C. A., Didier, A., Zeilinger, M. N. & Orvieto, A. Understanding the differences in foundation models: attention, state space models, and recurrent neural networks. Preprint at https://doi.org/10.48550/arXiv.2405.15731 (2024).

Tay, Y., Dehghani, M., Bahri, D. & Metzler, D. Efficient transformers: a survey. ACM Comput. Surv. 55, 109–110928 (2023).

Article

Google Scholar

Katharopoulos, A., Vyas, A., Pappas, N. & Fleuret, F. Transformers are RNNs: fast autoregressive transformers with linear attention. In International Conference on Machine Learning 5156–5165 (ACM, 2020).

Peng, H. et al. Random feature attention. In Proc. 9th International Conference on Learning Representations (Curran Associates, 2021).

Schlag, I., Irie, K. & Schmidhuber, J. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning 9355–9366 (ACM, 2021).

Qin, Z. et al. The devil in linear transformer. In Proc. 2022 Conference on Empirical Methods in Natural Language Processing 7025–7041 (EMNLP, 2022).

Schmidhuber, J. Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Comput. 4, 131–139 (1992).

Article

Google Scholar

Qin, Z. et al. Scaling transnormer to 175 billion parameters. Preprint at https://doi.org/10.48550/arXiv.2307.14995 (2023).

Katsch, T. GateLoop: fully data-controlled linear recurrence for sequence modeling. Preprint at https://doi.org/10.48550/arXiv.2311.01927 (2023).

Yang, S., Wang, B., Shen, Y., Panda, R. & Kim, Y. Gated linear attention transformers with hardware-efficient training. In Proc. 41st International Conference on Machine Learning (ACM, 2024).

Dauphin, Y. N., Fan, A., Auli, M. & Grangier, D. Language modeling with gated convolutional networks. In International Conference on Machine Learning 933–941 (ACM, 2017).

Ma, X. et al. MEGA: moving average equipped gated attention. In 11th International Conference on Learning Representations (Curran Associates, 2022).

Zhai, S. et al. An attention free transformer. Preprint at https://doi.org/10.48550/arXiv.2105.14103 (2021).

Peng, B. et al. Eagle and Finch: RWKV with matrix-valued states and dynamic recurrence. Preprint at https://doi.org/10.48550/arXiv.2404.05892 (2024).

Huang, F. et al. Encoding recurrence into transformers. In 11th International Conference on Learning Representations (Curran Associates, 2022).

Dai, Z. et al. Transformer-XL: attentive language models beyond a fixed-length context. In Proc. 57th Annual Meeting of the Association for Computational Linguistics (ACL, 2019).

Bulatov, A., Kuratov, Y. & Burtsev, M. Recurrent memory transformer. Adv. Neural Inf. Process. Syst. 35, 11079–11091 (2022).

Google Scholar

Didolkar, A. et al. Temporal latent bottleneck: synthesis of fast and slow processing mechanisms in sequence learning. Adv. Neural Inf. Process. Syst. 35, 10505–10520 (2022).

Google Scholar

Munkhdalai, T., Faruqui, M. & Gopal, S. Leave no context behind: efficient infinite context transformers with infini-attention. Preprint at https://doi.org/10.48550/arXiv.2404.07143 (2024).

Gu, A., Goel, K., Gupta, A. & Ré, C. On the parameterization and initialization of diagonal state space models. Adv. Neural Inf. Process. Syst. 35, 35971–35983 (2022).

Gupta, A., Gu, A., Berant, J. Diagonal state spaces are as effective as structured state spaces. In 36th Conference on Neural Information Processing Systems (NeurIPS, 2022).

Massaroli, S. et al. Laughing hyena distillery: extracting compact recurrences from convolutions. In 37th Conference on Neural Information Processing Systems (NeurIPS, 2023).

Martin, E. & Cundy, C. Parallelizing linear recurrent neural nets over sequence length. In International Conference on Learning Representations (Curran Associates, 2018).

Dao, T. et al. Hungry hungry hippos: towards language modeling with state space models. In Proc. 11th International Conference on Learning Representations (Curran Associates, 2023).

Hasani, R. et al. Liquid structural state-space models. In 11th International Conference on Learning Representations (Curran Associates, 2023).

Hasani, R., Lechner, M., Amini, A., Rus, D. & Grosu, R. Liquid time-constant networks. In Proc. AAAI Conference on Artificial Intelligence Vol. 35, 7657–7666 (2021).

Olsson, C. et al. In-context learning and induction heads. Preprint at https://doi.org/10.48550/arXiv.2209.11895 (2022).

Ba, J., Hinton, G. E., Mnih, V., Leibo, J. Z. & Ionescu, C. Using fast weights to attend to the recent past. Adv. Neural Inf. Process. Syst. 29, 4338–4346 (2016).

Google Scholar

Jing, L. et al. Gated orthogonal recurrent units: on learning to forget. Neural Comput. 31, 765–783 (2019).

Article
MathSciNet
MATH

Google Scholar

Orvieto, A., De, S., Gulcehre, C., Pascanu, R. & Smith, S. L. Universality of linear recurrences followed by non-linear projections: finite-width guarantees and benefits of complex eigenvalues. In 41st International Conference on Machine Learning (ACM, 2024).

Tay, Y. et al. Long range arena: a benchmark for efficient transformers. In 9th International Conference on Learning Representations (Curran Associates, 2021).

Poli, M. et al. Hyena hierarchy: towards larger convolutional language models. In International Conference on Machine Learning 28043–28078 (ACM, 2023).

Poli, M. et al. Mechanistic design and scaling of hybrid architectures. In 41st International Conference on Machine Learning (ACM, 2024).

Li, S., Li, W., Cook, C., Zhu, C. & Gao, Y. Independently recurrent neural network (IndRNN): building a longer and deeper RNN. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 5457–5466 (IEEE, 2018).

Beck, M. et al. xLSTM: extended long short-term memory. Adv. Neural Inf. Process. Syst. 37, 107547–107603 (2025).

Google Scholar

Qin, Z., Yang, S. & Zhong, Y. Hierarchically gated recurrent neural network for sequence modeling. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS, 2023).

Mandic, D. P. & Chambers, J. A. Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability (John Wiley & Sons, 2001).

Chen, R. T., Rubanova, Y., Bettencourt, J. & Duvenaud, D. K. Neural ordinary differential equations. In Advances in Neural Information Processing Systems Vol. 31 (NeurIPS, 2018).

Rubanova, Y., Chen, R. T. & Duvenaud, D. K. Latent ordinary differential equations for irregularly-sampled time series. In Advances in Neural Information Processing Systems Vol. 32 (NeurIPS, 2019).

Kidger, P., Morrill, J., Foster, J. & Lyons, T. Neural controlled differential equations for irregular time series. Adv. Neural Inf. Process. Syst. 33, 6696–6707 (2020).

Google Scholar

Lechner, M. et al. Neural circuit policies enabling auditable autonomy. Nat. Mach. Intell. 2, 642–652 (2020).

Article

Google Scholar

Massaroli, S., Poli, M., Park, J., Yamashita, A. & Asama, H. Dissecting neural odes. Adv. Neural Inf. Process. Syst. 33, 3952–3963 (2020).

Google Scholar

Rusch, T. K. & Mishra, S. UnICORNN: a recurrent model for learning very long time dependencies. In International Conference on Machine Learning 9168–9178 (ACM, 2021).

Effenberger, F., Carvalho, P., Dubinin, I. & Singer, W. The functional role of oscillatory dynamics in neocortical circuits: a computational perspective, Proc. Natl. Acad. Sci. USA 122, e2412830122 (2025).

Lanthaler, S., Rusch, T. K. & Mishra, S. Neural oscillators are universal. Adv. Neural Inf. Process. Syst. 36, 46786–46806 (2024).

Google Scholar

Irie, K., Gopalakrishnan, A. & Schmidhuber, J. Exploring the promise and limits of real-time recurrent learning. In 12th International Conference on Learning Representations (Curran Associates, 2024).

Feng, L. et al. Attention as an RNN. Preprint at https://doi.org/10.48550/arXiv.2405.13956 (2024).

Dao, T. & Gu, A. Transformers are SSMs: generalized models and efficient algorithms through structured state space duality. In 41st International Conference on Machine Learning (ACM, 2024).

Merrill, W., Petty, J. & Sabharwal, A. The illusion of state in state-space models. In 41st International Conference on Machine Learning 35492–35506 (ACM, 2024).

Hahn, M. Theoretical limitations of self-attention in neural sequence models. Trans. Assoc. Comput. Linguist. 8, 156–171 (2020).

Article

Google Scholar

Peng, B., Narayanan, S. & Papadimitriou, C. On limitations of the transformer architecture. In First Conference on Language Modeling (COLM, 2024).

Zeng, A., Chen, M., Zhang, L. & Xu, Q. Are transformers effective for time series forecasting? In Proc. AAAI Conference on Artificial Intelligence Vol. 37, 11121–11128 (AAAI, 2023).

Jelassi, S., Brandfonbrener, D., Kakade, S. M. & Malach, E. Repeat after me: transformers are better than state space models at copying. In 41st International Conference on Machine Learning (ACM, 2024).

Jiang, A. Q. et al. Mistral 7B. Preprint at https://doi.org/10.48550/arXiv.2310.06825 (2023).

Team, J. et al. Jamba-1.5: hybrid transformer-Mamba models at scale. Preprint at https://doi.org/10.48550/arXiv.2408.12570 (2024).

Glorioso, P. et al. Zamba: a compact 7B SSM hybrid model. Preprint at https://doi.org/10.48550/arXiv.2405.16712 (2024).

Dao, T. FlashAttention-2: faster attention with better parallelism and work partitioning. In 12th International Conference on Learning Representations (Curran Associates, 2024).

Casoni, M. et al. Pitfalls in processing infinite-length sequences with popular approaches for sequential data. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition 37–48 (Springer, 2024).

Zucchet, N., Meier, R., Schug, S., Mujika, A. & Sacramento, J. Online learning of long-range dependencies. Adv. Neural Inf. Process. Syst. 36, 10477–10493 (2023).

Google Scholar

Betti, A., Gori, M. & Melacci, S. Learning visual features under motion invariance. Neural Networks 126, 275–299 (2020).

Article

Google Scholar

Tiezzi, M. et al. Stochastic coherence over attention trajectory for continuous learning in video streams. In Proc. 31st International Joint Conference on Artificial Intelligence 3480–3486 (IJCAI, 2022).

Don’t miss more hot News like this! Click here to discover the latest in AI news!

2025-05-15 00:00:00

0 9 minutes read