In-context learning (ICL) drives much of the practical utility of large language models (LLMs), but its limitations—particularly on tasks requiring algorithmic reasoning—lack a precise characterization. Theoretically, transformer networks with unlimited chain-of-thought tokens in their output should be able to simulate any learning algorithm, but recent work has found that LLMs fall far short in practice. In this paper, we contribute to the growing body of work examining this discrepancy by evaluating the ICL capabilities of several LLMs (ChatGPT, DeepSeek, Qwen, and Llama) on a suite of formal language recognition tasks, which provide a controlled testbed for assessing reasoning ability grounded in the theory of computation. Our experiments span a range of language classes, namely sub-regular, regular, deterministic context-free, context-free, and context-sensitive languages. Bearing in mind recent work showing that a transformer network’s expressive power increases with the number of padding tokens in its input, we test several ways of encoding exemplars that result in varying numbers of input tokens. To test the role of chain-of-thought, we also test prompts that require the model to produce an output immediately after reading the input and prompts that permit unrestricted reasoning before a label is produced. We find that pretrained LLMs perform very poorly on these reasoning tasks in all cases, only successfully learning the language of binary strings that begin with a 1. Also, contrary to expectation, adding padding and chain-of-thought tokens does not consistently improve accuracy. Still, ICL with pretrained LLMs is consistently more accurate than training a small transformer from scratch on the same data, suggesting that pretraining imbues transformers with a learning mechanism that is at least more sample efficient than training from scratch. These results reveal a disconnect between theoretical models of transformer capacity and the practical behavior of LLMs in ICL.

Expressivity of In-Context Learning in Large Language Models

LORENZON, NICOLA
2024/2025

Abstract

In-context learning (ICL) drives much of the practical utility of large language models (LLMs), but its limitations—particularly on tasks requiring algorithmic reasoning—lack a precise characterization. Theoretically, transformer networks with unlimited chain-of-thought tokens in their output should be able to simulate any learning algorithm, but recent work has found that LLMs fall far short in practice. In this paper, we contribute to the growing body of work examining this discrepancy by evaluating the ICL capabilities of several LLMs (ChatGPT, DeepSeek, Qwen, and Llama) on a suite of formal language recognition tasks, which provide a controlled testbed for assessing reasoning ability grounded in the theory of computation. Our experiments span a range of language classes, namely sub-regular, regular, deterministic context-free, context-free, and context-sensitive languages. Bearing in mind recent work showing that a transformer network’s expressive power increases with the number of padding tokens in its input, we test several ways of encoding exemplars that result in varying numbers of input tokens. To test the role of chain-of-thought, we also test prompts that require the model to produce an output immediately after reading the input and prompts that permit unrestricted reasoning before a label is produced. We find that pretrained LLMs perform very poorly on these reasoning tasks in all cases, only successfully learning the language of binary strings that begin with a 1. Also, contrary to expectation, adding padding and chain-of-thought tokens does not consistently improve accuracy. Still, ICL with pretrained LLMs is consistently more accurate than training a small transformer from scratch on the same data, suggesting that pretraining imbues transformers with a learning mechanism that is at least more sample efficient than training from scratch. These results reveal a disconnect between theoretical models of transformer capacity and the practical behavior of LLMs in ICL.
2024
Expressivity of In-Context Learning in Large Language Models
NLP
Formal Languages
LLMs
In-context Learning
Expressivity
File in questo prodotto:
File Dimensione Formato  
Lorenzon_Nicola.pdf

Accesso riservato

Dimensione 4.22 MB
Formato Adobe PDF
4.22 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/99633