大型语言模型如何压缩自己的思维链？基于标记复杂度的方法

Research

arXiv

How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach

Ayeong Lee ,

Ethan Che ,

Tianyi Peng

论文信息在线阅读PDF

摘要 Abstract

思维链提示作为一种强大的技术，使大型语言模型（LLMs）能够解决复杂的推理任务。然而，这些推理链条可能冗长，引发了对效率的担忧。为应对这一问题，近期的工作通过简单的提示策略（例如“简洁”）来减少响应长度。在本文中，我们首次系统地研究了推理长度与模型性能之间的关系，涵盖了各种压缩指令（例如“用10个词或更少”或“删除所有标点符号”）。通过这样做，我们发现推理长度和准确性之间存在一种普遍的权衡关系，这种关系即使在非常不同的推理链条中也持续存在。我们证明了这种权衡源于问题层面的一个尖锐阈值行为：每个任务都有一个内在的“标记复杂度”——成功解决问题所需的最小标记数。我们展示了标记复杂度如何使我们能够计算准确率-压缩权衡的信息论极限，并发现基于提示的压缩策略远远低于这些理论极限。这表明还有很大的改进空间，我们的框架提供了一个基准，帮助研究人员评估推理效率的进步。我们的工作还强调了自适应压缩的重要性——对于较简单的问题给出较短的回答——并展示了标记复杂度是如何衡量这种能力的有用工具。

Chain-of-thought prompting has emerged as a powerful technique for enabling large language models (LLMs) to solve complex reasoning tasks. However, these reasoning chains can be verbose, raising concerns about efficiency. In response, recent works have sought to decrease response lengths through simple prompting strategies (e.g. 'be concise'). In this work, we conduct the first systematic study of the relationship between reasoning length and model performance across a diverse range of compression instructions (e.g. 'use 10 words or less' or 'remove all punctuation'). In doing so, we discover a universal tradeoff between reasoning length and accuracy that persists across even very distinct reasoning chains. We demonstrate that this tradeoff emerges from a sharp threshold behavior at the question level: each task has an intrinsic 'token complexity' - a minimal number of tokens required for successful problem-solving. We show how token complexity enables us to compute information-theoretic limits on the accuracy-compression tradeoff, and find that prompt-based compression strategies operate far from these theoretical limits. This suggests there may be significant room for improvement and our framework provides a benchmark to help researchers evaluate progress in reasoning efficiency. Our work also highlights the importance of adaptive compression -- giving shorter responses for easier questions -- and we show that token complexity is a useful tool for measuring this capability.