因果探查的几何概念

A Geometric Notion of Causal Probing

摘要 Abstract

线性子空间假设(Bolukbasi等,2016)指出,在语言模型的表征空间中,关于诸如词汇数等概念的所有信息都编码在了一个线性子空间中。先前的研究依赖于辅助分类任务来识别和评估可能支持该假设的候选子空间。我们提出了一组内在标准,用于刻画理想的线性概念子空间,并使我们仅通过语言模型分布即可识别该子空间。我们的信息论框架通过调和概念信息的统计概念与概念在表征空间中如何被编码的几何概念,解决了表征空间中虚假相关特征的问题(Kumar等,2022)。作为此分析的一个副产品,我们推测了语言模型在生成过程中如何利用概念的因果过程。实证研究表明,在我们提出的框架下,对于词汇数以及餐厅评论数据集中的一些复杂方面级情感概念,线性概念擦除成功消除了大部分概念信息。我们的因果干预控制生成实验表明,至少在一个跨两种语言模型的概念中,可以通过精确操纵生成词的概念值来利用概念子空间。

The linear subspace hypothesis (Bolukbasi et al., 2016) states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace. Prior work has relied on auxiliary classification tasks to identify and evaluate candidate subspaces that might give support for this hypothesis. We instead give a set of intrinsic criteria which characterize an ideal linear concept subspace and enable us to identify the subspace using only the language model distribution. Our information-theoretic framework accounts for spuriously correlated features in the representation space (Kumar et al., 2022) by reconciling the statistical notion of concept information and the geometric notion of how concepts are encoded in the representation space. As a byproduct of this analysis, we hypothesize a causal process for how a language model might leverage concepts during generation. Empirically, we find that linear concept erasure is successful in erasing most concept information under our framework for verbal number as well as some complex aspect-level sentiment concepts from a restaurant review dataset. Our causal intervention for controlled generation shows that, for at least one concept across two languages models, the concept subspace can be used to manipulate the concept value of the generated word with precision.