基于Choquet积分的特征子集加权在距离驱动监督学习中的应用

Research

arXiv

Feature Subset Weighting for Distance-based Supervised Learning through Choquet Integration

Adnan Theerens ,

Yvan Saeys ,

Chris Cornelis

论文信息在线阅读PDF

摘要 Abstract

本文提出了一种基于单调测度的特征子集加权方法应用于距离驱动的监督学习中。通过Choquet积分定义了一种包含这些权重的距离度量。这种集成使得所提出的距离能够有效地捕捉非线性关系，并考虑条件属性与决策属性之间以及条件属性之间的相互作用，从而实现更灵活的距离度量。特别地，我们展示了这种方法如何确保距离不受重复特征和强相关特征添加的影响。该方法的另一个关键点是它使特征子集加权在计算上可行，因为每次只需计算$m$个特征子集权重，而不是计算所有特征子集权重（$2^m$），其中$m$为属性数量。此外，我们还研究了使用Choquet积分测量相似性导致的距离定义的非等价性，并通过对偶测度进一步探讨了距离与相似性之间的关系。同时，提出了保持相似性与距离经典对称性的对称Choquet距离和相似性。最后，我们引入了一个具体的特征子集加权距离，并在$k$-最近邻（KNN）分类设置下评估其性能，与Mahalanobis距离和加权距离方法进行了比较。

This paper introduces feature subset weighting using monotone measures for distance-based supervised learning. The Choquet integral is used to define a distance metric that incorporates these weights. This integration enables the proposed distances to effectively capture non-linear relationships and account for interactions both between conditional and decision attributes and among conditional attributes themselves, resulting in a more flexible distance measure. In particular, we show how this approach ensures that the distances remain unaffected by the addition of duplicate and strongly correlated features. Another key point of this approach is that it makes feature subset weighting computationally feasible, since only $m$ feature subset weights should be calculated each time instead of calculating all feature subset weights ($2^m$), where $m$ is the number of attributes. Next, we also examine how the use of the Choquet integral for measuring similarity leads to a non-equivalent definition of distance. The relationship between distance and similarity is further explored through dual measures. Additionally, symmetric Choquet distances and similarities are proposed, preserving the classical symmetry between similarity and distance. Finally, we introduce a concrete feature subset weighting distance, evaluate its performance in a $k$-nearest neighbors (KNN) classification setting, and compare it against Mahalanobis distances and weighted distance methods.