利用Hessian方向可视化高维损失景观

Visualizing high-dimensional loss landscapes with Hessian directions

摘要 Abstract

分析高维损失函数的几何特性(如局部曲率以及在损失空间某点周围是否存在其他最优解)有助于更好地理解神经网络结构、实现属性与学习性能之间的相互关系。本文结合高维概率论与微分几何的概念,研究了低维损失表示中的曲率性质如何依赖于原始损失空间中的相应性质。我们发现,如果使用随机投影,则在原空间中通常是鞍点的点,在预期的低维表示中很少被正确识别为鞍点。低维表示中的主曲率与原始损失空间中的平均曲率成正比,因此原始损失空间中的平均曲率决定了鞍点在平均情况下表现为极小值、极大值还是几乎平坦区域。我们利用随机投影中的期望曲率与原始空间中曲率(即归一化的Hessian迹)之间的联系,计算了类似于Hutchinson方法的迹估计,而无需计算Hessian-向量积。由于随机投影不适合正确识别鞍点信息,我们提出研究与最大和最小主曲率相关的主导Hessian方向的投影。我们将研究结果与关于损失景观平坦性和泛化性的争论联系起来。最后,对于不同的常见图像分类器和一个函数逼近器,我们展示了具有约$7\times 10^6$个参数的损失景观的随机投影和Hessian投影,并进行了比较。

Analyzing geometric properties of high-dimensional loss functions, such as local curvature and the existence of other optima around a certain point in loss space, can help provide a better understanding of the interplay between neural network structure, implementation attributes, and learning performance. In this work, we combine concepts from high-dimensional probability and differential geometry to study how curvature properties in lower-dimensional loss representations depend on those in the original loss space. We show that saddle points in the original space are rarely correctly identified as such in expected lower-dimensional representations if random projections are used. The principal curvature in the expected lower-dimensional representation is proportional to the mean curvature in the original loss space. Hence, the mean curvature in the original loss space determines if saddle points appear, on average, as either minima, maxima, or almost flat regions. We use the connection between expected curvature in random projections and mean curvature in the original space (i.e., the normalized Hessian trace) to compute Hutchinson-type trace estimates without calculating Hessian-vector products as in the original Hutchinson method. Because random projections are not suitable to correctly identify saddle information, we propose to study projections along dominant Hessian directions that are associated with the largest and smallest principal curvatures. We connect our findings to the ongoing debate on loss landscape flatness and generalizability. Finally, for different common image classifiers and a function approximator, we show and compare random and Hessian projections of loss landscapes with up to about $7\times 10^6$ parameters.