带有消减正交旋转的最优Vintage因子分析

Optimal vintage factor analysis with deflation varimax

摘要 Abstract

Vintage因子分析是一种重要的因子分析类型,其目标是首先找到原始数据的低维表示,然后通过旋转使旋转后的低维表示具有科学意义。最广泛使用的Vintage因子分析方法是主成分分析(PCA)后接varimax旋转。尽管其广受欢迎,但由于varimax旋转需要在正交矩阵集合上解决非凸优化问题,目前几乎无法提供理论保证。本文提出了一种消减正交旋转程序,该程序依次求解正交矩阵的每一行。除了计算上的净收益和灵活性外,我们还能够在更广泛的背景下完全建立所提出程序的理论保证。采用这种新的消减正交旋转作为PCA后的第二步,我们在一般因子模型类下进一步分析了这一两步过程。我们的结果显示,当信噪比(SNR)适中或较大时,它以渐近最优速率估计因子载荷矩阵。在低信噪比情况下,当因子模型下的附加噪声具有结构时,我们展示了改进PCA和消减正交旋转的可能性。修改后的程序在所有信噪比范围内都被证明是渐近最优的。我们的理论适用于有限样本,并允许潜在因子数量随样本量增长,以及允许环境维度随样本量增长甚至超过样本量。大量的模拟和真实数据分析进一步证实了我们的理论发现。

Vintage factor analysis is one important type of factor analysis that aims to first find a low-dimensional representation of the original data, and then to seek a rotation such that the rotated low-dimensional representation is scientifically meaningful. The most widely used vintage factor analysis is the Principal Component Analysis (PCA) followed by the varimax rotation. Despite its popularity, little theoretical guarantee can be provided to date mainly because varimax rotation requires to solve a non-convex optimization over the set of orthogonal matrices. In this paper, we propose a deflation varimax procedure that solves each row of an orthogonal matrix sequentially. In addition to its net computational gain and flexibility, we are able to fully establish theoretical guarantees for the proposed procedure in a broader context. Adopting this new deflation varimax as the second step after PCA, we further analyze this two step procedure under a general class of factor models. Our results show that it estimates the factor loading matrix in the minimax optimal rate when the signal-to-noise-ratio (SNR) is moderate or large. In the low SNR regime, we offer possible improvement over using PCA and the deflation varimax when the additive noise under the factor model is structured. The modified procedure is shown to be minimax optimal in all SNR regimes. Our theory is valid for finite sample and allows the number of the latent factors to grow with the sample size as well as the ambient dimension to grow with, or even exceed, the sample size. Extensive simulation and real data analysis further corroborate our theoretical findings.