基于\(L^{\infty}\)-归一化的组成数据新分析方法及其在阴道微生物组中的应用

A New Approach to Compositional Data Analysis using \(L^{\infty}\)-normalization with Applications to Vaginal Microbiome

摘要 Abstract

我们提出了一种基于\(L^{\infty}\)-归一化的新型组成数据分析方法,以应对高通量数据中零值丰富的挑战。传统方法如Aitchison变换需要排除零值,但现实中组学数据包含无法去除的结构性零值,去除这些零值会破坏内在的生物结构。这类数据仅存在于组成空间的边界上,使得专注于内部的传统方法从根本上不适用。我们引入了\(L^p\)-归一化的一族方法,特别关注\(L^{\infty}\)-归一化因其优越的特性。此方法将组成空间与\(L^{\infty}\)-单纯形识别,并表示为称为\(L^{\infty}\)-单元的高维面的并集。每个单元由绝对丰度中某一成分等于或超过其他所有成分的样本组成,其坐标系统将其与d维单位立方体等同。应用于阴道微生物组数据时,\(L^{\infty}\)-分解与已知的社区状态类型(Community State Types)一致,同时具有优势:每个\(L^{\infty}\)-CST以其主导成分命名,具有明确的生物学意义,对样本变化保持稳定,解决了基于聚类的问题,并提供了探索内部结构的坐标系统。我们通过立方体嵌入扩展齐次坐标,将数据映射到d维单位立方体中。这些嵌入可以通过笛卡尔积整合,从多个视角提供统一表示。虽然这些方法主要在微生物组研究中展示,但它们适用于任何组成数据。

We introduce a novel approach to compositional data analysis based on $L^{\infty}$-normalization, addressing challenges posed by zero-rich high-throughput data. Traditional methods like Aitchison's transformations require excluding zeros, conflicting with the reality that omics datasets contain structural zeros that cannot be removed without violating inherent biological structures. Such datasets exist exclusively on the boundary of compositional space, making interior-focused approaches fundamentally misaligned. We present a family of $L^p$-normalizations, focusing on $L^{\infty}$-normalization due to its advantageous properties. This approach identifies compositional space with the $L^{\infty}$-simplex, represented as a union of top-dimensional faces called $L^{\infty}$-cells. Each cell consists of samples where one component's absolute abundance equals or exceeds all others, with a coordinate system identifying it with a d-dimensional unit cube. When applied to vaginal microbiome data, $L^{\infty}$-decomposition aligns with established Community State Types while offering advantages: each $L^{\infty}$-CST is named after its dominating component, has clear biological meaning, remains stable under sample changes, resolves cluster-based issues, and provides a coordinate system for exploring internal structure. We extend homogeneous coordinates through cube embedding, mapping data into a d-dimensional unit cube. These embeddings can be integrated via Cartesian product, providing unified representations from multiple perspectives. While demonstrated through microbiome studies, these methods apply to any compositional data.

基于\(L^{\infty}\)-归一化的组成数据新分析方法及其在阴道微生物组中的应用 - arXiv