PROVGEN:基因组研究中结果验证的隐私保护方法
PROVGEN: A Privacy-Preserving Approach for Outcome Validation in Genomic Research
摘要 Abstract
随着近年来基因组研究的日益流行,由于隐私问题,数据集共享仍然受到限制。这种限制阻碍了研究结果的可重复性和验证,而这两者在识别研究过程中计算错误方面至关重要。本文介绍了一种名为PROVGEN的隐私保护方法,用于共享基因组数据集,以促进全基因组关联研究(GWAS)中的结果可重复性和验证。我们的方法将基因组数据编码到二进制空间,并采用两阶段过程。首先,我们利用基于异或机制并结合生物学特征生成数据集的差分私有版本。其次,通过最优传输调整噪声数据中的次要等位基因频率(MAF)值,使其与已发布的MAF对齐,从而恢复数据效用。最后,我们将处理后的二进制数据转换回其基因组表示形式并发布生成的数据集。我们在三个真实世界的基因组数据集上评估了PROVGEN,并将其与本地差分隐私和三种基于合成的方法进行了比较。结果显示,我们提出的方案在检测GWAS结果错误方面优于所有现有方法,实现了更好的数据效用,并提供了更高的针对成员推理攻击(MIA)的隐私保护。通过采用我们的方法,基因组研究人员将倾向于共享差分私有数据集,同时保持高质量的数据以实现其发现的可重复性。
As genomic research has grown increasingly popular in recent years, dataset sharing has remained limited due to privacy concerns. This limitation hinders the reproducibility and validation of research outcomes, both of which are essential for identifying computational errors during the research process. In this paper, we introduce PROVGEN, a privacy-preserving method for sharing genomic datasets that facilitates reproducibility and outcome validation in genome-wide association studies (GWAS). Our approach encodes genomic data into binary space and applies a two-stage process. First, we generate a differentially private version of the dataset using an XOR-based mechanism that incorporates biological characteristics. Second, we restore data utility by adjusting the Minor Allele Frequency (MAF) values in the noisy dataset to align with published MAFs using optimal transport. Finally, we convert the processed binary data back into its genomic representation and publish the resulting dataset. We evaluate PROVGEN on three real-world genomic datasets and compare it with local differential privacy and three synthesis-based methods. We show that our proposed scheme outperforms all existing methods in detecting GWAS outcome errors, achieves better data utility, and provides higher privacy protection against membership inference attacks (MIAs). By adopting our method, genomic researchers will be inclined to share differentially private datasets while maintaining high data quality for reproducibility of their findings.