基于非类别先验的场景文本图像超分辨率（NCAP）

Research

arXiv

NCAP: Scene Text Image Super-Resolution with Non-CAtegorical Prior

摘要 Abstract

场景文本图像超分辨率（STISR）旨在提升低分辨率图像的分辨率和质量。与以往将场景文本图像视为自然图像的研究不同，近期利用文本先验（TP）的方法，通过从预训练的文本识别器中提取先验信息，展现了强大的性能。然而，存在两个主要问题：（1）明确的分类先验，如TP，若不正确可能会对STISR产生负面影响。我们揭示了这些显式先验的不稳定性，并提出使用倒数第二层表征来替代NCAP（非类别先验）。（2）用于生成TP的预训练识别器在处理低分辨率图像时表现不佳。为了解决这一问题，大多数研究通过联合训练识别器与STISR网络来弥合低分辨率与高分辨率图像之间的领域差距，但这种方法可能导致先验模态的过度自信现象。我们指出了这一问题，并通过混合硬标签和软标签的方法提出了缓解策略。在TextZoom数据集上的实验表明，我们的方法相比现有方法提升了3.5%，并且在四个文本识别数据集上显著提升了泛化性能达14.8%。我们的方法可以推广到所有基于TP引导的STISR网络。

Scene text image super-resolution (STISR) enhances the resolution and quality of low-resolution images. Unlike previous studies that treated scene text images as natural images, recent methods using a text prior (TP), extracted from a pre-trained text recognizer, have shown strong performance. However, two major issues emerge: (1) Explicit categorical priors, like TP, can negatively impact STISR if incorrect. We reveal that these explicit priors are unstable and propose replacing them with Non-CAtegorical Prior (NCAP) using penultimate layer representations. (2) Pre-trained recognizers used to generate TP struggle with low-resolution images. To address this, most studies jointly train the recognizer with the STISR network to bridge the domain gap between low- and high-resolution images, but this can cause an overconfidence phenomenon in the prior modality. We highlight this issue and propose a method to mitigate it by mixing hard and soft labels. Experiments on the TextZoom dataset demonstrate an improvement by 3.5%, while our method significantly enhances generalization performance by 14.8\% across four text recognition datasets. Our method generalizes to all TP-guided STISR networks.