摘要 Abstract
场景文本图像超分辨率(STISR)旨在提升低分辨率图像的分辨率和质量。与以往将场景文本图像视为自然图像的研究不同,近期利用文本先验(TP)的方法,通过从预训练的文本识别器中提取先验信息,展现了强大的性能。然而,存在两个主要问题:(1)明确的分类先验,如TP,若不正确可能会对STISR产生负面影响。我们揭示了这些显式先验的不稳定性,并提出使用倒数第二层表征来替代NCAP(非类别先验)。(2)用于生成TP的预训练识别器在处理低分辨率图像时表现不佳。为了解决这一问题,大多数研究通过联合训练识别器与STISR网络来弥合低分辨率与高分辨率图像之间的领域差距,但这种方法可能导致先验模态的过度自信现象。我们指出了这一问题,并通过混合硬标签和软标签的方法提出了缓解策略。在TextZoom数据集上的实验表明,我们的方法相比现有方法提升了3.5%,并且在四个文本识别数据集上显著提升了泛化性能达14.8%。我们的方法可以推广到所有基于TP引导的STISR网络。
Scene text image super-resolution (STISR) enhances the resolution and quality of low-resolution images. Unlike previous studies that treated scene text images as natural images, recent methods using a text prior (TP), extracted from a pre-trained text recognizer, have shown strong performance. However, two major issues emerge: (1) Explicit categorical priors, like TP, can negatively impact STISR if incorrect. We reveal that these explicit priors are unstable and propose replacing them with Non-CAtegorical Prior (NCAP) using penultimate layer representations. (2) Pre-trained recognizers used to generate TP struggle with low-resolution images. To address this, most studies jointly train the recognizer with the STISR network to bridge the domain gap between low- and high-resolution images, but this can cause an overconfidence phenomenon in the prior modality. We highlight this issue and propose a method to mitigate it by mixing hard and soft labels. Experiments on the TextZoom dataset demonstrate an improvement by 3.5%, while our method significantly enhances generalization performance by 14.8\% across four text recognition datasets. Our method generalizes to all TP-guided STISR networks.