Yanwu Xu, Shaoan Xie, Maxwell Reynolds, Matthew Ragoza, Mingming Gong, Kayhan Batmanghelich
An organ segmentation method that can generalize to unseen contrasts and scanner settings can significantly reduce the need for retraining of deep learning models. Domain Generalization (DG) aims to achieve this goal. However, most DG methods for segmentation require training data from multiple domains during training. We propose a novel adversarial domain generalization method for organ segmentation trained on data from a \emph{single} domain. We synthesize the new domains via learning an adversarial domain synthesizer (ADS) and presume that the synthetic domains cover a large enough area of plausible distributions so that unseen domains can be interpolated from synthetic domains. We propose a mutual information regularizer to enforce the semantic consistency between images from the synthetic domains, which can be estimated by patch-level contrastive learning. We evaluate our method for various organ segmentation for unseen modalities, scanning protocols, and scanner sites.
Yanwu Xu, Mingming Gong, Shaoan Xie, Wei Wei, Matthias Grundmann, Kayhan Batmanghelich, Tingbo Hou
Despite the proliferation of generative models, achieving fast sampling during inference without compromising sample diversity and quality remains challenging. Existing models such as Denoising Diffusion Probabilistic Models (DDPM) deliver high-quality, diverse samples but are slowed by an inherently high number of iterative steps. The Denoising Diffusion Generative Adversarial Networks (DDGAN) attempted to circumvent this limitation by integrating a GAN model for larger jumps in the diffusion process. However, DDGAN encountered scalability limitations when applied to large datasets. To address these limitations, we introduce a novel approach that tackles the problem by matching implicit and explicit factors. More specifically, our approach involves utilizing an implicit model to match the marginal distributions of noisy data and the explicit conditional distribution of the forward diffusion. This combination allows us to effectively match the joint denoising distributions. Unlike DDPM but similar to DDGAN, we do not enforce a parametric distribution for the reverse step, enabling us to take large steps during inference. Similar to the DDPM but unlike DDGAN, we take advantage of the exact form of the diffusion process. We demonstrate that our proposed method obtains comparable generative performance to diffusion-based models and vastly superior results to models with a small number of sampling steps.
Yanwu Xu, Yang Zhao, Zhisheng Xiao, Tingbo Hou
Text-to-image diffusion models have demonstrated remarkable capabilities in transforming textual prompts into coherent images, yet the computational cost of their inference remains a persistent challenge. To address this issue, we present UFOGen, a novel generative model designed for ultra-fast, one-step text-to-image synthesis. In contrast to conventional approaches that focus on improving samplers or employing distillation techniques for diffusion models, UFOGen adopts a hybrid methodology, integrating diffusion models with a GAN objective. Leveraging a newly introduced diffusion-GAN objective and initialization with pre-trained diffusion models, UFOGen excels in efficiently generating high-quality images conditioned on textual descriptions in a single step. Beyond traditional text-to-image generation, UFOGen showcases versatility in applications. Notably, UFOGen stands among the pioneering models enabling one-step text-to-image generation and diverse downstream tasks, presenting a significant advancement in the landscape of efficient generative models.
Yanwu Xu, Li Sun, Wei Peng, Shuyue Jia, Katelyn Morrison, Adam Perer, Afrooz Zandifar, Shyam Visweswaran, Motahhare Eslami, Kayhan Batmanghelich
This paper introduces an innovative methodology for producing high-quality 3D lung CT images guided by textual information. While diffusion-based generative models are increasingly used in medical imaging, current state-of-the-art approaches are limited to low-resolution outputs and underutilize radiology reports' abundant information. The radiology reports can enhance the generation process by providing additional guidance and offering fine-grained control over the synthesis of images. Nevertheless, expanding text-guided generation to high-resolution 3D images poses significant memory and anatomical detail-preserving challenges. Addressing the memory issue, we introduce a hierarchical scheme that uses a modified UNet architecture. We start by synthesizing low-resolution images conditioned on the text, serving as a foundation for subsequent generators for complete volumetric data. To ensure the anatomical plausibility of the generated samples, we provide further guidance by generating vascular, airway, and lobular segmentation masks in conjunction with the CT images. The model demonstrates the capability to use textual input and segmentation tasks to generate synthesized images. The results of comparative assessments indicate that our approach exhibits superior performance compared to the most advanced models based on GAN and diffusion techniques, especially in accurately retaining crucial anatomical features such as fissure lines, airways, and vascular structures. This innovation introduces novel possibilities. This study focuses on two main objectives: (1) the development of a method for creating images based on textual prompts and anatomical components, and (2) the capability to generate new images conditioning on anatomical elements. The advancements in image generation can be applied to enhance numerous downstream tasks.
Yanwu Xu, Mingming Gong, Shaoan Xie, Kayhan Batmanghelich
Deep learning has achieved remarkable success in medicalimage segmentation, but it usually requires a large numberof images labeled with fine-grained segmentation masks, andthe annotation of these masks can be very expensive andtime-consuming. Therefore, recent methods try to use un-supervised domain adaptation (UDA) methods to borrow in-formation from labeled data from other datasets (source do-mains) to a new dataset (target domain). However, due tothe absence of labels in the target domain, the performance ofUDA methods is much worse than that of the fully supervisedmethod. In this paper, we propose a weakly supervised do-main adaptation setting, in which we can partially label newdatasets with bounding boxes, which are easier and cheaperto obtain than segmentation masks. Accordingly, we proposea new weakly-supervised domain adaptation method calledBox-Adapt, which fully explores the fine-grained segmenta-tion mask in the source domain and the weak bounding boxin the target domain. Our Box-Adapt is a two-stage methodthat first performs joint training on the source and target do-mains, and then conducts self-training with the pseudo-labelsof the target domain. We demonstrate the effectiveness of ourmethod in the liver segmentation task. Weakly supervised do-main adaptation
Yanwu Xu, Shaoan Xie, Wenhao Wu, Kun Zhang, Mingming Gong, Kayhan Batmanghelich
Unpaired image-to-image translation (I2I) is an ill-posed problem, as an infinite number of translation functions can map the source domain distribution to the target distribution. Therefore, much effort has been put into designing suitable constraints, e.g., cycle consistency (CycleGAN), geometry consistency (GCGAN), and contrastive learning-based constraints (CUTGAN), that help better pose the problem. However, these well-known constraints have limitations: (1) they are either too restrictive or too weak for specific I2I tasks; (2) these methods result in content distortion when there is a significant spatial variation between the source and target domains. This paper proposes a universal regularization technique called maximum spatial perturbation consistency (MSPC), which enforces a spatial perturbation function (T ) and the translation operator (G) to be commutative (i.e., TG = GT ). In addition, we introduce two adversarial training components for learning the spatial perturbation function. The first one lets T compete with G to achieve maximum perturbation. The second one lets G and T compete with discriminators to align the spatial variations caused by the change of object size, object distortion, background interruptions, etc. Our method outperforms the state-of-the-art methods on most I2I benchmarks. We also introduce a new benchmark, namely the front face to profile face dataset, to emphasize the underlying challenges of I2I for real-world applications. We finally perform ablation experiments to study the sensitivity of our method to the severity of spatial perturbation and its effectiveness for distribution alignment.
Mingming Gong, Yanwu Xu, Chunyuan Li, Kun Zhang, Kayhan Batmanghelich
Conditional generative models enjoy remarkable progress over the past few years. One of the popular conditional models is Auxiliary Classifier GAN (AC-GAN), which generates highly discriminative images by extending the loss function of GAN with an auxiliary classifier. However, the diversity of the generated samples by AC-GAN tends to decrease as the number of classes increases, hence limiting its power on large-scale data. In this paper, we identify the source of the low diversity issue theoretically and propose a practical solution to solve the problem. We show that the auxiliary classifier in AC-GAN imposes perfect separability, which is disadvantageous when the supports of the class distributions have significant overlap. To address the issue, we propose Twin Auxiliary Classifiers Generative Adversarial Net (TAC-GAN) that further benefits from a new player that interacts with other players (the generator and the discriminator) in GAN. Theoretically, we demonstrate that TAC-GAN can effectively minimize the divergence between the generated and real-data distributions. Extensive experimental results show that our TAC-GAN can successfully replicate the true data distributions on simulated data, and significantly improves the diversity of class-conditional image generation on real datasets.
Yanwu Xu, Mingming Gong, Junxiang Chen, Tongliang Liu, Kun Zhang, Kayhan Batmanghelich
Majority of state-of-the-art deep learning methods are discriminative approaches, which model the conditional distribution of labels given inputs features. The success of such approaches heavily depends on high-quality labeled instances, which are not easy to obtain, especially as the number of candidate classes increases. In this paper, we study the complementary learning problem. Unlike ordinary labels, complementary labels are easy to obtain because an annotator only needs to provide a yes/no answer to a randomly chosen candidate class for each instance. We propose a generative-discriminative complementary learning method that estimates the ordinary labels by modeling both the conditional (discriminative) and instance (generative) distributions. Our method, we call Complementary Conditional GAN (CCGAN), improves the accuracy of predicting ordinary labels and can generate high-quality instances in spite of weak supervision. In addition to the extensive empirical studies, we also theoretically show that our model can retrieve the true conditional distribution from the complementarily-labeled data.
Yanwu Xu, Mingming Gong, Tongliang Liu, Kayhan Batmanghelich, Chaohui Wang
In recent years, the learned local descriptors have outperformed handcrafted ones by a large margin, due to the powerful deep convolutional neural network architectures such as L2-Net [1] and triplet based metric learning [2]. However, there are two problems in the current methods, which hinders the overall performance. Firstly, the widely-used margin loss is sensitive to incorrect correspondences, which are prevalent in the existing local descriptor learning datasets. Second, the L2 distance ignores the fact that the feature vectors have been normalized to unit norm. To tackle these two problems and further boost the performance, we propose a robust angular loss which 1) uses cosine similarity instead of L2 distance to compare descriptors and 2) relies on a robust loss function that gives smaller penalty to triplets with negative relative similarity. The resulting descriptor shows robustness on different datasets, reaching the state-of-the-art result on Brown dataset , as well as demonstrating excellent generalization ability on the Hpatches dataset and a Wide Baseline Stereo dataset.
Chendi Rao, Jiezhang Cao, Runhao Zeng, Qi Chen, Huazhu Fu, Yanwu Xu, Mingkui Tan
Recently, deep neural networks (DNNs) have made great progress on automated diagnosis with chest X-rays images. However, DNNs are vulnerable to adversarial examples, which may cause misdiagnoses to patients when applying the DNN based methods in disease detection. Recently, there is few comprehensive studies exploring the influence of attack and defense methods on disease detection, especially for the multi-label classification problem. In this paper, we aim to review various adversarial attack and defense methods on chest X-rays. First, the motivations and the mathematical representations of attack and defense methods are introduced in details. Second, we evaluate the influence of several state-of-the-art attack and defense methods for common thorax disease classification in chest X-rays. We found that the attack and defense methods have poor performance with excessive iterations and large perturbations. To address this, we propose a new defense method that is robust to different degrees of perturbations. This study could provide new insights into methodological development for the community.
Jinkui Hao, Huazhu Fu, Yanwu Xu, Yan Hu, Fei Li, Xiulan Zhang, Jiang Liu, Yitian Zhao
Precise characterization and analysis of iris shape from Anterior Segment OCT (AS-OCT) are of great importance in facilitating diagnosis of angle-closure-related diseases. Existing methods focus solely on analyzing structural properties identified from the 2D slice, while accurate characterization of morphological changes of iris shape in 3D AS-OCT may be able to reveal in addition the risk of disease progression. In this paper, we propose a novel framework for reconstruction and quantification of 3D iris surface from AS-OCT imagery. We consider it to be the first work to detect angle-closure glaucoma by means of 3D representation. An iris segmentation network with wavelet refinement block (WRB) is first proposed to generate the initial shape of the iris from single AS-OCT slice. The 3D iris surface is then reconstructed using a guided optimization method with Poisson-disk sampling. Finally, a set of surface-based features are extracted, which are used in detecting of angle-closure glaucoma. Experimental results demonstrate that our method is highly effective in iris segmentation and surface reconstruction. Moreover, we show that 3D-based representation achieves better performance in angle-closure glaucoma detection than does 2D-based feature.
Xu Sun, Huihui Fang, Yehui Yang, Dongwei Zhu, Lei Wang, Junwei Liu, Yanwu Xu
Retinal vessel segmentation is a fundamental step in screening, diagnosis, and treatment of various cardiovascular and ophthalmic diseases. Robustness is one of the most critical requirements for practical utilization, since the test images may be captured using different fundus cameras, or be affected by various pathological changes. We investigate this problem from a data augmentation perspective, with the merits of no additional training data or inference time. In this paper, we propose two new data augmentation modules, namely, channel-wise random Gamma correction and channel-wise random vessel augmentation. Given a training color fundus image, the former applies random gamma correction on each color channel of the entire image, while the latter intentionally enhances or decreases only the fine-grained blood vessel regions using morphological transformations. With the additional training samples generated by applying these two modules sequentially, a model could learn more invariant and discriminating features against both global and local disturbances. Experimental results on both real-world and synthetic datasets demonstrate that our method can improve the performance and robustness of a classic convolutional neural network architecture. The source code is available at \url{https://github.com/PaddlePaddle/Research/tree/master/CV/robust_vessel_segmentation}.
Junde Wu, Huihui Fang, Hoayi Xiong, Lixin Duan, Mingkui Tan, Weihua Yang, Huiying Liu, Yanwu Xu
On the medical images, many of the tissues/lesions may be ambiguous. That is why the medical segmentation is typically annotated by a group of clinical experts to mitigate the personal bias. However, this clinical routine also brings new challenges to the application of machine learning algorithms. Without a definite ground-truth, it will be difficult to train and evaluate the deep learning models. When the annotations are collected from different graders, a common choice is majority vote. However such a strategy ignores the difference between the grader expertness. In this paper, we consider the task of predicting the segmentation with the calibrated inter-observer uncertainty. We note that in clinical practice, the medical image segmentation is usually used to assist the disease diagnosis. Inspired by this observation, we propose diagnosis-first principle, which is to take disease diagnosis as the criterion to calibrate the inter-observer segmentation uncertainty. Following this idea, a framework named Diagnosis First segmentation Framework (DiFF) is proposed to estimate diagnosis-first segmentation from the raw images.Specifically, DiFF will first learn to fuse the multi-rater segmentation labels to a single ground-truth which could maximize the disease diagnosis performance. We dubbed the fused ground-truth as Diagnosis First Ground-truth (DF-GT).Then, we further propose Take and Give Modelto segment DF-GT from the raw image. We verify the effectiveness of DiFF on three different medical segmentation tasks: OD/OC segmentation on fundus images, thyroid nodule segmentation on ultrasound images, and skin lesion segmentation on dermoscopic images. Experimental results show that the proposed DiFF is able to significantly facilitate the corresponding disease diagnosis, which outperforms previous state-of-the-art multi-rater learning methods.
Shaoan Xie, Yang Zhao, Zhisheng Xiao, Kelvin C. K. Chan, Yandong Li, Yanwu Xu, Kun Zhang, Tingbo Hou
This study introduces Text-Guided Subject-Driven Image Inpainting, a novel task that combines text and exemplar images for image inpainting. While both text and exemplar images have been used independently in previous efforts, their combined utilization remains unexplored. Simultaneously accommodating both conditions poses a significant challenge due to the inherent balance required between editability and subject fidelity. To tackle this challenge, we propose a two-step approach DreamInpainter. First, we compute dense subject features to ensure accurate subject replication. Then, we employ a discriminative token selection module to eliminate redundant subject details, preserving the subject's identity while allowing changes according to other conditions such as mask shape and text prompts. Additionally, we introduce a decoupling regularization technique to enhance text control in the presence of exemplar images. Our extensive experiments demonstrate the superior performance of our method in terms of visual quality, identity preservation, and text control, showcasing its effectiveness in the context of text-guided subject-driven image inpainting.
Shouyue Liu, Jinkui Hao, Yanwu Xu, Huazhu Fu, Xinyu Guo, Jiang Liu, Yalin Zheng, Yonghuai Liu, Jiong Zhang, Yitian Zhao
Optical Coherence Tomography Angiography (OCTA) is a promising tool for detecting Alzheimer's disease (AD) by imaging the retinal microvasculature. Ophthalmologists commonly use region-based analysis, such as the ETDRS grid, to study OCTA image biomarkers and understand the correlation with AD. However, existing studies have used general deep computer vision methods, which present challenges in providing interpretable results and leveraging clinical prior knowledge. To address these challenges, we propose a novel deep-learning framework called Polar-Net. Our approach involves mapping OCTA images from Cartesian coordinates to polar coordinates, which allows for the use of approximate sector convolution and enables the implementation of the ETDRS grid-based regional analysis method commonly used in clinical practice. Furthermore, Polar-Net incorporates clinical prior information of each sector region into the training process, which further enhances its performance. Additionally, our framework adapts to acquire the importance of the corresponding retinal region, which helps researchers and clinicians understand the model's decision-making process in detecting AD and assess its conformity to clinical observations. Through evaluations on private and public datasets, we have demonstrated that Polar-Net outperforms existing state-of-the-art methods and provides more valuable pathological evidence for the association between retinal vascular changes and AD. In addition, we also show that the two innovative modules introduced in our framework have a significant impact on improving overall performance.
Jiansheng Fang, Yanwu Xu, Yitian Zhao, Yuguang Yan, Junling Liu, Jiang Liu
Chest X-rays are the most commonly available and affordable radiological examination for screening thoracic diseases. According to the domain knowledge of screening chest X-rays, the pathological information usually lay on the lung and heart regions. However, it is costly to acquire region-level annotation in practice, and model training mainly relies on image-level class labels in a weakly supervised manner, which is highly challenging for computer-aided chest X-ray screening. To address this issue, some methods have been proposed recently to identify local regions containing pathological information, which is vital for thoracic disease classification. Inspired by this, we propose a novel deep learning framework to explore discriminative information from lung and heart regions. We design a feature extractor equipped with a multi-scale attention module to learn global attention maps from global images. To exploit disease-specific cues effectively, we locate lung and heart regions containing pathological information by a well-trained pixel-wise segmentation model to generate binarization masks. By introducing element-wise logical AND operator on the learned global attention maps and the binarization masks, we obtain local attention maps in which pixels are $1$ for lung and heart region and $0$ for other regions. By zeroing features of non-lung and heart regions in attention maps, we can effectively exploit their disease-specific cues in lung and heart regions. Compared to existing methods fusing global and local features, we adopt feature weighting to avoid weakening visual cues unique to lung and heart regions. Evaluated by the benchmark split on the publicly available chest X-ray14 dataset, the comprehensive experiments show that our method achieves superior performance compared to the state-of-the-art methods.
Shuhan Li, Dong Zhang, Xiaomeng Li, Chubin Ou, Lin An, Yanwu Xu, Kwang-Ting Cheng
Optical Coherence Tomography Angiography (OCTA) is a crucial tool in the clinical screening of retinal diseases, allowing for accurate 3D imaging of blood vessels through non-invasive scanning. However, the hardware-based approach for acquiring OCTA images presents challenges due to the need for specialized sensors and expensive devices. In this paper, we introduce a novel method called TransPro, which can translate the readily available 3D Optical Coherence Tomography (OCT) images into 3D OCTA images without requiring any additional hardware modifications. Our TransPro method is primarily driven by two novel ideas that have been overlooked by prior work. The first idea is derived from a critical observation that the OCTA projection map is generated by averaging pixel values from its corresponding B-scans along the Z-axis. Hence, we introduce a hybrid architecture incorporating a 3D adversarial generative network and a novel Heuristic Contextual Guidance (HCG) module, which effectively maintains the consistency of the generated OCTA images between 3D volumes and projection maps. The second idea is to improve the vessel quality in the translated OCTA projection maps. As a result, we propose a novel Vessel Promoted Guidance (VPG) module to enhance the attention of network on retinal vessels. Experimental results on two datasets demonstrate that our TransPro outperforms state-of-the-art approaches, with relative improvements around 11.4% in MAE, 2.7% in PSNR, 2% in SSIM, 40% in VDE, and 9.1% in VDC compared to the baseline method. The code is available at: https://github.com/ustlsh/TransPro.
Huazhu Fu, Jun Cheng, Yanwu Xu, Changqing Zhang, Damon Wing Kee Wong, Jiang Liu, Xiaochun Cao
Glaucoma is a chronic eye disease that leads to irreversible vision loss. Most of the existing automatic screening methods firstly segment the main structure, and subsequently calculate the clinical measurement for detection and screening of glaucoma. However, these measurement-based methods rely heavily on the segmentation accuracy, and ignore various visual features. In this paper, we introduce a deep learning technique to gain additional image-relevant information, and screen glaucoma from the fundus image directly. Specifically, a novel Disc-aware Ensemble Network (DENet) for automatic glaucoma screening is proposed, which integrates the deep hierarchical context of the global fundus image and the local optic disc region. Four deep streams on different levels and modules are respectively considered as global image stream, segmentation-guided network, local disc region stream, and disc polar transformation stream. Finally, the output probabilities of different streams are fused as the final screening result. The experiments on two glaucoma datasets (SCES and new SINDI datasets) show our method outperforms other state-of-the-art algorithms.
Huaying Hao, Huazhu Fu, Yanwu Xu, Jianlong Yang, Fei Li, Xiulan Zhang, Jiang Liu, Yitian Zhao
Anterior chamber angle (ACA) classification is a key step in the diagnosis of angle-closure glaucoma in Anterior Segment Optical Coherence Tomography (AS-OCT). Existing automated analysis methods focus on a binary classification system (i.e., open angle or angle-closure) in a 2D AS-OCT slice. However, clinical diagnosis requires a more discriminating ACA three-class system (i.e., open, narrow, or synechiae angles) for the benefit of clinicians who seek better to understand the progression of the spectrum of angle-closure glaucoma types. To address this, we propose a novel sequence multi-scale aggregation deep network (SMA-Net) for open-narrow-synechiae ACA classification based on an AS-OCT sequence. In our method, a Multi-Scale Discriminative Aggregation (MSDA) block is utilized to learn the multi-scale representations at slice level, while a ConvLSTM is introduced to study the temporal dynamics of these representations at sequence level. Finally, a multi-level loss function is used to combine the slice-based and sequence-based losses. The proposed method is evaluated across two AS-OCT datasets. The experimental results show that the proposed method outperforms existing state-of-the-art methods in applicability, effectiveness, and accuracy. We believe this work to be the first attempt to classify ACAs into open, narrow, or synechia types grading using AS-OCT sequences.
Dalu Yang, Yehui Yang, Tiantian Huang, Binghong Wu, Lei Wang, Yanwu Xu
There are extensive researches focusing on automated diabetic reti-nopathy (DR) detection from fundus images. However, the accuracy drop is ob-served when applying these models in real-world DR screening, where the fun-dus camera brands are different from the ones used to capture the training im-ages. How can we train a classification model on labeled fundus images ac-quired from only one camera brand, yet still achieves good performance on im-ages taken by other brands of cameras? In this paper, we quantitatively verify the impact of fundus camera brands related domain shift on the performance of DR classification models, from an experimental perspective. Further, we pro-pose camera-oriented residual-CycleGAN to mitigate the camera brand differ-ence by domain adaptation and achieve increased classification performance on target camera images. Extensive ablation experiments on both the EyePACS da-taset and a private dataset show that the camera brand difference can signifi-cantly impact the classification performance and prove that our proposed meth-od can effectively improve the model performance on the target domain. We have inferred and labeled the camera brand for each image in the EyePACS da-taset and will publicize the camera brand labels for further research on domain adaptation.