GNN-ViTCap: GNN-Enhanced Multiple Instance Learning with Vision Transformers for Whole Slide Image Classification and Captioning
/ Authors
/ Abstract
Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSIs face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these challenges, a novel GNN-ViTCap framework is introduced for classification and caption generation from histopathological microscopic images. A visual feature extractor is used to extract feature embeddings. The redundant patches are then removed by dynamically clustering images using deep embedded clustering and extracting representative images through a scalar dot attention mechanism. The graph is formed by constructing edges from the similarity matrix, connecting each node to its nearest neighbors. Therefore, a graph neural network is utilized to extract and represent contextual information from both local and global areas. The aggregated image embeddings are then projected into the language model’s input space using a linear layer and combined with input caption tokens to fine-tune the large language models for caption generation. Our proposed method is validated using the BreakHis and PatchGastric microscopic datasets. The GNN-ViTCap method achieves an F1-Score of 0.934 and AUC of 0.963 for classification, along with BLEU@4 = 0.811 and METEOR = 0.569 for captioning. Experimental analysis demonstrates that the GNN-ViTCap architecture outper-forms state-of-the-art (SOTA) approaches, providing a reliable and efficient approach for patient diagnosis using microscopy images.
Journal: 2025 International Joint Conference on Neural Networks (IJCNN)