Clinically oriented CNN–transformer architectures for reliable bronchoscopic recognition of lung lesions and anatomical structures

Erwan Ntoutoume Nguema, Rolyph, Forouzanfar, Mohamad et Traore, Ali. 2026. « Clinically oriented CNN–transformer architectures for reliable bronchoscopic recognition of lung lesions and anatomical structures ». IEEE Access, vol. 14. pp. 35944-35957.

[thumbnail of Forouzanfar-M-2026-33547.pdf]

Prévisualisation

PDF
Forouzanfar-M-2026-33547.pdf - Version publiée
Licence d'utilisation : Creative Commons CC BY.
Télécharger (1MB) | Prévisualisation

URL Officielle: http://dx.doi.org/10.1109/ACCESS.2026.3670363

Résumé

Bronchoscopy is central to diagnosing central lung cancers but remains limited by reliance on operator expertise and variability in visual interpretation. In this work, we adapt and evaluate CNN–Transformer hybrid models for the classification and segmentation of bronchoscopic images, with a particular emphasis on clinically realistic patient-level evaluation. These models combine convolutional blocks, which capture fine-grained local features, with Transformer components that encode long-range dependencies and global context, yielding feature representations well suited to the complexity of bronchoscopic images. The primary objective of this study is to adopt CNN-Transformer hybrid architectures for bronchoscopic lesion and landmark recognition, while evaluating their performance under clinically relevant data partitioning conditions. We evaluate our methods on BM-BronchoLC, a publicly available dataset of 2,921 annotated bronchoscopic images, and present two complementary frameworks: MedViT, a convolution-enhanced vision transformer for multi-label classification, and FCB-SwinV2, a dual-branch design coupling a convolutional encoder with a SwinV2 Transformer U-Net decoder for semantic segmentation. To directly address the study objective, we compare the performance of both models under random image-level splitting and rigorous patient-level partitioning, which prevents leakage of visual patterns between training and testing sets and provides a more clinically realistic evaluation. MedViT achieves 94.7% accuracy (AUC 0.95) for anatomical landmarks under random splitting and preserves comparative performance with 93% (AUC 0.91) under patient-level separation. For lung lesions, results remain competitive at 82.3% (AUC 0.79) and 80% (AUC 0.69), respectively. FCB-SwinV2 yields Dice scores of 0.42 for landmarks and 0.33 for lesions with random splitting, which decline to 0.38 and 0.32 under patient-level evaluation. These results show that while the models maintain overall solid performance, they also exhibit a consistent drop under patient-level validation, underscoring the risk of overestimation when relying solely on random splitting. This controlled comparison between the two evaluation protocols demonstrates that despite the expected decrease in performance when removing data leakage, the proposed architectures remain competitive and generalize effectively to unseen patients. These findings indicate that our adapted CNN–Transformer architectures provide useful baselines for BM-BronchoLC and show encouraging signs of generalization to unseen patients, while also illustrating the performance differences between random and patient-level evaluation. They also reinforce that proper patient-level evaluation is central to the reliability of AI systems, and should be systematically adopted to avoid inflated performance estimates. All code and models are released to support reproducibility and foster future research.

Type de document:	Article publié dans une revue, révisé par les pairs
Chercheur(-euse):	Chercheur(-euse) Forouzanfar, Mohamad
Affiliation:	Génie des systèmes
Date de dépôt:	01 avr. 2026 20:20
Dernière modification:	22 avr. 2026 19:38
URI:	https://espace2.etsmtl.ca/id/eprint/33547

Actions (Authentification requise)

Dernière vérification avant le dépôt