No-reference video quality assessment using transformers and attention recurrent networks

Kossi, Koffi, Coulombe, Stéphane et Desrosiers, Christian. 2024. « No-reference video quality assessment using transformers and attention recurrent networks ». IEEE Access, vol. 12. pp. 140671-140680.

[thumbnail of Coulombe-S-2024-29741.pdf]

Prévisualisation

PDF
Coulombe-S-2024-29741.pdf - Version publiée
Licence d'utilisation : Creative Commons CC BY.
Télécharger (4MB) | Prévisualisation

URL Officielle: https://doi.org/10.1109/ACCESS.2024.3468336

Résumé

In recent years, numerous studies have investigated the development of methods for video quality assessment (VQA). These studies have predominantly focused on specific types of video degradation tailored to the application of interest. However, natural videos or recent videos generated by users (UGC) present complex distortions that are not easy to model. Consequently, most current VQA approaches struggle to achieve high performance when applied to these videos. In this paper, we propose a novel Transformer-based architecture that extracts spatial distortion features and spatio-temporal features from videos in two specialized branches. The spatial distortion branch leverages a transfer learning strategy where a standard ViT is pre-trained using a masked autoencoder (MAE) self-supervised learning task, and then fine-tuned to predict the distortion type of corrupted images from the CSIQ database. The features from this branch capture degradation at the level of individual frames. On the other hand, the second branch employs a 3D ShiftedWindows Transformer (Swin-T) to extract spatio-temporal features across multiple frames. Once again, we use transfer learning to extract rich features by pre-training this 3D Swin-T model on a video dataset for human action recognition. Finally, a temporal memory block hinged on an attention recurrent neural networks is proposed to predict the final video quality score from the spatio-temporal sequence of features. We evaluate the performance of our method on two popular UGC databases, namely KoNViD-1k and LIVE-VQC. Results show it outperforms state-of-the-art models on the KoNViD-1k database, achieving a SROCC performance of 0.927 and a PLCC of 0.925, while also delivering highly competitive results on the LIVE-VQC database.

Type de document:	Article publié dans une revue, révisé par les pairs
Professeur:	Professeur Coulombe, Stéphane Desrosiers, Christian
Affiliation:	Génie logiciel et des technologies de l'information, Génie logiciel et des technologies de l'information
Date de dépôt:	25 oct. 2024 15:33
Dernière modification:	28 oct. 2024 16:49
URI:	https://espace2.etsmtl.ca/id/eprint/29741

Actions (Authentification requise)

Dernière vérification avant le dépôt