CAT-VAE: A Cross-Attention Transformer-Enhanced Variational Autoencoder for Improved Image Synthesis

  • Khadija Rais The University of Latbi Tbessi Laboratory of Mathematics, Informatics and Systems (LAMIS)
  • Mohamed Amroune Laboratory of mathematics, informatics and systems (LAMIS), Echahid Cheikh Larbi Tebessi University, Tebessa, 12002, Algeria
  • Mohamed Yassine Haouam Laboratory of mathematics, informatics and systems (LAMIS), Echahid Cheikh Larbi Tebessi University, Tebessa, 12002, Algeria
  • Abdelmadjid Benmachiche Department of Computer Science, LIMA Laboratory, Chadli Bendjedid, University, El-Tarf, PB 73, 36000, Algeria
Keywords: Variational Autoencoder (VAE), Cross-Attention Transformers (CAT), Synthetic images, Imbalanced classification, Data augmentation.

Abstract

Deep generative models are increasingly useful in medical image analysis to solve various issues, including class imbalance in classification tasks, motivating the development of multiple methods, where the Variational Autoencoder (VAE) is recognized as one of the most popular image generators. However, the utilization of convolutional layers in VAEs weakens their ability to model global context and long-range dependencies. This paper presents CAT-VAE, a hybrid approach based on VAE and Cross-Attention Transformers (CAT), in which a cross-attention mechanism is employed to promote long-range dependencies and improve the quality of the generated images. On the Ultrasound breast cancer dataset, the CAT-VAE achieved better image quality (FID 8.7659 for Malignant and 7.8761 for Normal) compared to the standard VAE. An experiment was conducted where a CNN classifier model was trained without data augmentation, with augmentation based on VAE, and using synthetic data generated by CAT-VAE. The CNN achieved the highest accuracy (97.00%) when trained with CAT-VAE synthetic images. A classification accuracy of 86.67% was achieved with mixed datasets of real and synthetic images, demonstrating that CAT-VAE improves generalization and resilience. These results highlight CAT-VAE's ability to produce diverse and realistic synthetic datasets.
Published
2025-07-13
How to Cite
Rais, K., Mohamed Amroune, Haouam, M. Y., & Benmachiche, A. (2025). CAT-VAE: A Cross-Attention Transformer-Enhanced Variational Autoencoder for Improved Image Synthesis. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-2546
Section
Research Articles