CAT-VAE: A Cross-Attention Transformer-Enhanced Variational Autoencoder for Improved Image Synthesis

Khadija Rais; Mohamed Amroune; Mohamed Yassine  Haouam; Abdelmadjid  Benmachiche

doi:10.19139/soic-2310-5070-2546

CAT-VAE: A Cross-Attention Transformer-Enhanced Variational Autoencoder for Improved Image Synthesis

Khadija Rais The University of Latbi Tbessi Laboratory of Mathematics, Informatics and Systems (LAMIS)
Mohamed Amroune Laboratory of mathematics, informatics and systems (LAMIS), Echahid Cheikh Larbi Tebessi University, Tebessa, 12002, Algeria
Mohamed Yassine Haouam Laboratory of mathematics, informatics and systems (LAMIS), Echahid Cheikh Larbi Tebessi University, Tebessa, 12002, Algeria
Abdelmadjid Benmachiche Department of Computer Science, LIMA Laboratory, Chadli Bendjedid, University, El-Tarf, PB 73, 36000, Algeria

DOI: https://doi.org/10.19139/soic-2310-5070-2546

Keywords: Variational Autoencoder (VAE), Cross-Attention Transformers (CAT), Synthetic images, Imbalanced classification, Data augmentation.

Abstract

Deep generative models are increasingly useful in medical image analysis to solve various issues, including class imbalance in classification tasks, motivating the development of multiple methods, where the Variational Autoencoder (VAE) is recognized as one of the most popular image generators. However, the utilization of convolutional layers in VAEs weakens their ability to model global context and long-range dependencies. This paper presents CAT-VAE, a hybrid approach based on VAE and Cross-Attention Transformers (CAT), in which a cross-attention mechanism is employed to promote long-range dependencies and improve the quality of the generated images. On the Ultrasound breast cancer dataset, the CAT-VAE achieved better image quality (FID 8.7659 for Malignant and 7.8761 for Normal) compared to the standard VAE. An experiment was conducted where a CNN classifier model was trained without data augmentation, with augmentation based on VAE, and using synthetic data generated by CAT-VAE. The CNN achieved the highest accuracy (97.00%) when trained with CAT-VAE synthetic images. A classification accuracy of 86.67% was achieved with mixed datasets of real and synthetic images, demonstrating that CAT-VAE improves generalization and resilience. These results highlight CAT-VAE's ability to produce diverse and realistic synthetic datasets.

Published

2025-07-13

How to Cite

Rais, K., Mohamed Amroune, Haouam, M. Y., & Benmachiche, A. (2025). CAT-VAE: A Cross-Attention Transformer-Enhanced Variational Autoencoder for Improved Image Synthesis. Statistics, Optimization & Information Computing. https://doi.org/10.19139/soic-2310-5070-2546

Download Citation

Issue

Online First

Section

Research Articles

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).