Publications | Xiaoyu Lin

2024

PhD Thesis
Deep latent-variable generative models for multimedia processing

Xiaoyu Lin

Université Grenoble Alpes, 2024

Abs Bib PDF

Deep probabilistic generative models hold a crucial position within the realm of machine learning research. They serve as powerful tools for comprehending complex real-world data, such as image, audio, and text, by modeling their underlying distributions. This capability further enables the generation of new data samples. Moreover, these models can be utilized to discover hidden structures and the intrinsic factors of variation within data. The data representations that are learned through this process can be leveraged across a spectrum of downstream prediction tasks, thereby enhancing the decision-making process. Another research direction involves leveraging the flexibility and robust generalization ability of deep probabilistic generative models for solving intricate scientific and engineering problems. Though supervised deep learning methods applied to sophisticatedly designed neural architectures have achieved state-of-the-art performance across various domains, their practical application to real-world situations remains constrained. These limitations arise from the necessity of extensive volumes of annotated data for training and a shortfall in model interpretability. In this PhD work, we explore an alternative approach using deep probabilistic generative models within an unsupervised or weakly supervised framework to overcome these hurdles. Specifically, the proposed approach involves initially pre-training a deep probabilistic generative model with natural or synthetic signals to embed prior knowledge about the complex data patterns. Subsequently, this pre-trained model is integrated into an extended latent variable generative model (LVGM) to address the specific practical problem. Our research focuses on a specific type of deep probabilistic generative model designed for sequential data, referred to as dynamical variational auto-encoders (DVAEs). DVAEs are a family of deep latent variable models extended from the variational auto-encoder (VAE) for sequential data modeling. They leverage a sequence of latent vectors to depict the intricate temporal dependencies within the sequential observed data. By integrating DVAEs within a LVGM, we address a range of audio and visual tasks, namely multi-object tracking, single-channel audio source separation, and speech enhancement. The solutions are derived based on variational inference methods. Additionally, we also investigate a novel architecture, HiT-DVAE, which incorporates the Transformer architecture within the probabilistic framework of DVAEs. HiT-DVAE and its variant, LigHT-DVAE, both demonstrate excellent performance in speech modeling through robust sequential data handling. The findings from our experiments confirm the potential of deep probabilistic generative models to address real-world problems with limited labeled data, offering scalable and interpretable solutions. Furthermore, the introduction of HiT-DVAE represents a significant advancement in the field, combining the strengths of Transformer architectures with probabilistic modeling for enhanced sequential data analysis. These works not only contribute to the theoretical understanding of deep generative models, but also demonstrate their practical applicability across various domains, laying the groundwork for future innovations in machine learning.
@article{lin2024thesis, title = {Deep latent-variable generative models for multimedia processing}, author = {Lin, Xiaoyu}, journal = {Université Grenoble Alpes}, year = {2024}, }

2023

TMLR
Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation

Xiaoyu Lin, Laurent Girin , and Xavier Alameda-Pineda

Transactions on Machine Learning Research, 2023

Abs Bib PDF Code Poster

In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDVAE) to model the dynamics of a system composed of multiple moving sources. A DVAE model is pre-trained on a single-source dataset to capture the source dynamics. Then, multiple instances of the pre-trained DVAE model are integrated into a multi-source mixture model with a discrete observation-to-source assignment latent variable. The posterior distributions of both the discrete observation-to-source assignment variable and the continuous DVAE variables representing the sources content/position are estimated using the variational expectation-maximization algorithm, leading to multi-source trajectories estimation. We illustrate the versatility of the proposed MixDVAE model on two tasks: a computer vision task, namely multi-object tracking, and an audio processing task, namely single-channel audio source separation. Experimental results show that the proposed method works well on these two tasks, and outperforms several baseline methods.
@article{lin2023mixture, title = {Mixture of Dynamical Variational Autoencoders for Multi-Source Trajectory Modeling and Separation}, author = {Lin, Xiaoyu and Girin, Laurent and Alameda-Pineda, Xavier}, journal = {Transactions on Machine Learning Research}, issn = {2835-8856}, year = {2023}, url = {https://openreview.net/forum?id=sbkZKBVC31}, }
ICASSP
Speech Modeling with a Hierarchical Transformer Dynamical VAE

Xiaoyu Lin, Xiaoyu Bie , Simon Leglaive , and 2 more authors

In IEEE International Conference on Acoustics, Speech and Signal Processing , 2023

Abs Bib PDF Code Poster

The dynamical variational autoencoders (DVAEs) are a family of latent-variable deep generative models that extends the VAE to model a sequence of observed data and a corresponding sequence of latent vectors. In almost all the DVAEs of the literature, the temporal dependencies within each sequence and across the two sequences are modeled with recurrent neural networks. In this paper, we propose to model speech signals with the Hierarchical Transformer DVAE (HiT-DVAE), which is a DVAE with two levels of latent variable (sequence-wise and frame-wise) and in which the temporal dependencies are implemented with the Transformer architecture. We show that HiT-DVAE outperforms several other DVAEs for speech spectrogram modeling, while enabling a simpler training procedure, revealing its high potential for downstream low-level speech processing tasks such as speech enhancement.
@inproceedings{10096751, author = {Lin, Xiaoyu and Bie, Xiaoyu and Leglaive, Simon and Girin, Laurent and Alameda-Pineda, Xavier}, booktitle = {IEEE International Conference on Acoustics, Speech and Signal Processing}, title = {Speech Modeling with a Hierarchical Transformer Dynamical VAE}, year = {2023}, volume = {}, number = {}, pages = {1-5}, doi = {10.1109/ICASSP49357.2023.10096751}, location = {Rhodes, Greece}, }
INTERSPEECH
Unsupervised speech enhancement with deep dynamical generative speech and noise models

Xiaoyu Lin, Simon Leglaive , Laurent Girin , and 1 more author

In Proceedings Interspeech , 2023

Abs Bib PDF Code Slides

This work builds on a previous work on unsupervised speech enhancement using a dynamical variational autoencoder (DVAE) as the clean speech model and non-negative matrix factorization (NMF) as the noise model. We propose to replace the NMF noise model with a deep dynamical generative model (DDGM) depending either on the DVAE latent variables, or on the noisy observations, or on both. This DDGM can be trained in three configurations: noise-agnostic, noise-dependent and noise adaptation after noise-dependent training. Experimental results show that the proposed method achieves competitive performance compared to state-of-the-art unsupervised speech enhancement methods, while the noise-dependent training configuration yields a much more time-efficient inference process.
@inproceedings{interspeech2023, author = {Lin, Xiaoyu and Leglaive, Simon and Girin, Laurent and Alameda-Pineda, Xavier}, booktitle = {Proceedings Interspeech}, title = {Unsupervised speech enhancement with deep dynamical generative speech and noise models}, year = {2023}, volume = {}, number = {}, location = {Dublin, Ireland}, }