This work investigates the potential of electroencephalography (EEG)-based brain– computer interfaces as an alternative communication system for individuals with speech impairments. Rather than focusing on isolated words or phonemes, the study adopts the syllabic structure of the Italian language as the fundamental unit of speech representation. A dataset was collected from 21 participants during the pronunciation and syllabification of a set of sentences. EEG segments were then aligned with acoustic representations by training a neural network to learn a shared latent space through a contrastive learning approach, using embeddings extracted from a pre-trained automatic speech recognition (ASR) model.The evaluation was formulated as an information retrieval task, where a k-nearest neighbors search was performed within the learned EEG embedding space and later a language model was applied to improve the prediction. Both a multi-subject model and subject-specific models were considered and compared. The results show an average Top-10 accuracy of 37.4% for the multisubject model and 38.1% for subject-specific models over a 192-class syllabic vocabulary. Overall, this work demonstrates the feasibility of leveraging syllablebased representations and cross-modal alignment techniques for non-invasive speech decoding. The proposed framework represents a step toward more expressive and scalable assistive communication systems based on EEG signals.
This work investigates the potential of electroencephalography (EEG)-based brain– computer interfaces as an alternative communication system for individuals with speech impairments. Rather than focusing on isolated words or phonemes, the study adopts the syllabic structure of the Italian language as the fundamental unit of speech representation. A dataset was collected from 21 participants during the pronunciation and syllabification of a set of sentences. EEG segments were then aligned with acoustic representations by training a neural network to learn a shared latent space through a contrastive learning approach, using embeddings extracted from a pre-trained automatic speech recognition (ASR) model.The evaluation was formulated as an information retrieval task, where a k-nearest neighbors search was performed within the learned EEG embedding space and later a language model was applied to improve the prediction. Both a multi-subject model and subject-specific models were considered and compared. The results show an average Top-10 accuracy of 37.4% for the multisubject model and 38.1% for subject-specific models over a 192-class syllabic vocabulary. Overall, this work demonstrates the feasibility of leveraging syllablebased representations and cross-modal alignment techniques for non-invasive speech decoding. The proposed framework represents a step toward more expressive and scalable assistive communication systems based on EEG signals.
EEG Representation Learning via Contrastive Embedding Alignment for Imagined Speech Decoding
VOLPATO, PIETRO
2025/2026
Abstract
This work investigates the potential of electroencephalography (EEG)-based brain– computer interfaces as an alternative communication system for individuals with speech impairments. Rather than focusing on isolated words or phonemes, the study adopts the syllabic structure of the Italian language as the fundamental unit of speech representation. A dataset was collected from 21 participants during the pronunciation and syllabification of a set of sentences. EEG segments were then aligned with acoustic representations by training a neural network to learn a shared latent space through a contrastive learning approach, using embeddings extracted from a pre-trained automatic speech recognition (ASR) model.The evaluation was formulated as an information retrieval task, where a k-nearest neighbors search was performed within the learned EEG embedding space and later a language model was applied to improve the prediction. Both a multi-subject model and subject-specific models were considered and compared. The results show an average Top-10 accuracy of 37.4% for the multisubject model and 38.1% for subject-specific models over a 192-class syllabic vocabulary. Overall, this work demonstrates the feasibility of leveraging syllablebased representations and cross-modal alignment techniques for non-invasive speech decoding. The proposed framework represents a step toward more expressive and scalable assistive communication systems based on EEG signals.| File | Dimensione | Formato | |
|---|---|---|---|
|
Volpato_Pietro.pdf
accesso aperto
Dimensione
1.84 MB
Formato
Adobe PDF
|
1.84 MB | Adobe PDF | Visualizza/Apri |
The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License
https://hdl.handle.net/20.500.12608/106238