PhD Position F/M Acoustic to Articulatory Inversion by using dynamic MRI images

March 16, 2023
Offerd Salary:Salary: 2051€ gross/month for 1st and 2
Working address:N/A
Contract Type:Other
Working Time:Negotigation
Working type:N/A
Ref info:N/A

2023-05790 - PhD Position F/M Acoustic to Articulatory Inversion by using dynamic MRI images

Level of qualifications required : Graduate degree or equivalent

Fonction : PhD Position


This project is a LUE (Lorraine Université d'Excellence) project; it will be conducted jointly by the Loria laboratory (MultiSpeech team) and the IADI laboratory (INSERM U1254), which have been working together for several years on speech production and vocal tract imaging.

In particular, this will allow us to use the two-dimensional MRI acquisition system in real time (at 50 images per second) of the IADI laboratory. This system, unique in France, enables imaging the vocal tract at a frequency of 50 Hz in any direction, which is interesting for the recovery of the area function.

Yves Laprie ([email protected]) and Pierre-André Vuissoz ([email protected]) will co-supervise this PhD work.


Scientific challenge

Articulatory synthesis mimics the speech production process by first generating the shape of the vocal tract from the sequence of phonemes to be pronounced, then the acoustic signal by solving the aeroacoustic equations 1, 2. Compared to other approaches to speech synthesis which offer a very high level of quality, the main interest is to control the whole production process, beyond the acoustic signal alone.

The objective of this PhD is to succeed in the inverse transformation, called acoustic to articulatory inversion, in order to recover the geometric shape of the vocal tract from the acoustic signal. A simple voice recording will allow the dynamics of the different articulators to be followed during the production of the sentence.

Beyond its interest in terms of scientific challenge, articulatory acoustic inversion has many potential applications. Alone, it can be used as a diagnostic tool to evaluate articulatory gestures in an educational or medical context. Associated with articulatory synthesis tools, it can also be used to provide audiovisual feedback in situations of remediation (e.g. to help a hearing- impaired person to produce the correct articulation of a phoneme), learning (e.g. to master the achievement of phonetic contrasts in a foreign language) or to improve singing techniques in a professional context.

This project is a LUE (Lorraine Université d'Excellence) project; it will be conducted jointly by the Loria laboratory (MultiSpeech team) and the IADI laboratory (INSERM U1254), which have been working together for several years on speech production and vocal tract imaging.

In particular, this will allow us to use the two-dimensional MRI acquisition system in real time (at 50 images per second) of the IADI laboratory. This system, unique in France, enables imaging the vocal tract at a frequency of 50 Hz in any direction, which is interesting for the recovery of the area function.

State of the art and innovative character

Almost all current inversion works are based on the use of data from EMA (ElectroMagnetic Articulography) which gives the position of a few sensors glued on the tongue and other easily accessible articulators. From the point of view of the inversion techniques themselves, deep learning is widely used because it allows EMA data corpus to be efficiently exploited. Currently, the LSTM (LongShort-Term Memory) approach and its bidirectional variant gives the best results 3. Despite their very good geometric accuracy, and because EMA data can only cover the part of the vocal tract closest to the mouth, the current approaches do not allow the complete geometry of the vocal tract to be retrieved, while it is known, for example, that the larynx plays a determining role on the acoustics of the vocal tract. In practice, this considerably limits the interest of existing inversion techniques since the results cannot be used to reconstruct the speech signal. The objective of this project is to remove this limitation and the originality is to recover the complete geometry of the vocal tract using dynamic MRI data that we can acquire in Nancy at the IADI laboratory. This approach will open a really operational bridge between articulatory gestures and acoustics in both directions (physical numerical simulations for the direct transformation and inversion). Another innovative aspect of the inversion that we propose is the identification of the role of each articulator in order to take into account a possible perturbation concerning a specific articulator.

Description of work

The first objective is the inversion of the acoustic signal to recover the temporal evolution of the medio-sagittal slice. Indeed, dynamic MRI provides two-dimensional images in the medio-sagittal plane at 50Hz of very good quality and the speech signal acquired with an optical microphone can be very efficiently deconstructed with the algorithms developed in the MultiSpeech team (examples available on https: // We plan to use corpora already acquired or in the process of being acquired. These corpora represent a very large volume of data (several hundreds of thousands of images) and it is therefore necessary to preprocess them in order to identify the contours of the articulators involved in speech production (mandible, tongue, lips, velum, larynx, epiglottis). Last year we developed an approach for tracking the contours of articulators in MRI images that gives very good results 10. Each articulator is tracked independently of the others in order to keep the possibility to analyze the individual behavior of an articulator, e.g. in case one of them fails. The automatically tracked contours can therefore be used to train the inversion. Initially, the goal is to perform the inversion using the LSTM approach on data from a small number of speakers for which sufficient data exists. This approach will have to be adapted to the nature of the data and to be able to identify the contribution of each articulator.

In itself, successful inversion to recover the shape of the vocal tract in the medio-sagittal plane will be a remarkable success since the current results only cover a very small part of the vocal tract (a few points on the front part of the vocal tract). However, it is important to be able to transpose this result to any subject, which raises the question of speaker adaptation, which is the second objective.

The most recent speaker adaptation techniques are based on the construction of embeddings, which are widely used in speaker recognition or identification, with the idea of "embedding" an individual in a continuous space in order to adapt the system to a new speaker 6, 7. Here, both acoustic and anatomical data are available. In the context of this thesis, the objective is to construct anatomical plots because we wish to be able to study each articulator independently of the others, which requires a fairly precise knowledge of its position and its immediate anatomical environment. This adaptation to the speaker on the basis of a few static MRIs only, answers a double constraint: the rarity and the cost of dynamic MRI on the one hand, and the impossibility of using MRI on the other hand, for example after the insertion of a cochlear implant whose compatibility with MRI is not guaranteed.

We have already addressed the issue of anatomical adaptation through the construction of dynamic atlases of consonant articulation 8, which is based on the use of a fairly classical transformation in medical image processing 5. It has the drawback of not identifying the remarkable anatomical landmarks as such, and the path we intend to follow will be inspired by anatomical plunging recently proposed for the processing of radiological images 4. In spirit, the idea of these plungers is quite close to LSTM (Long Short Term Memory) networks since they combine a global plunge and a local plunge.

Main activities


The PhD student will be able to use the databases already acquired in the framework of the ANR ArtSpeech (about 10 minutes of speech for 10 speakers) and the much larger databases being acquired in the framework of the ANR Full3DTalkingHead (about 2 hours of speech for 2 speakers). The PhD student will of course also be able to acquire complementary data using the MRI system available in the IADI laboratory (40% of the MRI time reserved for research). The scientific environment of the two teams is very complementary with a very strong competence in all fields of MRI and anatomy in the IADI laboratory and in deep learning in the MultiSpeech team of Loria. The two teams are geographically close (1.5 km). The PhD student will have an office in both laboratories and the technical means (computer, access to the computing clusters) allowing him to work in very good conditions. A follow-up meeting will take place every week and each of the two teams will organize a weekly scientific seminar. The PhD student will also have the opportunity to participate in one or two summer schools and conferences in MRI and automatic speech processing. He/she will also be assisted in writing conference or journal papers.

  • Benjamin Elie , and Yves Laprie, Extension of the single-matrix formulation of the vocal tract: consideration of bilateral channels and connection of self-oscillating models of the vocal folds with a glottal chink. Speech Comm. 82, pp. 85-96 (2016). https: //
  • Benjamin Elie , and Yves Laprie. Copy-synthesis of phrase-level utterances. EUSIPCO, Budapest 2016 https: //
  • Maud Parrot , Juliette Millet, Ewan Dunbar. Independent and Automatic Evaluation of Speaker-Independent Acoustic-to-Articulatory Reconstruction. Interspeech 2020 - 21st Annual Conference of the International Speech Communication Association , Oct 2020, Shanghai / Virtual, China. ⟨hal-03087264⟩
  • Ke Yan ,Jinzheng Cai, Dakai Jin et al. Self-supervised Learning of Pixel-wise Anatomical Embeddings in Radiological Images. arXiv:2012.02383 cs.CV, 2020
  • Rueckert D, Sonoda LI, Hayes C, Hill DL, Leach MO, Hawkes DJ. Nonrigid registration using free-form deformations: application to breast MR images. IEEE Trans Med Imaging. 1999 Aug;18(8):712-21. doi: 1109/42.796284.
  • David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur, “Deep neural network embed-dings for text-independent speaker verification.,” pp. 999–1003, Interspeech, 2017, https: // www.
  • David Snyder , Daniel Garcia-Romero, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Ro-bust dnn embeddings for speaker recognition,” in IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP). IEEE, 2018, pp.5329–5333. https: //
  • Ioannis Douros , Ajinkya Kulkarni, Chrysanthi Dourou, Yu Xie, Jacques Felblinger, Karyna Isaieva, Pierre-Andé Vuissoz and Yves Laprie. Using Silence MR Image to Synthesise Dynamic MRI Vocal Tract Data of CV. INTERSPEECH 2020 , Oct 2020, Shangaï / Virtual, China. ⟨hal-03090808⟩
  • Slim Ouni. Tongue Gestures Awareness and Pronunciation Training. 12th Annual Conference of the International Speech Communication Association - Interspeech 2011 , Aug 2011, Florence, Italy. ⟨inria-00602418⟩
  • Karyna Isaieva , Yves Laprie, Nicolas Turpault, Alexis Houssard, Jacques Felblinger & Pierre-André Vuissoz (2020), Automatic Tongue Delineation from MRI Images with a Convolutional Neural Network Approach, Applied Artificial Intelligence, 34:14, 1115-1123, https: //
  • Karyna Isaieva , Y. Laprie, J. Leclère, Ioannis K. Douros, Jacques Felblinger & Pierre-André Vuissoz . Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers. Scientific Data 8, 258 (2021). https: //
  • Vinicius Ribeiro , Karyna Isaieva, Justine Leclere, Pierre-André Vuissoz, Yves Laprie. Towards the prediction of the vocal tract shape from the sequence of phonemes to be articulated. iNTERSPEECH 2021 , Aug 2021, Brno, Czech Republic. ⟨hal-03360113⟩
  • Skills

    deep learning, computer science, speech processing, applied mathematics

    Benefits package
  • Subsidized meals
  • Partial reimbursement of public transport costs
  • Leave: 7 weeks of annual leave + 10 extra days off due to RTT (statutory reduction in working hours) + possibility of exceptional leave (sick children, moving home, etc.)
  • Possibility of teleworking (after 6 months of employment) and flexible organization of working hours
  • Professional equipment available (videoconferencing, loan of computer equipment, etc.)
  • Social, cultural and sports events and activities
  • Access to vocational training
  • Social security coverage
  • Remuneration

    Salary: 2051€ gross/month for 1st and 2nd year. 2158€ gross/month for 3rd year.

    General Information
  • Theme/Domain : Language, Speech and Audio Scientific computing (BAP E)

  • Town/city : Villers lès Nancy

  • Inria Center : CRI Nancy - Grand Est
  • Starting date : 2023-02-14
  • Duration of contract : 2 months
  • Deadline to apply : 2023-03-16
  • Contacts
  • Inria Team : MULTISPEECH
  • PhD Supervisor : Laprie Yves / [email protected]
  • About Inria

    Inria is the French national research institute dedicated to digital science and technology. It employs 2,600 people. Its 200 agile project teams, generally run jointly with academic partners, include more than 3,500 scientists and engineers working to meet the challenges of digital technology, often at the interface with other disciplines. The Institute also employs numerous talents in over forty different professions. 900 research support staff contribute to the preparation and development of scientific and entrepreneurial projects that have a worldwide impact.

    Instruction to apply

    Defence Security : This position is likely to be situated in a restricted area (ZRR), as defined in Decree No. 2011-1425 relating to the protection of national scientific and technical potential (PPST).Authorisation to enter an area is granted by the director of the unit, following a favourable Ministerial decision, as defined in the decree of 3 July 2012 relating to the PPST. An unfavourable Ministerial decision in respect of a position situated in a ZRR would result in the cancellation of the appointment.

    Recruitment Policy : As part of its diversity policy, all Inria positions are accessible to people with disabilities.

    Warning : you must enter your e-mail address in order to save your application to Inria. Applications must be submitted online on the Inria website. Processing of applications sent from other channels is not guaranteed.

    From this employer

    Recent blogs

    Recent news