Speaker Recognition Research Internship

Snips is hiring!


Snips is an AI-powered voice platform for connected devices. It enables makers and companies to add voice recognition and understanding to their products or services.

Created in 2013, the vision behind Snips has been to put an AI assistant in every device, making technology so intuitive that it disappears into the background. We are now over 70 people, with offices in Paris and New York.

What makes Snips unique is that everything runs locally on the device the user is speaking to, meaning no data ever gets sent to the cloud. This guarantees Privacy by Design and resilience to internet outages, making Snips the first ever voice technology to be GDPR compliant.

Our web platform was launched in June last year, and already has more than 15,000 developers who created over 24,000 assistants.

Snips is looking for passionate people who want to contribute to our vision to make technology disappear. More than a company, we see ourselves as a community that values purpose, inclusiveness, curiosity, grit and impact. We expect future Snipsters to be curious and want to learn continuously, helping each other solve the various problems they are working on.

What you will do

Snips is looking for a Speaker Recognition Research Intern, to join our team in Paris.

This internship offer fits within the scope of a collaboration between Snips and the Multispeech team of Inria Nancy - Grand Est (https://team.inria.fr/multispeech/). You will be based in our office in Paris and will have the opportunity to spend some time at Inria. This internship can be followed by a CIFRE PhD, co-supervised by Snips and Inria, on a closely-related subject.

English is the official language of the company, as we have over 12 different nationalities, so don't worry if you don't speak French!

On top of a competitive salary, we offer many perks including relocation assistance and VISA sponsorships, laptops, makers kits, language classes, sports classes, full health insurance, a transport card and free lunches!


State-of-the-art speaker recognition systems [1] rely on various speaker embedding methods, such as i-vector [2], x-vector [3], d-vector [4], that have allowed good recognition performance under controlled conditions. However, speaker recognition remains a challenging problem under real-world conditions. For instance, when the system is trained under clean acoustic conditions but the recognition has to be performed on speech corrupted by background noises, speaker recognition accuracy severely drops. This is a strong requirement from emerging smart assistant technology sector to develop reliable speaker recognition methods for authenticating users in adverse environmental conditions. The existing DNN-based speaker recognition methods uses short-term magnitude information but ignore the phase information which is found useful in other speech processing applications such as speech recognition [5], speech enhancement [6], speech separation [7], etc.

The goal of this Master internship is to design and implement a phase-aware DNN-based noise-robust speaker recognition system and to evaluate it for practical applications. The intern will be responsible for (a) investigating the reliability of existing handcrafted phase representations such as the modified group delay, the relative phase, and the all-pole group delay function, and (b) developing an end-to-end DNN architecture that accounts for the phase. This work will involve both using the existing speaker recognition system in Kaldi [8] and developing additional software in Python using the publicly available PyTorch machine learning library.

The experiments will be conducted on the state-of-the-art SRE-18 speaker recognition corpus [9]. This corpus does not involve recordings with background noise. Therefore, in a first step, in order to simulate noisy conditions, the intern will mix this speech data with noise signals from the publicly available MUSAN noise corpus [10].


[1] Hansen, J.H. and Hasan, T., 2015. Speaker recognition by machines and humans: A tutorial review. IEEE Signal processing magazine, 32(6), pp.74-99. (Overview of speaker recognition)
[2] Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P. and Ouellet, P., 2011. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), pp.788-798.
[3] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. and Khudanpur, S., 2018. X-vectors: Robust DNN embeddings for speaker recognition. Proc. of ICASSP 2018.
[4] Variani, E., Lei, X., McDermott, E., Moreno, I.L. and Gonzalez-Dominguez, J., 2014, May. Deep neural networks for small footprint text-dependent speaker verification. Proc. of ICASSP 2014.
[5] C. Kim, T. Sainath, A. Narayanan, A. Misra, R. Nongpiur and M. Bacchiani, "Spectral Distortion Model for Training Phase-Sensitive Deep-Neural Networks for Far-Field Speech Recognition," Proc. of ICASSP 2018.
[6] N. Zheng and X. Zhang, "Phase-Aware Speech Enhancement Based on Deep Neural Networks," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 63-76, Jan. 2019.
[7] Erdogan, H., Hershey, J.R., Watanabe, S. and Le Roux, J., “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks”, Proc. of ICASSP 2018.
[8] https://github.com/kaldi-asr/kaldi
[9] https://www.nist.gov/itl/iad/mig/nist-2018-speaker-recognition-evaluation
[10] http://www.openslr.org/17/

What we are looking for

  • 2nd year Master student in computer science, machine learning, or audio signal processing
  • Experience programming in Python
  • Experience with PyTorch and/or TensorFlow is a plus

Recruitment Process

  • application form
  • challenge at home
  • technical interview
  • onsite trial

Additional Information

  • Contract type: Internship (6 to 6 months)
  • Start date: 01 March 2019
  • Location: Paris, France (75002)
  • Education Level: Fourth-Year University Level
  • Salary: 1100€ / month