Thanks to Speech Translation (ST) technology, the task of translating speech in a source language into text in a target language is possible! Unfortunately, traditional cascaded ST approaches using Automatic Speech Recognition (ASR) and Machine Translation (MT) are prone to errors propagations. Due to that, end-to-end ST has recently been gaining more popularity instead. The drawback is that it heavily depends on direct ST data, which can be difficult to obtain. In this blog, we interviewed our former KE@Work student about her research paper on how to tackle data scarcity in Speech Translation, which she did in collaboration with Mediaan. 

1. Could you please introduce yourself?

My name is Tu Anh. I’m 23 years old and I am from Vietnam. I came to The Netherlands to pursue my professional interest in Data Science at Maastricht University and it has been a very exciting jo­urney ever since! Currently, I am a Master’s student at the University of Amsterdam, also focusing on Data Science. In my free time, I usually like to go out with my close friends for shopping, karaoke, or just a simple indoor get-together. I also like to sing and play guitar, although I am not an expert myself. At last, I like watching funny cat videos and even editing those videos sometimes just for fun!

2. What was your overall experience working at Mediaan Conclusion as a KE@Work student and what expertise did you develop within this timeframe?

Let me start by saying that it was a very valuable 2 years of experience! When I first started at Mediaan Conclusion, I had zero working experience. Valentin, as my direct supervisor, was very supportive, hence I was comfortable with speaking up about my interests and asking for help whenever I needed it. One thing that I really like about Mediaan Conclusion is that the office has a very laid-back, informal vibe which creates a very comfortable atmosphere.

The company is also very flexible and supportive of my working schedule since I was also pursuing my study at Maastricht University at the time. During this timeframe, I gained experience in different Data Science domains, how to use various tools and I also learned how to use Microsoft Azure Services: Automated Machine Learning and Custom Speech-To-Text. It was really nice to be able to explore different topics and find out where my interest lies!

The company is also very flexible and supportive of my working schedule since I was also pursuing my study at Maastricht University at the time. During this timeframe, I gained experience in different Data Science domains, how to use various tools and I also learned how to use Microsoft Azure Services: Automated Machine Learning and Custom Speech-To-Text. It was really nice to be able to explore different topics and find out where my interest lies!

3. Your research paper has been accepted by the International Conference on Acoustics, Speech, & Signal Processing IEEE ICASSP 2022. This paper is derived from your Bachelor’s thesis assignment at Mediaan Conclusion. Can you tell us more about it?

Yes, the paper is about “Tackling data scarcity in speech translation using zero-shot multilingual machine translation techniques”. The main focus is to build an end-to-end Speech Translation model that translates speech in a source language directly into text in a target language. This type of model usually requires a large amount of Speech Translation (ST) data for training, which can be difficult to obtain.

So the paper proposed approaches to make use of other more easily available data sources, i.e., Speech Recognition (ASR) data and Machine Translation (MT) data, for training.  For this topic, I conducted research on the state-of-the-art architectures for Speech Translation, and how to modify these architectures to make them work for my model which uses different data sources. I also researched the zero-shot techniques used for multilingual text translation, which are useful to apply to the domain of speech.

4. What were the challenges that you had to face when trying to find the right solution?

The challenge here is that speech and text are two different types of data. Speech is a continuous signal, while text is a sequence of discrete words. For the same sentence, the representation of the speech and the text would look very different digitally, even though they have the same semantic meaning. A model can only make use of both Speech Recognition data and Text Translation data for Speech Translation if it can learn how to semantically encode these text and audio modalities in a similar way.

5. What was the solution you came up with and what methods did you use?

The solution for using different types of training data for one model is to use two parallel encoders, one for text and one for audio. To encourage similar semantical representation for text and speech data, I use an auxiliary loss function that minimizes the difference between encoder output of audio and text for the same sentence. Additionally, to better control the output language, I used a data augmentation approach, where I introduce an artificial language to the training data, by reversing the source language character-wise.

I implement the approaches for my thesis by extending another project which was built in Pytorch. Pytorch is a framework for Machine Learning, which is widely used nowadays. Thanks to my previous tasks at Mediaan Conclusion, I was already exposed to Pytorch beforehand, making the implementation process for my thesis much easier.

6. What kind of advantages would you get from implementing this solution?

Since the proposed model makes better use of different data sources, it will be possible to build Speech Translation systems, even when the amount of Speech Translation data for training is scarce. As an example, if we want to build a model that translates English speech to German text, but do not have a large amount of English speech to German text data for training, we can use this model to also make use of English speech to English text data, as well as English text to German text data. The proposed approach for Speech Translation is built in and end-to-end fashion. That is, we only need one model to perform speech translation, as opposed to other cascaded approaches that uses two models sequentially. I believe that this would make the system more easily used in production, since we do not need to store multiple models nor the intermediate output of the models.

7. This assignment required expertise in the area of Natural Language Processing (NLP), how did Mediaan Conclusion provide guidance in this aspect?

Actually, it was Valentin who encouraged me to pursue a Speech-related thesis topic, since speech is currently the field of interest with many practical applications. I regularly had a weekly meeting with him and other students who were also working on NLP projects. Sometimes, other senior colleagues with expertise in NLP would also join. Since everyone was working on NLP, and especially in the domain of Speech, it was easy for us to exchange ideas and solutions for many problems. Additionally, I sometimes had the chance to present my work to the whole team in informal stand-up meetings. It was interesting to get feedback from different perspectives!

8. In what kind of business cases or practical examples is this solution applicable?

As the world is becoming more globalized, Speech Translation would be useful for many use cases. As an example, if a company meeting is in Dutch, then the Speech Translation system can be used to provide the English transcript of the meeting so that international colleagues can understand. Other applications could be creating movie subtitles, on-the-fly translation apps, etc.

9. What other things except for hard skills, did you develop during your KE@Work program at Mediaan Conclusion?

One thing I learned was to speak up – either about technical issues, or personal circumstances. Looking back at myself when I first started at Mediaan Conclusion, I feel like a different person now!

Thanks to the daily stand-ups and the weekly supervisor meetings, I also learned how to communicate effectively, and how to present my work in a suitable way for different types of audiences. These frequent meetings were continued in Corona times, so they also taught me how to keep in touch and work effectively online. I consider the skills I developed at Mediaan Conclusion to be priceless, and they will be very helpful for my future career!

You can read more about Tu Anh’s research paper here. Do you want to know more about what Speech Translation and other NLP fields can offer to your business? Our team of Data Science experts is always ready to help you!