R&D Internship - Podcast Topic Modeling

  • 12 Rue d'Athènes, 75009 Paris, France
  • Intern

Company Description

Just Hack it!

With 53 millions of tracks and a presence in 180 countries,

Deezer is the most personal music streaming service in the world.

Behind the code and the pixels is our team of 500 music lovers, and we’re building something incredible together. Want in? If you’re looking for an adventure, not just a job, and you fancy seeing ideas come to life in a heartbeat, you’re in the right place.

We dare to challenge the status quo and believe innovation is part of our DNA.

Job Description

Podcasts are a special type of audio content used for entertainment, information or advertisement [1]. They are frequently considered the “spoken” version of blog posts [2]. Allowing the users to retrieve speech files effectively from an increasingly large item set and providing automatic podcast recommendation is essential for streaming services [1,2]. In contrast to music recommendation and retrieval, users put a lot more emphasis on the podcast topics compared to the podcast audio style [1]. Consequently, annotating podcasts with topics is a prerequisite for both search and content-based recommendation.

Previous works have exploited different types of data in order to automatically infer topic tags: metadata such as podcast title [3] and the transcribed spoken content [1,2,3], each proving suitable in different user search scenarios [3]. Although the transcribed speech is a richer source of data for topic extraction, using only metadata is considered a more economic alternative [3]. Additionally, with the advancements made in topic modeling on short texts, fostered by the latest NLP trends in language representation in low-dimensional embedding spaces [4,5,6], the question of whether we are able to obtain relevant topic representations even from short podcast titles is worth further investigation.

The objective of this internship is thus to propose a method to model topics on short text starting from the latest related literature [4,5,6] and assess its suitability in the podcast scenario. The proposed method has to be compared with topic modeling when using the transcribed speech only. Moreover, topic modeling approaches combining both the short metadata and noisy transcribed content may be investigated. For this, the intern is expected to review existing literature, propose a solution, design a suitable experimental protocol for evaluation and report the results in a scientific report or article.

The intern is supervised by research scientists and research engineers from the Deezer R&D team who provide practical and scientific help with the performed task. The intern is nonetheless encouraged to propose solutions and work autonomously. For data experiments, Deezer ensures cutting edge technology and appropriate calculus power.



[1] Longqi Yang, Yu Wang, Drew Dunne, Michael Sobolev, Mor Naaman, and Deborah Estrin. 2019. More Than Just Words: Modeling Non-Textual Characteristics of Podcasts. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19). ACM, New York, NY, USA, 276-284. DOI: https://doi.org/10.1145/3289600.3290993

[2] J. Mizuno, J. Ogata and M. Goto. 2008. A similar content retrieval method for podcast episodes. In Proceedings of the 2008 IEEE Spoken Language Technology Workshop, Goa, 297-300.DOI: 10.1109/SLT.2008.4777899

[3] Besser, J., Larson, M. and Hofmann, K. 2010. Podcast search: user goals and retrieval technologies. Online Information Review, Vol. 34 No. 3, pp. 395-419. https://doi.org/10.1108/14684521011054053

[4] Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR '16). ACM, New York, NY, USA, 165-174. DOI: https://doi.org/10.1145/2911451.2911499

[5] Tian Shi, Kyeongpil Kang, Jaegul Choo, and Chandan K. Reddy. 2018. Short-Text Topic Modeling via Non-negative Matrix Factorization Enriched with Local Word-Context Correlations. In Proceedings of the 2018 World Wide Web Conference (WWW '18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1105-1114. DOI: https://doi.org/10.1145/3178876.3186009

[6] Felipe Viegas, Sérgio Canuto, Christian Gomes, Washington Luiz, Thierson Rosa, Sabir Ribas, Leonardo Rocha, and Marcos André Gonçalves. 2019. CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19). ACM, New York, NY, USA, 753-761. DOI: https://doi.org/10.1145/3289600.3291032 



  • Master or PhD student with a background in Computer Science / Computational Linguistics / Applied Mathematics / Statistics.


  • Strong knowledge of natural language processing, applied machine learning and data mining

  • Good programming skills for data processing and experimentation (preferred python, but we are open to other technologies too)

  • Creativity and autonomy

Additional Information

Life @ Deezer Paris

  • Start-up environment and philosophy
  • Highly motivated and product-focused people ready to drive innovation
  • In-house Deezer Sessions with your favorite artists, gig tickets
  • Hackathons & meetups
  • Friday drinks, summer and winter parties
  • A stocked kitchen with free drinks and snacks
  • Areas to relax and collaborate with beanbags, guitars and table football
  • An ‘at home’ vibe, with great outdoor spaces
  • Gym access at Deezer HQ, with lunch-time yoga, pilates and boxing classes