Machine Learning Research Intern

  • Intern

Company Description

Serving 4 million students and recent graduates; 80,000 businesses; and a network of 700 higher education institutions in 25 European countries, JobTeaser is now the European leader for the recruitment of young talent in Europe.

With 68 million euros raised to date and recently recognised in the Next40 list as one of the top 40 French tech startups with the most potential, we continue to build on our unique ecosystem that brings together a range of businesses, as well as schools and universities, around the next generation, giving them the tools they need to launch their working careers with confidence.

To support our growth in Europe, we are currently looking for people who want to evolve in a fast-growing company, with a start-up spirit and labeled Great Place to Work.

Job Description

The team

Our team is composed of 3 data scientists, 2 data engineers and 4 data analysts, working in synergy with the product and tech teams.

 

Context of the internship

Visual information extraction from structured documents.

The data science team at JobTeaser focuses on building machine learning based tools to help improve the product. From recommendation systems to talent sourcing, we train our algorithms on data from resumes and job postings, which are often in raw text format (using PDF parsers).  However, many new product features require having access to resume data in a structured manner. In this context, we believe that extracting information (education, work experiences, personal information ...) from the resumes could have a great impact on both the product and our algorithms. 

 

Your mission

While PDF parsers and OCR tools are still good solutions for simple documents, they perform poorly on column-structured and block-structured documents such as resumes. In addition, variations of text in PDF documents due to differences in size, style, orientation, and alignment, as well as complex background make the problem of automatic text extraction extremely challenging. One way to overcome this problem is to consider these documents as images and visually segment key regions by adapting object detection techniques to this case.  Object detection is often a two-step process that matches the mechanism of perception of the human brain to some extent:

  • Step 1: propose regions of interest in the image (boxes).

  • Step 2: classify each region according to its context.

The two steps could be learned jointly or separately as explained in the [1]. A first implementation of this method has proven to be successful for experience retrieval. These methods could be augmented by adding contextual features (positional features for instance) as explained in [2] and [3].

In this internship, we propose to explore and implement different methods of information extraction from resumes by combining recent object detection techniques in computer vision and text classification in natural language processing. If the proposed solution is successful, it could be directly implemented in production as an API. The candidate will be working closely with the team to drive the project, design and implement prototypes and ensure best practices are applied.

 

Bibliography

[1] Licheng Jiao and al. A Survey of Deep Learning-based Object Detection. arXiv:1907.09408v2 [cs.CV] 

[2]  Carlos X. Soto and al. Visual Detection with Context for Document Layout Analysis. https://www.aclweb.org/anthology/D19-1348.pdf

[3]  Yiheng Xu and al. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. arXiv:1912.13318v5 [cs.CL]

Qualifications

We’re looking for talented, smart, kind, and genuinely curious individuals to work alongside us. 

  • Currently pursuing a master's degree in a quantitative field (Engineering, Mathematics, Statistics, Computer Science).

  • Interest in R&D topics.

  • Strong machine learning and deep learning backgrounds (Natural Language Processing and/or computer vision).

  • Good programming skills in Python and at least one deep learning framework (TensorFlow or Torch preferably).

  • Familiar with Linux environment.

  • Familiar with AWS and Cloud based infrastructures is a plus.

  • Excited about code craftsmanship, to build robust code with the best practices.

  • Autonomous but enjoys working as a team-player.

  • Good communication skills.

Privacy Policy