About Resume Contact Technology Portfolio Links

French Resume Analyser

One third of French companies use Internet as a preferred medium for their recruitment (According to: IPSOS Liaisons Sociales/France Télécom survey based on a 300 companies of more than 100 people).

E-recruitment, which is relatively new, will grow up very fast in the few next years. A recent survey conducted by APEC in several European countries (Germany, Belgium, Spain, France, Italy, Luxembourg, Netherlands et United Kingdom), conclude that 49% of the questioned companies use the web during their recruitment process. This number reaches an amount of 81% for the Netherlands.

Using this medium generates a very important number of candidate appliances and it is not infrequent for large companies to receive more than 80,000 resumes per year.

Language processing tools may be used in order to facilitate this huge information flow.

This document presents a resume analyser which permit to automatically extract all information's related to a candidate from his resume:

  • Personal data (last name, first name, address, date of birth and/or age, situation),
  • Phone numbers,
  • Email addresses,
  • Education (date, diplomas),
  • Work experience (date, occupation, company),
  • Foreign languages (language, level),
  • Computing skills,
  • Miscellaneous.

The application is based upon a modular architecture, every module is in charge of a specialised function:

  • file format detection,
  • text extraction (ASCII),
  • resume identification (in order to select the good file if message contains many attachment),
  • pre-processing stage for cleaning and normalization of the resume,
  • analysis of the resume, all extracted information's are structured as an XML file,
  • data validation according to selected XML schema,
  • state machine for management of the processing stages.

The picture hereafter present the different stage of resume analysis.

analyser

Section which follow present some of the modules used during the analysis.

File format detection module

File format detection is based upon file signatures (aka "magic numbers"). As such byte sequences are not available for some format (Word, Excel, PowerPoint), more complex sequences are used in those cases.

This identification is implemented using two finite state automata.

finite state automata
Finite state automata for file format identification

Conversion to text format

Most of the resumes are available as Microsoft Word documents. Other encountered format list contains: PDF, RTF and HTML.

After several tests performed in order to select a conversion tool, we decide to use Word, through its COM interface for conversion of DOC and RTF documents.

A dedicated procedure based upon Word document object model has been developed for document with complex layout.

Specific converters are used for PDF and HTML format.

Resume identification

Mail based candidate appliance may contains several attached files. It is necessary to identify which file corresponds to the resume of the candidate.

This module uses several regular expression and heuristics in order to analyse documents. A fuzzy logic based system is used for combining observed data's and compute the probability for a document to be a resume.

The following performance measurement have been conducted:

  • Recall = (correct number)/(present number) = 100 %
  • Precision = (correct number)/(given number) = 99.3 %
Evaluation measures
Evaluation measures for Information Extraction Systems

Preprocessing

Some control characters have been removed from the documents.

Document normalization has been done using regular expression in order to facilitate the analysis stage of the document.

Document analyser

Document analyser is the most important module of the system. Its aims are to identify and extract selected information's from the candidate resume.

Document analysis is performed in several steps:

  • lexical analysis,
  • syntactic analysis,
  • complex entities identification,
  • part of speech tagging,
  • document structure analysis,
  • semantic analysis.

The picture hereafter presents the resume analysis steps.

analyser

Lexical analysis

During this step, text is segmented into tokens. Tokens identify the simplest element encountered in the document text (word, number, punctuation, control characters, ...).

An example of the result of this stage is given in the picture bellow

lexical analysis

Complex entities identification

Tokens are combined with dedicated rules in order to identify most complex entities as for example dates, phone numbers, email addresses, etc.

Some named entities (person name, place name, functions, diplomas) are identified during this step.

Specialised dictionaries are used for several entities identification as for example for first names and civilities.

Bayesian network

This step use probabilistic network in order to identify several named entities without any use of dedicated dictionaries.

Bayesian networks have been constructed following a supervised learning period (using a corpus of pretagged resume documents). Trained network are used to tag entities words. Heuristics are used in order to identify full entities names using pre tagged words as start (see "Bayesian Networks for Organization Name Identification" for more details).

Picture below presents the document parse tree obtained following lexical analysis and named entities recognition steps:

syntactic analysis

Part of speech tagging

During this stage the possible base forms and classes for each words is identified and stored in variables associated to each words.

Document structure analysis

The aims of this stage is to identify the different section of the resume: Personal data's, Work experience, Education, etc.

Document segmentation is done using two neural networks: The first one is used to identify section header, the second one is used to categorize the identified section according to its content.

neural network
Resume segmentation using neural network

Semantic analysis

This analysis is based upon the mean of words. During this step, complex information like for example work experience is identified.

This is done using rules based upon all information's recorded from all preceding steps.

Extracted information are structured as an XML file.

The frame hereafter presents the XML document generated from my resume my resume:

<?xml version="1.0" encoding="ISO-8859-1"?>
<resume xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="D:\NewResume.xsd">
 <header>
   <name>
     <firstname gender="M">Patrice</firstname>
     <lastname>MELLOT</lastname>
   </name>
   <address>
     <street>14, résidence Le Clos de Verrières</street>
     <zip>91370</zip>
     <city>Verrières le Buisson</city>
   </address>
   <contact>
     <phones>
       <phone type="Tél. Répondeur">01.60.13.15.92</phone>
     </phones>
     <emails>
       <email></email>
     </emails>
   </contact>
   <personaldata>
     <age type="age">45</age>
     <situation>Marié</situation>
     <nationality/>
   </personaldata>
 </header>
 <academics>
   <degree level="Bac+8" diploma="Doctorat" title="de Microbiologie" obtention="1">
     <date literal="1987">
       <day>0</day>
       <month>0</month>
       <year>1987</year>
     </date>
   </degree>
   <degree level="Bac+5" diploma="Ecole Polytechnique" title=": Assistant de travaux pratiques de Biologie" obtention="1">
     <date literal="1988">
       <day>0</day>
       <month>0</month>
       <year>1988</year>
     </date>
   </degree>
   <degree level="Bac+4" diploma="Maîtrise" title="de Biochimie" obtention="1">
     <date literal="1982">
       <day>0</day>
       <month>0</month>
       <year>1982</year>
     </date>
   </degree>
   <degree level="Bac" diploma="Baccalauréat C" title="empty" obtention="1">
     <date literal="1978">
       <day>0</day>
       <month>0</month>
       <year>1978</year>
     </date>
   </degree>
 </academics>
 <experiences>
   <experience stage="false">
     <period literal="Depuis 1994">
       <from literal="1994">
         <day>0</day>
         <month>0</month>
         <year>1994</year>
       </from>
       <to literal="now">
         <day>0</day>
         <month>0</month>
         <year>now</year>
       </to>
     </period>
     <function>Consultant</function>
     <company>PATRICE MELLOT</company>
   </experience>
   <experience stage="false">
     <period literal="1988 - 1993">
       <from literal="1988">
         <day>0</day>
         <month>0</month>
         <year>1988</year>
       </from>
       <to literal="1993">
         <day>0</day>
         <month>0</month>
         <year>1993</year>
       </to>
     </period>
     <function>Gérant</function>
     <company>DIGITHEME SARL</company>
   </experience>
 </experiences>
 <languages>
   <language level="Lu, parlé et écrit">anglais</language>
 </languages>
 <skills>
   <skill type="lang">HTML</skill>
   <skill type="lang">JAVA</skill>
   <skill type="lang">XML</skill>
   <skill type="lang">APL</skill>
   <skill type="lang">C</skill>
   <skill type="lang">C++</skill>
   <skill type="lang">COLDFUSION</skill>
   <skill type="lang">JAVASCRIPT</skill>
   <skill type="lang">JSP</skill>
   <skill type="lang">NQL</skill>
   <skill type="lang">PYTHON</skill>
   <skill type="lang">SQL</skill>
   <skill type="lang">Visual Basic</skill>
   <skill type="lang">XSLT</skill>
   <skill type="graph">Photoshop</skill>
   <skill type="graph">Flash</skill>
   <skill type="method">Uml</skill>
   <skill type="os">DOS</skill>
   <skill type="os">WINDOWS 95</skill>
   <skill type="os">WINDOWS 98</skill>
   <skill type="os">WINDOWS NT</skill>
   <skill type="os">WINDOWS 2000</skill>
   <skill type="pao">Framemaker</skill>
   <skill type="server">Jrun</skill>
   <skill type="server">Weblogic</skill>
   <skill type="sgbd">ORACLE</skill>
   <skill type="sgbd">ACCESS</skill>
   <skill type="agl">AMC/Designor</skill>
 </skills>
 <misc/>
</resume>

Evaluation of the system

Performance evaluation has been conducted using resumes not used during the development stage.

For the candidate personal data's section (title, last name, first, full address, phone, email, date of birth), mean recall is 93.66% and mean precision is 99.45% (mean P & R = 96,44%).

Candidate career history result will be published later.