A rule-based coreference resolution tool (master thesis)

I have two French master degrees:

I have thus written two master theses. This page summarizes the first one: a rule-based automatic coreference resolution tool. For the other one (a linguistic study of coreference chain in IMRaD research articles), read here.

The thesis is entitled ODACR: un Outil de Détection Automatique des Chaînes de Référence à base de règles linguistiques (a rule-based automatic coreference resolution tool). I have created two resources: a dictionary of entities for coreference detection from Wikipedia and WordNet and a dictionary of hypernyms.

Part of if has been publish in a scientific paper: Détection automatique de chaînes de coréférence pour le français écrit: règles et ressources adaptées au repérage de phénomènes linguistiques spécifiques. Actes des Rencontres des Etudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (TALN-RECITAL), Association française pour l'Intelligence Artificielle, Toulouse, Juillet 2019.

read the thesis (135 pages) read the paper read the poster

 

Both theses are related to coreference chains. A coreference chain is the set of all expressions of a text that refer to the same referent (referring expressions). For example, all the expressions in bold in the following text refer to the same entity "Sophia Loren":

[Sophia Loren] says [she] will always be grateful to Bono. [The actress] revealed that the U2 singer helped [her] calm down when [she] became scared by a thunderstorm while travelling on a plane. (This example is from Mitkov's Anaphora Resolution book.)

There is a second chain for the entity "Bono".

Each expression that is part of a coreference chain is called a mention.

 

I developed a new rule-based coreference resolution system for written French. This system takes into account linguistic phenomena often ignored by other (more machine learning oriented) systems. For example:

The two lexical resources for French I have built are:

First, a dictionary of named entities and proper nouns from Wikipedia and WordNet, based on Yago (chapter 4.1 of the thesis). For each entity, it records:

Second, dictionary of common noun hypernyms, from the Wiktionary (XMLfied by Glawi) (chapter 4.2 of the thesis). Entry definitions usually start with an hypernym. For example:

So I have collected these hypernyms and have turned them into a dictionary, for instance: chat > mammifère > animal > métazoaire.

I also crafted rules to correct the tree given the syntactic parser I used (Talismane) (chapter 5 of the thesis). This was done after a careful error analysis. For example (click to enlarge):

original (output from Talismane)

corrected with my rules

Other rules are used to simplify the tree, for example to unite different items of a same group (click to enlarge):

original (output from Talismane)

corrected with my rules

Or to divide coordinated (click to enlarge):

original (output from Talismane)

corrected with my rules

The coreference resolution algorithme in itself has several passes:

 

read the full thesis (in French)

To see the other master thesis: Ccoreference chains in IMRaD research articles.