I have two French master degrees:
I have thus written two master theses. This page summarizes the first one: a rule-based automatic coreference resolution tool. For the other one (a linguistic study of coreference chain in IMRaD research articles), read here.
The thesis is entitled ODACR: un Outil de Détection Automatique des Chaînes de Référence à base de règles linguistiques (a rule-based automatic coreference resolution tool). I have created two resources: a dictionary of entities for coreference detection from Wikipedia and WordNet and a dictionary of hypernyms.
Part of if has been publish in a scientific paper: Détection automatique de chaînes de coréférence pour le français écrit: règles et ressources adaptées au repérage de phénomènes linguistiques spécifiques. Actes des Rencontres des Etudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (TALN-RECITAL), Association française pour l'Intelligence Artificielle, Toulouse, Juillet 2019.
read the thesis (135 pages) read the paper read the poster
Both theses are related to coreference chains. A coreference chain is the set of all expressions of a text that refer to the same referent (referring expressions). For example, all the expressions in bold in the following text refer to the same entity "Sophia Loren":
[Sophia Loren] says [she] will always be grateful to Bono. [The actress] revealed that the U2 singer helped [her] calm down when [she] became scared by a thunderstorm while travelling on a plane. (This example is from Mitkov's Anaphora Resolution book.)
There is a second chain for the entity "Bono".
Each expression that is part of a coreference chain is called a mention.
I developed a new rule-based coreference resolution system for written French. This system takes into account linguistic phenomena often ignored by other (more machine learning oriented) systems. For example:
The two lexical resources for French I have built are:
First, a dictionary of named entities and proper nouns from Wikipedia and WordNet, based on Yago (chapter 4.1 of the thesis). For each entity, it records:
Second, dictionary of common noun hypernyms, from the Wiktionary (XMLfied by Glawi) (chapter 4.2 of the thesis). Entry definitions usually start with an hypernym. For example:
So I have collected these hypernyms and have turned them into a dictionary, for instance: chat > mammifère > animal > métazoaire.
I also crafted rules to correct the tree given the syntactic parser I used (Talismane) (chapter 5 of the thesis). This was done after a careful error analysis. For example (click to enlarge):
Other rules are used to simplify the tree, for example to unite different items of a same group (click to enlarge):
Or to divide coordinated (click to enlarge):
The coreference resolution algorithme in itself has several passes:
read the full thesis (in French)
To see the other master thesis: Ccoreference chains in IMRaD research articles.