Coreference databases and corpora for English and French (corefdb)

Analyse coreference in a corpus with a relational databases containing tables for coreference data (mentions, chains, relations) as well as for textual structures (tokens, sentences, paragraphs, texts). Includes linguistic annotations (part of speech, named entity, etc.).

Enriched version of the Democrat corpus for French.

In this presentation document, you will find:

download code view github repo

On this page:

Introduction

Coreference is the relation between two expressions of a text that refer to the same world entity. When there are several such expressions, they form a coreference chain. For example, all the expressions in bold in the following text refer to the same entity Sophia Loren:

[Sophia Loren] says [she] will always be grateful to Bono. The actress revealed that the U2 singer helped [her] calm down when [she] became scared by a thunderstorm while travelling on a plane. (This example is from Mitkov's Anaphora Resolution (2002).)

Notice that there is a second chain for the entity Bono.

Each expression that is part of a coreference chain is called a mention.

Corpora

Available corpora

They are several corpora with coreference annotation in English as well as in French. The most known corpus in English is the CoNLL-2012 corpus (1.6m tokens, from the OntoNotes corpus), which was used to evaluate systems presented in the CoNLL-2012 shared task (a competition in which several automatic coreference resolution systems competed). This corpus, based on the OntoNotes corpus, offer several annotation layers, with parts of speech, lemmas, segmentation into phrases and clauses (constitency syntactic parses), named entities, argument structures, reference to WordNet, alongside coreference.

In French, the Ancor corpus (460k tokens) is a corpus of transcribed oral French, with some part of speech and named entity annotation. Because it is based on conversations and interviews, it contains specific features of oral language, such as disfluencies (word repetitions, interjections as euh "er") and no punctation nor sentence boundaries.

The Democrat corpus (689k tokens for the whole corpus, 285k for modern texts), on the other hand, is a multi-genre corpus of written texts, from the 12th to the 21st century. But, besides coreference, it has no valid linguistic annotation whatsoever. In this repository I offer an enriched version the Democrat corpus, in various formats (including a relational database, as well as CoNLL, jsonlines and text formats), used in the following paper:

Wilkens Rodrigo, Oberle Bruno, Landragin Frédéric, Todirascu Amalia (2020). French coreference for spoken and written language, Proceedings of the 12th Edition of the Language Resources and Evaluation Conference (LREC 2020), Marseille, France.

to train both ODACR and COFR and described in the LREC paper. Note that ODACR was originally from:

Oberle Bruno, Détection automatique de chaînes de coréférence pour le français écrit: règles et ressources adaptées au repérage de phénomènes linguistiques spécifiques. TALN-RECITAL 2019

Annotation layers added to Democrat (19th to 21st-century texts)

Most of the texts from the Democrat corpus are narratives, usually 10k word-long extracts from novels, short stories and biographies. But there are also Wikipedia articles and 103 press articles. These texts are concatenated into one document in the original distribution: I have separated them using their "natural boundaries" (that is, the beginning and the end of the text). For the modern part of the Democrat corpus, I have kept only texts from the 19th to the 21st centuries, hence the name of the subcorpus: "Dem1921". Five documents are excerpts from legal texts; I have excluded them since they are written in a specialized legal language.

So the main part of Dem1921 focuses on 126 modern documents; 103 are small but complete press articles from the newspaper L'Est Républicain and 23 of them are 10k word-long excerpts from mostly narrative texts. Note that a technical issue has occurred when concatenating the three Wikipedia texts in the original corpus. Here, I have used the correct version of these texts.

The Democrat corpus is provided in an xml format compliant with TEI. I have converted it to the conll format (tabulation separated text file, one token per line), with a supplementary column for paragraph information. I have also manually harmonized the typography (e.g. the type of apostrophe or quotation marks).

The original version of Democrat only contains parts of speech from TreeTagger. Sentences are split at every full-stop; some mentions are thus split across several sentences, e.g. the name J. B. L. Bard is split across four sentences. For this reason I have decided not to keep the original tokenization and sentence splitting.

First, I used the StanfordNLP dependency parser to perform tokenization and sentence splitting, and to add syntactic parses. After an error anlysis of the tokenization and sentence splitting, I have adjusted parameters and input data to get the best possible results. I have also corrected systematic errors, mostly to ensure that no mention is split across several sentences. The tagsets used for parts of speech and dependency labels are the Universal Dependencies tagsets.

Lemmatization has been done by a lookup in a morphological and syntactic lexicon (Lefff).

I have added named entities with the Flair tool. Four types are defined: persons (PER), locations (LOC), organizations (ORG) and a miscellaneous (MISC) category (usually when the tool was unable to categorize a detected entity into one of the three previous categories). After a error analysis, I kept all the PER, LOC and ORG. From the MISC entities, only those with a probability higher then 0.8 have been kept, and for those, I have corrected the list manually. For the MISC entities the text of which was found elsewhere in the text with a different type, I subsituted the MISC type by the other type. For example, if "Peter" was here PER and there MISC, I gave it the category PER everywhere.

The output format is CoNLL-U, with supplementary columns for named entities, paragraph number, and, of course, coreference chain.

Note that the original layers for CoNLL and the layers I have added for Democrat are not the same (besides the language difference): CoNLL has constituency parses whereas Democrat has dependency parses; CoNLL has 18 named entity types whereas Democrat has only four...

For Ancor, a parallel work has been done by Rodrigo Wilkens, with the same tools, from the raw text.

Databases

The three corpora, CoNLL-2012, Democrat and Ancor, are here mainly presented as a relational database containing table for textual structures:

and for coreference annotation:

All these tables are related according to the following diagram:

The list of fields is detailed in the next section.

Complete list of fields in the database

This section presents all the fields in the relational database. Since CoNLL and Democrat/Ancor do not have the same annotation layers (for example, CoNLL uses constituency parses while the syntactic information added to Democrat/Ancor are dependency parses), there are some differences between the corpora of the two languages. A UK flag () indicates a feature present in the CoNLL corpus, while a France flag () indicates a feature present in Democrat and Ancor.

There are three annotation complexity levels. The base level, indicated by a green disk () can be computed for all texts and corpora: it only uses coreference annotation. You can use the available script (db_builder.py, see below) to compute this level for your own text, without having to add any other linguistic annotations. The linguistic level (blue: ) requires the files to have linguistic layers added: syntactic parses and named entities, as described above. The advanced level (purple: ) doesn't required more annotations, but rather external resources: WordNet and FastText and the Pyton modules used to compute them (NLTK and Gensim), as well as some processing time.

To save space, field names are sometimes indicated with a set notation: {a,b}_something must be understood as two fields: a_something and b_something. The term buckets indicates that values have been distributed accross a range of buckets (or bins): [00-01[, [01-02[, [02-04[, [04-08[, [08-16[, [16-32[, [33+.

Here are the links to jump directly to a specific table:

Table Chains

Primary and foreign keys:

1id
1text_id

Coverage of the chain in the text:

2size: number of mentions
2text_span: number of tokens between the first and the last mention of the chain
2text_coverage: text_span divided by the number of tokens in the text

Rank of the mentions:

2mean_mention_level: level is the number of nested mentions (0 = outer mention, not nested)
2{mean,median}_{sent,par,text}_mention_rank: the mention rank is the position of the mention in the chain: the first mention of the chain has the rank 0. The sentence mention rank is the position of the mention in the chain in a specific sentence. For example, the second mention of the chain has the rank 1, but if it is the first mention of the chain in the sentence (that is, the first mention of the chain is in a different sentence), its sentence rank will be 0

Distances between two consecutive mentions in the chain:

2{median,mean}_{token,mention,sent,par}_dist: mean distance between two consecutive mentions in number of tokens, mentions, sentences, paragraphs
2distribution_of_{token,mention,sent,par}_dist_{buckets}: distribution in buckets of distances of consecutives mentions
2distribution_of_mention_length_{buckets}: distribution in buckets of lengths of the mentions of the chain

Shape of the distribution of the distances between two consecutive mentions in the chain (the distribution is virtually never normal):

2dist_{skewness,kurtosis}
2lafon: Lafon’s burst coefficient (Lafon, Pierre (1984). Dépouillements et statistiques en lexicométrie, Slatkine) (see Oberle Bruno (2019). Chaînes de référence et structuration textuelle: quelques indicateurs... Journée d'études “Mesures statistiques et approches quantitatives”, 14 juin 2019, Strasbourg, France, and the slides for visual examples). The higher the value, the less uniform are the mentions in the text: for instance, in the chain xx.x.x..........xxx......xxxx you see that the x's are not evenly distributed, they form “bursts”

Variation in the chain:

2pattern_diversity: number of tri-grams (of part of speech) of consecutive mentions divided by the number of mentions in the chain
2stability_coeff_string: stability coefficient (a number between 0 and 1: if 0, all the nouns in the chain are different (ex.: the cat… the animal), if 1 all the nouns in the chain are the same (ex.: the cat… the cat…), if between there are some repetition and some difference
2stability_coeff_h_lemma: same as before but only noun lemma are considered (and not the whole mention string)

Named entity type of the chain (for chain that have at least one mention that is a named entity):

2type

Proportions:

2outer_proportion: an outer mention is a mention that is not nested
2proportion_of_named_entities
2has_multiple_speakers
2proportion_of_subjects
2proportion_of_mentions_in_pp
2proportion_of_mentions_in_main_clause
2mean_node_depth: depth of the syntactic head in the syntactic tree
2proportion_of_mention_with_dependents: proportion of mention with at least one dependent
2proportion_of_{n,p,d,v,o}: proportion of noun, pronoun, determiner (my, your, etc.), verb, other
2proportion_of_proper_nouns
2is_plural: the chain has at least one mention that is tagged as plural
2proportion_of_first_arguments: proportion of mentions that are the first argument in their argument structure

The following fields are copied from the texts table for convenience:

2text_genre
2text_source
2text_token_count
2text_mention_count
2text_chain_count

Table Mentions

Primary and foreign keys:

1id
1chain_id
1par_id
1sent_id
1text_id

Positions:

1{start,end}: index of the first and last tokens of the mention in the sentence
1par_{start,end}: index of the first and last tokens of the mention in the paragraph
1text_{start,end}: index of the first and last tokens of the mention in the text
1chain_mention_{index,rindex}: index of the mention in the chain (first mention has index 0), from the start/from the end
1{sent,par}_mention_index: index of the mention in the sentence/paragraph
1par_sent_index: index of the sentence (in which the mention is) in the paragraph (first sentence of the paragraph has index 0)
1text_mention_index: index of the mention in the text (first mention of the text has index 0)
1text_par_index: index of the paragraph (in which the mention is) in the text (first paragraph of the text has index 0)
1text_sent_index: index of the sentence (in which the mention is) in the text (first sentence of the text has index 0)
1{sent,par,text}_mention_rank: the mention rank is the position of the mention in the chain: the first mention of the chain has the rank 0. The sentence mention rank is the position of the mention in the chain in a specific sentence. For example, the second mention of the chain has the rank 1, but if it is the first mention of the chain in the sentence (that is, the first mention of the chain is in a different sentence), its sentence rank will be 0

Content:

1string: the textual content of the mention
2token_count: the number of tokens

Nestedness:

1is_outer: an outer mention is a mention that is not nested
1parent: id of the direct parent, that is the mention in the which the current mention is nested into (for nested mention)
1level: the level of nestedness (the outer mention has a level of 0)

Dominant node in the syntactic tree:

1node_id: head node id (this id is not referenced elsewhere in the database, it is an arbitrary number used only for the two next fields)
1parent_node_id: parent node of the head node
1grandparent_node_id: grand parent node of the head node

Named entity:

1named_entity_type: types vary according to the corpora (18 types for CoNLL, 4 types for Democrat/Ancor, see the complete list of values)
1is_named_entity: whether the mention is a named entity
1is_name: whether the mention is a named entity is of one of the types PERSON, FACILITY, ORG, GPE, WORK OF ART, NORP, LOCATION, PRODUCT, EVENT, LAW, LANGUAGE
1is_pers: whether the mention is a named entity is of one of the type PER

Speaker:

1speaker: for CoNLL and Ancor (dummy value for Democrat)

Type of the mention:

1is_clause
1is_phrase
1is_word

Morphosyntax:

1tag
1pspeech: alias for tag
1pos
1deplabel
1subdeplabel

Categories (according to UD):

1is_relative_pronoun
1is_reciprocal
1is_reflexive
1is_expletive
1is_complement
1is_apposition
1is_verb
1is_verb_without_subject

Morphosyntax and syntax of the parent:

1parent_phrase_tag
1parent_pos
1parent_deplabel
1parent_subdeplabel
1parent_clause_tag
1parent_clause_pos
1parent_clause_deplabel
1parent_clause_subdeplabel
1parent_phrase_id
1parent_clause_id

Function:

1is_subject
1is_object
1is_non_core: neither subject nor object

Preposition:

1in_pp: whether the mention is in a prepositional phrase
1preposition: the text of the preposition

Syntactic depth of the head node (for Democrat/Ancor) or of the constituent (for CoNLL):

1node_depth
1clause_depth
1phrase_depth

Type of clauses:

1is_in_main_clause
1is_in_matrix
1is_embedded
1is_in_embedded

Dependents:

1dependent_count
1predependent_count
1postdependent_count
1adjective_dependent_counter
1noun_dependent_counter
1clause_dependent_counter
1phrase_dependent_counter
1other_dependent_counter
1num_dependent_counter
1appos_dependent_counter
1is_dependent
1dependent_type

Determiner:

1determiner_string
1determiner_head_string
1determiner_head_lemma
1determiner_type
1has_bare_determiner
1has_genetive_determiner
1has_complex_determiner
1is_determiner
1is_possessive
1is_possessive_determiner
1is_genitive
1is_genitive_determiner

Head:

1head: the actual text of the head
1h_lemma: the lemma of the head

Head position:

1h_{start,end}: index of the head in the sentence
1h_text_{start,end}: in the text

Morphosyntax and syntax of the head:

1h_pspeech
1h_ud_pspeech
1h_broad_pspeech: one of [npdvo] (noun, pronoun, determiner, verb, other)
1h_noun_type: proper or common, see the possible values in the list of possible values
1h_pronoun_type
1h_reflex: is reflexive (if pronoun)
1h_poss: is possessive (is pronoun)
1h_definite: is definite (if determiner)

Declension:

1h_number
1h_gender
1h_person

Syntactic depth of the head:

1h_level
1h_node_depth

Argument structure:

1struct_id: the id of the structure; it is not referenced elsewhere in the relational database, it is an arbitrary number used to determine if two mentions are in the same structure
1struct_is_negative
1struct_is_passive
1struct_tense
1struct_mood
1struct_person

Arguments of the structure:

1is_arg: the mention matches the boundaries of an argument
1arg_index: index of the argument in the structure (first argument has index 0)
1arg_type: ARG0: agent; ARG1: patient; ARG2: instrument, benefactive, attribute; ARG3: starting point, benefactive, attribute; ARG4: ending point; ARGM: modifier, with a subclassification (LOC for locative, GOL for goal, etc.)
1arg_is_agent: arg_type is ARG0

WordNet:

1wn: WordNet synset

The following fields are copied from the texts table for convenience:

2text_genre
2text_source
2text_token_count
2text_mention_count
2text_chain_count

The following fields are copied from the chains table for convenience:

2chain_size
2chain_type
2chain_coverage
2chain_mean_token_dist
2chain_median_token_dist

Table Relations

Note that in the default databases, the relations are only relations between consecutive mentions.

Primary and foreign keys:

1id
1chain_id
1m1_id
1m2_id
1text_id

Distance between the mentions in the relation:

2{token,mention,sent,par}_dist: in tokens, mentions, sentences, paragraphs

Type of the relation:

2is_consecutive: the mentions of the relations are consecutive (always the case in the default database)
2is_to_first: the first mention is the first mention of the chain (never the case in the default database)

Comparing the two mentions:

2same_sent_id: are the mentions in the same sentence?
2same_par_id: are the mentions in the same paragraph?
2same_string: do the mentions have the same textual content?
2same_is_outer: are the mentions both nested or both not nested?

Part of speech type:

2type: a string composed of the h_broad_speech of the two mentions, for example n-n for two nouns, n-p for a noun and a pronoun, etc.

Formal distance:

1levensthein
1sorensen_dice

Cosine similarity:

1context_similarity: (using FastText in English or French)

WordNet distance (computed with NLTK, WOLF used for French):

1{min,max}_wup_similarity: when several synsets were possible for one or both mentions, all the combinations were computed: I report the min and the max value
1{min,max}_path_similarity
1{min,max}_shortest_path_distance
1hypernymy: whether the mentions are in a hypernymic relationship
1meronymy: whether the mentions are in a meronymic relationship

The following fields are copied from the texts table for convenience:

2text_genre
2text_source
2text_token_count
2text_mention_count
2text_chain_count

The following fields are copied from the chains table for convenience:

2chain_size
2chain_type
2chain_coverage
2chain_mean_token_dist
2chain_median_token_dist

Table Tokens

Primary and foreign keys:

1id
1par_id
1sent_id
1text_id

Token features:

1text_token_index: position of the token in the text
1string: the content of the token
2pos: the part of speech

Table Sentences

Primary and foreign keys:

1id
1par_id
1text_id

Positions:

1{first,last}_token_index: index (in the text) of the first and last token of the sentence
1par_sent_index: index of the sentence in the paragraph (first sentence has index 0)
1text_sent_index: index of the sentence in the text (first sentence has index 0)
1text_par_index: index of the paragraph (in which the sentence is) in the text (first paragraph has index 0)

Table Paragraphs

Primary and foreign keys:

1id
1text_id

Positions:

1{first,last}_token_index: index (in the text) of the first and last token of the sentence
1text_par_index: index of the paragraph in the text (first paragraph of the text has index 0)

Table Texts

Primary and foreign keys:

1id

Counts:

2{token,mention,chain,sent,par}_count

Caracteristics of the text:

1genre: one of the predefined genres for CoNLL (nw, pt, etc.); for Democrat: wk, pr, ot (wikipedia, press, other); for Ancor, one of the subcorpus (ot, co, es, ub).
1source: the source (for CoNLL: CNN, MSNBC, etc.)

Mentions in the text:

2outer_proportion
2mean_mention_level

Relations in the text:

2{mean,median}_{token,mention,sent,par}_dist

Chains in the text:

2chain_ngrams: number of different tri-grams of mentions (each chain been represented by an id) divided by the total number of mentions. For exemple, let there be three chains A, B and C, and the following mentions: AABABCC; the tri-grams are AAB (mention of chain A followed by a mention of chain A followed by a mention of chain B), ABA, etc. See Landragin, Frédéric (2016). Conception d'un outil de visualisation et d'exploration de chaînes de coréférences. JADT 2016: 13ème Journées internationales d’Analyse statistique des Données Textuelles.

Lexical diversity of the text:

2ttr: token type ratio
2yule_s_k: Yule’s K (Yule, Udny (1944). The Statistical study of Literary Vocabulary. Cambridge University Press.)
2yule_s_i: Yule’s I (Oakes, Michael (1998). Statistics for Corpus Linguistics. Edingburgh University Press.)

Complete list of possible values

Please see here to get the values recorded for each fields.

Database formats and other formats

CSV files

The corpora are available in three formats:

The database format is just a zip file containing the csv files for each table, which can be imported into any software (even in Microsfot Excel, some hints on how to use this in Excel may be found here):

mentions.csv
chains.csv
relations.csv
texts.csv
paragraphs.csv
sentences.csv
tokens.csv

CoNLL format

The CoNLL format is a tabular format: each token is on a separate line and annotation for the token are on separate column. Document boundaries are indicated by specific marks, and sentence separation by a white line.

Here is an example:

#begin document <name of the document>
1            Les             le     DET ... other columns ...
2         singes          singe    NOUN
3           sont           être     AUX
4            des             un     DET
5     mammifères      mammifère    NOUN
...

1           Bien           bien     ADV
2            que            que   SCONJ
3           leur            son     DET
4   ressemblance   ressemblance    NOUN
5           avec           avec     ADP
6             l'             le     DET
7          Homme          homme    NOUN
...
#end document

The original CoNLL-2012 (used for the CoNLL-2012 corpus that is not available in this directory because of copyright restrictions) is described in Pradhan et al. (2012), CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes. Here is the list of columns (which are separated by any number of spaces):

1.    Document ID
2.    Part number
3.    Word number
4.    Word
5.    Part of Speech
6.    Parse bit
7.    Lemma
8.    Predicate Frameset ID
9.    Word sense
10.   Speaker/Author
11.   Named Entities
12:N. Predicate Arguments
N.    Coreference

For Democrat (the augmented version: dem1921) and Ancor, the list of columns (separated by tabulation) is the list of conll-u augmented with columns for the speaker, the paragraph, the named entity type (2 columns), the coreference. So, in total:

  1. index of the token in the sentence
  2. form of the token
  3. lemma of the token
  4. universal part-of-speech tag.
  5. always _ (language-specific part-of-speech tag, not used)
  6. morphological features (see universal dependencies)
  7. head of the current token (an index of another word or 0 for root)
  8. universal dependency relation to the head (or root) (see universal dependencies)
  9. always _ (enhanced dependencies, not used)
  10. always _ (other annotation, not used)
  11. speaker (or _ for Democrat, where no speaker is recorded)
  12. paragraph number
  13. named entity in the format (PER * * *) (ex. with 4 tokens)
  14. named entity in the format (PER PER PER PER) (ex. with 4 tokens)
  15. coreference in conll-2012 style

Jsonlines format

The jsonlines format stores data for several texts (a corpus). Each line is a valid json document, as follows:

{
  "clusters": [],
  "doc_key": "nw:docname",
  "sentences": [["This", "is", "the", "first", "sentence", "."],
                ["This", "is", "the", "second", "."]],
  "speakers":  [["spk1", "spk1", "spk1", "spk1", "spk1", "spk1"],
                ["spk2", "spk2", "spk2", "spk2", "spk2"]]
  "pos":       [["DET", "V", "DET", "ADJ", "NOUN", "PUNCT"],
                ["DET", "V", "DET", "ADJ", "PUNCT"]],
  ...
}

Availability of the corpora

Legal stuff (licenses)

TLDR:

Democrat

The Democrat corpus is originally distributed on the ortolang plateform. The corpus is distributed under the terms of the "Attribution-ShareAlike 4.0 International" Creative Commons license (CC BY-SA 4.0). So it is allowed to copy and redistribute the corpus, and transform it, as long as it is redistributed under the same conditions (that is, the same license) and the original authors are cited. The original corpus has been developed in the context of the Democrat project, from the French National Research Agency (ANR-15-CE38-0008). The paper that is required to cite is

Frédéric Landragin. Description, modélisation et détection automatique des chaînes de référence (DEMOCRAT). Bulletin de l'Association Française pour l'Intelligence Artificielle, AFIA, 2016, pp.11-15.

The original license, and the license of the modified work I propose, is the CC BY-SA 4.0, which can be read in full here.

Ancor

The Ancor corpus is available on the ortolang.fr plateform (it seems to be restricted to academics working in a French school or university), but also on the corpus website (the link is given on Ortolang), where you can donwload it without restrictions. According to this site, the license is Creative Commons CC-BY-SA-NC for the ESLO part, and Creative Commons CC-BY-SA for the rest. It is the same license as Democrat, except for the Non Commercial part. The legal text of the licences are found here and here. The papers that need to cited are the following ones:

Muzerelle J., Lefeuvre A., Schang E., Antoine J.-Y, Pelletier A., Maurel D., Eshkol I., Villaneau J. 2014. ANCOR-Centre, a Large Free Spoken French Coreference Corpus: description of the Resource and Reliability Measures. LREC'2014, 9th Language Resources and Evaluation Conference

Muzerelle J., Lefeuvre A., Antoine J.-Y., Schang E., Maurel D., Villaneau J., Eshkol I. 2013. ANCOR : premier corpus de français parlé d'envergure annoté en coréférence et distribué librement. Actes TALN'2013.

The modified work is distributed under the terms of the same license as the original work, as required by that very license.

Conll

The CoNLL-2012 corpus is derived from the OntoNotes corpus, which comes with a specific license. The corpus cannot be redistributed, so you will have to compile the database with the scripts I have created (see below).

I just give a sample database built from the trial data from the CoNLL-2012 shared task website. I have masked all the tokens with a [token] string.

Democrat (19th to 21st texts): dem1921

The 19th to 21th-century texts of Democrat (except legal texts), with all the annotations (base, linguistic and advanced, as described above):

This subcorpus is available as a relational database, as CoNLL (extended) and as jsonlines.

Democrat (legal texts)

The five legal texts from Democrat have been converted to the CoNLL-2012 format. The original sentence boundaries and part of speech (from TreeTagger) have been kept.

In another version, they have also been parsed with StandfordNLP, but named entities have not been added.

This subcorpus is available as CoNLL-2012 and CoNLL extended formats.

Democrat as a whole

The whole corpus (from 12th to 21st centuries) is available as CoNLL-2012 format, with the original sentence boundaries and part of speech tagging.

Ancor

All the Ancor corpus, with all the annotations (base, linguistic and advanced), computed from the version annotated in the same way as Democrat.

Note that the original documents boundaries have not been kept: the new texts correspond to thematic sections defined by the transcribers of the audio data. Also, the long "sentences" (or rather speech turns, as there are no sentences in Ancor) of more than 100 words have been split at euh interjections.

This is the version used for the LREC paper cited in the introduction.

This corpus is available in the relational database format, as well as CoNLL (extended) and jsonlines.

CoNLL

The CoNLL corpus is based on the OntoNotes corpus, the copyright of which is held by the LDC. You will need to download the corpus from the LDC (which is free but requires you to register and agree to a specifice license), then transform OntoNotes into CoNLL by using the scripts and instructions provided on the CoNLL-2012 shared task website. Then you will need to use the script I provide here.

First concatenate all the documents you want to compute the database for within one file, for example for all the three sets (dev, train, test):

# when you build the data from the conll-2012 website, they are in the `conll-2012` directory
cat path/to/conll-2012/*/data/*/data/english/annotations/*/*/*/*gold_conll > everything.conll

Run the setup script:

bash setup.sh

Then run the db_builder.py script:

python3 db_builder.py --corpus-name conll2012 --linguistic --advanced \
   -o db_conll.zip everything.conll

If you have any trouble using the scripts, please contact me through my website.

Compiling the database yourself

First run the setup script, which will download the fasttext filtered word vectors for each corpus:

bash setup.sh

Then run the db_builder.py script, according to the corpus you want to compile:

# in the `scripts` directory:
# conll
python3 db_builder.py --corpus-name conll2012 --linguistic --advanced \
   -o db_conll.zip INPUT.conll
# democrat
python3 db_builder.py --corpus-name dem1921 --linguistic --advanced \
   -o /tmp/db_dem1921.zip INPUT.conll
# ancor
python3 db_builder.py --corpus-name ancor --linguistic --advanced \
   -o /tmp/db_ancor.zip INPUT.conll

This assumes that you have the INPUT.conll files. For Ancor and Democrat, you can choose one of the conll files from the ancor and democrat directories. For CoNLL, you must build it from the data provided by the LDC and the CoNLL-2012 shared task (see above), or use the trial data.

If you have any trouble using the scripts, please contact me through my website.

Compiling databases for your own texts

You may want to compute a database for your own texts. You may annotate them with an automatic tool such a Standford CoreNLP or manually with a tool like SACR. These tools export to a conll format (use these scripts to convert from SACR to conll). Then run:

python3 db_builder.py --corpus-name conll -o output_db.zip input_file.conll

(Use --corpus-name conllu for the conll-u format.)

This will create a database with the base annotations (the green circles above). If you data have syntactic parses and named entities, and are like the Democrat corpus, you may add the --linguistic and --advanced options, and change the --corpus-name to dem1921.

License (more legal stuff)

For the scripts (not the corpora): (c) Bruno Oberle 2020, released under the Mozilla Public Licence, version 2.0

If you use the modified corpora: