Cette page n'est disponible qu'en anglais !

Regular Expression For Classe Instances (`refci`)

Regular expressions features on arbitrary class instances. For example, given a list of token objects, search for patterns such as:

a noun: [pos="noun"]
a noun with more than 3 characters: [pos="noun" length>3]
a noun beginning with a c: [pos="noun" form=/c.*/]
a noun with a determiner before it: [pos="determiner"][pos="noun"]
a noun phrase with a determiner, then 0, 1 or more adjectives, then a noun: [pos="determiner"][pos="adjective"]*[pos="noun"]
etc.

Quick start below on this page, or a more complete use guide in the form of a Jupyter notebook that you can download with the code:

view jupyter notebook download code view github repo

On this page:

Introduction
Quickstart
More explanations

Introduction

Regular expressions features on arbitrary class instances. It is somewhat similar to TokensRegex or CQL, but is implemented in Python and offers some specific features.

Here is an example:

Let's say that you have a list of tokens, each token being an object, with

the form (form),
the part of speech (pos),
the length of the form (length):

tokens = [
    Token('The',    'determiner',   3),
    Token('little', 'adjective',    6),
    Token('cats',   'noun',         4),
    Token('eat',    'verb',         3),
    Token('a',      'determiner',   1),
    Token('fish',   'noun',         4),
    Token('.',      'punctuation',  1),
]

Then you can search patterns:

a noun: [pos="noun"]
a noun with more than 3 characters: [pos="noun" length>3]
a noun beginning with a c: [pos="noun" form=/c.*/]
a noun with a determiner before it: [pos="determiner"][pos="noun"]
a noun phrase with a determiner, then 0, 1 or more adjectives, then a noun: [pos="determiner"][pos="adjective"]*[pos="noun"]
and much, much more...

Quickstart

Setup

Let's define a Token class with a named tuple. The class has the following attributs:

form,
lemma,
pos (part of speech),
is_upper (whether the form starts with an upper case letter),
length.

from collections import namedtuple

Token = namedtuple('Token', 'form lemma pos is_upper length')

token = Token("cats" , "cat", "noun", False, 4)
print(token.form)
print(token.lemma)
print(token.pos)
print(token.is_upper)
print(token.length)

cats
cat
noun
False
4

Now let's build some sentences, in the form of a list of Tokens:

tokens = [
    Token('The',    'the',      'determiner',   True,   3),
    Token('little', 'little',   'adjective',    False,  6),
    Token('cats',   'cat',      'noun',         False,  4),
    Token('eat',    'eat',      'verb',         False,  3),
    Token('a',      'a',        'determiner',   False,  1),
    Token('fish',   'fish',     'noun',         False,  4),
    Token('.',      '.',        'punctuation',  False,  1),
    Token('They',   'they',     'pronoun',      True,   4),
    Token('are',    'be',       'verb',         False,  3),
    Token('happy',  'happy',    'adjective',    False,  5),
    Token(':',      ':',        'punctuation',  False,  1),
    Token('they',   'they',     'pronoun',      False,  4),
    Token('like',   'like',     'verb',         False,  4),
    Token('this',   'this',     'determiner',   False,  4),
    Token('Meal',   'meal',     'noun',         True,  4),
    Token('.',      '.',        'punctuation',  False,  1),
    Token('.',      '.',        'punctuation',  False,  1),
    Token('.',      '.',        'punctuation',  False,  1),
]

Let's import refci Pattern class:

from refci import Pattern

And now we can start search for patterns. To build a pattern, just use:

pat = Pattern('[pos="determiner"][pos="noun"]')

There are 4 main functions you can use:

pat.search(tokens): find the first occurrence of the pattern in the tokens,
pat.match(tokens): the pattern must be at the beginning of the tokens,
pat.fullmatch(tokens): the pattern must match the whole set of tokens
pat.finditer(tokens): loop over all the patterns that match in the tokens (by default not overlapping).

Simple patterns

So, two find all the determiners followed by a noun:

pat = Pattern('[pos="determiner"][pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['a', 'fish']
['this', 'Meal']

Note here that seq is a list of tokens. You can get position indices if you prefer:

pat = Pattern('[pos="determiner"][pos="noun"]')
for seq in pat.finditer(tokens, return_objects=False):
    print(seq)

(4, 6)
(13, 15)

If the determiner must have less than 4 characters, just add a condition:

pat = Pattern('[pos="determiner" length<4][pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['a', 'fish']

If the noun must be capitalized:

pat = Pattern('[pos="determiner"][pos="noun" is_upper=True]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['this', 'Meal']

If the noun must have a specific lemma, determined with a regular expression:

pat = Pattern('[pos="determiner"][]*?[pos="noun" lemma=/cats?/]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['The', 'little', 'cats']

Now we want noun phrase with a determiner and a noun, and 0 or 1 adjective in the middle:

pat = Pattern('[pos="determiner"][pos="adjective"]?[pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['The', 'little', 'cats']
['a', 'fish']
['this', 'Meal']

Or, really, any word in the middle:

pat = Pattern('[pos="determiner"][]*?[pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['The', 'little', 'cats']
['a', 'fish']
['this', 'Meal']

Variables

You can define variables. For example, if you want to search for contiguous words of the same length (even if overlapping):

pat = Pattern('[variable<-length][length==$variable]')
for seq in pat.finditer(tokens, overlapping=True):
    print([token.form for token in seq])

['they', 'like']
['like', 'this']
['this', 'Meal']
['.', '.']
['.', '.']

Or sequence of 2 words in which the second word is longer than the first:

pat = Pattern('[variable<-length][length>$variable]')
for seq in pat.finditer(tokens, overlapping=True):
    print([token.form for token in seq])

['The', 'little']
['a', 'fish']
['.', 'They']
['are', 'happy']
[':', 'they']

Groups

You can define groups, either to offer an alternative (OR operator), for example if you want either a full noun phrase or a pronoun:

pat = Pattern('( [pos="determiner"][]*?[pos="noun"] | [pos="pronoun"] )')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['The', 'little', 'cats']
['a', 'fish']
['They']
['they']
['this', 'Meal']

or to capture only parts of the pattern, for example if your only interested in the noun, not the determiner or the adjectives:

pat = Pattern('[pos="determiner"][]*?(?P<interesting>[pos="noun"])')
for _ in pat.finditer(tokens):
    group_indices = pat.get_group('interesting')
    print(group_indices)
    group_tokens = pat.get_group('interesting', objs=tokens)
    print([token.form for token in group_tokens])

(2, 3)
['cats']
(5, 6)
['fish']
(14, 15)
['Meal']

Quantifiers

You can use the quantifiers familiar to any regular expression engine. For example, with no quantifier after the ponctuation:

pat = Pattern('[pos="noun"][pos="punctuation"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['fish', '.']
['Meal', '.']

with a * (0, 1 or more punctuation):

pat = Pattern('[pos="noun"][pos="punctuation"]*')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['cats']
['fish', '.']
['Meal', '.', '.', '.']

with a ? (0 or 1 punctuation):

pat = Pattern('[pos="noun"][pos="punctuation"]?')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['cats']
['fish', '.']
['Meal', '.']

with a + (1 or more punctuations):

pat = Pattern('[pos="noun"][pos="punctuation"]+')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['fish', '.']
['Meal', '.', '.', '.']

with a custom number of punctuation (here between 2 and 3):

pat = Pattern('[pos="noun"][pos="punctuation"]{2,3}')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])

['Meal', '.', '.', '.']

`finditer` vs `search` vs `[full]match`

Rather than finditer, you can use search to get the first occurrence:

pat = Pattern('[pos="noun"]')
seq = pat.search(tokens)
print([token.form for token in seq])

['cats']

or the first occurrence after a certain point:

pat = Pattern('[pos="noun"]')
seq = pat.search(tokens, start=10)
print([token.form for token in seq])

['Meal']

The match function will only match at the beginning of the tokens:

pat = Pattern('[pos="determiner"][pos="adjective"]')
seq = pat.match(tokens)
print([token.form for token in seq])

['The', 'little']

While the fullmatch will only match for the whole token sequence:

pat = Pattern('[pos="determiner"][pos="adjective"]')
seq = pat.fullmatch(tokens)
print(seq)

None

More explanations

A Jupyter notebook is available in the code archive:

view jupyter notebook download code

Regular Expression For Classe Instances (refci)