Regular Expression For Classe Instances (refci)

Regular expressions features on arbitrary class instances. For example, given a list of token objects, search for patterns such as:

Quick start below on this page, or a more complete use guide in the form of a Jupyter notebook that you can download with the code:

view jupyter notebook download code view github repo

On this page:

Introduction

Regular expressions features on arbitrary class instances. It is somewhat similar to TokensRegex or CQL, but is implemented in Python and offers some specific features.

Here is an example:

Let's say that you have a list of tokens, each token being an object, with

tokens = [
    Token('The',    'determiner',   3),
    Token('little', 'adjective',    6),
    Token('cats',   'noun',         4),
    Token('eat',    'verb',         3),
    Token('a',      'determiner',   1),
    Token('fish',   'noun',         4),
    Token('.',      'punctuation',  1),
]

Then you can search patterns:

Quickstart

Setup

Let's define a Token class with a named tuple. The class has the following attributs:

from collections import namedtuple

Token = namedtuple('Token', 'form lemma pos is_upper length')

token = Token("cats" , "cat", "noun", False, 4)
print(token.form)
print(token.lemma)
print(token.pos)
print(token.is_upper)
print(token.length)
cats
cat
noun
False
4

Now let's build some sentences, in the form of a list of Tokens:

tokens = [
    Token('The',    'the',      'determiner',   True,   3),
    Token('little', 'little',   'adjective',    False,  6),
    Token('cats',   'cat',      'noun',         False,  4),
    Token('eat',    'eat',      'verb',         False,  3),
    Token('a',      'a',        'determiner',   False,  1),
    Token('fish',   'fish',     'noun',         False,  4),
    Token('.',      '.',        'punctuation',  False,  1),
    Token('They',   'they',     'pronoun',      True,   4),
    Token('are',    'be',       'verb',         False,  3),
    Token('happy',  'happy',    'adjective',    False,  5),
    Token(':',      ':',        'punctuation',  False,  1),
    Token('they',   'they',     'pronoun',      False,  4),
    Token('like',   'like',     'verb',         False,  4),
    Token('this',   'this',     'determiner',   False,  4),
    Token('Meal',   'meal',     'noun',         True,  4),
    Token('.',      '.',        'punctuation',  False,  1),
    Token('.',      '.',        'punctuation',  False,  1),
    Token('.',      '.',        'punctuation',  False,  1),
]

Let's import refci Pattern class:

from refci import Pattern

And now we can start search for patterns. To build a pattern, just use:

pat = Pattern('[pos="determiner"][pos="noun"]')

There are 4 main functions you can use:

Simple patterns

So, two find all the determiners followed by a noun:

pat = Pattern('[pos="determiner"][pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['a', 'fish']
['this', 'Meal']

Note here that seq is a list of tokens. You can get position indices if you prefer:

pat = Pattern('[pos="determiner"][pos="noun"]')
for seq in pat.finditer(tokens, return_objects=False):
    print(seq)
(4, 6)
(13, 15)

If the determiner must have less than 4 characters, just add a condition:

pat = Pattern('[pos="determiner" length<4][pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['a', 'fish']

If the noun must be capitalized:

pat = Pattern('[pos="determiner"][pos="noun" is_upper=True]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['this', 'Meal']

If the noun must have a specific lemma, determined with a regular expression:

pat = Pattern('[pos="determiner"][]*?[pos="noun" lemma=/cats?/]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['The', 'little', 'cats']

Now we want noun phrase with a determiner and a noun, and 0 or 1 adjective in the middle:

pat = Pattern('[pos="determiner"][pos="adjective"]?[pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['The', 'little', 'cats']
['a', 'fish']
['this', 'Meal']

Or, really, any word in the middle:

pat = Pattern('[pos="determiner"][]*?[pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['The', 'little', 'cats']
['a', 'fish']
['this', 'Meal']

Variables

You can define variables. For example, if you want to search for contiguous words of the same length (even if overlapping):

pat = Pattern('[variable<-length][length==$variable]')
for seq in pat.finditer(tokens, overlapping=True):
    print([token.form for token in seq])
['they', 'like']
['like', 'this']
['this', 'Meal']
['.', '.']
['.', '.']

Or sequence of 2 words in which the second word is longer than the first:

pat = Pattern('[variable<-length][length>$variable]')
for seq in pat.finditer(tokens, overlapping=True):
    print([token.form for token in seq])
['The', 'little']
['a', 'fish']
['.', 'They']
['are', 'happy']
[':', 'they']

Groups

You can define groups, either to offer an alternative (OR operator), for example if you want either a full noun phrase or a pronoun:

pat = Pattern('( [pos="determiner"][]*?[pos="noun"] | [pos="pronoun"] )')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['The', 'little', 'cats']
['a', 'fish']
['They']
['they']
['this', 'Meal']

or to capture only parts of the pattern, for example if your only interested in the noun, not the determiner or the adjectives:

pat = Pattern('[pos="determiner"][]*?(?P<interesting>[pos="noun"])')
for _ in pat.finditer(tokens):
    group_indices = pat.get_group('interesting')
    print(group_indices)
    group_tokens = pat.get_group('interesting', objs=tokens)
    print([token.form for token in group_tokens])
(2, 3)
['cats']
(5, 6)
['fish']
(14, 15)
['Meal']

Quantifiers

You can use the quantifiers familiar to any regular expression engine. For example, with no quantifier after the ponctuation:

pat = Pattern('[pos="noun"][pos="punctuation"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['fish', '.']
['Meal', '.']

with a * (0, 1 or more punctuation):

pat = Pattern('[pos="noun"][pos="punctuation"]*')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['cats']
['fish', '.']
['Meal', '.', '.', '.']

with a ? (0 or 1 punctuation):

pat = Pattern('[pos="noun"][pos="punctuation"]?')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['cats']
['fish', '.']
['Meal', '.']

with a + (1 or more punctuations):

pat = Pattern('[pos="noun"][pos="punctuation"]+')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['fish', '.']
['Meal', '.', '.', '.']

with a custom number of punctuation (here between 2 and 3):

pat = Pattern('[pos="noun"][pos="punctuation"]{2,3}')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['Meal', '.', '.', '.']

finditer vs search vs [full]match

Rather than finditer, you can use search to get the first occurrence:

pat = Pattern('[pos="noun"]')
seq = pat.search(tokens)
print([token.form for token in seq])
['cats']

or the first occurrence after a certain point:

pat = Pattern('[pos="noun"]')
seq = pat.search(tokens, start=10)
print([token.form for token in seq])
['Meal']

The match function will only match at the beginning of the tokens:

pat = Pattern('[pos="determiner"][pos="adjective"]')
seq = pat.match(tokens)
print([token.form for token in seq])
['The', 'little']

While the fullmatch will only match for the whole token sequence:

pat = Pattern('[pos="determiner"][pos="adjective"]')
seq = pat.fullmatch(tokens)
print(seq)
None

More explanations

A Jupyter notebook is available in the code archive:

view jupyter notebook download code