REFCI (Regular Expressions For Class Instances)

REFCI let you use regular expressions for class instances (objects) in Python, using a syntax similar (but not identical) to CQL or Tregex. It has features on its own.

Here is an example:

Let's say that you have a list of tokens, each token being an object, with

  • the form (form),
  • the part of speech (pos),
  • the length of the form (length):
tokens = [
    Token('The',    'determiner',   3),
    Token('little', 'adjective',    6),
    Token('cats',   'noun',         4),
    Token('eat',    'verb',         3),
    Token('a',      'determiner',   1),
    Token('fish',   'noun',         4),
    Token('.',      'punctuation',  1),
]

Then you can search patterns:

  • a noun: [pos="noun"]
  • a noun with more than 3 characters: [pos="noun" length>3]
  • a noun beginning with a c: `[pos="noun" form=/c.*/]
  • a noun with a determiner before it: [pos="determiner"][pos="noun"]
  • a noun phrase with a determiner, then 0, 1 or more adjectives, then a noun: [pos="determiner"][pos="adjective"]*[pos="noun"]
  • and much, much more...

Table of contents

Quick start

Setup

Let's define a Token class with a named tuple. The class has the following attributs:

  • form,
  • lemma,
  • pos (part of speech),
  • is_upper (whether the form starts with an upper case letter),
  • length.
In [1]:
from collections import namedtuple

Token = namedtuple('Token', 'form lemma pos is_upper length')

token = Token("cats" , "cat", "noun", False, 4)
print(token.form)
print(token.lemma)
print(token.pos)
print(token.is_upper)
print(token.length)
cats
cat
noun
False
4

Now let's build some sentences, in the form of a list of Tokens:

In [2]:
tokens = [
    Token('The',    'the',      'determiner',   True,   3),
    Token('little', 'little',   'adjective',    False,  6),
    Token('cats',   'cat',      'noun',         False,  4),
    Token('eat',    'eat',      'verb',         False,  3),
    Token('a',      'a',        'determiner',   False,  1),
    Token('fish',   'fish',     'noun',         False,  4),
    Token('.',      '.',        'punctuation',  False,  1),
    Token('They',   'they',     'pronoun',      True,   4),
    Token('are',    'be',       'verb',         False,  3),
    Token('happy',  'happy',    'adjective',    False,  5),
    Token(':',      ':',        'punctuation',  False,  1),
    Token('they',   'they',     'pronoun',      False,  4),
    Token('like',   'like',     'verb',         False,  4),
    Token('this',   'this',     'determiner',   False,  4),
    Token('Meal',   'meal',     'noun',         True,  4),
    Token('.',      '.',        'punctuation',  False,  1),
    Token('.',      '.',        'punctuation',  False,  1),
    Token('.',      '.',        'punctuation',  False,  1),
]

Let's import refci Pattern class:

In [3]:
from refci import Pattern

And now we can start search for patterns. To build a pattern, just use:

pat = Pattern('[pos="determiner"][pos="noun"]')

There are 4 main functions you can use:

  • pat.search(tokens): find the first occurrence of the pattern in the tokens,
  • pat.match(tokens): the pattern must be at the beginning of the tokens,
  • pat.fullmatch(tokens): the pattern must match the whole set of tokens
  • pat.finditer(tokens): loop over all the patterns that match in the tokens (by default not overlapping).

Simple patterns

So, two find all the determiners followed by a noun:

In [4]:
pat = Pattern('[pos="determiner"][pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['a', 'fish']
['this', 'Meal']

Note here that seq is a list of tokens. You can get position indices if you prefer:

In [5]:
pat = Pattern('[pos="determiner"][pos="noun"]')
for seq in pat.finditer(tokens, return_objects=False):
    print(seq)
(4, 6)
(13, 15)

If the determiner must have less than 4 characters, just add a condition:

In [6]:
pat = Pattern('[pos="determiner" length<4][pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['a', 'fish']

If the noun must be capitalized:

In [7]:
pat = Pattern('[pos="determiner"][pos="noun" is_upper=True]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['this', 'Meal']

If the noun must have a specific lemma, determined with a regular expression:

In [8]:
pat = Pattern('[pos="determiner"][]*?[pos="noun" lemma=/cats?/]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['The', 'little', 'cats']

Now we want noun phrase with a determiner and a noun, and 0 or 1 adjective in the middle:

In [9]:
pat = Pattern('[pos="determiner"][pos="adjective"]?[pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['The', 'little', 'cats']
['a', 'fish']
['this', 'Meal']

Or, really, any word in the middle:

In [10]:
pat = Pattern('[pos="determiner"][]*?[pos="noun"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['The', 'little', 'cats']
['a', 'fish']
['this', 'Meal']

Variables

You can define variables. For example, if you want to search for contiguous words of the same length (even if overlapping):

In [11]:
pat = Pattern('[variable<-length][length==$variable]')
for seq in pat.finditer(tokens, overlapping=True):
    print([token.form for token in seq])
['they', 'like']
['like', 'this']
['this', 'Meal']
['.', '.']
['.', '.']

Or sequence of 2 words in which the second word is longer than the first:

In [12]:
pat = Pattern('[variable<-length][length>$variable]')
for seq in pat.finditer(tokens, overlapping=True):
    print([token.form for token in seq])
['The', 'little']
['a', 'fish']
['.', 'They']
['are', 'happy']
[':', 'they']

Groups

You can define groups, either to offer an alternative (OR operator), for example if you want either a full noun phrase or a pronoun:

In [13]:
pat = Pattern('( [pos="determiner"][]*?[pos="noun"] | [pos="pronoun"] )')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['The', 'little', 'cats']
['a', 'fish']
['They']
['they']
['this', 'Meal']

or to capture only parts of the pattern, for example if your only interested in the noun, not the determiner or the adjectives:

In [14]:
pat = Pattern('[pos="determiner"][]*?(?P<interesting>[pos="noun"])')
for _ in pat.finditer(tokens):
    group_indices = pat.get_group('interesting')
    print(group_indices)
    group_tokens = pat.get_group('interesting', objs=tokens)
    print([token.form for token in group_tokens])
(2, 3)
['cats']
(5, 6)
['fish']
(14, 15)
['Meal']

Quantifiers

You can use the quantifiers familiar to any regular expression engine. For example, with no quantifier after the ponctuation:

In [15]:
pat = Pattern('[pos="noun"][pos="punctuation"]')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['fish', '.']
['Meal', '.']

with a * (0, 1 or more punctuation):

In [16]:
pat = Pattern('[pos="noun"][pos="punctuation"]*')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['cats']
['fish', '.']
['Meal', '.', '.', '.']

with a ? (0 or 1 punctuation):

In [17]:
pat = Pattern('[pos="noun"][pos="punctuation"]?')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['cats']
['fish', '.']
['Meal', '.']

with a + (1 or more punctuations):

In [18]:
pat = Pattern('[pos="noun"][pos="punctuation"]+')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['fish', '.']
['Meal', '.', '.', '.']

with a custom number of punctuation (here between 2 and 3):

In [19]:
pat = Pattern('[pos="noun"][pos="punctuation"]{2,3}')
for seq in pat.finditer(tokens):
    print([token.form for token in seq])
['Meal', '.', '.', '.']

finditer vs search vs [full]match

Rather than finditer, you can use search to get the first occurrence:

In [27]:
pat = Pattern('[pos="noun"]')
seq = pat.search(tokens)
print([token.form for token in seq])
['cats']

or the first occurrence after a certain point:

In [28]:
pat = Pattern('[pos="noun"]')
seq = pat.search(tokens, start=10)
print([token.form for token in seq])
['Meal']

The match function will only match at the beginning of the tokens:

In [29]:
pat = Pattern('[pos="determiner"][pos="adjective"]')
seq = pat.match(tokens)
print([token.form for token in seq])
['The', 'little']

While the fullmatch will only match for the whole token sequence:

In [30]:
pat = Pattern('[pos="determiner"][pos="adjective"]')
seq = pat.fullmatch(tokens)
print(seq)
None

Detailed guide

In [1]:
from refci import Pattern, make_test_data, make_test_pattern

Test sets

REFCI needs a list of objects. Lets define some objets with two attributes:

  • data: a lower case letter,
  • upper: True if the letter should be rendered has uppercase.

There is a function to quickly make this kind of objects:

In [2]:
objs = make_test_data("aBcD")
print("\n".join("data: %s, upper: %s" % (obj.data, str(obj.upper)) for obj in objs))
data: a, upper: False
data: b, upper: True
data: c, upper: False
data: d, upper: True

You can use the same function to define simple patterns, using only letters: (a+b|f.). This will build the pattern as ([data="a"]+[data="b"]|[data="f"][]).

If you want a more complex pattern, you must define it yourself.

In [3]:
objs, pat = make_test_data("aBcD", "(a+b|f.)")
print("\n".join("data: %s, upper: %s" % (obj.data, str(obj.upper)) for obj in objs))
print(pat.get_string())
data: a, upper: False
data: b, upper: True
data: c, upper: False
data: d, upper: True
(([data="a"]+ [data="b"]) | ([data="f"] []))

Pattern

A pattern is a sequence of atoms ([...]). Each atom represents a class instances to match against. To match, the instance must validate a series of constraints expressed as "specifications" inside the atom: [attr1="foo" attr2>5].

Atoms can be grouped with parentheses, either to define an OR operator, or to define a capturing group.

Quantifiers can be added to both atoms and groups.

Specifications

You can define:

  • string: [attr="string"] or [attr!="string"],
  • regex: [attr=/regex/] or [attr!=/regex/],
  • number: [attr==5] or [attr!=5]; you can use these operators: ==, !=, <, >, <=, >=,
  • set variable: [varname<-attr]
  • use variable: [attr==$varname or [attr!=$var], you can use these operators: ==, !=, <, >, <=, >=. Please note that the operator is == even for a string,
  • bool: [attr=T] or [attr!=T], the value may be True, T, true, t, False, F, false, f,
  • sub pattern: [attr={attr1_of_subobject="foo" attr2_of_subobject=/bar/}], where attr refers to a list of objects that must match the subpattern. Available operators: = (match), == (fullmatch), ~ (search), all can be prefixed with ! to invert the result.

When you specify several specifications, as in [foo="bar" baz="truc"], all must matched. Use groups to emulate an OR operator.

The no-spec atom [] match every instance.

Quantifiers

These are standard regex quantifiers:

  • default: one repetition, lazy
  • *: 0 or more, greedy
  • ?: 0 or 1, greedy
  • +: 1 or more, greedy
  • *?: 0 or more, lazy
  • ??: 0 or 1, lazy
  • +?: 1 or more, lazy
  • *+: 0 or more, possessive
  • ?+: 0 or 1, possessive
  • ++: 1 or more, possessive
  • {1,2} (greedy), {1,2}? (lazy), {2,}+ (possessive)

Groups

The Python syntax is used:

  • capturing group: ([][])
  • named capturing group: (?P<name>[][])
  • non capturing group: (?:[][])
  • the OR operator: ([] | [][] | ([] | [][]) )

Examples

[attr1="foo" attr2=/.foo/]++ []*? (?P<something> [attr1="bar"] | [attr1="baz"] )
[var<-attr1] []* [attr1==$var]

Match and search functions

Define sample data and pattern:

In [4]:
objs, pat = make_test_data(
    'aaababaabc',
    '(a+b)+',
)
print(pat.get_string())
def print_res(res):
    if res is not None:
        print("".join(str(x) for x in res))
    else:
        print("no result")
([data="a"]+ [data="b"])+

Match (the pattern must match at object index 0):

In [5]:
print_res(pat.match(objs, start=0))
aaababaab
In [6]:
print_res(pat.match(objs, start=3)) # no match at index 3
no result

Search (anywhere in the sequence of objects):

In [7]:
print_res(pat.search(objs, start=0))
aaababaab

Match at start and at the end of the sequence:

In [8]:
print_res(pat.fullmatch(objs, start=0)) # no match until the end
no result
In [9]:
print_res(make_test_pattern('(a+b)+c').fullmatch(objs, start=0))
aaababaabc

Iterate, without overlapping:

In [10]:
for res in make_test_pattern('aa+').finditer(objs, start=0):
    print_res(res)
aaa
aa

Iterate, with overlapping:

In [11]:
for res in make_test_pattern('aa+').finditer(objs, start=0, overlapping=True):
    print_res(res)
aaa
aa
aa

Returning indices or objects

Pattern has an attribute return_objects, set to True by default. If it is True, the above functions return a list of objects. Otherwise, they return a tuple (start, stop) (or None if no match).

Each function has a parameter of the same name: if it is None, then the value of the instance attribute is used. Otherwise, it can be True or False.

In [12]:
res = pat.search(objs, start=0, return_objects=False)
print(res)
start, stop = res
print_res(objs[start:stop])
(0, 9)
aaababaab

Groups

In [14]:
objs = make_test_data('aaabc')
In [15]:
pat = Pattern('(?P<one>[data="a"][data="b"]) ([])')
pat.search(objs)
print(pat.get_group(0)) # the whole match
res = pat.get_group(0, objs=objs)
print_res(res)
(2, 5)
abc

You can access the group by name or by index:

In [16]:
print_res(pat.get_group(1, objs=objs))
print_res(pat.get_group('one', objs=objs))
print_res(pat.get_group(2, objs=objs))
ab
ab
c

You can also get the (start, stop) tuple:

In [18]:
print(pat.get_group(1))
(2, 4)

Special behavior of group quantifiers

Regular expressions are defined in such a way that the quantifier of the content of group has is more important than the quantifier of the group. In the following example, the quantifier of the content of the group is lazy, but the quantifier of the group is possessive. The pattern match only a minimal number of repetition of group:

string: 'aaabababc'
regex: /(a++b??)++/
match: aaa

even if we might expect a longer match, like this one:

match: aaababa

You can test that with perl:

In [19]:
open('/tmp/tmp.sh', 'w').write(
"""
my $text = "aaabababc";

if ($text =~ m/(a++b??)++/) {
   print $1, "\n";
}
"""
)
!perl /tmp/tmp.sh
aaa

By setting the mode of the pattern, you can put the focus of the group quantifier, and get a longer match:

In [20]:
objs = make_test_data(
    'aaababaabc',
)
def show_diff(pat):
    pat.set_group_mode('normal') # default
    print_res(pat.match(objs))
    pat.set_group_mode('group')
    print_res(pat.match(objs))
In [21]:
show_diff(make_test_pattern('(a++b??)++'))
aaa
aaababaa
In [22]:
show_diff(make_test_pattern('(a++b?)++'))
aaababaab
aaababaab
In [23]:
show_diff(make_test_pattern('(a++b??)+?'))
aaa
aaa
In [24]:
show_diff(make_test_pattern('(a++b??){2,3}+'))
aaaba
aaababaa

OR group

In [8]:
objs, pat = make_test_data(
    'aaAbabaabcDeeF',
    '((c|e).|a+b)'
)
for res in pat.finditer(objs):
    print_res(res)
aaAb
ab
aab
cD
ee
In [14]:
objs = make_test_data(
    'aaAbabcDeeFa',
)
pat = Pattern('([data="f"] | [data="b"] | [upper=T] )')
for res in pat.finditer(objs):
    print_res(res)
A
b
b
D
F

Using subpatterns

In [2]:
objs = make_test_data(
    'aaab',
)
objs[0].content = make_test_data('ab')
objs[1].content = make_test_data('cd')
objs[2].content = make_test_data('cdd')
objs[3].content = make_test_data('ccddd')

patterns = [
    '[data="a" content={[data="c"][data="d"]}]++',
    '[data="a" content=={[data="c"][data="d"]}]++',
    '[data="a" content~{[data="c"][data="d"]}]++',
    '[data="a" content!={[data="c"][data="d"]}]++',
    '[data="a" content!=={[data="c"][data="d"]}]++',
    '[data="a" content!~{[data="c"][data="d"]}]++',
    '[data="b" content!={[data="c"][data="d"]}]++',
    '[data="b" content={[data="d"][data="d"]}]++',
    '[data=/a|b/ content~{[data="c"][data="d"]}]++',
]
for pat in patterns:
    pat = Pattern(pat)
    res = pat.search(objs, return_objects=False)
    print(pat.get_string(), res)
([data="a" content={([data="c"] [data="d"])}]++) (1, 3)
([data="a" content=={([data="c"] [data="d"])}]++) (1, 2)
([data="a" content~{([data="c"] [data="d"])}]++) (1, 3)
([data="a" content!={([data="c"] [data="d"])}]++) (0, 1)
([data="a" content!=={([data="c"] [data="d"])}]++) (0, 1)
([data="a" content!~{([data="c"] [data="d"])}]++) (0, 1)
([data="b" content!={([data="c"] [data="d"])}]++) (3, 4)
([data="b" content={([data="d"] [data="d"])}]++) None
([data=/a|b/ content~{([data="c"] [data="d"])}]++) (1, 4)

Using variable

In [10]:
objs = make_test_data(
    'aaAbabcDeeFa',
)
pat = Pattern('[a<-data upper=T][]+[data==$a]')
print_res(pat.search(objs))
AbabcDeeFa
In [11]:
pat = Pattern('[a<-upper][]+[upper==$a]')
print_res(pat.search(objs))
aaAbabcDeeFa
In [12]:
pat = Pattern('[a<-upper data="d"][]*[upper==$a]')
print_res(pat.search(objs))
DeeF

Using regex

In [16]:
objs = make_test_data(
    'aaAbabcDeeFa',
)
pat = Pattern('[upper=T data=/d|a/]')
for res in pat.finditer(objs):
    print_res(res)
A
D