REFCI let you use regular expressions for class instances (objects) in Python, using a syntax similar (but not identical) to CQL or Tregex. It has features on its own.
Here is an example:
Let's say that you have a list of tokens, each token being an object, with
form
),pos
),length
):tokens = [
Token('The', 'determiner', 3),
Token('little', 'adjective', 6),
Token('cats', 'noun', 4),
Token('eat', 'verb', 3),
Token('a', 'determiner', 1),
Token('fish', 'noun', 4),
Token('.', 'punctuation', 1),
]
Then you can search patterns:
[pos="noun"]
[pos="noun" length>3]
c
: `[pos="noun" form=/c.*/]
[pos="determiner"][pos="noun"]
[pos="determiner"][pos="adjective"]*[pos="noun"]
Let's define a Token class with a named tuple. The class has the following attributs:
form
,lemma
,pos
(part of speech),is_upper
(whether the form starts with an upper case letter),length
.from collections import namedtuple
Token = namedtuple('Token', 'form lemma pos is_upper length')
token = Token("cats" , "cat", "noun", False, 4)
print(token.form)
print(token.lemma)
print(token.pos)
print(token.is_upper)
print(token.length)
Now let's build some sentences, in the form of a list
of Token
s:
tokens = [
Token('The', 'the', 'determiner', True, 3),
Token('little', 'little', 'adjective', False, 6),
Token('cats', 'cat', 'noun', False, 4),
Token('eat', 'eat', 'verb', False, 3),
Token('a', 'a', 'determiner', False, 1),
Token('fish', 'fish', 'noun', False, 4),
Token('.', '.', 'punctuation', False, 1),
Token('They', 'they', 'pronoun', True, 4),
Token('are', 'be', 'verb', False, 3),
Token('happy', 'happy', 'adjective', False, 5),
Token(':', ':', 'punctuation', False, 1),
Token('they', 'they', 'pronoun', False, 4),
Token('like', 'like', 'verb', False, 4),
Token('this', 'this', 'determiner', False, 4),
Token('Meal', 'meal', 'noun', True, 4),
Token('.', '.', 'punctuation', False, 1),
Token('.', '.', 'punctuation', False, 1),
Token('.', '.', 'punctuation', False, 1),
]
Let's import refci
Pattern
class:
from refci import Pattern
And now we can start search for patterns. To build a pattern, just use:
pat = Pattern('[pos="determiner"][pos="noun"]')
There are 4 main functions you can use:
pat.search(tokens)
: find the first occurrence of the pattern in the tokens,pat.match(tokens)
: the pattern must be at the beginning of the tokens,pat.fullmatch(tokens)
: the pattern must match the whole set of tokenspat.finditer(tokens)
: loop over all the patterns that match in the tokens (by default not overlapping).So, two find all the determiners followed by a noun:
pat = Pattern('[pos="determiner"][pos="noun"]')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
Note here that seq
is a list
of tokens. You can get position indices if you prefer:
pat = Pattern('[pos="determiner"][pos="noun"]')
for seq in pat.finditer(tokens, return_objects=False):
print(seq)
If the determiner must have less than 4 characters, just add a condition:
pat = Pattern('[pos="determiner" length<4][pos="noun"]')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
If the noun must be capitalized:
pat = Pattern('[pos="determiner"][pos="noun" is_upper=True]')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
If the noun must have a specific lemma, determined with a regular expression:
pat = Pattern('[pos="determiner"][]*?[pos="noun" lemma=/cats?/]')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
Now we want noun phrase with a determiner and a noun, and 0 or 1 adjective in the middle:
pat = Pattern('[pos="determiner"][pos="adjective"]?[pos="noun"]')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
Or, really, any word in the middle:
pat = Pattern('[pos="determiner"][]*?[pos="noun"]')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
You can define variables. For example, if you want to search for contiguous words of the same length (even if overlapping):
pat = Pattern('[variable<-length][length==$variable]')
for seq in pat.finditer(tokens, overlapping=True):
print([token.form for token in seq])
Or sequence of 2 words in which the second word is longer than the first:
pat = Pattern('[variable<-length][length>$variable]')
for seq in pat.finditer(tokens, overlapping=True):
print([token.form for token in seq])
You can define groups, either to offer an alternative (OR operator), for example if you want either a full noun phrase or a pronoun:
pat = Pattern('( [pos="determiner"][]*?[pos="noun"] | [pos="pronoun"] )')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
or to capture only parts of the pattern, for example if your only interested in the noun, not the determiner or the adjectives:
pat = Pattern('[pos="determiner"][]*?(?P<interesting>[pos="noun"])')
for _ in pat.finditer(tokens):
group_indices = pat.get_group('interesting')
print(group_indices)
group_tokens = pat.get_group('interesting', objs=tokens)
print([token.form for token in group_tokens])
You can use the quantifiers familiar to any regular expression engine. For example, with no quantifier after the ponctuation:
pat = Pattern('[pos="noun"][pos="punctuation"]')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
with a *
(0, 1 or more punctuation):
pat = Pattern('[pos="noun"][pos="punctuation"]*')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
with a ?
(0 or 1 punctuation):
pat = Pattern('[pos="noun"][pos="punctuation"]?')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
with a +
(1 or more punctuations):
pat = Pattern('[pos="noun"][pos="punctuation"]+')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
with a custom number of punctuation (here between 2 and 3):
pat = Pattern('[pos="noun"][pos="punctuation"]{2,3}')
for seq in pat.finditer(tokens):
print([token.form for token in seq])
finditer
vs search
vs [full]match
¶Rather than finditer
, you can use search
to get the first occurrence:
pat = Pattern('[pos="noun"]')
seq = pat.search(tokens)
print([token.form for token in seq])
or the first occurrence after a certain point:
pat = Pattern('[pos="noun"]')
seq = pat.search(tokens, start=10)
print([token.form for token in seq])
The match
function will only match at the beginning of the tokens:
pat = Pattern('[pos="determiner"][pos="adjective"]')
seq = pat.match(tokens)
print([token.form for token in seq])
While the fullmatch
will only match for the whole token sequence:
pat = Pattern('[pos="determiner"][pos="adjective"]')
seq = pat.fullmatch(tokens)
print(seq)
from refci import Pattern, make_test_data, make_test_pattern
REFCI needs a list of objects. Lets define some objets with two attributes:
data
: a lower case letter,upper
: True if the letter should be rendered has uppercase.There is a function to quickly make this kind of objects:
objs = make_test_data("aBcD")
print("\n".join("data: %s, upper: %s" % (obj.data, str(obj.upper)) for obj in objs))
You can use the same function to define simple patterns, using only letters: (a+b|f.)
. This will build the pattern as ([data="a"]+[data="b"]|[data="f"][])
.
If you want a more complex pattern, you must define it yourself.
objs, pat = make_test_data("aBcD", "(a+b|f.)")
print("\n".join("data: %s, upper: %s" % (obj.data, str(obj.upper)) for obj in objs))
print(pat.get_string())
A pattern is a sequence of atoms ([...]
). Each atom represents a class instances to match against. To match, the instance must validate a series of constraints expressed as "specifications" inside the atom: [attr1="foo" attr2>5]
.
Atoms can be grouped with parentheses, either to define an OR
operator, or to define a capturing group.
Quantifiers can be added to both atoms and groups.
You can define:
[attr="string"]
or [attr!="string"]
,[attr=/regex/]
or [attr!=/regex/]
,[attr==5]
or [attr!=5]
; you can use these operators: ==, !=, <, >, <=, >=
,[varname<-attr]
[attr==$varname
or [attr!=$var]
, you can use these operators: ==, !=, <, >, <=, >=
. Please note that the operator is ==
even for a string,[attr=T]
or [attr!=T]
, the value may be True, T, true, t, False, F, false, f
,[attr={attr1_of_subobject="foo" attr2_of_subobject=/bar/}]
, where attr
refers to a list of objects that must match the subpattern. Available operators: =
(match), ==
(fullmatch), ~
(search), all can be prefixed with !
to invert the result.When you specify several specifications, as in [foo="bar" baz="truc"]
, all must matched. Use groups to emulate an OR
operator.
The no-spec atom []
match every instance.
These are standard regex quantifiers:
*
: 0 or more, greedy?
: 0 or 1, greedy+
: 1 or more, greedy*?
: 0 or more, lazy??
: 0 or 1, lazy+?
: 1 or more, lazy*+
: 0 or more, possessive?+
: 0 or 1, possessive++
: 1 or more, possessive{1,2}
(greedy), {1,2}?
(lazy), {2,}+
(possessive)The Python syntax is used:
([][])
(?P<name>[][])
(?:[][])
OR
operator: ([] | [][] | ([] | [][]) )
[attr1="foo" attr2=/.foo/]++ []*? (?P<something> [attr1="bar"] | [attr1="baz"] )
[var<-attr1] []* [attr1==$var]
Define sample data and pattern:
objs, pat = make_test_data(
'aaababaabc',
'(a+b)+',
)
print(pat.get_string())
def print_res(res):
if res is not None:
print("".join(str(x) for x in res))
else:
print("no result")
Match (the pattern must match at object index 0):
print_res(pat.match(objs, start=0))
print_res(pat.match(objs, start=3)) # no match at index 3
Search (anywhere in the sequence of objects):
print_res(pat.search(objs, start=0))
Match at start
and at the end of the sequence:
print_res(pat.fullmatch(objs, start=0)) # no match until the end
print_res(make_test_pattern('(a+b)+c').fullmatch(objs, start=0))
Iterate, without overlapping:
for res in make_test_pattern('aa+').finditer(objs, start=0):
print_res(res)
Iterate, with overlapping:
for res in make_test_pattern('aa+').finditer(objs, start=0, overlapping=True):
print_res(res)
Pattern
has an attribute return_objects
, set to True by default. If it is True, the above functions return a list of objects. Otherwise, they return a tuple (start, stop)
(or None if no match).
Each function has a parameter of the same name: if it is None, then the value of the instance attribute is used. Otherwise, it can be True or False.
res = pat.search(objs, start=0, return_objects=False)
print(res)
start, stop = res
print_res(objs[start:stop])
objs = make_test_data('aaabc')
pat = Pattern('(?P<one>[data="a"][data="b"]) ([])')
pat.search(objs)
print(pat.get_group(0)) # the whole match
res = pat.get_group(0, objs=objs)
print_res(res)
You can access the group by name or by index:
print_res(pat.get_group(1, objs=objs))
print_res(pat.get_group('one', objs=objs))
print_res(pat.get_group(2, objs=objs))
You can also get the (start, stop)
tuple:
print(pat.get_group(1))
Regular expressions are defined in such a way that the quantifier of the content of group has is more important than the quantifier of the group. In the following example, the quantifier of the content of the group is lazy, but the quantifier of the group is possessive. The pattern match only a minimal number of repetition of group:
string: 'aaabababc'
regex: /(a++b??)++/
match: aaa
even if we might expect a longer match, like this one:
match: aaababa
You can test that with perl:
open('/tmp/tmp.sh', 'w').write(
"""
my $text = "aaabababc";
if ($text =~ m/(a++b??)++/) {
print $1, "\n";
}
"""
)
!perl /tmp/tmp.sh
By setting the mode
of the pattern, you can put the focus of the group quantifier, and get a longer match:
objs = make_test_data(
'aaababaabc',
)
def show_diff(pat):
pat.set_group_mode('normal') # default
print_res(pat.match(objs))
pat.set_group_mode('group')
print_res(pat.match(objs))
show_diff(make_test_pattern('(a++b??)++'))
show_diff(make_test_pattern('(a++b?)++'))
show_diff(make_test_pattern('(a++b??)+?'))
show_diff(make_test_pattern('(a++b??){2,3}+'))
objs, pat = make_test_data(
'aaAbabaabcDeeF',
'((c|e).|a+b)'
)
for res in pat.finditer(objs):
print_res(res)
objs = make_test_data(
'aaAbabcDeeFa',
)
pat = Pattern('([data="f"] | [data="b"] | [upper=T] )')
for res in pat.finditer(objs):
print_res(res)
objs = make_test_data(
'aaab',
)
objs[0].content = make_test_data('ab')
objs[1].content = make_test_data('cd')
objs[2].content = make_test_data('cdd')
objs[3].content = make_test_data('ccddd')
patterns = [
'[data="a" content={[data="c"][data="d"]}]++',
'[data="a" content=={[data="c"][data="d"]}]++',
'[data="a" content~{[data="c"][data="d"]}]++',
'[data="a" content!={[data="c"][data="d"]}]++',
'[data="a" content!=={[data="c"][data="d"]}]++',
'[data="a" content!~{[data="c"][data="d"]}]++',
'[data="b" content!={[data="c"][data="d"]}]++',
'[data="b" content={[data="d"][data="d"]}]++',
'[data=/a|b/ content~{[data="c"][data="d"]}]++',
]
for pat in patterns:
pat = Pattern(pat)
res = pat.search(objs, return_objects=False)
print(pat.get_string(), res)
objs = make_test_data(
'aaAbabcDeeFa',
)
pat = Pattern('[a<-data upper=T][]+[data==$a]')
print_res(pat.search(objs))
pat = Pattern('[a<-upper][]+[upper==$a]')
print_res(pat.search(objs))
pat = Pattern('[a<-upper data="d"][]*[upper==$a]')
print_res(pat.search(objs))
objs = make_test_data(
'aaAbabcDeeFa',
)
pat = Pattern('[upper=T data=/d|a/]')
for res in pat.finditer(objs):
print_res(res)