standoff2inline module user guide

Converting standoff annotations to inline annotations.

For example, in the sentence:

The little cat drinks milk.

you know that the third word, between the 12th and 14th characters, is a noun. You may want to surround it with some tags, like <noun> and </noun>:

The little <noun>cat</noun> drinks milk.

This module offer classes and function to:

  • add inline annotations, like xml annotations, counting in characters or tokens,
  • highlight some chunks of text, for example with styled <span> tags,
  • remove parts without annotations and replace them with something like [...].

Overview

Adding annotations from character positions:

In [1]:
from standoff2inline import Standoff2Inline

string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (26, "</sent>"))
inliner.add((0, "<gn>"), (13, "</gn>"))
inliner.add((11, "<noun>"), (13, "</noun>"))
inliner.add((22, "<noun>"), (25, "</noun>"))
inliner.add((0, "<det>"), (2, "</det>"))
inliner.apply(string)
Out[1]:
'<sent><gn><det>The</det> little <noun>cat</noun></gn> drinks <noun>milk</noun>.</sent>'

Adding annotations from token positions:

In [2]:
from standoff2inline import Standoff2Inline

tokens = "The little cat drinks milk .".split()
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (5, "</sent>"))
inliner.add((0, "<gn>"), (2, "</gn>"))
inliner.add((2, "<noun>"), (2, "</noun>"))
inliner.add((4, "<noun>"), (4, "</noun>"))
inliner.add((0, "<det>"), (0, "</det>"))
inliner.apply(tokens=tokens)
Out[2]:
'<sent><gn><det>The</det> little <noun>cat</noun></gn> drinks <noun>milk</noun> .</sent> '

Using the Highlighter class to highlight differently different parts of speech:

  • determiners in red
  • nouns underlined
  • verbs in bold and italic
In [1]:
from standoff2inline import Highlighter, highlight

hl_det = Highlighter()
hl_det.add_mark(0, 0)
hl_det.set_style(color="red")

hl_noun = Highlighter()
hl_noun.add_mark(2, 2)
hl_noun.add_mark(4, 4)
hl_noun.set_style(underline=True)

hl_verb = Highlighter(prefix="[", suffix="]")
hl_verb.add_mark(3, 3)
hl_verb.set_style(bold=True, italic=True)

tokens = "The little cat drinks milk ...".split()
res = highlight(tokens, hl_det, hl_noun, hl_verb)

print(res)

from IPython.core.display import display, HTML
display(HTML(res))
<span style="color: red; ">The</span> little <span style="text-decoration: underline; ">cat</span> <span style="font-weight: bold; font-style: italic; ">[drinks]</span> <span style="text-decoration: underline; ">milk</span> ...
The little cat [drinks] milk ...

Cut long passages without annotations (and replace by e.g. "[...]"):

In [11]:
from standoff2inline import Highlighter, highlight

hl = Highlighter(suffix="</span>")
hl.add_mark(2, 2, '<span type="noun">')
hl.add_mark(12, 12 ,'<span type="noun">')
hl.add_mark(0, 0, '<span type="det">')
tokens = "The little cat who played yesterday with my " \
    "neighbor 's children drinks " \
    "milk ...  And the next sentence ...".split()
highlight(tokens, hl, margin=2, max_gap=4)
Out[11]:
'<span type="det">The</span> little <span type="noun">cat</span> who played [...] children drinks <span type="noun">milk</span> ... And [...]'

Inliner recipes

In [3]:
from standoff2inline import Standoff2Inline

Add annotations at some positions:

In [4]:
string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((0, "(flag)"))
inliner.add((3, "(flag)"))
inliner.add((4, "(flag)"))
inliner.add((10, "(flag)"))
inliner.apply(string)
Out[4]:
'(flag)The(flag) (flag)little(flag) cat drinks milk.'

Put open and close tags:

In [5]:
string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (26, "</sent>"))
inliner.add((0, "<gn>"), (13, "</gn>"))
inliner.add((11, "<noun>"), (13, "</noun>"))
inliner.add((22, "<noun>"), (25, "</noun>"))
inliner.add((0, "<det>"), (2, "</det>"))
inliner.apply(string)
Out[5]:
'<sent><gn><det>The</det> little <noun>cat</noun></gn> drinks <noun>milk</noun>.</sent>'

You can use predefined modules, like:

Example with XML (notice how to specify tagname and attributes):

In [6]:
string = "The little cat drinks milk."
inliner = Standoff2Inline(kind='xml')
inliner.add((0, ('sent', dict(foo="bar", truc="chose"))), 26)
inliner.add((0, 'gn'), 13)
inliner.add((11, ("noun", dict())), 13)
inliner.add((22, "noun"), 25)
inliner.add((0, "det"), 2)
inliner.apply(string)
Out[6]:
'<sent foo="bar" truc="chose"><gn><det>The</det> little <noun>cat</noun></gn> drinks <noun>milk</noun>.</sent>'

You can also use tokens instead of a string:

In [7]:
tokens = "The little cat drinks milk .".split()
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (5, "</sent>"))
inliner.add((0, "<gn>"), (2, "</gn>"))
inliner.add((2, "<noun>"), (2, "</noun>"))
inliner.add((4, "<noun>"), (4, "</noun>"))
inliner.add((0, "<det>"), (0, "</det>"))
inliner.apply(tokens=tokens)
Out[7]:
'<sent><gn><det>The</det> little <noun>cat</noun></gn> drinks <noun>milk</noun> .</sent> '

What happens when two annotation have the same position? In order of appearance in the resulting string:

  • first is outer
  • last is inner
In [8]:
string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((4, "<outer>"), (9, "</outer>"))
inliner.add((4, "<inner>"), (9, "</inner>"))
inliner.apply(string)
Out[8]:
'The <outer><inner>little</inner></outer> cat drinks milk.'

You can iterate over the result:

In [9]:
string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (26, "</sent>"))
inliner.add((0, "<gn>"), (13, "</gn>"))
inliner.add((11, "<noun>"), (13, "</noun>"))
inliner.add((22, "<noun>"), (25, "</noun>"))
inliner.add((0, "<det>"), (2, "</det>"))
for kind, string in inliner.iter_result(string):
  print(kind, '"%s"' % string)
prefix "<sent>"
prefix "<gn>"
prefix "<det>"
string "The"
suffix "</det>"
string " little "
prefix "<noun>"
string "cat"
suffix "</noun>"
suffix "</gn>"
string " drinks "
prefix "<noun>"
string "milk"
suffix "</noun>"
string "."
suffix "</sent>"

In this case, if you give tokens, you may want to get back tokens and not string: use the return_tokens parameter:

In [10]:
tokens = "The very little cat drinks milk .".split()
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (6, "</sent>"))
inliner.add((0, "<gn>"), (3, "</gn>"))
inliner.add((3, "<noun>"), (3, "</noun>"))
inliner.add((5, "<noun>"), (5, "</noun>"))
inliner.add((0, "<det>"), (0, "</det>"))
for kind, string in inliner.iter_result(tokens=tokens, return_tokens=True):
  print(kind, string)
prefix <sent>
prefix <gn>
prefix <det>
string ['The']
suffix </det>
string ['very', 'little']
prefix <noun>
string ['cat']
suffix </noun>
suffix </gn>
string ['drinks']
prefix <noun>
string ['milk']
suffix </noun>
string ['.']
suffix </sent>

Highlighter recipes

In [1]:
from standoff2inline import Highlighter, highlight

Use the Highlighter class to set prefixes and suffixes common to several annotations. You can defined several Highlighter classes and pass them all to the highlight() function to merge them.

Basics

For example, if you want to put between square brackets the nouns, you just need to define the open and close brackets in the constructor:

In [12]:
hl = Highlighter(prefix="[", suffix="]")
hl.add_mark(4, 4)
hl.add_mark(2, 2)
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Out[12]:
'The little [cat] drinks [milk] ...'

You can set custom prefixes and suffixes for each annotation, though:

In [13]:
hl = Highlighter()
hl.add_mark(4, 4, "(", ")")
hl.add_mark(2, 2, "<", ">")
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Out[13]:
'The little <cat> drinks (milk) ...'

And you can combine both, for example a common prefix with different suffixes for each annotation:

In [14]:
hl = Highlighter(prefix="[")
hl.add_mark(4, 4, suffix=")")
hl.add_mark(2, 2, suffix=">")
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Out[14]:
'The little [cat> drinks [milk) ...'

Or the other way around:

In [15]:
hl = Highlighter(suffix="]")
hl.add_mark(4, 4, prefix="(")
hl.add_mark(2, 2, prefix="<")
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Out[15]:
'The little <cat] drinks (milk] ...'

Or with xmlish tags:

In [23]:
hl = Highlighter(suffix="</span>")
hl.add_mark(0, 5, '<span type="sent">')
hl.add_mark(0, 2, '<span type="gn">')
hl.add_mark(4, 4 ,'<span type="noun">')
hl.add_mark(2, 2, '<span type="noun">')
hl.add_mark(0, 0, '<span type="det">')
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Out[23]:
'<span type="sent"><span type="gn"><span type="det">The</span> little <span type="noun">cat</span></span> drinks <span type="noun">milk</span> ...</span>'

Note that you can't, for now, put a default affix and then change it:

In [5]:
# -- This doesn't work --
hl = Highlighter(prefix="[", suffix="]") # don't define `suffix` here, or put it to None
hl.add_mark(4, 4, suffix=")")
hl.add_mark(2, 2, suffix=">")
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Out[5]:
'The little [cat) drinks [milk] ...'

Prefixes and suffixes may be given as list (matching the number of tokens):

In [16]:
hl = Highlighter(prefix="[", suffix="]D ]A ]N ]V ]M ]P".split())
tokens = "The little cat drinks milk ...".split()
hl.add_marks((x, x) for x in range(len(tokens)))
highlight(tokens, hl)
Out[16]:
'[The]D [little]A [cat]N [drinks]V [milk]M [...]P'

Gaps

If may want to keep only a context window between your annotation, and to remove long chunks of text between two annotation. Use the following parameters of highlight():

  • margin: the margin (left and right) to keep, in characters or tokens (depending on the char parameter),
  • max_gap: the maximum number of characters or tokens allowed between two annotations.
In [24]:
hl = Highlighter(suffix="</span>")
hl.add_mark(2, 2, '<span type="noun">')
hl.add_mark(12, 12 ,'<span type="noun">')
hl.add_mark(0, 0, '<span type="det">')
tokens = "The little cat who played yesterday with my " \
"neighbor 's children drinks " \
"milk ...  And the next sentence ...".split()
highlight(tokens, hl, margin=2, max_gap=4)
Out[24]:
'<span type="det">The</span> little <span type="noun">cat</span> who played [...] children drinks <span type="noun">milk</span> ... And [...]'

Several highlighters

In [25]:
hl1 = Highlighter(suffix="</span>")
hl1.add_mark(0, 5, '<span type="sent">')
hl1.add_mark(0, 2, '<span type="gn">')
hl1.add_mark(4, 4 ,'<span type="noun">')
hl1.add_mark(2, 2, '<span type="noun">')
hl1.add_mark(0, 0, '<span type="det">')
hl2 = Highlighter(prefix="[", suffix="]")
hl2.add_mark(2, 2)
hl2.add_mark(4, 4)
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl1, hl2)
Out[25]:
'<span type="sent"><span type="gn"><span type="det">The</span> little <span type="noun">[cat]</span></span> drinks <span type="noun">[milk]</span> ...</span>'

Styled highlighters

You may set predefined styles to highlighter to get html <span>s. Note that you can combine putting prefixes and suffixes in the constructor and then use set_style(), but apply the style after setting other prefixes and suffixes.

In [2]:
from standoff2inline import Highlighter, highlight

hl_det = Highlighter()
hl_det.add_mark(0, 0)
hl_det.set_style(color="red")

hl_noun = Highlighter()
hl_noun.add_mark(2, 2)
hl_noun.add_mark(4, 4)
hl_noun.set_style(underline=True)

hl_verb = Highlighter(prefix="[", suffix="]")
hl_verb.add_mark(3, 3)
hl_verb.set_style(bold=True, italic=True)

tokens = "The little cat drinks milk ...".split()
res = highlight(tokens, hl_det, hl_noun, hl_verb)
print(res)

from IPython.core.display import display, HTML
display(HTML(res))
<span style="color: red; ">The</span> little <span style="text-decoration: underline; ">cat</span> <span style="font-weight: bold; font-style: italic; ">[drinks]</span> <span style="text-decoration: underline; ">milk</span> ...
The little cat [drinks] milk ...

Using character positions rather than token positions

Use the char parameter:

In [27]:
hl_det = Highlighter()
hl_det.add_mark(0, 2)
hl_det.set_style(color="red")
hl_noun = Highlighter()
hl_noun.add_mark(11, 13)
hl_noun.add_mark(22, 25)
hl_noun.set_style(underline=True)
hl_verb = Highlighter()
hl_verb.add_mark(15, 20)
hl_verb.set_style(bold=True, italic=True)
string = "The little cat drinks milk..."
res = highlight(string, hl_det, hl_noun, hl_verb, char=True)

print(res)

from IPython.core.display import display, HTML
display(HTML(res))
<span style="color: red; ">The</span> little <span style="text-decoration: underline; ">cat</span> <span style="font-weight: bold; font-style: italic; ">drinks</span> <span style="text-decoration: underline; ">milk</span>...
The little cat drinks milk...

Using an highlighter to put a line break

In [29]:
hl1 = Highlighter(suffix="</span>")
hl1.add_mark(4, 4 ,'<span type="noun">')
hl1.add_mark(2, 2, '<span type="noun">')
hl2 = Highlighter(prefix="<br />")
hl2.add_mark(6, 6)
tokens = "The little cat drinks milk ...  And is happy".split()
highlight(tokens, hl1, hl2)
Out[29]:
'The little <span type="noun">cat</span> drinks <span type="noun">milk</span> ... <br />And is happy'