standoff2inline
module user guide¶Converting standoff annotations to inline annotations.
For example, in the sentence:
The little cat drinks milk.
you know that the third word, between the 12th and 14th characters, is a noun. You may want to surround it with some tags, like <noun>
and </noun>
:
The little <noun>cat</noun> drinks milk.
This module offer classes and function to:
<span>
tags,[...]
.Adding annotations from character positions:
from standoff2inline import Standoff2Inline
string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (26, "</sent>"))
inliner.add((0, "<gn>"), (13, "</gn>"))
inliner.add((11, "<noun>"), (13, "</noun>"))
inliner.add((22, "<noun>"), (25, "</noun>"))
inliner.add((0, "<det>"), (2, "</det>"))
inliner.apply(string)
Adding annotations from token positions:
from standoff2inline import Standoff2Inline
tokens = "The little cat drinks milk .".split()
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (5, "</sent>"))
inliner.add((0, "<gn>"), (2, "</gn>"))
inliner.add((2, "<noun>"), (2, "</noun>"))
inliner.add((4, "<noun>"), (4, "</noun>"))
inliner.add((0, "<det>"), (0, "</det>"))
inliner.apply(tokens=tokens)
Using the Highlighter
class to highlight differently different parts of speech:
from standoff2inline import Highlighter, highlight
hl_det = Highlighter()
hl_det.add_mark(0, 0)
hl_det.set_style(color="red")
hl_noun = Highlighter()
hl_noun.add_mark(2, 2)
hl_noun.add_mark(4, 4)
hl_noun.set_style(underline=True)
hl_verb = Highlighter(prefix="[", suffix="]")
hl_verb.add_mark(3, 3)
hl_verb.set_style(bold=True, italic=True)
tokens = "The little cat drinks milk ...".split()
res = highlight(tokens, hl_det, hl_noun, hl_verb)
print(res)
from IPython.core.display import display, HTML
display(HTML(res))
Cut long passages without annotations (and replace by e.g. "[...]"):
from standoff2inline import Highlighter, highlight
hl = Highlighter(suffix="</span>")
hl.add_mark(2, 2, '<span type="noun">')
hl.add_mark(12, 12 ,'<span type="noun">')
hl.add_mark(0, 0, '<span type="det">')
tokens = "The little cat who played yesterday with my " \
"neighbor 's children drinks " \
"milk ... And the next sentence ...".split()
highlight(tokens, hl, margin=2, max_gap=4)
from standoff2inline import Standoff2Inline
Add annotations at some positions:
string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((0, "(flag)"))
inliner.add((3, "(flag)"))
inliner.add((4, "(flag)"))
inliner.add((10, "(flag)"))
inliner.apply(string)
Put open and close tags:
string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (26, "</sent>"))
inliner.add((0, "<gn>"), (13, "</gn>"))
inliner.add((11, "<noun>"), (13, "</noun>"))
inliner.add((22, "<noun>"), (25, "</noun>"))
inliner.add((0, "<det>"), (2, "</det>"))
inliner.apply(string)
You can use predefined modules, like:
Example with XML (notice how to specify tagname and attributes):
string = "The little cat drinks milk."
inliner = Standoff2Inline(kind='xml')
inliner.add((0, ('sent', dict(foo="bar", truc="chose"))), 26)
inliner.add((0, 'gn'), 13)
inliner.add((11, ("noun", dict())), 13)
inliner.add((22, "noun"), 25)
inliner.add((0, "det"), 2)
inliner.apply(string)
You can also use tokens instead of a string:
tokens = "The little cat drinks milk .".split()
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (5, "</sent>"))
inliner.add((0, "<gn>"), (2, "</gn>"))
inliner.add((2, "<noun>"), (2, "</noun>"))
inliner.add((4, "<noun>"), (4, "</noun>"))
inliner.add((0, "<det>"), (0, "</det>"))
inliner.apply(tokens=tokens)
What happens when two annotation have the same position? In order of appearance in the resulting string:
string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((4, "<outer>"), (9, "</outer>"))
inliner.add((4, "<inner>"), (9, "</inner>"))
inliner.apply(string)
You can iterate over the result:
string = "The little cat drinks milk."
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (26, "</sent>"))
inliner.add((0, "<gn>"), (13, "</gn>"))
inliner.add((11, "<noun>"), (13, "</noun>"))
inliner.add((22, "<noun>"), (25, "</noun>"))
inliner.add((0, "<det>"), (2, "</det>"))
for kind, string in inliner.iter_result(string):
print(kind, '"%s"' % string)
In this case, if you give tokens, you may want to get back tokens and not string: use the return_tokens
parameter:
tokens = "The very little cat drinks milk .".split()
inliner = Standoff2Inline()
inliner.add((0, "<sent>"), (6, "</sent>"))
inliner.add((0, "<gn>"), (3, "</gn>"))
inliner.add((3, "<noun>"), (3, "</noun>"))
inliner.add((5, "<noun>"), (5, "</noun>"))
inliner.add((0, "<det>"), (0, "</det>"))
for kind, string in inliner.iter_result(tokens=tokens, return_tokens=True):
print(kind, string)
from standoff2inline import Highlighter, highlight
Use the Highlighter
class to set prefixes and suffixes common to several annotations. You can defined several Highlighter
classes and pass them all to the highlight()
function to merge them.
For example, if you want to put between square brackets the nouns, you just need to define the open and close brackets in the constructor:
hl = Highlighter(prefix="[", suffix="]")
hl.add_mark(4, 4)
hl.add_mark(2, 2)
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
You can set custom prefixes and suffixes for each annotation, though:
hl = Highlighter()
hl.add_mark(4, 4, "(", ")")
hl.add_mark(2, 2, "<", ">")
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
And you can combine both, for example a common prefix with different suffixes for each annotation:
hl = Highlighter(prefix="[")
hl.add_mark(4, 4, suffix=")")
hl.add_mark(2, 2, suffix=">")
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Or the other way around:
hl = Highlighter(suffix="]")
hl.add_mark(4, 4, prefix="(")
hl.add_mark(2, 2, prefix="<")
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Or with xmlish tags:
hl = Highlighter(suffix="</span>")
hl.add_mark(0, 5, '<span type="sent">')
hl.add_mark(0, 2, '<span type="gn">')
hl.add_mark(4, 4 ,'<span type="noun">')
hl.add_mark(2, 2, '<span type="noun">')
hl.add_mark(0, 0, '<span type="det">')
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Note that you can't, for now, put a default affix and then change it:
# -- This doesn't work --
hl = Highlighter(prefix="[", suffix="]") # don't define `suffix` here, or put it to None
hl.add_mark(4, 4, suffix=")")
hl.add_mark(2, 2, suffix=">")
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl)
Prefixes and suffixes may be given as list (matching the number of tokens):
hl = Highlighter(prefix="[", suffix="]D ]A ]N ]V ]M ]P".split())
tokens = "The little cat drinks milk ...".split()
hl.add_marks((x, x) for x in range(len(tokens)))
highlight(tokens, hl)
If may want to keep only a context window between your annotation, and to remove long chunks of text between two annotation. Use the following parameters of highlight()
:
margin
: the margin (left and right) to keep, in characters or tokens (depending on the char
parameter),max_gap
: the maximum number of characters or tokens allowed between two annotations.hl = Highlighter(suffix="</span>")
hl.add_mark(2, 2, '<span type="noun">')
hl.add_mark(12, 12 ,'<span type="noun">')
hl.add_mark(0, 0, '<span type="det">')
tokens = "The little cat who played yesterday with my " \
"neighbor 's children drinks " \
"milk ... And the next sentence ...".split()
highlight(tokens, hl, margin=2, max_gap=4)
hl1 = Highlighter(suffix="</span>")
hl1.add_mark(0, 5, '<span type="sent">')
hl1.add_mark(0, 2, '<span type="gn">')
hl1.add_mark(4, 4 ,'<span type="noun">')
hl1.add_mark(2, 2, '<span type="noun">')
hl1.add_mark(0, 0, '<span type="det">')
hl2 = Highlighter(prefix="[", suffix="]")
hl2.add_mark(2, 2)
hl2.add_mark(4, 4)
tokens = "The little cat drinks milk ...".split()
highlight(tokens, hl1, hl2)
You may set predefined styles to highlighter to get html <span>
s. Note that you can combine putting prefixes and suffixes in the constructor and then use set_style()
, but apply the style after setting other prefixes and suffixes.
from standoff2inline import Highlighter, highlight
hl_det = Highlighter()
hl_det.add_mark(0, 0)
hl_det.set_style(color="red")
hl_noun = Highlighter()
hl_noun.add_mark(2, 2)
hl_noun.add_mark(4, 4)
hl_noun.set_style(underline=True)
hl_verb = Highlighter(prefix="[", suffix="]")
hl_verb.add_mark(3, 3)
hl_verb.set_style(bold=True, italic=True)
tokens = "The little cat drinks milk ...".split()
res = highlight(tokens, hl_det, hl_noun, hl_verb)
print(res)
from IPython.core.display import display, HTML
display(HTML(res))
Use the char
parameter:
hl_det = Highlighter()
hl_det.add_mark(0, 2)
hl_det.set_style(color="red")
hl_noun = Highlighter()
hl_noun.add_mark(11, 13)
hl_noun.add_mark(22, 25)
hl_noun.set_style(underline=True)
hl_verb = Highlighter()
hl_verb.add_mark(15, 20)
hl_verb.set_style(bold=True, italic=True)
string = "The little cat drinks milk..."
res = highlight(string, hl_det, hl_noun, hl_verb, char=True)
print(res)
from IPython.core.display import display, HTML
display(HTML(res))
hl1 = Highlighter(suffix="</span>")
hl1.add_mark(4, 4 ,'<span type="noun">')
hl1.add_mark(2, 2, '<span type="noun">')
hl2 = Highlighter(prefix="<br />")
hl2.add_mark(6, 6)
tokens = "The little cat drinks milk ... And is happy".split()
highlight(tokens, hl1, hl2)