Parsing
Before we get into semantics, we need to learn about tokens and parse trees.
A tokenizer breaks a sentence up into smaller parts, tokens, which form the input to the parser. As an example sentence let’s take “John loves Mary”. A tokenizer could cut up the sentence into the tokens John, loves, and Mary. However, this would limit the analysis to the word level, and would require another analyser to analyse the morphemes of a word. This system combines parsing and morphological analysis and to that end it splits the sentence into characters: J o h n l o v e s M a r y `. Technically, the character is the token, but since there’s not much use in that, I call a token a sequence of characters: either a fixed character string or the match of a regular expression.
All whitespace sequences of space, tabs and newlines are replaced by a single space.
A parse tree is a tree-representation of a sentence. This syntactic representation is based on the hierarchical nature of language: a sentence is a compound of phrases, and these phrases are themselves composed of other phrases. A parser parses a sentence to form such a tree. It uses rewrite rules to transform the root node “s” into the branches “np” and “vp”. These branches are then themselves rewritten into new branches. It’s a recursive process that ends in the words of the sentence.
To see this in action, copy this sample script and run it.
from richard.core.System import System
from richard.entity.SentenceRequest import SentenceRequest
from richard.processor.parser.BasicParser import BasicParser
from richard.block.FindOne import FindOne
def parser_demo():
grammar = [
{ "syn": "s(V) -> np(E1) vp(V, E1)" },
{ "syn": "vp(V, E1) -> verb(V) np(E1)" },
{ "syn": "np(E1) -> noun(E1)" },
{ "syn": "noun(E1) -> proper_noun(E1)" },
{ "syn": "proper_noun(E1) -> 'john'" },
{ "syn": "proper_noun(E1) -> 'mary'" },
{ "syn": "verb(V) -> 'loves'" },
]
parser = BasicParser(grammar)
system = System(
model=model,
input_pipeline=[
FindOne(parser)
],
output_generator=generator
)
request = SentenceRequest("John loves Mary")
parse_tree = system.enter(request)
print(parse_tree)
if __name__ == '__main__':
parser_demo()
The variables E1 and V that you see after each category within brackets, help to integrate the semantics of child nodes with their parent. You will see how they are important when we get to semantics.
Something about the choice of the rewrite rules that form the grammar: these rules, like vp -> verb np, are not the only ones possible. Linguistic frameworks have many ways of decomposing a sentence. Read Wikipedia on Grammar for more information. I have not yet found a grammar that fits all purposes, and I will be using slightly different grammar rules throughout this documentation. You may look at X-bar theory as matches my approach best. But other types of grammar may be used as well.
An important characteristic of a grammar for a semantic parser is that there are many rewrite rules at the sentence level. The results that a sentence produces largely depend on the top-level construct of a sentence.
This library uses Earley’s parser, which is fast and efficient, and doesn’t fall into infinite recursion with left-recursive rules (i.e. a -> a b). The algorithm is adapted to use spans of tokens (here: characters), rather than work with individual tokens.
You may be missing a lexicon in this example. A lexicon is a collection of all the individual words of a language, together with their meanings. This library integrates the lexicon in the grammar to simplify the definition of idioms. An idiom is a group of words that does not form a phrase but contains a specific meaning. For example: “How many countries have population above 100 million?”
When you run the script, you should see the following parse tree representation.
s
+- np
| +- noun
| +- proper_noun
| +- john 'John'
+- vp
+- verb
| +- loves 'loves'
+- np
+- noun
+- proper_noun
+- mary 'Mary'
character strings and regular expressions
A character string like ‘mary’ matches a sequence of 4 tokens: M, a, r and y.
In stead of character strings, you can also use a regular expression like /\w+/ or /\d+/.
{
"syn": "proper_noun(E1) -> /\w+/",
"sem": lambda token: [('resolve_name', token, E1)]
}
Token delimiters
The default delimiter between two phrases or tokens is the space (” “). However, when no space is expected, use the + delimiter to glue together two tokens:
{
"syn": "s(E1) -> 'does' np(E1) vp_nosub_obj(E1)+'?'",
}
The + delimiter is essential in analysing morphemes. In this example the plural “s” is simply discarded:
{
"syn": "common_noun(E1) -> common_noun(E1)+'s'",
"sem": lambda common_noun: common_noun
}
If a space is optional but not required, use the tilde delimeter: ~:
{
"syn": "s(E1) -> 'does' np(E1) vp_nosub_obj(E1)~'?'",
}
This says: the sentence ends with a question mark, possibly preceded by space.
Optional tokens
A token can be made optional by adding a “?” after it.
{
"syn": "s(E1) -> np(E1) 'is' 'a'? 'part' 'of' np(E2)",
}
This will match both “is a part of” and “is part of”.
Morphological analysis
The following rules ignore the plural suffix “s”. The first rule is the common rule. The second one is for words that end with “ies”: cities -> city.
{
"syn": "noun(E1) -> noun(E1) + 's'",
"sem": lambda noun: noun
},
{
"syn": "noun(E1) -> /\w+/ + 'ies'",
"sem": lambda token: [(token+'y', E1)], "dialog": lambda token: [('dialog_isa', e1, token+'y')]
}
Multiple parse trees
Even if the input consists of multiple sentences, the default output of the parser is a single tree with s as the root category.
It’s possible to produce multiple parse trees; one for each sentence. In this case, specify the categories that form the root categories of these trees.
parser = BasicParser(read_grammar, sentence_categories=["decl", "question"])
In this example, the topmost occurrences of the specified categories will serve as sentence roots.
Parse tree pruning
After all parse trees have been extracted, there are many meaningless trees that were formed by the regular expression rules that could be applied on about any word. To get rid of these meaningless parse trees, only the ones with the least amount of regexp nodes are kept.
Parse tree ordering
A bigger grammar will produce more parse trees. The most important factor for ambiguity is the use of the token category in a rewrite rule, since it matches any word.
The pipeline will try the alternative parse trees one by one, starting with the one that happens to be the first to be produced. Without support, the order of the parse trees would only be influenced by the order of the rewrite rules in the grammar. Depending on this can be tricky and in some cases it is impossible to have the right interpretation come up first.
That’s why the parser is equuipped with a few sorting heuristics that help place the best tree up front. These heuristics, combined in BasicParseTreeSortHeuristics produce a reasonable result, but you may find it insufficient. It may be needed to replace these basic heuristics with your own.
Sort by tree depth
The first heuristic sorts parse trees by decreasing tree depth. The most deeply nested sentence is placed first. To illustrate this idea, take the following sentence
What are the continents no country in which contains more than two cities whose population exceeds 1 million?
It can be parsed (among many others) like this
What are the continents
+ no country in which contains more than two cities
+ whose population exceeds 1 million?
and like this
What are the continents
+ no country in which contains more than two cities
+ whose population exceeds 1 million?
The clause that starts with “whose” can modify either “continents” or “two cities”. The latter is a more likely interpretation, since this np is nearer.
I don’t know of any literature that supports this claim, however. Let me know if you know of any.
Sort by token count
The category token is used for proper nouns and other entities that can’t be listed in full in the lexicon. It matches any token. A grammar with more tokens will create more parse trees and make it accept more sentences. However, when execution starts these tokens are turned into names to be resolved. A sentence with many random tokens will fail. A sentence with the least amount of tokens has the best chance of succeeding and is therefore placed up front.
Sort by boost
When you just know that one interpretation of a sentence should be preferred over the other, you can “boost” that sentence, like this:
{
"syn": "s(E1) -> 'what' 'are' np(E1) '?'",
"sem": lambda np: apply(np, []),
"dialog": [("format", "list"), ("format_list", e1)],
},
{
"syn": "s(E1, E2) -> 'what' 'are' 'the' noun(E1) 'of' np(E2) '?'",
"sem": lambda noun, np: noun + [('of', E1, E2)] + apply(np, []),
"dialog": [("format", "table"), ("format_table", [e2, e1], [None, None])],
"boost": 1
}
The sentence “What are the capitals of european cities?” is matches by both rules, but the second one is more specific and should be preferred. Therefore it is boosted. The default boost value is 0. Multiple boost values can be used if needed.