Universal Parser

Version 1.0

Hiroshi Uchida, Meiying Zhu

UNL Center

UNDL Foundation

Created in January 2003 / Revised on 21 July 2003

 

1.        Introduction

 

The purpose of a parser for natural language processing is generally to analyze sentences of a target language syntactically and semantically.  Information on the syntax and grammatical features of each language is therefore necessary for a parser.  Recently, some attempts were made to develop parsers for annotated natural language texts in order to obtain more accurate results from sentences with ambiguities that cannot be solved by syntax and grammatical information only.  However, the main aim of current annotations is to clarify the syntactic structure of target languages, and tag sets for such annotations have been designed according to each language.  A language dependent parser is therefore necessary for analyzing annotated texts using such a tag set.

 

The Universal Parser proposed here generates UNL expressions from input sentences without using language dependent grammatical information but only language independent annotations.  Unlike current annotations, those for the Universal Parser are the set of tags required for generating the target meaning representations of sentences.  The Universal Parser generates UNL expressions of the target meaning representations by interpreting such tags of annotation inserted in input sentences.

 

Thus, the Universal Parser can deal with any language because it does not require any language dependent grammatical information.  It generates meaning representations of input sentences in any language by simply inserting tags for annotation in the input sentences.

 

 

2.        Structure of the Universal Parser

 

The Universal Parser uses an improved system of the EnConverter for the tag treatment. Figure 1 shows the structure of the Universal Parser.

 


 

 

 

 

 

Figure.1  Structure of Universal Parser

 

Inputs of the Universal Parser are annotated texts using the set of tags defined in the UNL Annotation specifications (http://www.undl.org/doc/unla.pdf).  The Universal Parser analyzes annotated input texts using Universal Parser Rules and a UW dictionary.  The Universal Parser Rules describe the appropriate actions for generating the target UNL expressions by analyzing tags inserted in input sentences.  The UW dictionary is a correction of entries containing links between words (in base form) in each language and corresponding UWs.

 

2.1 Inputs of the Universal Parser

 

The inputs of the Universal Parser must be annotated texts with UNL annotations.  If necessary, it is also permitted to change or replace a word, or insert a UW directly in the input text.  The use of UNL annotation tags is described in the document “UNL Annotation”.  This section will focus on the requirements of input texts for the Universal Parser.

 

[1]         Replacement of Words

 

Various forms of words should be replaced with their base forms in the input texts when a language has declension, conjugation or inflection of words according to tense, voice, person, and so on.  Because the words of each language are linked with corresponding UWs in base form in UW dictionaries, this replacement is necessary to retrieve the corresponding UWs from such a UW Dictionary through the words in the input texts.  When a replacement is carried out, information represented by the original form should also be described using corresponding tags.

 

For instance, the word “said” in “He said to me” needs to be replaced with its base form “say” by attaching the attribute tag “.@past” together, as in “he say.@past to me”.  Likewise “included” and “parts” in the example below are replaced by the two pairs of words and attributes “include” and “.@past”, and “part” and “.@pl” respectively.

 

The conference.@def{<aoj,>p} include.p.@past.@entry two{<qua,>n} part.n.@pl{<obj,<p}.

 

However, this replacement is not necessary when using a language specific parser having a facility for the morphological analysis of the target language.

 

[2]         Insertion of UWs

 

It is permitted to insert a UW or to replace a word with its UW in the input text.  This function is useful in dealing with an elliptic construction of a sentence or when a desired UW cannot be retrieved from a UW dictionary through a word in the input text.  UWs are basically described by being enclosed by the pair of tags <uw> and </uw>.  However, these tags can be omitted when a UW is simply restricted in the form of “(rel>headword)” or “(rel<headword)”, and taking a single word as the headword.  When the Universal Parser finds a relation description in the form of either “rel>” or “rel<”, it tries to search “(“ and “)” on both the left and right sides, and to make up a UW using the part enclosed by “(“ and “)” as a restriction.

 

For example in the sentence ‘The Conference on "Universal Knowledge and Language" held in Goa at 25-29 of November was a big success

‘, “25-29” is used to indicate a period of days without any explicit word.  In this case, a UW ‘day(icl>date)’ is necessary to clarify the meaning of “25-29”.  It is enclosed by <uw> and </uw> as shown in the example below.  However, the tags <uw> and </uw> can be omitted in this case because the UW ‘day(icl>date)’ takes a single word “day” as the headword, and its restriction is simply composed by a single relation with a simple headword ‘(icl>date)’.

 

The Conference.n.@def.@topic{<aoj,>p} <c>on.p{>aoj,<n} <c>"Universal{>aoj,>n} Knowledge.n{<and,>n} and Language.n.@entry"</c>{<obj,<p} hold.p.@past{>obj,<n} in Goa{<plc,<p} on <c>25{<fmt,>n}-29.n.@entry</c>{<mod,>n} <uw>day(icl>date)</uw>.n{<tim,<p} of November{<tim,<n}</c> was a big{>aoj,>p} success.p.@past.@entry.

 

In the example below, a UW ‘web site(icl>address)’ is used instead of the word “web site”.  This description happens when the UW is found in a UW dictionary through the word “web site”.  The tags <uw> and </uw> cannot be omitted in this case because the headword of the UW ‘web site(icl>address)’ is a complex word (“web site”), and it is necessary to indicate the range of a UW.

 

The paper.@def.@pl{<aoj,>p} are available.p.@entry in the <uw>web site(icl>address)</uw>.n.@def{<plc,<p} of the Conference.@def{<mod,<n} <w>http://www.cfilt.iitb.ac.in/icukl2002/</w>{<cnt,<n}.

 

[3]         Description of a Complex Word

 

Names in an English letter, such as an e-mail or homepage address, should be enclosed by <w> and </w> as they do not need to be defined in a UW dictionary and can be used as they are in any language. Such complex words will be outputted in the target UNL expressions as they are and treated as Temporary UWs.

 

For instance, the URL address “http://www.cfilt.iitb.ac.in/icukl2002/” in the example below is enclosed by <w> and </w>.

 

The paper.@def.@pl{<aoj,>p} are available.p.@entry in the <uw>web site(icl>address)</uw>.n.@def{<plc,<p} of the Conference.@def{<mod,<n} <w>http://www.cfilt.iitb.ac.in/icukl2002/</w>{<cnt,<n}.

 

2.2  Dictionary used in the Universal Parser

 

A UW dictionary is a correction of dictionary entries, each containing a word (in base form) of a language and a corresponding UW in IBAM[1] format.  As the Universal Parser aims at obtaining meaning representations of sentences without using language dependent information such as grammatical features, a UW Dictionary is sufficient for the purpose of obtaining the corresponding UWs for the words in input sentences.

 

Since a UW dictionary uses exactly the same format as a word dictionary, the Universal Parser can use a word dictionary instead of a UW dictionary. In this case, the Universal Parser can utilize grammatical information just by adding enconversion rules based on the grammatical information.  If the grammatical information is universal to every language, the universality of the Universal Parser is retained. If they are common to each other within a group of languages, the Universal Parser can retain its universality among these languages.  If any language dependent information is used, the Universal Parser is no longer universal but the number of annotations necessary for perfect enconversion will decrease.

 

Accordingly, the option to use the Universal Parser without any grammatical information, or with universal (language independent) or language dependent grammatical information depends on the developers and their purpose.  However, the most important aim and characteristics of the Universal Parser are to enable sentence enconversion using only annotations instead of a dictionary containing grammatical features.  Moreover, the fact that the Universal Parser can deal with any language, using a common set of tags for annotation to all languages, is an extremely important feature.

 

The detailed definitions of tags for UNL annotation are described in the document of “UNL Annotation”.  Because the tags for annotation are to be inserted into input sentences, they are stored in a UW dictionary in the same format as dictionary entries for words.  This enables the Universal Parser to treat tags like any other words in a sentence without special treatment.  Each tag is stored in a dictionary entry with its own tag notation and a code.  These tags are to be treated like any other word in the process of input analysis.

 

2.3   Rules of the Universal Parser

 

The rules of the Universal Parser describe the appropriate actions by recognizing tags for annotation inserted in an input sentence.  The Universal Parser analyzes input sentences according to the rules and generates target UNL expressions.  As mentioned above, it is also possible to describe rules to refer to grammatical features when a word dictionary is used instead of a UW dictionary.

 

2.4  Output of the Universal Parser

 

The output of the Universal Parser consists of the UNL expressions of each sentence.

For details of the UNL, it is recommended to refer to “UNL Specifications” at http://www.undl.org/unlsys/unl/UNL%20Specifications.htm.

 

When no UW is found for a word in an input sentence, the surface of the word is outputted to the UNL expression as a “temporary UW”.  The Universal Parser outputs such “temporary UWs” following the message “undefined words”.

 

For example, a URL like “http://www.cfilt.iitb.ac.in/icukl2002/” should be enclosed by the tags <w> and </w> as shown in the example below.  The Universal Parser will treat it as one element (word) and outputs it in the target UNL expression as it is.  It is unnecessary to register such a URL in a dictionary, although it is also outputted following the message “undefined words”.  Except for names like URL that can be used as they are, words outputted following “undefined words” must be defined in a UW dictionary or described by their UW in the input sentence.

 

[S:15]

{org:el}

The paper.@def.@pl{<aoj,>p} are available.p.@entry in the <uw>web site(icl>address)</uw>.n.@def{<plc,<p} of the Conference.@def{<mod,<n} <w>http://www.cfilt.iitb.ac.in/icukl2002/</w>{<cnt,<n}.

{unl}

aoj(available(aoj>thing):0W.@entry, paper(icl>stationery):04.@pl.@def)

plc(available(aoj<thing):0W.@entry, web site(icl>address):1P.@def)

cnt(web site(icl>address):1P.@def, http://www.cfilt.iitb.ac.in/icukl2002/:3H)

mod(web site(icl>address):1P.@def, conference(icl>meeting):2P.@def)

{/unl}

;;undefined words

;; http://www.cfilt.iitb.ac.in/icukl2002/

[/S]

 

In a case when a UW is directly described in the input sentence using the tags <uw> and </uw>, the Universal Parser checks whether the UW is defined in a UW dictionary.  It outputs the UW following the message “undefined UWs” if it is not found in the UW dictionary.

 

 

3.        How the Universal Parser Functions

 

The purpose of the rules of the Universal Parser is to analyze tags inserted in input sentences for annotation.  The Universal Parser generates UNL expressions by analyzing tagged texts according to the rules.  The Universal Parser functions as follows for various input sentences.

 

[1]        When it finds a referent tag such as {1} or {2}, it combines the tag with the element immediately in front of it, and makes a copy.  It then shifts the copy to the place immediately in front of the element attached to the corresponding reference tag such as {<1} or {<2}, and replaces the combination of the reference tag and the element with the copy of the referent.

 

For instance, in the example below, “John” is combined with “{1}” first and then its copy replaces “him{<1}”.  The relation tag ”{<obj,<p}” described in ”him{<1}” will be attached to the copy of “John{1}”.  As a result, “John” becomes the ‘obj’ of “support” in the target UNL expression of output.

 

John{1}{<agt,>p} appeal.p.@past.@entry to Michael.n{<gol,<p} to support.p{>agt,<n} him{<1}{<obj,<p}

 

[2]        It combines all the following tags with each word.  This process is carried out when operation [1] is completed.

 

For instance, in the example below, five elements combined with tags are created: ”John{1}{<agt,>p}”, “appealed.p.@past.@entry”, “Michael.n{<gol,<p}”, “support.p{>agt,<n}” and ”John{1}{<obj,<p}”

 

John{1}{<agt,>p} appeal.p.@past.@entry to Michael.n{<gol,<p} to support.p{>agt,<n} John{1}{<obj,<p}

 

[3]        Elements without a tag are deleted.

 

For instance, in the example below, “to” will be deleted automatically because no tag is attached.

 

John{1}{<agt,>p} appeal.p.@past.@entry to Michael.n{<gol,<p} to support.p{>agt,<n} John{1}{<obj,<p}

 

[4]        Next, it begins generating UNL expressions according to Relations Description Tags.  The parts enclosed by <c> and </c> are processed first.  A scope will be created only when the word to which “.@entry” is attached to becomes the head of the words enclosed between <c> and </c>.  If “.@entry” is not attached to the head between <c> and </c>, a scope will not be created.

 

For instance, ’<c>"Universal{>aoj,>n} Knowledge.n{<and,>n} and Language.n.@entry"</c>’ and “<c>25{<fmt,>n}-29.n.@entry</c>” below both express a scope. ”language” and “29” are the respective entries.  But the outside pair of <c> and </c> is used to avoid the reference from “Conference.n.@def.@topic{<aoj,>p}” to “hold.p”. It should refer to “success.p”.

 

The Conference.n.@def.@topic{<aoj,>p} <c>on.p{>aoj,<n} <c>"Universal{>aoj,>n} Knowledge.n{<and,>n} and Language.n.@entry"</c>{<obj,<p} hold.p.@past{>obj,<n} in Goa{<plc,<p} on <c>25{<fmt,>n}-29.n.@entry</c>{<mod,>n} <uw>day(icl>date)</uw>.n{<tim,<p} of November{<tim,<n}</c> was a big{>aoj,>p} success.p.@past.@entry.

 

[5]        It creates links between elements specified by relation tags in the main sentence.  As a relation tag is attached to a modifier, when a relation between two elements is created, the modifier is removed from the process target if it is not modified by others.  The process is repeated until only one element (head of a sentence) is left.

 

John{1}{<agt,>p} appeal.p.@past.@entry to Michael.n{<gol,<p} to support.p{>agt,<n} him{<1}{<obj,<p}

 

For instance, the following binary relations are created from the above five elements ”John{1}{<agt,>p}”, “appealed.p.@past.@entry”, “Michael.n{<gol,<p}”, “support.p{>agt,<n}”, and ”John{1}{<obj,<p}”.

 

{unl}

gol(appeal:0I.@entry.@past,           Michael:15)

agt(appeal:0I.@entry.@past,           John(iof>person):00)

agt(Michael:15,             support:1U)

obj(support:1U,             John(iof>person):00)

{/unl}

;;undefined words

;; appeal

;; Michael

;; appeal

;; Michael

;; support

 

In the above output, “appeal”, “Michael” and “support” are outputted as undefined words.  Except for “Michael”, all the words must be defined in a UW dictionary.

 

The Conference.n.@def.@topic{<aoj,>p} <c>on.p{>aoj,<n} <c>"Universal{>aoj,>n} Knowledge.n{<and,>n} and Language.n.@entry"</c>{<obj,<p} hold.p.@past{>obj,<n} in Goa{<plc,<p} on <c>25{<fmt,>n}-29.n.@entry</c>{<mod,>n} <uw>day(icl>date)</uw>.n{<tim,<p} of November{<tim,<n}</c> was a big{>aoj,>p} success.p.@past.@entry.

 

The following UNL expression will be generated from the above sentence.

 

{unl}

aoj(success(icl>state):80.@entry.@past,            conference(icl>meeting):04.@topic.@def)

aoj(big(aoj>thing):7N,   success(icl>state):80.@entry.@past)

obj(hold(agt>thing,obj>meeting):3R.@past,        conference(icl>meeting):04.@topic.@def)

aoj(on(icl>about):15,     conference(icl>meeting):04.@topic.@def)

obj(on(icl>about):15,     :01)

and:01(language(icl>word):2V.@entry,              knowledge(icl>information):26)

aoj:01(universal(aoj>thing):1N ,     knowledge(icl>information):26)

tim(hold(agt>thing,obj>meeting):3R.@past,         day(icl>date):5Y)

plc(hold(agt>thing,obj>meeting):3R.@past,        Goa(iof>state):4A)

tim(day(icl>date):5Y,      November:6V)

mod(day(icl>date):5Y,   :02)

fmt:02(29:55.@entry,     25:4T)

{/unl}

 

 

4.        Conclusion

 

It is possible to master the UNL Annotation necessary to generate UNL expressions using the Universal Parser by simple training.  Since the Universal Parser is language independent, it can be applied easily to any language.  This Universal Parser is not only helpful to generate meaning representations of UNL expressions from sentences in any languages, but hopefully can also be used as a basic tool for natural language processing and understanding of meaning.

 

The Universal Parser uses the language independent parser “EnConverter[2]” developed by the UNL Center.  The Enconverter can analyze input sentences of any language using a word dictionary and the enconversion rules of a target language.  The structural flexibility of the Universal Parser facilitates the introduction of both language independent and language dependent grammatical information.  This makes it possible to extend the Universal Parser to the specific parser of a language or group of languages using less annotations.

 

5.        Related Documents

 

[1]        UNL Annotation http://www.undl.org/doc/unla.pdf (English version),

http://www.undl.org/doc/unla_jp.pdf (Japanese version), Hiroshi Uchida and Meiying Zhu, UNL Center of UNDL Foundation, Jan. 2003

[2]        UNL Specifications http://www.undl.org/unlsys/unl/UNL%20Specifications.htm, UNL Center of UNDL Foundation, 2002

[3]        EnConverter Specifications http://www.undl.org/unlsys/ds.html, UNL Center of UNDL Foundation, 2002



[1] The acronym of “Index Based Access Method”, an electronic dictionary access method developed by Libra Corporation in 1995, improved and used in the UNL system since 1997.

[2] A language independent parser designed and developed by Hiroshi Uchida and Meiying Zhu, Libra Corporation 1995-1996, UNL Center since 1997