A simple NLP application for ambiguity resolution

  • 11 May 2021
  • 1 reply
  • 130 views

Userlevel 3
Badge +1

How to resolve ambiguity for homographs and polysemy using expert.ai NLP technology.

 

A series of picture frames of different colours and different dimensions.

Photo by Markus Spiske on Unsplash

 

Read the article on Towards Data Science.

 

Ambiguity is one of the biggest challenges in NLP. When trying to understand the meaning of a word we consider several different aspects, such as the context in which it is used, our own knowledge of the world, and how a given word is generally used in society. Words change meaning over time and can also mean one thing in a certain domain and another in a different one. This phenomenon can be observed in homographs — two words that happen to be written in the same way, usually coming from different etymologies — and polysemy — one word that carries different meanings.
In this tutorial, we’ll see how to resolve ambiguity in PoS tagging and semantic tagging, using expert.ai technology.

 

Before you start

 

Please check how to install expert.ai NL API python SDK, either on this Towards Data Science article or on the official documentation, here.

 

Part of Speech tagging

 

Language is ambiguous: not only a sentence could be written in different ways and still convey the same meaning, but even lemmas — a concept that is supposed to be far less ambiguous — can carry different meanings.

For example, the word play could refer to several different things. Let’s take a look at the following examples:
I really enjoyed the play.
I’m in a band and I play the guitar.

Not only the same word can have different meanings, but it can be used in different roles: in the first sentence, play is a noun, while in the second it’s a verb. Assigning the correct grammatical label to each token is called PoS (Part of Speech) tagging and it’s not a piece of cake.

Let’s see how to resolve PoS ambiguity with expert.ai — first, let’s import the library and create the client:

from expertai.nlapi.cloud.client import ExpertAiClient
client = ExpertAiClient()

We’ll see the PoS tagging for two sentences — notice how the lemma key is the same in both sentences, while its PoS changes:

# Two sentences in which the same word, "key", has a different grammatical label
key_as_noun = "The key broke in the lock."
key_as_adjective = "The key problem was not one of quality but of quantity."

To analyze each sentence we need to create a request to NL API: the most important parameters — shown in the code below as well — are the text to analyze, the language, and the analysis we are requesting, represented by the resource parameter.
Please notice that expert.ai NL API currently supports five languages (en, it, es, fr, de). The resource we use is disambiguation, which performs multi-level tagging as the product of the expert.ai NLP pipeline.
Without further ado, let’s create our first request:

# Requesting  for the disambiguation of the first sentence, key_as_noun
# Notice: the parameter for resource specifies the kind of exploration we want to perform on the documents.
document = client.specific_resource_analysis(
body={"document": {"text": key_as_noun}},
params={'language': 'en', 'resource': 'disambiguation'})

Now we need to iterate over the PoS of the text and check which one was assigned to the lemma key:

# Producing and printing PoS tagging of the first sentence
# Notice: to retrieve the textual form of the element we use document.content with slicing on element start and end chars
print(f'Parts of speech for "{key_as_noun}"\n')
for token in document.tokens:
print(f'{document.content[token.start:token.end]:{15}}\tPOS: {token.pos}')
An image representing the output of the previous code: the sentence appears in the first line, then on the left we have a column with the tokens and on the right another column with their part of speech labels.

What is printed above, is a list of PoS following UD Labels, where NOUN indicates that the lemma key is here used as a noun. This should not be the case for its homograph that we see in the second sentence, in which key is used as an adjective:

# Requesting for the disambiguation of the second sentence, key_as_adj
document = client.specific_resource_analysis(
body={"document": {"text": key_as_adjective}},
params={'language': 'en', 'resource': 'disambiguation'})

# Producing and printing PoS tagging of the first sentence
# Notice: to retrieve the textual form of the element we use document.content with slicing on element start and end chars
print(f'Part of speech for "{key_as_adjective}"\n')
for token in document.tokens:
print(f'{document.content[token.start:token.end]:{15}}\tPOS: {token.pos}')
An image representing the output of the previous code: the sentence appears in the first line, then on the left we have a column with the tokens and on the right another column with their part of speech labels.

As you can see printed above, the lemma key was correctly recognized as an adjective in this sentence.

 

Semantic tagging

 

One word can also have the same grammatical label and have different meanings. This phenomenon is called polysemy. Being able to infer the correct meaning for each word is to perform semantic tagging.

Words that are more common tend to have more meanings that have been added to them in time. For example, the lemma paper can have multiple meanings, as seen here:
I like to take notes on paper.
Every morning my husband reads the news from the local paper.

Pointing out the correct meaning of every single lemma is an important task, as one document could change meaning or focus based on that. To do so, we must rely on technology that is well developed and robust, since semantic tagging heavily depends on many pieces of information that come from the text.

For semantic tagging IDs are often used: these IDs are identifiers of concepts, and each concept will have its own ID. For the same lemma, e.g. paper, we will have a certain id x for its meaning as a material, and another y for its meaning as a newspaper.
These IDs are usually stored in a Knowledge Graph, that is a graph in which each node is a concept and the arches are the connections between concepts that follow a certain logic (e.g. an arch could link two concepts if one is the hyponym of the other).
Let’s now look at how expert.ai performs semantic tagging. We begin by choosing the sentences from which we will compare the two lemmas solution:

solution_as_tactic = "Work out the solution in your head."
solution_as_chemical_mixture = "Heat the chlorine solution to 75° Celsius."

And now the request for the first sentence — using the same parameters as the previous example:

# Requesting disambiguation of the first sentence, solution_as_tactic
# Notice: the parameter for resource specifies the kind of exploration we want to perform on our documents.
document = client.specific_resource_analysis(
body={"document": {"text": solution_as_tactic}},
params={'language': 'en', 'resource': 'disambiguation'})

Semantic information is found in the syncon attribute for each token: a syncon is a concept, that is stored in expert.ai’s Knowledge Graph; each concept is formed by one or more lemmas, which are synonyms.
Let’s see how the information is presented in the document object:

# Producing and printing semantic tagging of the first sentence
# Notice: to retrieve the textual form of the element we use document.content with slicing on element start and end chars
print(f'Semantic tagging for "{solution_as_tactic}"\n')
for token in document.tokens:
print(f'{document.content[token.start:token.end]:{15}}\tCONCEPT_ID: {token.syncon}')
An image representing the output of the previous code: the sentence appears in the first line, then on the left we have a column with the tokens and on the right another column with their semantic IDs.

Each token has its own syncon, whereas some of them present -1 as concept id: this is the default ID assigned to tokens that do not have any concept, such as punctuation or articles.
So, if for the previous sentence we obtain concept id 25789 for the lemma solution, for the second sentence we should obtain another one since the two lemmas have a different meaning in the two sentences:

# Requesting disambiguation of the second sentence, solution_as_chemical_mixture
# Notice: the parameter for resource specifies the kind of exploration we want to perform on our documents.
document = client.specific_resource_analysis(
body={"document": {"text": solution_as_chemical_mixture}},
params={'language': 'en', 'resource': 'disambiguation'})

# Producing and printing semantic tagging of the second sentence
# Notice: to retrieve the textual form of the element we use document.content with slicing on element start and end chars
print(f'Semantic tagging for "{solution_as_chemical_mixture}"\n')
for token in document.tokens:
print(f'{document.content[token.start:token.end]:{15}}\tCONCEPT_ID: {token.syncon}')
An image representing the output of the previous code: the sentence appears in the first line, then on the left we have a column with the tokens and on the right another column with their semantic IDs.

As expected, the lemma solution corresponds to a different concept id, indicating that the lemma used has a different meaning from the previous sentence.

Please find this article as a notebook on GitHub.

 

Conclusion

 

NLP is hard because language is ambiguous: one word, one phrase, or one sentence can mean different things depending on the context. With technologies such as expert.ai, we can solve ambiguity and build solutions that are more accurate when dealing with the meaning of words.


This topic has been closed for comments

1 reply

Userlevel 4
Badge +2

Thanks Laura!

Amazing content! 

It’s amazing this contribution has been featured by TowardsDataScience, it’s a great recognition for a great job!

Keep posting on our Community!

 

Ciao,

Francesco Baldassarri