On Knowledge Graphs and how they enhance Natural Language Understanding technology

  • 22 October 2021
  • 0 replies

Userlevel 4
Badge +1

We’ve stepped into a second phase for Machine Learning, one in which it’s become fairly known that it isn’t very efficient to build a basic understanding of language in a Machine Learning model from scratch every time a new project starts. Language is an established baseline of every document, and it has nothing to do with the specificity of an industry or a use case. A small portion of it is, but, for the most part, a document is full of words and concepts any person would understand, even those that don’t work in the industry the text is meant for.

by Jayarathina - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=37135596

Therefore, it is faster to infuse the training of a model with pre-established knowledge, and the best way to do this is to adopt a knowledge graph. This new level of maturity in the AI world has rendered the conversation around knowledge graphs more common than it used to be. Consequently, now many are wondering what a knowledge graph is, how it works, how it can be leveraged. The advantages knowledge graphs bring can’t be exhausted in one article, but I’ll start with one that goes straight to the point. I’ll jump right in with an example.

Let’s say we need to catalog or search content (documents/products/etc.) that belongs to very specific categories like Music, Sports, Movies, Furniture, etc.

One document reads: “Susan’s the pianist who’s playing now.

A second document reads: “Mary’s the goalkeeper who’s playing now.

Why is it so easy for a person to immediately get that the first document is about music, while the second one is about sports? The answer is straightforward: the concept “pianist” is associated to the domain of music, while “goalkeeper” is associated to soccer, which is a sport. This is what we call a semantic association.

A knowledge graph is a data structure that stores all these concepts and associations, respecting their dependencies in the form of tree branches (human activities -> sports -> soccer -> goalkeeper) as well as differentiating the types of relations between concepts (a “pug” and a “paw” are both connected to “dog”, but the former is a type of dog while the latter is a part of a dog).

Additionally, if we look again at those sentences, we notice there’s more. Something not apparent at word level, hidden in the semantics of those statements. Those two instances of the word “playing” refer to entirely different actions, the first one being “playing the piano”, and the second one having to do with engaging in a sporting activity. This shows how many subtle implications routinely trickle through entire sentences and paragraphs without actually using different forms. An accurate analysis and indexing would consider those two instances of the verb “to play” as they were two different things, and act accordingly. Understanding this particular feature of knowledge graphs also happens to highlight a bigger problem we’ve observed during the first phase of this modern iteration of Machine Learning: if an ML model (that is, the production engine produced by an ML algorithm) is only trained using words, as opposed to concepts, it can only get so far.



We can’t always rely on finding words that are void of any ambiguity. In fact, that is pretty rare in most languages. The AI software we adopt in the processing of our archives of unstructured data (AKA freeform textual content) should be able to index words and portions of documents based on what the author actually meant, so that if I’m looking for documents about sports I’ll be presented with those that use the verb “to play” with that particular meaning, and not all the instances of the word “play” in total disregard of its meaning in context.


This topic has been closed for comments