Modern approaches to Natural Language Processing are offering a streamlining of the process of document analysis by way of simplification.
Simply put, there’s a tendency to drop the hard stuff (i.e., understanding the content) for more direct techniques like looking at words, how often they appear in documents, what other words show up next to them or somewhere else in the same document; this kind of statistical information is collected and carefully optimized during what is known in Machine Learning as the Training stage. Practically speaking, a person will manually tag a document that talks, for instance, about sports with the label “Sports” (known as the Target), and, when that document is processed, the engine will collect all the words present and mark them as potentially leading to the assumption they indicate a sport context. When more content from the Training Set of documents is analyzed, some of those words will be present again (reinforcing the idea that they truly are indicative of the domain of sports) while others will be absent (softening the possibility they have domain significance.)
Naturally, while simplification is appealing (because of its speed and being forgiving in terms of skills necessary to meet a challenge around unstructured data), it has its drawbacks too. The more obvious one being that collecting words as they were abstract symbols has nothing to do with understanding the meaning of what’s been described in a document, so it might seem moderately efficient for documents as a whole, but it becomes much harder the more laser-like our focus gets (understanding a sentence or, even harder, who’s performing an action in a sentence).
Looking at documents more closely through this lens, it also becomes immediately apparent that a major portion of our language is present in every text, no matter the topic. Which forces these simplified approaches to use very large training sets to compensate for ambiguous repetition, which refers to the fact that most words carry many different meanings and we only discern them thanks to their context, therefore an ML algorithm needs many instances of the same word for each meaning that word has, multiplying the number of documents needed for proper training. Since the simplification we’re talking about requires effort in the form of humans manually assigning the targets I mentioned above, the bigger a training set is the more work has to be carried out. And no shortcuts are allowed, because if lots of documents use a word with one specific meaning (and not as many use the same word with its other meanings) then the result will be an engine that assumes that one word can only have that one meaning (problem known in ML as overfitting).
A more precise way to address the issues described above is to adopt a technology that doesn’t skip the understanding step of the analysis of natural language. In NLP, not many components have the same impact as POS Tagging. A POS Tagger is expected to understand every sentence at a grammatical level. More elaborate ones will recognize proper nouns, phrases and idioms so that multiple words can be grouped together when that makes sense. Very advanced POS Taggers will also propagate information through the document, so that information that was recognized thanks to context in a sentence will still be recognized in other sentences of the same document where that helpful context isn’t present, and this will work just the same even if that necessary information appears only later in the document (usually this happens through a first analysis only meant to solve these ambiguities, followed by a second analysis that leverages the information gathered during the first pass to properly understand everything else).
Finally, a POS Tagger isn’t complete without being able to solve a sufficient amount of anaphoras (when a pronoun, like “she”, is linked to an actual name in the document), as well as guess correctly the role of proper nouns based on how they’re used.
A couple of examples to better grasp some of the features mentioned above:
- “My wife Washington and I took off on a road trip on the East Coast. Washington really liked New York.”: clearly here “Washington” is a person (as shown in the screenshot above, which is also displaying propagation capabilities), and this can’t be correctly recognized if not by having a proper understanding of the whole statement and the context around every word.
- “The BMW I bought to replace my Mercedes is a good car.”: an effective POS Tagger will recognize that the “good car” we’re talking about is the BMW, even if, positionally, we read “[…] my Mercedes is a good car”
An advanced POS Tagger, typically combined with other NLP components (like a Knowledge Graph, and more) is what leads to platforms that are known as Natural Language Understanding. In Artificial Intelligence applied to document analysis, being this ML or Symbolic/Semantics or even Hybrid AI, NLU is an elevated form of NLP that takes the processing of language to a level in which a deeper understanding of context is expected.
Expert.ai’s POS Tagger and other NLP functionalities can be tested at try.expert.ai