Companies today are starting to understand that there’s a lot of value hidden in all the unstructured data they handle daily, and buried in archives whose size increased immensely over the years. We’re observing the (re)birth of an industry made of many players offering Artificial Intelligence-based solutions, organizations asking for their help in understanding the content of their documents, new important roles like the one of Data Scientist.
Being this industry very young, we’re also recording the difficulties in understanding what Natural Language processing is really about. Most companies look at it like it’s one big technology, and assume the vendors’ offerings might differ in product quality and price but ultimately be largely the same. Truth is, NLP is not one thing; it’s not one tool, but rather a toolbox. There’s great diversity when we consider the market as a whole, even though most vendors only have one tool each at their disposal, and that tool isn’t the right one for every problem. While it is understandable that a technical partner, when approached by a prospective client, will try to address a business case using the tool it has, from the client’s standpoint this isn’t ideal. Each problem demands a different solution.
Over the years I’ve worked with many clients from every industry, and since I was lucky enough to work for a company that had many tools in its toolbox, I could pick and choose a different approach every time. The most appropriate instrument for the job. My typical questions are:
- Is the methodology relevant? Given the same functionality, is it impactful to prefer, for instance, Deep Learning to Symbolic?
- What is the AI solution expected to deliver? Given a specific use case, which NLP feature is the most fitting?
While realizing this topic could easily require a 2-week seminar to be properly investigated, I’ll try and summarize my experience using a few examples (and, of course, applying the necessary simplification).
I’ll start by saying that, the way I look at this problem, those two questions are very much connected. Some approaches (like, for instance, ML-based ones) can respond to a short time-to-market requirement, it is in fact possible to deliver very quickly a solution that has good-enough performance, at least for some use cases (for example those where you can ignore a balance between Precision and Recall), especially when our solution happens to be based on a large archive that happened to be, for some reason, manually pre-tagged in the past. On the other hand, a project might demand high precision and high recall, but it mostly revolves around proper nouns or codes that are unique (that is, rarely presenting any ambiguity), so it’s easier to approach the problem using a straight-forward list of keywords. Unfortunately, we don’t have strict guidelines on when a methodology is better than others, this choice is tightly linked to the specific solution we want to build…but there are a few general rules. Since everything in life comes with advantages and drawbacks, here’s a (again, simplified) view:
- Keywords technology (aka, shallow linguistics) is preferable when lists of unambiguous words are involved, not advisable when relevant words can represent multiple meanings
- Symbolic (syntactical analysis, Semantics, deep linguistics) is going to collect information in great detail, and it’s ideal when one wants to remove noise from the results, but it’s not the best solution when a goal needs to be reached quickly or if the effort needs to be kept to a minimum (unless we’re talking of an already-customized solution, in fact some NLP vendors are specialized in a single industry, which makes the development faster)
- Machine Learning (statistical approach) has made a strong comeback in recent years in the form of techniques we generally address as Deep Learning, mostly on the promise to require very little effort and time to deliver a solution starting from scratch; and it is true that sometimes it’s incredibly easy to reach 75% Accuracy with a very basic algorithm (assuming you have a large-enough corpus that’s been tagged, or you’re willing to put in the work). This is probably why a lot of startups, notoriously careful when it comes to spending, are riding this horse. If your application expects a production-grade accuracy (which I personally define as north of 85% F-score) then the problem could seem insurmountable, depending on the use case, in fact we have lately read more and more articles talking about Machine Learning not being the ideal approach for NLP problems, and some big players have reshaped their message into something like “Machine Learning is here to work with you”.
But let’s talk about the tools in the toolbox. Here’s a brief, non-comprehensive list: Classification, Entity Extraction, Sentiment, Summarization, Part-of-Speech, Triples (SAO), Relations, Fact Mining, Linked Data, Heuristics, Emotions/Feelings/Moods. Almost every single use case in Computational Linguistics can be reconducted to meta-tagging; a document goes through an engine, and it comes out richer, decorated with a list of tags indicating key intelligence data related to the document. It’s probably this simple concept that has so many companies think that all NLP technologies are the same, but the point is: what do you want to tag your documents with? Categories coming from a standard taxonomy? Names of companies mentioned in the text? An indication of the document’s general sentiment stated as “positive” or “negative”? Maybe a combination of tools (sentiment for each entity extracted from the document)?
Text Analytics provider are, by nature, problem solvers. If a vendor only offers, for example, Classification, it’s pretty easy to make an argument about how to address any use case through the process of classifying content. The same way I can put a nail into a wall using my shoe, but if I had a hammer I’d probably go with that. How can we recognize the tool we need for each project? Some are more obvious than others, but let me give you a few pointers about the most famous:
- Classification should be used when the final objective of your application comes from recognizing that a document belongs to a very specific, predefined class (sports, food, insurance policies, financial reports about the energy market in south-east Asia, …). Like storing magazines in a box and putting a label on it. The name of a class isn’t necessarily mentioned in a document belonging to that class
- Entity Extraction is useful when you’re interested in the part of your content that’s variable, specifically those non-predefined elements, topics, or names that are actually mentioned
- Summarization helps when your solution requires speeding up investigation, so you want to be able to automatically build small abstracts that give a sense of the content of a document without having to read said document entirely until you know it is relevant to your research
- Sentiment and Emotions (or Feelings, or Moods, depending on the vendor) are pretty much self-explanatory; pretty popular in Analytics and BI applications, especially when it comes to measure brand/product reputation in the consumer market (through analysis of Social Media)
- Relations and Triples/SAO (e.g.: “CompanyX acquires CompanyY” tagged as CompanyX + Acquisition + CompanyY) are useful when the kind of information we’re looking for is a little more complex than usual; sometimes we’re just interested in co-occurrences of different named entities (people, companies, etc.) in the same document, other times we need to know if a specific entity was the object of an action involving another entity
It is impossible to list a complete set of the features offered by all the technology vendors in the NLP space, and, more importantly, NLP is still growing every year and its world keeps on expanding, which is probably why it is sometimes hard to sort through everything that the market offers. But knowing that every product is profoundly different helps in making the right choice.