Machine Learning is slowly but surely leaving academic circles and enthusiasts’ nighttime projects to relocate to business applications. While some solutions have quickly adopted modern ML-based algorithms (Deep Learning) successfully, image recognition for example, others have been struggling with achieving production-grade results, leading to stark statistics highlighting that 85% of ML projects in major corporations never see the light of day. This is a defining step for ML as a whole, since the expectations companies have when onboarding a new technology are high, there’s a demand for higher quality than what one usually chases in a lab, and a positive return on investment is considered the norm.
Nevertheless, experiments in this field are the new normal for actors in all industries. And when ML is applied to Document Analysis, some conclusions are fairly common. In fact, one in particular: that is, wondering if ML is even the best fit for a solution that requires understanding language and communications. Anyone who’s attempted to deploy a functioning solution, aimed at optimizing a real-life business workflow, knows how much easier was at some point to support ML with other techniques (commonly referred to as Hybrid approach), or use a different technology entirely (like Symbolic, for instance).
While it is true that Machine Learning today isn’t ready for prime time in many business cases that revolve around Document Analysis, there are indeed scenarios where a pure ML approach can be considered. Here follows a list of what I’ll informally refer to as Machine Learning’s Sweet Spots.
In order to frame my thought process, I’m going to first list the dimensions I’m considering:
- Documents size
- Content structure
- Language specialists
- Languages to cover
The first scenario I consider ML-friendly is one that focuses on production cost. Let’s imagine achieving accuracy over 60-70% is not a priority (or maybe not even possible because of documents poor quality or other reasons), then it’d be great to at least deliver some minor workflow optimization at a lower cost. An ML solution trained on a small amount of pre-tagged documents can often deliver an immediate 65% Accuracy score without asking for the investment of tagging hundreds of thousands of documents.
As in all that follows, the above concepts and numbers vary based on the specific content and use case at hand, but as a general rule of thumb we can assume that 60 to 70% Accuracy can be reached without having to hire 8 Data Analysts and asking them to tag documents for the next 12 months. In other words, I’m ultimately stating something very obvious: when you can compromise on the quality front, you at least have a chance at a cheaper, speedier deployment.
Another scenario where sub-par Accuracy is acceptable is when all the processed documents will be checked manually anyway. There’s not much more I can say about these use case without getting in more detail than this text allows, but in my life I have encountered processes that would never, under no circumstances, remove the human component from the mix. Sometimes it was because the documents they worked with were just too messy, and some other times it was because there was a need for 100% Recall, but the point is, they would have to check everything at the end of the pipeline. Still, they could enjoy some optimizations/extractions/highlights that would make the checking easier, faster, less tedious. The software didn’t really need to catch everything.
Third scenario that favors a pure ML approach is when a logical approach is not applicable because we don’t really know how to solve the problem we’re facing. Granted, this has more to do with cancer research than Natural Language Processing, but I thought this list wouldn’t be complete if I didn’t briefly mention it. Other situations when logical approaches aren’t favored are when the ideal solution doesn’t seem to follow logical patterns, and when the problem is just way too complicated (document size, structure) or chaotic. Again, this hardly applies to applications in the world of Document Analysis, but they deserve a space in here. We register the occasional exception to this, when the compromise of facing poor performance in an NLP solution is accepted anyway: when a linguistic solution needs to handle multiple languages, and at the same time there’s a necessity to deploy it quickly and at a low cost.
One uncommon, though incredibly fortunate, condition in which Machine Learning tends to be the first option one explores is when a company happens to have already a large amount of pre-tagged content. Some workflows have historically required for content to be identified and labeled manually; when the desire to automate portions of a process comes into play, there might be a sufficiently large archive of documents ready to train an algorithm. And, depending on the size of this archive, even simple algorithms can reach Accuracy of 80% and beyond. An example of this is Classification in a largely heterogeneous context like channels in the Media & Publishing industry, where subjects are as semantically distant as Finance and Sports, making it easier for ML to find highly recognizable patterns).
Finally, one of the best scenarios—probably one many would consider ML’s natural place—is when cost isn’t an issue. I am specifically talking about contexts in which a company can afford to have an army of Data Scientists working for years on training and perfecting an algorithm, spending all the money and time this requires…as long as quality hits 99%. I should clarify that actual 99% is something we can consider for a chess engine, not really a realistic expectation in the realm of natural language and communication in general (in fact, not even humans would achieve that result very often), but let’s call “99%” any performance level that is at least not too far below what a person would achieve in a given job. In some cases, though today still not many, this is possible with a pure ML approach; and, in this small group, there’s a sub-group where the same wouldn’t be possible following a logical approach (aka, Symbolic). Sure, this is typically when someone mentions that a hybrid approach putting together Symbolic and ML would cost half and give way more control in terms of explainability, but that’s a topic for another day.