Welcome to our Community!
Articles
Machine Learning’s Sweet Spot: 5 scenarios for a pure ML approach in NLP and Document Analysis
Machine Learning is slowly but surely leaving academic circles and enthusiasts’ nighttime projects to relocate to business applications. While some solutions have quickly adopted modern ML-based algorithms (Deep Learning) successfully, image recognition for example, others have been struggling with achieving production-grade results, leading to stark statistics highlighting that 85% of ML projects in major corporations never see the light of day. This is a defining step for ML as a whole, since the expectations companies have when onboarding a new technology are high, there’s a demand for higher quality than what one usually chases in a lab, and a positive return on investment is considered the norm.Nevertheless, experiments in this field are the new normal for actors in all industries. And when ML is applied to Document Analysis, some conclusions are fairly common. In fact, one in particular: that is, wondering if ML is even the best fit for a solution that requires understanding language and communications. Anyone who’s attempted to deploy a functioning solution, aimed at optimizing a real-life business workflow, knows how much easier was at some point to support ML with other techniques (commonly referred to as Hybrid approach), or use a different technology entirely (like Symbolic, for instance).While it is true that Machine Learning today isn’t ready for prime time in many business cases that revolve around Document Analysis, there are indeed scenarios where a pure ML approach can be considered. Here follows a list of what I’ll informally refer to as Machine Learning’s Sweet Spots.In order to frame my thought process, I’m going to first list the dimensions I’m considering:Cost Accuracy Speed Documents size Content structure Subject Language specialists Languages to cover The first scenario I consider ML-friendly is one that focuses on production cost. Let’s imagine achieving accuracy over 60-70% is not a priority (or maybe not even possible because of documents poor quality or other reasons), then it’d be great to at least deliver some minor workflow optimization at a lower cost. An ML solution trained on a small amount of pre-tagged documents can often deliver an immediate 65% Accuracy score without asking for the investment of tagging hundreds of thousands of documents.As in all that follows, the above concepts and numbers vary based on the specific content and use case at hand, but as a general rule of thumb we can assume that 60 to 70% Accuracy can be reached without having to hire 8 Data Analysts and asking them to tag documents for the next 12 months. In other words, I’m ultimately stating something very obvious: when you can compromise on the quality front, you at least have a chance at a cheaper, speedier deployment.Another scenario where sub-par Accuracy is acceptable is when all the processed documents will be checked manually anyway. There’s not much more I can say about these use case without getting in more detail than this text allows, but in my life I have encountered processes that would never, under no circumstances, remove the human component from the mix. Sometimes it was because the documents they worked with were just too messy, and some other times it was because there was a need for 100% Recall, but the point is, they would have to check everything at the end of the pipeline. Still, they could enjoy some optimizations/extractions/highlights that would make the checking easier, faster, less tedious. The software didn’t really need to catch everything.Third scenario that favors a pure ML approach is when a logical approach is not applicable because we don’t really know how to solve the problem we’re facing. Granted, this has more to do with cancer research than Natural Language Processing, but I thought this list wouldn’t be complete if I didn’t briefly mention it. Other situations when logical approaches aren’t favored are when the ideal solution doesn’t seem to follow logical patterns, and when the problem is just way too complicated (document size, structure) or chaotic. Again, this hardly applies to applications in the world of Document Analysis, but they deserve a space in here. We register the occasional exception to this, when the compromise of facing poor performance in an NLP solution is accepted anyway: when a linguistic solution needs to handle multiple languages, and at the same time there’s a necessity to deploy it quickly and at a low cost.One uncommon, though incredibly fortunate, condition in which Machine Learning tends to be the first option one explores is when a company happens to have already a large amount of pre-tagged content. Some workflows have historically required for content to be identified and labeled manually; when the desire to automate portions of a process comes into play, there might be a sufficiently large archive of documents ready to train an algorithm. And, depending on the size of this archive, even simple algorithms can reach Accuracy of 80% and beyond. An example of this is Classification in a largely heterogeneous context like channels in the Media & Publishing industry, where subjects are as semantically distant as Finance and Sports, making it easier for ML to find highly recognizable patterns).Finally, one of the best scenarios—probably one many would consider ML’s natural place—is when cost isn’t an issue. I am specifically talking about contexts in which a company can afford to have an army of Data Scientists working for years on training and perfecting an algorithm, spending all the money and time this requires…as long as quality hits 99%. I should clarify that actual 99% is something we can consider for a chess engine, not really a realistic expectation in the realm of natural language and communication in general (in fact, not even humans would achieve that result very often), but let’s call “99%” any performance level that is at least not too far below what a person would achieve in a given job. In some cases, though today still not many, this is possible with a pure ML approach; and, in this small group, there’s a sub-group where the same wouldn’t be possible following a logical approach (aka, Symbolic). Sure, this is typically when someone mentions that a hybrid approach putting together Symbolic and ML would cost half and give way more control in terms of explainability, but that’s a topic for another day.
Check Your Biases - Symbolic Engines and Unexpected Results
AbstractBack in June 2016, only days before the Brexit vote, I was flying from Aberdeen Scotland on a British Airways plane, heading back to Italy after the K-Drive Project had just wrapped. I was carrying a laptop with me, containing a small NLP model I had developed over the past weeks to predict the Brexit outcome. Together with the model, I had the infographics detailing the prediction itself, which was uncomfortably different from the mainstream one. The K-Drive Project, under the umbrella of the Marie Curie actions, is a European funded project based at the University of Aberdeen, aimed at bringing together industry and academia, for mutual exchange and joint effort in the field of semantic technologies. I joined it as a MER (More Experienced Researcher) in January 2016, when the main paper was already being published, so I was assigned with training and public speaking tasks to showcase the NLP technology involved.During the final month leading to the end of the project, we decided to employ a model we had initially developed for the Scottish Independence Referendum, in order to monitor opinions on social media about Brexit, searching for insights and trends to predict the outcome.After some consultations, we decided to use Twitter as a source, based on previous experience and know-how: tweets are short, relevant and to the point, written by a wide variety of users often with reliable hashtags, while Facebook posts on the other hand tend to be trivial, too long and written in broken English. Also Reddit threads are sometimes confusing to follow and can be riddled with trolling, whereas on Twitter comments are well distinguished from the original tweet and they can be easily ignored in the analysis.First of all a software engineer developed a spider to download tweets posted in Scotland, England, Northern Ireland and Wales, so that they could be processed separately, while I was in charge of adapting and further developing the linguistic engine, as a knowledge engineer. The spider was designed to download only tweets containing hashtags related to the Brexit referendum, belonging to a fine-tuned list I prepared.After an initial analysis of a training corpus of about 400 random tweets, I decided to develop a fully symbolic engine, heavily relying on knowledge graphs to grasp the concepts contained in the tweets, in order to understand the opinions as well as the related mood. I tried to cover the entire semantic field also including Scottish slang.So everyday we would crawl Twitter, get tweets, convert them and feed them to the linguistic engine. No machine learning involved. The engine remained the same throughout the entire process, so that we could record the trends that were being expressed online, according to geographical area.But when the first results began to come in, there was some embarrassment, as the output was showing a solid, undisputed advantage for Leave. Actually every single day ended with a clear Leave outcome and the trend kept going, while the entire world was agreeing that Stay was consistently winning by a couple of percentage points.Of course I was asked why my results were out of tune, so I gave a (self-) convincing speech about how referendum campaigners are always more vocal than the silent majority, so my results were to be intended as insights for social sciences, more than actual predictions for political decision-making. Obviously I avoided mentioning the shy Tory factor.But while I was completely absorbed by my coding, I could not help noticing that I was trapped between the narrative of mainstream media and my anecdotical experience of actually meeting a significant number of Leavers on a daily basis. My landlady, fellow hillwalkers, fellow bird-watchers, oil&gas people at the pub, the cashier lady of the place down the road. They were young, they were from all sorts of places and backgrounds, some of them had PhDs, and they would ask questions.There I was, a (southern) European citizen on a European funded project, comfortably sitting in the venerable Meston building, merrily coding my working day away, or walking the muddy paths in search for eider ducks and razorbills and puffins and the occasional fulmar in the weekend.I was a controversial figure, so questions were asked.“I’m on a business trip” was generally frowned upon.“I’m working at the University of Aberdeen, on a European funded project” was met with some nervous embarrassment.“I work in IT” just sounded suspicious, since all expats in Aberdeen work in oil&gas really.Leavers really were vocal. They would tell me about feeling trapped inside European bureaucracy and hoping to be able to join more large-scale projects. Or they would talk about the UK not wanting to pay European taxes, or not being able to sustain an increasing immigration flow.Of course I could not tell them oh don’t worry! I just have invented a magic computer thingy that says Leave is going to win!The AI hype had not properly started yet, so they would not believe me.Anyway, press releases were written and our results were presented. The morning the Referendum outcome was officially announced, I was at home in Italy. I was having breakfast and I thought to myself, well I don’t want to become an alien in the UK, but wouldn’t it be cool if we were right and everybody else was wrong.Well it actually turned out that we were right and everybody else was wrong.Apparently, a wise choice of the source (Twitter in this case) and the careful development of a symbolic engine were key to our success.But then the mainstream analysis came out and I was completely taken aback. The general picture of Brexit UK that was given, where only rural old folks with no schooling had voted Leave, as opposed to the dynamic young professionals in the cities who had all voted Stay, felt completely surreal compared to my direct experience. Obviously, I had to brush off the suspicion that mainstream media were simply framing the matter according to a political agenda. Maybe they knew all along about the strong Leave vote but did not tell for fear of leading to even more people voting Leave and now they were manufacturing a stereotypical picture of the UK inspired by radio drama The Archers, to make Leavers reconsider. But that would be a conspiracy theory so we can rule it out.So slowly doubt started creeping up. What if my code was biased in the first place, and I got the right result by mistake. But how? No machine learning was involved, so the engine could not have accidentally learned from an unbalanced training set. It was a fully symbolic engine - therefore the bias must have been unintentional.The point is that a symbolic engine is always theory-free. According to the scientific method you should form a hypothesis based on observations, then you make a prediction, you test it, and you iterate the test until you have data consistent enough to draw conclusions. A symbolic engine does not work like that. It is never developed according to a theory, there are no observations, hypothesis nor predictions. There might be assumptions I guess, but they should be regarded as such and not coded into the engine. A symbolic engine consists in a complex set of generalized and explainable conditions that is developed based on known requirements as well as the analysis of a related training set. Generalization requires intuition, and that will be the human spark in the engine, which will allow it to work autonomously. When the development is finished, a new data set is processed and the code is executed depending on which conditions are true, thus returning an output that mirrors the initial requirements and outlines information and correlations that could or could not be the expected ones.In the end we do trust calculations over assumptions, just like Galileo.Suddenly the revelation came. What if the everyday contact with Leave people led me to develop code that was better geared to understand Leave vocabulary, concepts, phrasing and wording? Or maybe being Leave people more vocal, in real life as well as on social media, they ended up teaching me their language, and I passed on this knowledge to the engine. Maybe I could not learn properly from Stay people since they were much quieter. After all, it could be that in the half-hearted self-defense of my work, I was actually describing the very sociolinguistic process underlying the success of my model.So right now I’m worried about my code being unintentionally biased, i.e. suffering from the most dangerous disease in the AI world. Then again, there could be another element. The hype on biases in the current AI narration is so strong, that the hype itself might be affecting my judgement.In conclusion, when dealing with the development of symbolic engines, ML as well as hybrid models, I think it is useful to integrate sociolinguistic elements in our analysis. They can help us find a better-balanced training set for our ML model, and/or they can help us develop a better-balanced code for our symbolic engine, thus improving our ability to obtain automated, reliable insight from large data sets.
Insurance Policies: Document Clustering through Hybrid NLP
Insurance Policies: Document Clustering through Hybrid NLP Insurance documents and policies: a complex use caseIt is common knowledge that up to 87% of data science projects fail to go from Proof of Concept to production; NLP projects for the Insurance domain make no exception. On the contrary, they must overcome several hardships which are inevitably connected to this space and its intricacies.The most known difficulties come from:the complex layout of Insurance-related documents the lack of sizeable corpora with related annotations.The complexity of the layout is so great that the exact same linguistic concept can greatly change its meaning and value depending on where it is placed in a document.Let’s look at a simple example: if we try to build an engine to identify the presence or absence of a “Terrorism” coverage in a policy, we will have to assign a different value whether it is placed in:The Sub-limit section of the Declaration Page. The “Exclusion” chapter of the policy. An Endorsement adding a single coverage or more than one. An Endorsement adding a specific inclusion for that coverage.The lack of good-quality, decently sized annotated insurance documents corpora is directly connected to the inherent difficulty of annotating such complex documents as well as the amount of work it would be required to annotate tens of thousands of policies.And this is only the tip of the iceberg. On top of this, we must also consider the need for the normalization of insurance concepts.Linguistic normalization: an invisible, yet powerful, force in the Insurance languageThe normalization of concepts is a well-understood process when working on databases, but it is also pivotal for NLP in the Insurance domain, as it is the key to applying inferences and increase the speed of the annotation process.Normalizing concepts means to group under the same label linguistic elements which may look extremely different. The examples are many, but a prime one comes from insurance policies against Natural Hazards.In this case different sub-limits will be applied to different Flood Zones. The ones with the highest level of risk of flood are usually called “High Risk Flood Zones”; however, this concept can be expressed as:Tier I Flood Zones SFHA Flood Zone A And so on…Virtually any coverage can have many terms that can be grouped together, and the most important Natural Hazard coverages even have a 2 or 3-layer distinction (Tier I, II and III) according to specific geographical zones and their inherent risk.Multiply this for all the possible elements we can find, and the number of variants will soon become very large. This causes both the ML annotators and NLP engines to struggle when trying to retrieve, infer, even label the correct information.A new type of linguistic clustering: the hybrid approachA better approach to solve complex NLP tasks is based on hybrid (ML/Symbolic) technology, which improves the results and life cycle of an insurance workflow via micro-linguistic clustering based on Machine Learning, then inherited by a Symbolic engine.While traditional text clustering is used in unsupervised learning approaches to infer semantic patterns and group together documents with similar topics, sentences with similar meanings, etc., a hybrid approach is substantially different. Micro-linguistic clusters are created at a granular level through ML algorithms trained on labeled data, using pre-defined normalized values. Once the micro-linguistic clustering is inferred, it can then be used for further ML activities or in a Hybrid pipeline which actuates inference logics based on a Symbolic layer.This goes in the direction of the traditional golden rule of programming: “breaking down the problem”. The first step to solve a complex use case (like most in the Insurance domain are) is to break it into smaller, easier-to-take-on chunks.What tasks can the hybrid linguistic clustering accomplish, and at what scalability?Symbolic engines are often labeled as extremely precise but not scalable, as they do not have the flexibility of ML when it comes to handle cases unseen during the training stage.However, this type of linguistic clustering goes in the direction of solving this matter by leveraging ML for the identification of concepts that are consequently passed on to the complex (and precise) logic of the Symbolic engine coming next in the pipeline.Possibilities are endless: for instance, the Symbolic step can alter the intrinsic value of the ML identification according to the document segment the concept falls in.The following is an example which uses the Symbolic process of “Segmentation” (splitting a text into its relevant zones) to understand how to use the label passed along by the ML module.Let us imagine that our model needs to understand if certain insurance coverages are excluded from a 100-page policy.The ML engine will first cluster together all the possible variations of the “Fine Arts” coverage:“Fine Arts” “Work of Arts” “Artistic Items” “Jewelry” etc.Immediately after, the Symbolic part of the pipeline will check whether the “Fine Arts” label is mentioned in the “Exclusions” section, thus understanding if that coverage is excluded from the policy or if it is instead covered (as part of the sub-limits list).Thanks to this, the ML annotators will not have to bother about having to assign a different label to all the “Fine Arts” variants according to where they are placed in a policy: they only need to annotate the normalized value of “Fine Arts” to its variants, which will act as a micro-linguistic cluster.Another useful example of a complex task is the aggregation of data. If a hybrid engine aims at extracting sub-limits to specific coverages, along with the coverage normalization issue, there is an additional layer of complexity to handle: the order of the linguistic items for their aggregation.Let’s consider that the task at hand is to extract not only the sub-limit for a specific coverage but also its qualifier (per occurrence, in the aggregate, etc.). These three items can be placed in several different orders:Fine Arts $100,000 Per Item Fine Arts Per Item $100,000 Per Item $100,000 Fine Arts $100,000 Fine Arts Fine Arts $100,000Leveraging all these permutations while aggregating data can increase considerably the complexity of a Machine Learning model. A hybrid approach, on the other hand, would have the ML model identify the normalized labels and then have the Symbolic reasoning identifying the correct order based on the input data coming from the ML part.Clearly, these are just two examples; an infinite number of complex Symbolic logics and inferences can be applied on top of the scalable ML algorithm for the identification of normalized concepts.A more scalable workflow which is also easier to build and maintainIn addition to scalability, symbolic reasoning brings other positives to the whole project workflow:There is no need to implement different ML workflows for complex task, with different labeling to be implemented and maintained. Also, it is quicker and less resource-intensive to retrain a single ML model than multiple ones. Since the complex portion of the business logic is dealt symbolically, adding manual annotations to the ML pipeline is much easier for data annotators. For these same reasons mentioned above, it is also easier for testers to directly provide feedback for the ML normalization process. Moreover, since linguistic elements are normalized by the ML portion of the workflow, users will have a smaller list of labels to tag documents. Symbolic rules do not need to be updated often: what will be more often updated is the ML part, which can also benefit from users’ feedback.Summary and conclusions ML in complex projects in the Insurance domain can suffer because inference logic can hardly be condensed into simple labels; this also makes life harder for the annotators. Text position and inferences can dramatically change the actual meaning of concepts that share the same linguistic form In a pure ML workflow, the more complex a logic is, the more training documents are usually needed to achieve production-grade accuracy For this reason, ML would need thousands (or even tens of thousands) of pre-tagged documents to build effective models Complexity can be reduced by adopting a Hybrid approach: ML and users’ annotation create linguistic clusters/tags, then these will be used as the starting point OR building blocks for a Symbolic engine to reach its goal, which will manage all the complexity of a specific use case Feedback from users, once validated, can be leveraged to retrain a model without changing the most delicate part (which can be handled by the Symbolic portion of the workflow)
The Word “word” Has 13 Meanings
Thoughts around Knowledge Graphs, the semantic nature of language, and the two main types of word ambiguity. Depending on the dictionary you use to look this up, the word “word” can have 13 meanings or more, but no matter the final count (or why one dictionary would be different than another…), what is certain is that the total number isn’t 1. It’d be really hard for us to understand what people are saying if we were to take words just for what they are, words. A word carries meaning, and this meaning is different based on the context for that word. Ultimately, meaning is what matters, words are just a vehicle.Linguists, NLP (Natural Language Processing) practitioners, developers and even search engines users, they are all somewhat aware of the concept of “word ambiguity” (polysemy). You look for “beds” and in your search results you do find beds (the ones we sleep in) but also flowerbeds and other types of beds. Cases like this one are only minimally annoying because the more frequent meaning of the word you’re using is very likely going to be also the one you’re interested in…but what if it’s the other way around? What if you’re looking for a “house” in the sense of a family dynasty instead of a building? What if you’re searching using a word that has 2 or 3 very common meanings instead of just 1 (like, for instance, “light”). It goes without saying that no technology can effectively support any activity that involves content, language, documents, communications without moving beyond words. This journey to happy document-processing land can only be completed if we are able to truly understand what the words in a document are trying to convey.There’s a second type of ambiguity, one that’s focused on lack of information, but before getting into that I need to expand a little more on the first type; if I only say one word then it is impossible to determine what I’m talking about, but I can offer context for that word by placing it in a sentence (“house, like in I live in a lovely house”, “house, like in I studied the history of the house of Tudor”); another way that doesn’t require thinking of an actual example is to offer a synonyms chain: “house…apartment”, “house…family”. A synonyms chain, which sometimes require more than just one additional word to solve an ambiguity, is a common device when semantics are applied to NLP and Computational Linguistics (a science focused on technologies that process language, documents, communication). Semantics, that is the act of looking at words as concepts instead of just words, when integrated with NLP leads to Natural Language Understanding (NLU). NLU’s goal is to analyze text to extract concrete, clear information, not simply see a document as a sequence of characters.The second main type of ambiguity I was mentioning above has to do with the fact that every concept carries information that goes beyond the concept itself. For instance, a dog is not just a dog (in terms of something that has very specific features like 4 legs, a tail, etc), a dog is also a mammal, an animal, a living being, and so on. All of those other things a dog is also come with features that imply a lot more (e.g., a dog is a mammal, therefore it doesn’t lay eggs). By the same token, a house/apartment is a man-made object, while a house/family is an abstract concept indicating a group of people related to each other. Being aware of a concept’s semantic ancestors (hyperonyms chain, or superordination) – the features it inherits from them -- is necessary to understand content, because real documents will rarely explain the word “house” is referring to an apartment, we’ll simply read “I live in a house”, and we are expected to know which meaning of “house” the one you can live in. Why is this valuable? Because when you search for different types of buildings, you want to be able to just write “buildings” and find every document that’s talking about houses, villas, castles, etc. And because if you’re interested in animals you can’t possibly write the name of every animal, you just want to search for “animals”. It’s valuable because the real world of content is full of implications that are not explicitly spelled out.These relations and features belonging to different concepts, as well as the words that are used to represent those concepts in language, are what makes up a Knowledge Graph. This modern-day repository, halfway between a dictionary and a taxonomy of reality, is what AI (Artificial Intelligence) technology uses to understand documents at a deeper, more human-like level than the alternative, simpler approach that looks at words just as words. This highly detailed level of understanding is what unleashes new experiences and advanced forms of automation that were once hard to reach.
Build better products
Feedback or suggestions for our product? Submit and vote on product ideas.
View ideasBadge winners
- gsarettohas earned the badge Innovator
- gsarettohas earned the badge Expert
- Gianluca Mamelihas earned the badge Expert
- mariakatosvichhas earned the badge Innovator
- mmurugesanhas earned the badge Innovator
Login to the community
No account yet? Create an account
Enter your username or e-mail address. We'll send you an e-mail with instructions to reset your password.
Scanning file for viruses.
Sorry, we're still checking this file's contents to make sure it's safe to download. Please try again in a few minutes.
OKThis file cannot be downloaded
Sorry, our virus scanner detected that this file isn't safe to download.
OK