Precision and Recall, F-score and Accuracy – Measuring NLP performance

  • 20 October 2021
  • 0 replies
  • 4334 views

Userlevel 4
Badge +1

When discussing Artificial Intelligence and Natural Language Processing, we often hear mention of Precision, Recall, F-score, Accuracy. These are ways to measure the quality of a software which returns information about an analyzed document.

These same scoring systems can also be used to assess the performance of other technologies, for instance search engines, since the ultimate goal is to quantify on a percentage scale their ability to retrieve the desired data.

Let’s jump into a slightly more in-depth definition for each one of these, before addressing the most important point, that is why they’re relevant.

  • Precision: Given a set of results out of a processed document, Precision is the percentage value indicating how many of those results are correct (correct being based on the expectations of a certain application). It can apply to any class of a predictive AI system, like Search, Categorization, Entity Recognition, etc. Example: given a document mentioning 10 dog breeds and an application that’s supposed to find all the dog breeds in a document, if the application returns 5 values and all are correctly dog breeds then the system will have performed at a 100% Precision (even if 5 out of 10 instances of dog breeds were missed, the ones that were returned were correct).
  • Recall: Given a set of results out of a processed document, Recall is the percentage value indicating how many of the correct results are found (correct being based on the expectations of a certain application). It can apply to any class of a predictive AI system, like Search, Categorization, Entity Recognition, etc. Example: given a document mentioning 10 instances of dog breeds and an application that’s supposed to find all the dog breeds in a document, if the application returns 5 values and all are correctly dog breeds then the system will have performed at a 50% Recall (only 5 out of 10 instances of dog breeds were found).
  • F-score (F-measure, F1 measure): An F-score is the harmonic mean of Precision and Recall values of a system, and it answers to the following formula: 2 x [(Precision x Recall) / (Precision + Recall)]. Criticism around the use of F-score values to determine the quality of a predictive system are based on the fact that a moderately high F-score can be the result of an imbalance between Precision and Recall, ergo not telling the whole story. On the other hand, most systems, past a high level of quality, face a challenge when trying to improve one of the two  indicators (Precision, Recall) without negative effects on the other. Critical (risk) applications that value the retrieving of information more than its precision (producing a large number of false positives, but virtually guaranteeing that all the true positives are found) can sometimes adopt a different scoring system called F2 measure, where Recall has a higher weight. The opposite (higher weight on Precision) is achieved by using the F0.5 measure.
  • Accuracy: Accuracy is a scoring system in binary classification (i.e., determining if an answer or returned information is correct or not), sometimes used as an alternative to F-score, and it’s calculated as: (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives).

 

 

By Walber - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=36926283

 

Since F-score and Accuracy are, very simply, quick and easy ways to rank a system with one single number that is supposed to offer a wholesome perspective (naturally, as explained above, losing along the way some of the details, therefore failing at giving a complete view), we don’t need to talk about those two to form an idea of why measuring NLP performance is valuable. For that, we can just focus on Precision and Recall.

Simply put, Precision can be considered an inverted measure of noise: the farther this value is from a perfect score, the more incorrect data will be part of the output a system produces. Similarly, Recall is an inverted measure of silence: the farther this value is from a perfect score, the more relevant data will be missing from the output a system produces.

Knowing how an NLP software performs based on these standard scales is extremely important at a business level, since it directly impacts ROI, and at a technological level, because it impacts the architecture of a solution. As an example: a bank that wants to automate part of a loan processing workflow can use NLP to speed up the segment of this process that identifies fraudulent applications; the organization will accept that some applications will be flagged for manual verification, which implies labor costs, hence the lack of Precision will have an effect on the return on investment for this technological solution, and it’ll also have an effect on architecture design because the process will need to include a seamless integration between the work the NLP solution performs and the interface the operators adopt during manual verification. Finally, the bank will not accept to miss a single fraudulent application, which means that the solution will have to offer close to 100% Recall, even if that means that Precision will suffer and even more manual revision will be required.

One key disclaimer should be clear, when analyzing these scores and evaluating the performance of an AI solution applied to unstructured data: no system is ever perfect. These technologies can be very effective or less so depending on the complexity of a problem and how clean the processed documents are, the divide is very wide, but generally speaking one never encounters a software that can consistently deliver 100% Precision and 100% Recall at the same time. In fact, in most real-world scenarios, not even humans are able to deliver perfect results every hour of every day. Having said that, this kind of software is flexible and can be tailored to a specific use case in a way that can stretch to deliver very high scores in one of the two principal indicators as long as it is acceptable to take a hit on the other. For instance, following the example of the bank and the automated flagging for manual revision of fraudulent applications, let’s say that one trigger is the inconsistency of a residential address over multiple documents: in this case, if we know that a specific type of document (proof of change of address), when present, justifies the inconsistency then we don’t need to flag a document (this is a system with a high Precision and high Recall); on the other hand, if that proof can be faked, then we have to assess validity and flag only those that look unreliable, therefore possibly getting it wrong and not flagging some applications that should have been (in this scenario Precision is just as high as before, but Recall dropped); if the bank cannot accept to miss a single fraudulent application then the NLP software can disregard the change-of-address proof and flag all documents presenting the address inconsistency, naturally sending some non-fraudulent applications for manual revision (Precision is now lower and reviewers will sometimes spend time on an application they didn’t need to see, but Recall is 100%).

The above is barely scratching the surface of how important these scoring systems are, but when Quality Assurance is properly integrated into a workflow, Precision and Recall values can be used to directly calculate real-time value (monetary or resource-centric) of Automation in an organization.

 


0 replies

Be the first to reply!

Reply