# Introduction

DeepView-Validator reports precision, recall, and accuracy to measure the model's capabilities of identifying or finding ground truth objects in an image. Furthermore, the model's inference timings are also reported to quantify its real-time capabilities.

DeepView-Validator reports precision, recall, and accuracy in terms of overall identification and finding of ground truth objects or in terms of the mean average precision, recall, and accuracy.

# Prerequisites

- Familiarity of the matching and classification rules established DeepView-Validator.

# Definitions

## Precision vs. Recall

Precision and recall are common metrics used for evaluating object detectors in machine learning. According to Mariescu-Istodor and Fränti (2023), "*Precision* is the number of correct results (true positives) relative to the number of all results. *Recall* is the number of correct results relative to the number of expected results" (p.1). In this case interpreting "all results" as the model's detection results and "expected results" as the ground truth in the dataset, precision is defined as the fraction of correct detections out of the total detections, and recall is defined as the fraction of correct detections out of the total ground truth.

Taking from Vignesh-Babu (2020) and Padilla, Passos, Dias, Netto, & Da Silva (2021), the equation for precision and recall is defined as the following.

However, on the account of DeepView-Validator's method of classifying detections where false positives are further categorized as localization and classification false positives, then the total number of detections is really the sum of true positives, classification false positives, and localization false positives. The total number of ground truths is the sum of true positives, false negatives, and classification false positives.

The following image result from DeepView-Validator defends this claim.

*playing_cards_v7; 000000000027.png *

In this image there are two true positives, one false negative, one classification false positive, and four ground truth objects. To agree with the definition of recall being the fraction of all correct detections over all ground truths, the number ground truth becomes the sum of true positives, false negatives, and classification false positives. The formulas are thus adjusted in the following way which is implemented in DeepView-Validator.

___________________________________________________________________________________________________

According to Mariescu-Istodor and Fränti (2023), "The performance is a trade-off between precision and recall. Recall can be increased by lowering the selection threshold to provide more predictions at the cost of decreased precision." The selection threshold is defined as the "detection threshold" in DeepView-Validator to provide to the NMS to determine which detections are deemed viable based on the confidence scores meeting the criteria of this threshold. Lowering the threshold means more detections are viable with the less leniency which can help in finding more ground truth objects (higher recall), but at the expense of possible incorrect detections (low precision).

## Accuracy

Traditionally, the F1-Score metric is also reported in the evaluation of object detection models to take into account the overall performance of the model in finding and identifying ground truth objects. However, currently DeepView-Validator reports accuracy in this regard following the formula described below which is defined as the fraction of correct detections out of all detections and ground truth objects.

## Overall Metrics

The overall metrics are reported as "Overall Precision", "Overall Recall", and "Overall Accuracy" which is based on the precision, recall, and accuracy metrics of the entire dataset regardless of per class computations. The same equations defined above are used but in the basis of the total number of true positives, false positives, and false negatives gathered throughout validation.

**Overall Precision**

Precision alone does not provide a final summary of the model performance because it only considers the ratio of the number of correct detections to the total number of detections. Consider a case where the model might have made 9 detections which are all correct and yields a precision of 100%, but there are 200 ground truth annotations, the model missed the rest of the 191 annotations which yields a recall of 4.5%.

**Overall Recall**

A similar idea is presented for recall, this metric only considers the ratio of correct detections against the total number of ground truths. It is possible that the model will correctly find all ground truth annotations, but it might have generated large amounts of localization false positives.

**Overall Accuracy**

This accuracy metric provides a better representation of the overall model performance over precision and recall. The accuracy metric aims to combine both precision and recall by considering correct detections (TP), false detections (localization FP and classification FP), and missed detections (FN). The accuracy is the ratio of the correct detections against all model detections and all ground truth objects. This metric aims to measure how well the model aligns its detections to the ground truth and a perfect alignment suggests zero missed annotations and zero false detections.

## Mean Average Metrics

The mean average metrics are reported as "mAP", "mAR", "mACC" for validation IoU thresholds 0.50, 0.75, 0.50-0.95.

**Mean Average Precision**

The mean average precision is based on the area under the precision vs. recall curve which plots the tradeoff between precision and recall by adjusting the thresholds.

According to Tenyks Blogger (2023), "The Precision-Recall curve is often represented in one of two ways: with a fixed IoU and varying confidence threshold (i.e. PASCAL VOC challenge), or with a varying IoU and a fixed confidence threshold (i.e. COCO challenge)."

DeepView-Validator follows the implementation of COCO for the precision vs. recall curve which varies the IoU thresholds over a fixed confidence threshold. The average precision is the area under the precision vs. recall curve for each class at varying IoU thresholds. The mean average precision 0.50 and 0.75 is the mean of the average precision across the different classes, but only at the IoU threshold values of 0.50 and 0.75. The average precision 0.50-0.95 is calculated by taking the mean across the varying IoU thresholds. This process is done per class and the mean average precision 0.50-0.95 is the mean of the average precision values across the different classes.

**Mean Average Recall**

The mean average recall at validation IoU thresholds 0.50 and 0.75 are calculated based on the equation below.

This metric is calculated as the sum of the recall values of each class over the number of classes. Specifying the validation thresholds determines the strictness of true positive definitions. A detection is a true positive if it correctly identifies the ground truth, if it has a score greater than the validation score threshold, and if it has an IoU greater than the validation IoU threshold.

According to Tenyks Blogger (2023), "AR is defined as the recall averaged over a range of IoU thresholds (from 0.50 to 1.0) ... We can compute mean average recall (mAR) as the mean of AR across all classes".

DeepView-Validator calculates this metric for mAR 0.50-0.95 by taking the sum of mAR (AR) values at validation IoU thresholds 0.50, 0.55, ..., 0.95 and then dividing by the number of validation IoU thresholds (in this case 10).

**Mean Average Accuracy**

Similarly, the mean average accuracy is calculated like mAR following the equation below.

This metric is calculated as the sum of accuracy values of each class over the number of classes at the specified validation IoU thresholds 0.50 and 0.75.

The following equation below calculates the mean average accuracy for a range of validation IoU thresholds from 0.50-0.95 which is calculated similarly to mean average recall.

## False Positive Rates

As mentioned previously, there are two categories to false positives: classification and localization. Classification false positives are matched to a ground truth based on the validation IoU threshold, but the model prediction label and the ground truth label are not matching. Localization false positives are arbitrary model predictions on an image that do not correlate to any ground truth.

**Localization False Positive Error**

This is the ratio of the localization false positives over the total false positives.

**Classification False Positive Error**

This is the ratio of the classification false positives over the total false positives.

We are able to extract information about the model's behavior from these errors. We can see where the model makes the most errors, either through misidentification of labels (classification) or mislocation for objects (localization).

## Timings

The timings represent how fast a model and DeepView-Validator processes a given image in milliseconds:

• Load Time: How long it takes to perform image preprocesses such as resizing or letterbox transformations to obey the model's input requirements. This set of timings is used to measure DeepView-Validator's processing time.

• Inference Time: How long it takes for the model to perform inference to detect objects in an image. This set of timings is used to measure the model's timing performance.

• Box (Decode) Time: How long it takes to postprocess the model outputs into proper bounding boxes, scores, and labels which could include NMS processing or conversion of labels into strings, or proper reshaping of the outputs.

Mean Load Time is the sum of all load times in all images divided by the total number of images.

Mean Inference Time is the sum of all inference times in all images divided by the total number of

images.

Mean Box Time is the sum of all box times in all images divided by the total number of images.

# Conclusion

This article has shown how to calculate the metrics reported in DeepView-Validator. The basic equations for precision, recall, and accuracy were shown. The article described how to calculate the overall metrics, the mean average metrics, false positive error rates, and the mean timings.

# References

Fränti, P., & Mariescu-Istodor, R. (2023, March 1). Soft precision and recall. https://doi.org/10.1016/j.patrec.2023.02.005

Babu, G. V. (2021, December 13). Metrics on Object Detection - gandham vignesh babu - Medium. Retrieved from https://vignesh943628.medium.com/metrics-on-object-detection-b9fe3f1bac59

Padilla, R., Passos, W. L., Dias, T. L. B., Netto, S. L., & Da Silva, E. A. B. (2021, January 25). A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. https://doi.org/10.3390/electronics10030279

Blogger, T. (2023, November 7). Mean Average Precision (mAP): Definitions & Misconceptions | Medium. Retrieved from https://medium.com/@tenyks_blogger/mean-average-precision-definition-and-common-myths-c679a809807a

## Comments

0 comments

Please sign in to leave a comment.