Discover How the AHHA Labs Team Analyzes AI Training Results and Continuously Improves Model Performance
By Doohee Lee, AI Researcher at AHHA Labs

If you want to boost your AI model’s performance, being able to analyze how it’s learning is key. In this article, we’ll walk you through how the AI team at AHHA Labs evaluates model performance—and how those insights help us build smarter and more reliable models.
1. Confusion Matrix: Analyzing Classification, Segmentation, and Anomaly Detection Performance
At AHHA Labs, we conduct detailed post-training evaluations to guide continuous model improvement. Confusion matrix, which compares predicted labels with ground truth, is a tool we use to assess model accuracy. This approach is applicable across classification, segmentation, and anomaly detection tasks.
It helps us identify two common issues:
- False Positives—when the model flags a defect that isn’t actually there
- False Negatives—when the model fails to detect an existing defect
The confusion matrix allows us to quickly pinpoint where the model is misclassifying data. Based on these insights, we refine labeling strategies and augment datasets to improve performance in future training cycles.

You can easily identify what types of data the model misclassified by using the confusion matrix. The top-left corner represents cases where the model correctly predicted a normal product as normal (true positive). The top-right corner indicates cases where it incorrectly predicted a normal product as defective (false positive). The bottom-left corner shows instances where a defective product was incorrectly predicted as normal (false negative), and the bottom-right corner represents correct predictions of defective products as defective (true negative). Image Credit: AHHA Labs
2. Bounding Box Evaluation: Object Detection Metrics That Matter
For object detection models, performance isn’t just about recognizing a defect—it’s about where and how accurately the model locates it. At AHHA Labs, we use the following metrics to evaluate bounding box performance:
(1) Bounding box (coordinates)
- IoU (Intersection over Union): Measures how closely the predicted bounding box overlaps with the actual one.
- IoU ≈ 0.9 → High accuracy
- IoU ≈ 0.5 → Moderate alignment
- IoU = 0 → No overlap

A value close to 0.9 indicates a highly accurate prediction, while a value of 0 means the prediction does not match the ground truth at all. Image Credit: AHHA Labs
- mAP (mean Average Precision): Aggregates precision scores across various IoU thresholds (e.g., 0.3, 0.5, 0.9) to provide an overall performance score.
(2) Bounding box (class)
Next, we assess the classification performance of bounding boxes using three key metrics: Recall, Precision, and F1-score.
As illustrated in the diagram below:
- A represents the ground truth (actual labels)
- B represents the model’s predictions
Among these:
- b = correctly predicted bounding boxes (true positives)
- a = ground truth labels the model missed (false negatives)
- c = incorrect predictions by the model (false positives)

The classification performance of bounding boxes using three key metrics: Recall, Precision, and F1-score. Image Credit: AHHA Labs
📌 Recall = b / (a + b)
- TP / (TP+FN)
- Recall measures the ratio of true positives to all actual defects.
- It includes false negatives and indicates the model’s ability to find all relevant defects.
- A low recall means the model is missing a significant number of defects, which can be critical in industrial applications.
📌 Precision = b / (b + c)
- TP / (TP+FP)
- Precision measures the ratio of true positives to all predicted defects.
- This reflects the reliability of the model’s predictions. A low precision score means the model is often misclassifying normal items as defective.
📌 F1-score = 2 × (Recall × Precision) / (Recall + Precision)
- The F1-score balances both precision and recall, offering a more comprehensive view of model performance.
- This is especially useful in cases of class imbalance, where one metric alone may not tell the full story.
By analyzing both recall and precision, we gain insight into how well the model handles false positives and false negatives. At AHHA Labs, we use this analysis to refine datasets and ensure models perform with high accuracy in real-world industrial environments.
3. Overfitting: A Key Challenge in Model Generalization
Overfitting is one of the most common challenges in AI training. It occurs when a model performs well on training data but struggles with new, unseen data—undermining its reliability in real-world applications.
At AHHA Labs, we carefully analyze the performance report of the model once the training is completed. We monitor both training and validation loss curves to ensure strong performance and detect overfitting early on across both training data and validation data.
(1) Monitoring loss values and comparing Training and Validation Datasets
As shown in the graph below, we use loss curves to evaluate whether the model is learning appropriately during training.
- What is Loss?
- Loss measures the difference between the model’s predictions and the actual labels.
- The lower the loss, the closer the predictions are to the true values.
- Training Dataset vs. Validation Dataset
- Training dataset is used to teach the model by adjusting weights and minimizing loss.
- Validation dataset is used to monitor the model’s performance on data it hasn’t seen before, helping to detect overfitting during training.
Monitoring both helps us verify that the model isn’t simply overfitting to the training data, but is instead learning patterns that can generalize to unseen data.
(2) Identifying overfitting—analyzing the loss curve
- Normal Training:
- When both the training loss and validation loss decrease together, it indicates that the model is learning properly.
- In this case, the model is being optimized without overfitting, and its performance is likely to generalize well to new data.
- Over fitting:
- If the training loss continues to decrease while the validation loss starts to rise at some point, this is a clear sign of overfitting.
- This means the model is becoming too specialized to the training data and is no longer performing well on unseen validation data.
- In such cases, the model starts to memorize specific patterns instead of learning features that generalize—resulting in poor performance on real-world data.

If the training loss continues to decrease while the validation loss starts to rise at some point, this is a clear sign of overfitting. Image Credit: AHHA Labs
At AHHA Labs, we address overfitting by ensuring diverse training data and carefully selecting the optimal point to save model weights—specifically, when the validation loss reaches its minimum. By doing so, we prevent overfitting and maintain the model’s best possible performance.
With precise loss curve analysis, AHHA Labs builds models that strike the right balance—achieving target accuracy while maintaining robustness in real-world operating conditions.