Accuracy Calculator
Enter values in the Confusion Matrix tab first, then return here for the full worked solution.
No data yet — enter TP, TN, FP, and FN in the Confusion Matrix tab first.
What is Accuracy?
Accuracy is a classification metric that measures the proportion of correct predictions out of all predictions made. It answers one direct question: out of every case evaluated, how many did the model or test classify correctly? Formally, accuracy is defined as the number of true positives plus true negatives divided by the total number of observations.
In both machine learning and diagnostic testing, accuracy is computed from a confusion matrix, a two-by-two table that records how a classifier's predictions compare to the actual ground truth labels. The four cells of that table — TP, TN, FP, and FN — are the raw material from which accuracy and every related metric is derived.
Accuracy Formula Library
Six formulas govern classification performance evaluation. Accuracy alone does not tell the full story; each formula below targets a different aspect of predictive quality. Understanding when each applies is as important as knowing how to compute it.
Basic Accuracy Formula
Accuracy = (TP + TN) / (TP + TN + FP + FN)
In percentage:
Accuracy (%) = Accuracy × 100
Error Rate Formula
Error Rate = 1 − Accuracy
Equivalently:
Error Rate = (FP + FN) / Total
Precision Formula
Precision = TP / (TP + FP)
Answers: Of all positive
predictions, how many
were actually correct?
Recall (Sensitivity) Formula
Recall = TP / (TP + FN)
Answers: Of all actual
positives, how many did
the model catch?
Specificity Formula
Specificity = TN / (TN + FP)
Answers: Of all actual
negatives, how many were
correctly ruled out?
F1 Score & Balanced Accuracy
F1 = 2 × (P × R) / (P + R)
Balanced Accuracy
= (Sensitivity + Specificity)
/ 2
These formulas are standard across machine learning, clinical epidemiology, and quality assurance. The scikit-learn documentation on model evaluation covers their Python implementations, while the BMJ's Statistics at Square One addresses their use in clinical research.
How to Calculate Accuracy from a Confusion Matrix — Step by Step
To calculate accuracy: build your confusion matrix, sum the four cells to get the total, then divide the number of correct predictions (TP + TN) by that total. Here is the complete method with a worked numerical example.
From your model's predictions versus actual labels, record TP (predicted positive, actually positive), TN (predicted negative, actually negative), FP (predicted positive, actually negative), and FN (predicted negative, actually positive). Example: a disease screening test on 200 patients yields TP = 85, TN = 90, FP = 10, FN = 15.
Total = TP + TN + FP + FN = 85 + 90 + 10 + 15 = 200. This is the denominator for every metric in the confusion matrix.
Accuracy = (TP + TN) / Total = (85 + 90) / 200 = 175 / 200 = 0.875.
0.875 × 100 = 87.5%. The test correctly classifies 87.5% of all patients.
Precision = 85 / (85 + 10) = 89.5%. Recall = 85 / (85 + 15) = 85.0%. F1 Score = 2 × (0.895 × 0.850) / (0.895 + 0.850) = 87.2%. Because the classes are relatively balanced here (100 true positives vs. 100 true negatives), accuracy is a fair summary. On imbalanced datasets, F1 score and balanced accuracy carry more weight.
Result: TP = 85, TN = 90, FP = 10, FN = 15, Total = 200. Accuracy = 87.5%, Error Rate = 12.5%, Precision = 89.5%, Recall = 85.0%, Specificity = 90.0%, F1 = 87.2%. Verify all six values using the calculator above.
Worked Examples Across Three Domains
Example 1 — Disease Screening Test
TP = 850 (sick patients correctly identified), TN = 8,850 (healthy patients correctly cleared), FP = 150 (healthy patients falsely flagged), FN = 150 (sick patients missed). Total = 10,000.
Accuracy = (850 + 8,850) / 10,000 = 9,700 / 10,000 = 97.0%.
Recall = 850 / (850 + 150) = 850 / 1,000 = 85.0%. This means 15% of truly sick patients are missed by the test, which is clinically significant.
Specificity = 8,850 / (8,850 + 150) = 8,850 / 9,000 = 98.3%. Very few healthy patients are falsely flagged.
Interpretation: 97% accuracy sounds impressive, but 15% of sick patients go undetected (recall = 85%). For a serious disease, this miss rate could be clinically unacceptable. This is why medical test evaluation in the WHO's diagnostic test guidance requires sensitivity and specificity alongside accuracy.
Example 2 — Spam Detection System (NLP)
TP = 2,300 (spam correctly flagged), TN = 2,400 (legitimate emails correctly passed), FP = 100 (legitimate emails wrongly flagged as spam), FN = 200 (spam emails missed). Total = 5,000.
(2,300 + 2,400) / 5,000 = 4,700 / 5,000 = 94.0%.
2,300 / (2,300 + 100) = 2,300 / 2,400 = 95.8%. Very few legitimate emails are lost to the spam folder.
Recall = 2,300 / (2,300 + 200) = 92.0%. F1 = 2 × (0.958 × 0.920) / (0.958 + 0.920) = 93.9%. Balanced model.
Interpretation: The model achieves 94% accuracy with precision prioritized over recall — the correct trade-off for spam filtering, where sending legitimate email to spam (false positive) costs more than letting occasional spam through (false negative).
Example 3 — Credit Risk Prediction (Financial Modeling)
TP = 120 (defaulters correctly flagged), TN = 1,750 (repayers correctly approved), FP = 50 (repayers wrongly rejected), FN = 80 (defaulters missed and approved). Total = 2,000.
(120 + 1,750) / 2,000 = 1,870 / 2,000 = 93.5%.
A model that approves everyone would achieve (0 + 1,800) / 2,000 = 90% accuracy without catching a single defaulter. The difference in accuracy between our model (93.5%) and this naive strategy (90%) understates the real improvement in predictive value.
Recall = 120 / 200 = 60.0%. Specificity = 1,750 / 1,800 = 97.2%. Balanced Accuracy = (0.600 + 0.972) / 2 = 78.6%. This is the more honest summary on imbalanced data.
Interpretation: Despite 93.5% accuracy, the model catches only 60% of defaulters. Balanced accuracy (78.6%) and recall (60%) are the metrics that lenders and regulators actually need to evaluate. This illustrates the accuracy paradox directly.
Accuracy Score Interpretation: Quick Reference
What counts as a "good" accuracy score depends entirely on the application, class balance, and the cost of errors. The table below provides general benchmarks, but always pair them with recall and precision before drawing conclusions.
Table: Accuracy Score Ranges and Interpretation
| Accuracy Score | Interpretation | Common Context | Watch Out For |
|---|---|---|---|
| 95–100% | Excellent | Image classification, OCR, medical imaging AI | Overfitting; check on unseen test data |
| 90–94% | Very Good | NLP classifiers, diagnostic tests | Still inspect recall on minority class |
| 80–89% | Good | Fraud detection, churn prediction | May need tuning; check F1 score |
| 70–79% | Fair | Early-stage models, noisy data | Review feature engineering |
| Below 70% | Poor | Near-random performance | Check for data leakage or class imbalance |
Accuracy vs. Precision vs. Recall vs. F1 Score
Accuracy measures overall correctness. Precision targets the quality of positive predictions. Recall targets the completeness of positive detection. F1 Score balances precision and recall into one number for imbalanced datasets. Choosing the right metric depends on what type of error is more costly in your application.
Table: Classification Metrics Compared
| Metric | Formula | Best Used When | Limitation |
|---|---|---|---|
| Accuracy | (TP+TN) / Total | Balanced classes, general overview | Misleading on imbalanced data (accuracy paradox) |
| Precision | TP / (TP+FP) | Cost of FP is high (spam, fraud alerts) | Ignores false negatives entirely |
| Recall | TP / (TP+FN) | Cost of FN is high (cancer screening, safety) | Can be maximized by predicting everything positive |
| F1 Score | 2PR / (P+R) | Imbalanced classes, when both FP and FN matter | Does not include TN in the calculation |
| Balanced Accuracy | (Sensitivity+Specificity)/2 | Highly imbalanced binary classification | Less interpretable than F1 in some contexts |
| Specificity | TN / (TN+FP) | Ruling out conditions (clinical screening) | Says nothing about positive prediction quality |
Confusion Matrix and Accuracy: Complete Formula Reference
The table below lists every key term and formula related to accuracy and confusion matrix evaluation. It is structured for direct reference by students, researchers, and practitioners.
Table: Accuracy Metric Glossary — 10 Key Entities
| Term | Symbol / Formula | Plain-English Definition | Primary Use Case |
|---|---|---|---|
| Accuracy | (TP+TN) / Total | Proportion of all predictions that are correct | General model evaluation on balanced data |
| Error Rate | (FP+FN) / Total | Proportion of all predictions that are wrong; equals 1 minus accuracy | Communicating failure rate to non-technical audiences |
| Precision | TP / (TP+FP) | Of all positive predictions, the share that were actually positive | When false positives are costly: spam detection, fraud alerts |
| Recall (Sensitivity) | TP / (TP+FN) | Of all actual positives, the share the model correctly identified | When false negatives are costly: disease detection, safety systems |
| Specificity | TN / (TN+FP) | Of all actual negatives, the share correctly ruled out | Clinical screening; ruling out conditions in diagnostic testing |
| F1 Score | 2×(P×R)/(P+R) | Harmonic mean of precision and recall; penalizes extreme imbalance between them | Imbalanced classification: fraud, rare disease, defect detection |
| Balanced Accuracy | (Sensitivity+Specificity)/2 | Average of sensitivity and specificity; treats both classes equally | Binary classification with severe class imbalance |
| True Positive (TP) | — | Predicted positive and actually positive; a correct hit | Counts correct detections in the positive class |
| False Positive (FP) | — | Predicted positive but actually negative; a Type I error | The "false alarm" cell in the confusion matrix |
| False Negative (FN) | — | Predicted negative but actually positive; a Type II error | The "missed detection" cell; high FN = low recall |
Diagnostic Accuracy in Healthcare and Epidemiology
In healthcare, diagnostic accuracy describes how well a test separates people who have a condition from those who do not. The terminology differs slightly from machine learning: sensitivity is the clinical equivalent of recall, and specificity directly maps to the same formula used in classification.
The STARD (Standards for Reporting Diagnostic Accuracy) guidelines require that clinical studies report sensitivity, specificity, and their confidence intervals alongside overall accuracy. This is because a test with 95% accuracy but only 60% sensitivity catches too few sick patients to be clinically useful.
Table: ML Terminology vs. Clinical Testing Terminology
| Concept | ML / Data Science Term | Clinical / Epidemiology Term |
|---|---|---|
| Correctly identified positive | True Positive (TP) | True Positive |
| Correctly identified negative | True Negative (TN) | True Negative |
| Type I Error | False Positive (FP) | False Positive |
| Type II Error | False Negative (FN) | False Negative |
| Overall correctness | Accuracy | Diagnostic Accuracy |
| TP / (TP + FN) | Recall | Sensitivity |
| TN / (TN + FP) | Specificity | Specificity |
| TP / (TP + FP) | Precision | Positive Predictive Value (PPV) |
Continue Your Statistics Learning
Accuracy is one measure of model quality. The tools and guides below will help you build a complete understanding of statistical evaluation and hypothesis testing, all covered in depth at Statistics Fundamentals.
Frequently Asked Questions
In machine learning, accuracy is the fraction of predictions a model gets right out of all predictions made. It equals (TP + TN) / (TP + TN + FP + FN), where TP and TN are correct predictions and FP and FN are errors. It is the most intuitive metric but can be misleading on imbalanced datasets where one class dominates the data. On balanced data with roughly equal class sizes, accuracy is a reliable headline metric for model quality.
The accuracy formula is: Accuracy = (TP + TN) / (TP + TN + FP + FN). In words: the number of correct predictions (true positives plus true negatives) divided by the total number of predictions. Multiplying by 100 gives the accuracy percentage. The error rate is the complement: Error Rate = 1 − Accuracy = (FP + FN) / Total.
A good accuracy percentage depends on the problem. In general: 95–100% is excellent, 90–94% is very good, 80–89% is good, 70–79% is fair, and below 70% is poor. But these thresholds shift with context. A 99% accurate fraud detector might still miss too many fraudulent transactions to be useful if the fraud rate is 0.5%. On imbalanced datasets, always evaluate precision, recall, and F1 score alongside accuracy.
Accuracy measures overall correctness across all classes: (TP + TN) / Total. Precision measures the correctness of positive predictions only: TP / (TP + FP). A model with high accuracy can have low precision if it correctly handles many negatives but frequently produces false alarms on positive predictions. Precision is the metric to optimize when false positives carry high costs — for example, flagging a legitimate credit card transaction as fraudulent, or sending a legitimate email to the spam folder.
Accuracy covers all four cells of the confusion matrix. Recall (also called sensitivity) focuses solely on the positive class: TP / (TP + FN). It measures what fraction of actual positives the model correctly captured. High recall is critical when missing a true positive carries a severe consequence — a cancer screening test that misses 20% of tumors has low recall regardless of its overall accuracy score. In clinical testing, recall is called sensitivity and is reported alongside specificity in every diagnostic study.
Yes — this is the well-documented accuracy paradox. On a dataset where 99% of cases are negative, a classifier that predicts negative for every observation achieves 99% accuracy while being completely useless. The model catches zero true positives, yet the accuracy metric flatters it. Whenever class distributions are skewed — which is common in fraud detection, medical diagnosis, and defect detection — use balanced accuracy, F1 score, or the area under the ROC curve (AUC-ROC) as your primary evaluation metrics instead.
Balanced accuracy is the arithmetic mean of sensitivity and specificity: (Sensitivity + Specificity) / 2. It gives equal weight to both classes regardless of their relative size, making it appropriate whenever the positive and negative classes appear in very different proportions. If a test correctly identifies 80% of sick patients (sensitivity) and 90% of healthy patients (specificity), its balanced accuracy is (80% + 90%) / 2 = 85%. The scikit-learn library provides balanced_accuracy_score() for computing this directly from predicted and true labels.
Step 1: Add all four confusion matrix values to get the total: Total = TP + TN + FP + FN. Step 2: Add the two correct prediction cells: Correct = TP + TN. Step 3: Divide: Accuracy = Correct / Total. Step 4: Multiply by 100 for percentage. Example: TP = 85, TN = 90, FP = 10, FN = 15. Total = 200. Correct = 175. Accuracy = 175 / 200 = 0.875 = 87.5%. The calculator at the top of this page automates all steps and also computes precision, recall, specificity, F1 score, and balanced accuracy from the same four inputs.
The mathematics are identical. Diagnostic accuracy in medicine and classification accuracy in machine learning both use the same formula: (TP + TN) / Total. The terminology differs: clinicians say "sensitivity" for recall and "positive predictive value (PPV)" for precision, but the calculations are equivalent. Clinical studies additionally report 95% confidence intervals around accuracy, sensitivity, and specificity, which you can compute using the confidence interval calculator.
95% accuracy means the model or test correctly classified 95 out of every 100 cases. Equivalently, the error rate is 5%. Whether this is good depends on context: 95% accuracy in image recognition for a photo app is fine; 95% accuracy in an autonomous vehicle's obstacle detection system may be dangerously insufficient. Always ask: what does the 5% error look like? Are the errors false positives, false negatives, or a mix? The answer determines whether 95% accuracy is acceptable for your specific application.