BY: Statistics Fundamentals Team
Reviewed By: Minsa A (Senior Statistics Editor)

Accuracy Calculator: Classification & Diagnostic Performance

Calculate classification accuracy, error rate, precision, recall, specificity, and F1 score from your confusion matrix in seconds. Enter your true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) to get a complete picture of your model or test's performance — with step-by-step solutions and worked examples, no signup required.

Accuracy Calculator

Accuracy Formula Accuracy = (TP + TN) / (TP + TN + FP + FN)
Correctly identified as positive
Correctly identified as negative
Predicted positive, actually negative (Type I error)
Predicted negative, actually positive (Type II error)

Enter values in the Confusion Matrix tab first, then return here for the full worked solution.

No data yet — enter TP, TN, FP, and FN in the Confusion Matrix tab first.

What is Accuracy?

Accuracy is a classification metric that measures the proportion of correct predictions out of all predictions made. It answers one direct question: out of every case evaluated, how many did the model or test classify correctly? Formally, accuracy is defined as the number of true positives plus true negatives divided by the total number of observations.

In both machine learning and diagnostic testing, accuracy is computed from a confusion matrix, a two-by-two table that records how a classifier's predictions compare to the actual ground truth labels. The four cells of that table — TP, TN, FP, and FN — are the raw material from which accuracy and every related metric is derived.

Featured snippet answer: Accuracy is a statistical classification metric measuring the overall proportion of correct predictions. It is calculated by dividing the sum of True Positives and True Negatives by the total number of observations: (TP + TN) ÷ (TP + TN + FP + FN).

Accuracy Formula Library

Six formulas govern classification performance evaluation. Accuracy alone does not tell the full story; each formula below targets a different aspect of predictive quality. Understanding when each applies is as important as knowing how to compute it.

Basic Accuracy Formula

Accuracy = (TP + TN) / (TP + TN + FP + FN) In percentage: Accuracy (%) = Accuracy × 100

Error Rate Formula

Error Rate = 1 − Accuracy Equivalently: Error Rate = (FP + FN) / Total

Precision Formula

Precision = TP / (TP + FP) Answers: Of all positive predictions, how many were actually correct?

Recall (Sensitivity) Formula

Recall = TP / (TP + FN) Answers: Of all actual positives, how many did the model catch?

Specificity Formula

Specificity = TN / (TN + FP) Answers: Of all actual negatives, how many were correctly ruled out?

F1 Score & Balanced Accuracy

F1 = 2 × (P × R) / (P + R) Balanced Accuracy = (Sensitivity + Specificity) / 2

These formulas are standard across machine learning, clinical epidemiology, and quality assurance. The scikit-learn documentation on model evaluation covers their Python implementations, while the BMJ's Statistics at Square One addresses their use in clinical research.

How to Calculate Accuracy from a Confusion Matrix — Step by Step

To calculate accuracy: build your confusion matrix, sum the four cells to get the total, then divide the number of correct predictions (TP + TN) by that total. Here is the complete method with a worked numerical example.

1
Identify the four confusion matrix values

From your model's predictions versus actual labels, record TP (predicted positive, actually positive), TN (predicted negative, actually negative), FP (predicted positive, actually negative), and FN (predicted negative, actually positive). Example: a disease screening test on 200 patients yields TP = 85, TN = 90, FP = 10, FN = 15.

2
Sum the total number of observations

Total = TP + TN + FP + FN = 85 + 90 + 10 + 15 = 200. This is the denominator for every metric in the confusion matrix.

3
Apply the accuracy formula

Accuracy = (TP + TN) / Total = (85 + 90) / 200 = 175 / 200 = 0.875.

4
Convert to a percentage

0.875 × 100 = 87.5%. The test correctly classifies 87.5% of all patients.

5
Compute additional metrics and interpret in context

Precision = 85 / (85 + 10) = 89.5%. Recall = 85 / (85 + 15) = 85.0%. F1 Score = 2 × (0.895 × 0.850) / (0.895 + 0.850) = 87.2%. Because the classes are relatively balanced here (100 true positives vs. 100 true negatives), accuracy is a fair summary. On imbalanced datasets, F1 score and balanced accuracy carry more weight.

Result: TP = 85, TN = 90, FP = 10, FN = 15, Total = 200. Accuracy = 87.5%, Error Rate = 12.5%, Precision = 89.5%, Recall = 85.0%, Specificity = 90.0%, F1 = 87.2%. Verify all six values using the calculator above.

Worked Examples Across Three Domains

Example 1 — Disease Screening Test

Scenario: A rapid diagnostic test is evaluated on 10,000 patients. Physicians need to know whether the test's 87% accuracy is sufficient for routine screening, and how often the test misses true cases.
Confusion matrix values

TP = 850 (sick patients correctly identified), TN = 8,850 (healthy patients correctly cleared), FP = 150 (healthy patients falsely flagged), FN = 150 (sick patients missed). Total = 10,000.

Accuracy calculation

Accuracy = (850 + 8,850) / 10,000 = 9,700 / 10,000 = 97.0%.

Recall (critical for screening)

Recall = 850 / (850 + 150) = 850 / 1,000 = 85.0%. This means 15% of truly sick patients are missed by the test, which is clinically significant.

Specificity

Specificity = 8,850 / (8,850 + 150) = 8,850 / 9,000 = 98.3%. Very few healthy patients are falsely flagged.

Interpretation: 97% accuracy sounds impressive, but 15% of sick patients go undetected (recall = 85%). For a serious disease, this miss rate could be clinically unacceptable. This is why medical test evaluation in the WHO's diagnostic test guidance requires sensitivity and specificity alongside accuracy.

Example 2 — Spam Detection System (NLP)

Scenario: A natural language processing (NLP) binary classifier is trained to detect spam emails. Evaluation on a 5,000-email test set with balanced classes produces the following confusion matrix.
Confusion matrix

TP = 2,300 (spam correctly flagged), TN = 2,400 (legitimate emails correctly passed), FP = 100 (legitimate emails wrongly flagged as spam), FN = 200 (spam emails missed). Total = 5,000.

Accuracy

(2,300 + 2,400) / 5,000 = 4,700 / 5,000 = 94.0%.

Precision

2,300 / (2,300 + 100) = 2,300 / 2,400 = 95.8%. Very few legitimate emails are lost to the spam folder.

F1 Score

Recall = 2,300 / (2,300 + 200) = 92.0%. F1 = 2 × (0.958 × 0.920) / (0.958 + 0.920) = 93.9%. Balanced model.

Interpretation: The model achieves 94% accuracy with precision prioritized over recall — the correct trade-off for spam filtering, where sending legitimate email to spam (false positive) costs more than letting occasional spam through (false negative).

Example 3 — Credit Risk Prediction (Financial Modeling)

Scenario: A logistic regression model predicts whether loan applicants will default. The dataset is imbalanced: 90% of applicants repay (negative class) and 10% default (positive class). Evaluating on 2,000 applicants.
Confusion matrix

TP = 120 (defaulters correctly flagged), TN = 1,750 (repayers correctly approved), FP = 50 (repayers wrongly rejected), FN = 80 (defaulters missed and approved). Total = 2,000.

Accuracy

(120 + 1,750) / 2,000 = 1,870 / 2,000 = 93.5%.

Why accuracy misleads here

A model that approves everyone would achieve (0 + 1,800) / 2,000 = 90% accuracy without catching a single defaulter. The difference in accuracy between our model (93.5%) and this naive strategy (90%) understates the real improvement in predictive value.

Balanced Accuracy

Recall = 120 / 200 = 60.0%. Specificity = 1,750 / 1,800 = 97.2%. Balanced Accuracy = (0.600 + 0.972) / 2 = 78.6%. This is the more honest summary on imbalanced data.

Interpretation: Despite 93.5% accuracy, the model catches only 60% of defaulters. Balanced accuracy (78.6%) and recall (60%) are the metrics that lenders and regulators actually need to evaluate. This illustrates the accuracy paradox directly.

Accuracy Score Interpretation: Quick Reference

What counts as a "good" accuracy score depends entirely on the application, class balance, and the cost of errors. The table below provides general benchmarks, but always pair them with recall and precision before drawing conclusions.

Table: Accuracy Score Ranges and Interpretation

Accuracy ScoreInterpretationCommon ContextWatch Out For
95–100%ExcellentImage classification, OCR, medical imaging AIOverfitting; check on unseen test data
90–94%Very GoodNLP classifiers, diagnostic testsStill inspect recall on minority class
80–89%GoodFraud detection, churn predictionMay need tuning; check F1 score
70–79%FairEarly-stage models, noisy dataReview feature engineering
Below 70%PoorNear-random performanceCheck for data leakage or class imbalance
95%+
Excellent
Production ready
80–94%
Good to Very Good
Check recall too
<70%
Poor
Revisit model

Accuracy vs. Precision vs. Recall vs. F1 Score

Accuracy measures overall correctness. Precision targets the quality of positive predictions. Recall targets the completeness of positive detection. F1 Score balances precision and recall into one number for imbalanced datasets. Choosing the right metric depends on what type of error is more costly in your application.

Table: Classification Metrics Compared

MetricFormulaBest Used WhenLimitation
Accuracy(TP+TN) / TotalBalanced classes, general overviewMisleading on imbalanced data (accuracy paradox)
PrecisionTP / (TP+FP)Cost of FP is high (spam, fraud alerts)Ignores false negatives entirely
RecallTP / (TP+FN)Cost of FN is high (cancer screening, safety)Can be maximized by predicting everything positive
F1 Score2PR / (P+R)Imbalanced classes, when both FP and FN matterDoes not include TN in the calculation
Balanced Accuracy(Sensitivity+Specificity)/2Highly imbalanced binary classificationLess interpretable than F1 in some contexts
SpecificityTN / (TN+FP)Ruling out conditions (clinical screening)Says nothing about positive prediction quality
⚠ The Accuracy Paradox: On a dataset where 99% of cases are negative, a classifier that always predicts "negative" achieves 99% accuracy — but catches zero true positives. This is the accuracy paradox. Whenever your positive class makes up less than 20% of the dataset, treat accuracy as a secondary metric and prioritize recall, precision, and F1 score. The Google Machine Learning crash course covers this distinction in depth.

Confusion Matrix and Accuracy: Complete Formula Reference

The table below lists every key term and formula related to accuracy and confusion matrix evaluation. It is structured for direct reference by students, researchers, and practitioners.

Table: Accuracy Metric Glossary — 10 Key Entities

Term Symbol / Formula Plain-English Definition Primary Use Case
Accuracy (TP+TN) / Total Proportion of all predictions that are correct General model evaluation on balanced data
Error Rate (FP+FN) / Total Proportion of all predictions that are wrong; equals 1 minus accuracy Communicating failure rate to non-technical audiences
Precision TP / (TP+FP) Of all positive predictions, the share that were actually positive When false positives are costly: spam detection, fraud alerts
Recall (Sensitivity) TP / (TP+FN) Of all actual positives, the share the model correctly identified When false negatives are costly: disease detection, safety systems
Specificity TN / (TN+FP) Of all actual negatives, the share correctly ruled out Clinical screening; ruling out conditions in diagnostic testing
F1 Score 2×(P×R)/(P+R) Harmonic mean of precision and recall; penalizes extreme imbalance between them Imbalanced classification: fraud, rare disease, defect detection
Balanced Accuracy (Sensitivity+Specificity)/2 Average of sensitivity and specificity; treats both classes equally Binary classification with severe class imbalance
True Positive (TP) Predicted positive and actually positive; a correct hit Counts correct detections in the positive class
False Positive (FP) Predicted positive but actually negative; a Type I error The "false alarm" cell in the confusion matrix
False Negative (FN) Predicted negative but actually positive; a Type II error The "missed detection" cell; high FN = low recall

Diagnostic Accuracy in Healthcare and Epidemiology

In healthcare, diagnostic accuracy describes how well a test separates people who have a condition from those who do not. The terminology differs slightly from machine learning: sensitivity is the clinical equivalent of recall, and specificity directly maps to the same formula used in classification.

The STARD (Standards for Reporting Diagnostic Accuracy) guidelines require that clinical studies report sensitivity, specificity, and their confidence intervals alongside overall accuracy. This is because a test with 95% accuracy but only 60% sensitivity catches too few sick patients to be clinically useful.

Table: ML Terminology vs. Clinical Testing Terminology

ConceptML / Data Science TermClinical / Epidemiology Term
Correctly identified positiveTrue Positive (TP)True Positive
Correctly identified negativeTrue Negative (TN)True Negative
Type I ErrorFalse Positive (FP)False Positive
Type II ErrorFalse Negative (FN)False Negative
Overall correctnessAccuracyDiagnostic Accuracy
TP / (TP + FN)RecallSensitivity
TN / (TN + FP)SpecificitySpecificity
TP / (TP + FP)PrecisionPositive Predictive Value (PPV)

Continue Your Statistics Learning

Accuracy is one measure of model quality. The tools and guides below will help you build a complete understanding of statistical evaluation and hypothesis testing, all covered in depth at Statistics Fundamentals.

Frequently Asked Questions

In machine learning, accuracy is the fraction of predictions a model gets right out of all predictions made. It equals (TP + TN) / (TP + TN + FP + FN), where TP and TN are correct predictions and FP and FN are errors. It is the most intuitive metric but can be misleading on imbalanced datasets where one class dominates the data. On balanced data with roughly equal class sizes, accuracy is a reliable headline metric for model quality.

The accuracy formula is: Accuracy = (TP + TN) / (TP + TN + FP + FN). In words: the number of correct predictions (true positives plus true negatives) divided by the total number of predictions. Multiplying by 100 gives the accuracy percentage. The error rate is the complement: Error Rate = 1 − Accuracy = (FP + FN) / Total.

A good accuracy percentage depends on the problem. In general: 95–100% is excellent, 90–94% is very good, 80–89% is good, 70–79% is fair, and below 70% is poor. But these thresholds shift with context. A 99% accurate fraud detector might still miss too many fraudulent transactions to be useful if the fraud rate is 0.5%. On imbalanced datasets, always evaluate precision, recall, and F1 score alongside accuracy.

Accuracy measures overall correctness across all classes: (TP + TN) / Total. Precision measures the correctness of positive predictions only: TP / (TP + FP). A model with high accuracy can have low precision if it correctly handles many negatives but frequently produces false alarms on positive predictions. Precision is the metric to optimize when false positives carry high costs — for example, flagging a legitimate credit card transaction as fraudulent, or sending a legitimate email to the spam folder.

Accuracy covers all four cells of the confusion matrix. Recall (also called sensitivity) focuses solely on the positive class: TP / (TP + FN). It measures what fraction of actual positives the model correctly captured. High recall is critical when missing a true positive carries a severe consequence — a cancer screening test that misses 20% of tumors has low recall regardless of its overall accuracy score. In clinical testing, recall is called sensitivity and is reported alongside specificity in every diagnostic study.

Yes — this is the well-documented accuracy paradox. On a dataset where 99% of cases are negative, a classifier that predicts negative for every observation achieves 99% accuracy while being completely useless. The model catches zero true positives, yet the accuracy metric flatters it. Whenever class distributions are skewed — which is common in fraud detection, medical diagnosis, and defect detection — use balanced accuracy, F1 score, or the area under the ROC curve (AUC-ROC) as your primary evaluation metrics instead.

Balanced accuracy is the arithmetic mean of sensitivity and specificity: (Sensitivity + Specificity) / 2. It gives equal weight to both classes regardless of their relative size, making it appropriate whenever the positive and negative classes appear in very different proportions. If a test correctly identifies 80% of sick patients (sensitivity) and 90% of healthy patients (specificity), its balanced accuracy is (80% + 90%) / 2 = 85%. The scikit-learn library provides balanced_accuracy_score() for computing this directly from predicted and true labels.

Step 1: Add all four confusion matrix values to get the total: Total = TP + TN + FP + FN. Step 2: Add the two correct prediction cells: Correct = TP + TN. Step 3: Divide: Accuracy = Correct / Total. Step 4: Multiply by 100 for percentage. Example: TP = 85, TN = 90, FP = 10, FN = 15. Total = 200. Correct = 175. Accuracy = 175 / 200 = 0.875 = 87.5%. The calculator at the top of this page automates all steps and also computes precision, recall, specificity, F1 score, and balanced accuracy from the same four inputs.

The mathematics are identical. Diagnostic accuracy in medicine and classification accuracy in machine learning both use the same formula: (TP + TN) / Total. The terminology differs: clinicians say "sensitivity" for recall and "positive predictive value (PPV)" for precision, but the calculations are equivalent. Clinical studies additionally report 95% confidence intervals around accuracy, sensitivity, and specificity, which you can compute using the confidence interval calculator.

95% accuracy means the model or test correctly classified 95 out of every 100 cases. Equivalently, the error rate is 5%. Whether this is good depends on context: 95% accuracy in image recognition for a photo app is fine; 95% accuracy in an autonomous vehicle's obstacle detection system may be dangerously insufficient. Always ask: what does the 5% error look like? Are the errors false positives, false negatives, or a mix? The answer determines whether 95% accuracy is acceptable for your specific application.