{"id":112102,"date":"2026-06-03T20:32:08","date_gmt":"2026-06-03T15:02:08","guid":{"rendered":"https:\/\/www.guvi.in\/blog\/?p=112102"},"modified":"2026-06-15T13:27:37","modified_gmt":"2026-06-15T07:57:37","slug":"understanding-f1-score-in-machine-learning","status":"publish","type":"post","link":"https:\/\/www.guvi.in\/blog\/understanding-f1-score-in-machine-learning\/","title":{"rendered":"Understanding F1 Score in Machine Learning"},"content":{"rendered":"\n<p>Imagine you are comparing two spam filters. Model A catches 90% of spam but marks 40% of legitimate emails as spam. Model B catches 70% of spam and only marks 10% of legitimate emails as spam. Which is better? Looking at just one number would be easier than juggling two metrics.<\/p>\n\n\n\n<p>This is exactly what the F1 score does. It combines precision and recall into a single number that tells you how well your classification model performs. Instead of choosing between two metrics, you get one balanced score.<\/p>\n\n\n\n<p>If you are building classification models, comparing different algorithms, or reporting model performance, understanding the F1 score is essential. It is one of the most widely used evaluation metrics in machine learning.<\/p>\n\n\n\n<p>This guide explains what the F1 score is, how to calculate it, when to use it, and how to interpret it for your specific classification problem.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Quick TL;DR Summary<\/strong><\/h2>\n\n\n\n<ol>\n<li>This guide explains the F1 score, a single metric that combines precision and recall into one number using the harmonic mean, making it easier to evaluate and compare classification models.<br><\/li>\n\n\n\n<li>You will learn why the F1 score uses harmonic mean instead of arithmetic mean, how it balances precision and recall, and why it penalizes models that are good at one metric but poor at the other.<br><\/li>\n\n\n\n<li>The guide covers different variants including binary F1 score for two-class problems, and macro, micro, and weighted F1 scores for multi-class classification problems.<br><\/li>\n\n\n\n<li>Step-by-step examples show you how to calculate F1 score by hand and using Python libraries like scikit-learn, interpret the results, and understand when F1 score is appropriate.<br><\/li>\n\n\n\n<li>You will understand the limitations of F1 score, when to use alternative metrics, and how to choose between F1 variants based on your dataset characteristics and business requirements.<\/li>\n<\/ol>\n\n\n\n<div class=\"guvi-answer-card\" style=\"margin: 40px 0;\">\n\n  <div style=\"\n    position: relative;\n    background: linear-gradient(135deg, #f0fff4, #e6f7ee);\n    border: 1px solid #cfeedd;\n    padding: 26px 24px 22px 24px;\n    border-radius: 14px;\n    font-family: Arial, sans-serif;\n    box-shadow: 0 6px 16px rgba(0,0,0,0.05);\n  \">\n\n    <!-- Top accent -->\n    <div style=\"\n      position: absolute;\n      top: 0;\n      left: 0;\n      height: 6px;\n      width: 100%;\n      background: linear-gradient(to right, #099f4e, #6dd5a3);\n      border-radius: 14px 14px 0 0;\n    \"><\/div>\n\n    <!-- Title -->\n    <h3 style=\"\n      margin: 10px 0 12px 0;\n      color: #099f4e;\n      font-size: 20px;\n    \">\n      What Is the F1 Score?\n    <\/h3>\n\n    <!-- Content -->\n    <p style=\"\n      margin: 0;\n      color: #2f4f3f;\n      font-size: 16px;\n      line-height: 1.7;\n    \">\n      The F1 score is a performance metric used in machine learning and classification tasks that combines precision and recall into a single value using their harmonic mean. It provides a balanced measure of a model\u2019s accuracy, especially when dealing with imbalanced datasets where both false positives and false negatives matter. The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall, and 0 represents the poorest possible performance.\n    <\/p>\n\n  <\/div>\n\n<\/div>\n\n\n\n<p>The formula is: F1 = 2 \u00d7 (Precision \u00d7 Recall) \/ (Precision + Recall)<\/p>\n\n\n\n<p>The F1 score answers the question: how well does my model balance finding positives and avoiding false alarms? It gives you one number instead of tracking two separate metrics.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Why F1 Score Uses Harmonic Mean<\/strong><\/h2>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1200\" height=\"628\" src=\"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/1-48.png\" alt=\"Why F1 Score Uses Harmonic Mean\" class=\"wp-image-116535\" srcset=\"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/1-48.png 1200w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/1-48-300x157.png 300w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/1-48-768x402.png 768w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/1-48-150x79.png 150w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" title=\"\"><\/figure>\n\n\n\n<ol>\n<li><strong>Arithmetic mean would be misleading<\/strong><\/li>\n<\/ol>\n\n\n\n<p>The arithmetic mean of precision and recall is (Precision + Recall) \/ 2. This treats both metrics equally but does not penalize extreme imbalances. If precision is 100% and recall is 10%, the arithmetic mean is 55%, which sounds decent but hides that you are missing 90% of positives.<\/p>\n\n\n\n<ol start=\"2\">\n<li><strong>Harmonic mean penalizes imbalanced performance<\/strong><\/li>\n<\/ol>\n\n\n\n<p>The harmonic mean is always lower than or equal to the arithmetic mean. It is much more sensitive to low values. If precision is 100% and recall is 10%, the F1 score is only 18%. This accurately reflects that your model has a serious problem despite perfect precision.Understanding these mathematical foundations is crucial whether you are working with traditional <a href=\"https:\/\/www.guvi.in\/blog\/types-of-machine-learning-algorithms\/\" target=\"_blank\" rel=\"noreferrer noopener\">machine learning algorithms<\/a> or modern <a href=\"https:\/\/www.guvi.in\/blog\/deep-learning-and-neural-network\/\" target=\"_blank\" rel=\"noreferrer noopener\">deep learning <\/a>systems.&nbsp;<\/p>\n\n\n\n<ol start=\"3\">\n<li><strong>Both metrics must be good for high F1<\/strong><\/li>\n<\/ol>\n\n\n\n<p>To achieve a high F1 score, both precision and recall must be reasonably high. You cannot game the metric by excelling at one while ignoring the other. A model with 90% precision and 90% recall gets F1 = 0.90. A model with 99% precision and 50% recall only gets F1 = 0.66.<\/p>\n\n\n\n<ol start=\"4\">\n<li><strong>The mathematical reason behind harmonic mean<\/strong><\/li>\n<\/ol>\n\n\n\n<p>The harmonic mean is the reciprocal of the arithmetic mean of reciprocals. For precision P and recall R: F1 = 2 \/ (1\/P + 1\/R). This formulation ensures that if either metric approaches zero, the F1 score also approaches zero regardless of how high the other metric is.<\/p>\n\n\n\n<div style=\"background-color: #099f4e; border: 3px solid #110053; border-radius: 12px; padding: 18px 22px; color: #FFFFFF; font-size: 18px; font-family: Montserrat, Helvetica, sans-serif; line-height: 1.6; box-shadow: 0 4px 12px rgba(0, 0, 0, 0.15); max-width: 750px;\">\n  <strong style=\"font-size: 22px; color: #FFFFFF;\">\ud83d\udca1 Did You Know?<\/strong>\n  <p style=\"margin-top: 14px; margin-bottom: 0;\">\n    The <strong style=\"color: #FFFFFF;\">F1 score<\/strong> gets its name from the broader <strong style=\"color: #FFFFFF;\">F-measure<\/strong> family introduced in <strong style=\"color: #FFFFFF;\">information retrieval research<\/strong>. The \u201c1\u201d in F1 indicates that <strong style=\"color: #FFFFFF;\">precision<\/strong> and <strong style=\"color: #FFFFFF;\">recall<\/strong> are weighted equally when computing the harmonic mean between them. Researchers later generalized the metric into the <strong style=\"color: #FFFFFF;\">F-beta score<\/strong>, where different beta values allow one metric to matter more than the other\u2014for example, <strong style=\"color: #FFFFFF;\">F2<\/strong> emphasizes recall more heavily, while <strong style=\"color: #FFFFFF;\">F0.5<\/strong> prioritizes precision. Despite these variations, F1 remains the most widely used version in modern machine learning evaluation.\n  <\/p>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How to Calculate F1 Score Step by Step<\/strong><\/h2>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1200\" height=\"628\" src=\"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/2-45.png\" alt=\"How to Calculate F1 Score Step by Step\" class=\"wp-image-116537\" srcset=\"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/2-45.png 1200w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/2-45-300x157.png 300w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/2-45-768x402.png 768w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/2-45-150x79.png 150w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" title=\"\"><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Step 1: Calculate precision from your confusion matrix<\/strong><\/h3>\n\n\n\n<p>First, you need precision. From your <a href=\"https:\/\/www.guvi.in\/blog\/confusion-matrix-in-machine-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">confusion matrix<\/a>, count true positives (TP) and false positives (FP). Precision = TP \/ (TP + FP). For example, if TP = 80 and FP = 20, then Precision = 80 \/ 100 = 0.80 or 80%.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Step 2: Calculate recall from your confusion matrix<\/strong><\/h3>\n\n\n\n<p>Next, calculate recall. Count true positives (TP) and false negatives (FN). Recall = TP \/ (TP + FN). Using the same example, if TP = 80 and FN = 30, then Recall = 80 \/ 110 = 0.727 or 72.7%.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Step 3: Apply the F1 formula<\/strong><\/h3>\n\n\n\n<p>Now plug both values into the F1 formula: F1 = 2 \u00d7 (Precision \u00d7 Recall) \/ (Precision + Recall). With Precision = 0.80 and Recall = 0.727: F1 = 2 \u00d7 (0.80 \u00d7 0.727) \/ (0.80 + 0.727) = 2 \u00d7 0.582 \/ 1.527 = 0.762 or 76.2%.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Step 4: Interpret the result<\/strong><\/h3>\n\n\n\n<p>Your F1 score of 0.762 indicates reasonably balanced performance. The model has good precision and decent recall. Neither metric is extremely low. This single number summarizes that your model performs fairly well at both avoiding false alarms and finding actual positives.<\/p>\n\n\n\n<p><strong>Read More: <\/strong><a href=\"https:\/\/www.guvi.in\/blog\/machine-learning-pipeline\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Machine Learning Pipeline Explained: Beginner to Pro Guide<\/strong><\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Calculating F1 with Python and scikit-learn<\/strong><\/h2>\n\n\n\n<p>In practice, use libraries instead of manual calculation. With scikit-learn:<\/p>\n\n\n\n<p>from sklearn.metrics import f1_score<\/p>\n\n\n\n<p>f1 = f1_score(y_true, y_pred)<\/p>\n\n\n\n<p>This calculates F1 automatically from your true labels and predictions. For multi-class problems, specify the averaging method with the average parameter.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>F1 Score for Binary Classification<\/strong><\/h2>\n\n\n\n<ol>\n<li><strong>Single positive class F1<\/strong><\/li>\n<\/ol>\n\n\n\n<p>In binary classification, you have one positive class and one negative class. The standard F1 score measures performance on the positive class. It tells you how well you identify that specific class while balancing precision and recall.<\/p>\n\n\n\n<ol start=\"2\">\n<li><strong>When F1 works well for binary problems<\/strong><\/li>\n<\/ol>\n\n\n\n<p>F1 score is excellent for binary classification when both classes matter and you want balanced performance. It is particularly useful for imbalanced datasets where accuracy is misleading. F1 focuses on the positive class you care about.<\/p>\n\n\n\n<ol start=\"3\">\n<li><strong>Example: Medical diagnosis<\/strong><\/li>\n<\/ol>\n\n\n\n<p>In disease detection, positive means the patient has the disease. You want high precision (patients you diagnose actually have it) and high recall (you catch most cases). F1 = 0.85 means good balanced performance. F1 = 0.50 suggests serious problems with one or both metrics. Medical machine learning systems and <a href=\"https:\/\/www.guvi.in\/blog\/ai-in-healthcare-applications\/\" target=\"_blank\" rel=\"noreferrer noopener\">AI in healthcare <\/a>rely heavily on F1 scores to ensure patient safety and diagnostic accuracy.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>F1 Score Variants for Multi-Class Classification<\/strong><\/h2>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1200\" height=\"628\" src=\"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/3-44.png\" alt=\"F1 Score Variants for Multi-Class Classification\" class=\"wp-image-116538\" srcset=\"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/3-44.png 1200w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/3-44-300x157.png 300w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/3-44-768x402.png 768w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/3-44-150x79.png 150w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" title=\"\"><\/figure>\n\n\n\n<ol>\n<li><strong>Macro F1: Average F1 across all classes<\/strong><\/li>\n<\/ol>\n\n\n\n<p><a href=\"https:\/\/www.ibm.com\/docs\/en\/watsonx\/saas?topic=metrics-macro-f1-score\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Macro F1<\/a> calculates F1 score for each class separately, then takes the simple average. Each class contributes equally regardless of how many examples it has. Formula: (F1_class1 + F1_class2 + &#8230; + F1_classN) \/ N. Use macro F1 when all classes are equally important.<\/p>\n\n\n\n<ol start=\"2\">\n<li><strong>Micro F1: Global precision and recall<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Micro F1 aggregates all true positives, false positives, and false negatives across all classes, then calculates a single precision and recall. Formula: calculate global TP, FP, FN, then F1 from those totals. Micro F1 gives more weight to classes with more examples.<\/p>\n\n\n\n<ol start=\"3\">\n<li><strong>Weighted F1: Class-size weighted average<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Weighted F1 calculates F1 for each class, then takes a weighted average where weights are the number of true instances of each class. Larger classes contribute more to the final score. Use weighted F1 when you want to account for class imbalance in the final metric.<\/p>\n\n\n\n<ol start=\"4\">\n<li><strong>Choosing the right variant<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Use macro F1 when all classes are equally important and you want to treat rare classes the same as common ones. Use micro F1 when larger classes should dominate the metric. Use weighted F1 when you want a middle ground that acknowledges class sizes but still considers all classes. The choice depends on your problem.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Example comparing the variants<\/strong><\/h2>\n\n\n\n<p>You have 3 classes: Class A (900 examples, F1=0.90), Class B (80 examples, F1=0.60), Class C (20 examples, F1=0.40). Macro F1 = (0.90 + 0.60 + 0.40) \/ 3 = 0.63. Micro F1 aggregates counts and might be around 0.87. Weighted F1 = (900\u00d70.90 + 80\u00d70.60 + 20\u00d70.40) \/ 1000 = 0.87. Notice how macro F1 is much lower because it treats the poor-performing rare class equally.<\/p>\n\n\n\n<p><strong><em>Did You Know?<\/em><\/strong><em> In multi-class classification competitions, macro F1 is often preferred because it prevents solutions that ignore minority classes. A model that performs excellently on 90% of classes but completely fails on 10% gets a low macro F1 score, forcing competitors to build systems that work for all classes. This is especially important in medical diagnosis or fraud detection where rare classes are often the most critical.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>When to Use F1 Score<\/strong><\/h2>\n\n\n\n<ol>\n<li><strong>Imbalanced datasets where accuracy is misleading<\/strong><\/li>\n<\/ol>\n\n\n\n<p>When one class dominates, accuracy is not informative. If 95% of examples are negative, predicting everything as negative gives 95% accuracy but is useless. F1 score focuses on the minority positive class and reveals the model is performing terribly.<\/p>\n\n\n\n<ol start=\"2\">\n<li><strong>When precision and recall both matter<\/strong><\/li>\n<\/ol>\n\n\n\n<p>F1 is ideal when you care about both false positives and false negatives. In fraud detection, you want to catch fraud (recall) without blocking too many legitimate transactions (precision). F1 balances these competing concerns.<\/p>\n\n\n\n<ol start=\"3\">\n<li><strong>Comparing models with different tradeoffs<\/strong><\/li>\n<\/ol>\n\n\n\n<p>When evaluating multiple models, some might favor precision while others favor recall. F1 gives you a single number for comparison. The model with the highest F1 achieves the best balance, making selection easier.<\/p>\n\n\n\n<ol start=\"4\">\n<li><strong>Binary and multi-class classification<\/strong><\/li>\n<\/ol>\n\n\n\n<p>F1 works for both binary problems (one positive class) and multi-class problems (multiple classes to predict). For multi-class, choose the appropriate variant based on whether all classes are equally important.<\/p>\n\n\n\n<ol start=\"5\">\n<li><strong>Reporting standard metric in research<\/strong><\/li>\n<\/ol>\n\n\n\n<p>F1 score is widely used in academic papers and industry reports. Using it makes your results comparable to published work. It is the standard metric in many fields including natural language processing and computer vision.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How to Improve Your F1 Score<\/strong><\/h2>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" width=\"1200\" height=\"628\" src=\"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/4-27.png\" alt=\"How to Improve Your F1 Score\" class=\"wp-image-116539\" srcset=\"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/4-27.png 1200w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/4-27-300x157.png 300w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/4-27-768x402.png 768w, https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/4-27-150x79.png 150w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" title=\"\"><\/figure>\n\n\n\n<ol>\n<li><strong>Balance your dataset<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Class imbalance hurts F1 score. Techniques like oversampling the minority class, undersampling the majority class, or using synthetic data generation (SMOTE) can help. More balanced training data often improves F1.<\/p>\n\n\n\n<ol start=\"2\">\n<li><strong>Adjust classification threshold<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Most models output probabilities. The default 0.5 threshold might not be optimal. Plot precision-recall curves and choose the threshold that maximizes F1 for your validation set. This simple adjustment often provides significant improvement.<\/p>\n\n\n\n<ol start=\"3\">\n<li><strong>Feature engineering<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Better features that distinguish classes improve both precision and recall. Domain knowledge helps identify informative features. Remove noisy features that add confusion. Feature importance analysis reveals what helps versus what hurts.<\/p>\n\n\n\n<ol start=\"4\">\n<li><strong>Try different algorithms<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Different algorithms have different precision-recall characteristics. Ensemble methods like Random Forest or Gradient Boosting often achieve better F1 than single models. Try multiple approaches and select based on F1 performance.<\/p>\n\n\n\n<ol start=\"5\">\n<li><strong>Use class weights<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Most algorithms support class weights that penalize mistakes on minority classes more heavily. Setting appropriate weights helps the model pay more attention to the positive class, typically improving F1 on imbalanced datasets.<\/p>\n\n\n\n<ol start=\"6\">\n<li><strong>Ensemble multiple models<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Combining predictions from multiple models often improves F1. Averaging probabilities or using voting can balance the strengths of different models. Ensembles typically achieve higher F1 than individual models.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Implementing F1 Score in Python<\/strong><\/h2>\n\n\n\n<ol>\n<li><strong>Using scikit-learn for binary classification<\/strong><\/li>\n<\/ol>\n\n\n\n<p>from sklearn.metrics import f1_score, precision_score, recall_score<\/p>\n\n\n\n<p># Binary classification<\/p>\n\n\n\n<p>f1 = f1_score(y_true, y_pred)<\/p>\n\n\n\n<p>precision = precision_score(y_true, y_pred)<\/p>\n\n\n\n<p>recall = recall_score(y_true, y_pred)<\/p>\n\n\n\n<p>print(f&#8221;F1 Score: {f1:.3f}&#8221;)<\/p>\n\n\n\n<p>print(f&#8221;Precision: {precision:.3f}&#8221;)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">print(f&#8221;Recall: {recall:.3f}&#8221;)<\/h3>\n\n\n\n<ol start=\"2\">\n<li><strong>Multi-class F1 with different averaging<\/strong><\/li>\n<\/ol>\n\n\n\n<p># Macro F1 (average F1 across classes)<\/p>\n\n\n\n<p>f1_macro = f1_score(y_true, y_pred, average=&#8217;macro&#8217;)<\/p>\n\n\n\n<p># Micro F1 (global precision and recall)<\/p>\n\n\n\n<p>f1_micro = f1_score(y_true, y_pred, average=&#8217;micro&#8217;)<\/p>\n\n\n\n<p># Weighted F1 (weighted by class size)<\/p>\n\n\n\n<p>f1_weighted = f1_score(y_true, y_pred, average=&#8217;weighted&#8217;)<\/p>\n\n\n\n<p># Per-class F1 scores<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">f1_per_class = f1_score(y_true, y_pred, average=None)<\/h3>\n\n\n\n<ol start=\"3\">\n<li><strong>Getting full classification report<\/strong><\/li>\n<\/ol>\n\n\n\n<p>from sklearn.metrics import classification_report<\/p>\n\n\n\n<p>report = classification_report(y_true, y_pred)<\/p>\n\n\n\n<p>print(report)<\/p>\n\n\n\n<p>This shows precision, recall, F1 score, and support for each class in a formatted table, making it easy to diagnose performance.<\/p>\n\n\n\n<p>To learn more about F1 Score in Machine Learning, do not miss the chance to enroll in this <strong>HCL GUVI\u2019s <\/strong><a href=\"https:\/\/www.guvi.in\/courses\/machine-learning-and-ai\/mastering-ai-and-machine-learning\/?utm_source=blog&amp;utm_medium=hyperlink&amp;utm_campaign=understanding-f1-score-in-machine-learning\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>AI and Machine Learning course<\/strong><\/a><strong> <\/strong>covering machine learning fundamentals, feature engineering, deep learning, and practical implementation through hands-on projects and expert guidance with certification.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>The F1 score is a single metric that balances precision and recall using the harmonic mean. It provides one number that summarizes classification performance, making model comparison easier.<\/p>\n\n\n\n<p>F1 score is especially valuable for imbalanced datasets where accuracy is misleading. It focuses on the positive class and requires both good precision and good recall for a high score.<\/p>\n\n\n\n<p>For multi-class problems, choose between macro, micro, and weighted F1 based on whether all classes are equally important or larger classes should dominate the metric.<\/p>\n\n\n\n<p>Always examine precision and recall individually alongside F1 to understand your model&#8217;s behavior. F1 is a convenient summary but hides important details about the precision-recall tradeoff.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>FAQs<\/strong><\/h2>\n\n\n<div id=\"rank-math-faq\" class=\"rank-math-block\">\n<div class=\"rank-math-list \">\n<div id=\"faq-question-1779704199473\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">1. <strong>What is a good F1 score?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>It depends on your problem and baseline performance. For balanced datasets, F1 above 0.80 is generally good. For difficult or highly imbalanced problems, F1 above 0.60 might be excellent. Compare against baselines and previous work in your domain rather than using absolute thresholds.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1779704266120\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">2. <strong>What is the difference between F1 and accuracy?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Accuracy measures overall correctness across all classes. F1 combines precision and recall for the positive class. For imbalanced datasets, accuracy can be misleading while F1 provides meaningful evaluation. F1 focuses on how well you identify the positive class specifically.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1779704276676\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">3. <strong>When should I use macro vs micro vs weighted F1?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Use macro F1 when all classes are equally important regardless of size. Use micro F1 when larger classes should dominate the metric. Use weighted F1 for a middle ground that considers class sizes. The choice depends on whether rare classes are as important as common ones in your application.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1779704289070\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">4. <strong>Can F1 score be higher than both precision and recall?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>No. The harmonic mean is always less than or equal to both precision and recall. F1 equals precision and recall only when they are equal to each other. If precision and recall differ, F1 is lower than the higher metric and higher than the lower metric.<\/p>\n\n<\/div>\n<\/div>\n<div id=\"faq-question-1779704299057\" class=\"rank-math-list-item\">\n<h3 class=\"rank-math-question \">5. <strong>How do I calculate F1 score in Python?<\/strong><\/h3>\n<div class=\"rank-math-answer \">\n\n<p>Use scikit-learn: from sklearn.metrics import f1_score then f1 = f1_score(y_true, y_pred). For multi-class, specify averaging: f1_score(y_true, y_pred, average=&#8217;macro&#8217;) for macro F1, average=&#8217;micro&#8217; for micro F1, or average=&#8217;weighted&#8217; for weighted F1.<\/p>\n\n<\/div>\n<\/div>\n<\/div>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>Imagine you are comparing two spam filters. Model A catches 90% of spam but marks 40% of legitimate emails as spam. Model B catches 70% of spam and only marks 10% of legitimate emails as spam. Which is better? Looking at just one number would be easier than juggling two metrics. This is exactly what [&hellip;]<\/p>\n","protected":false},"author":63,"featured_media":116534,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[933],"tags":[],"views":"394","authorinfo":{"name":"Vishalini Devarajan","url":"https:\/\/www.guvi.in\/blog\/author\/vishalini\/"},"thumbnailURL":"https:\/\/www.guvi.in\/blog\/wp-content\/uploads\/2026\/06\/Feature-image-25-300x116.png","_links":{"self":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts\/112102"}],"collection":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/users\/63"}],"replies":[{"embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/comments?post=112102"}],"version-history":[{"count":3,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts\/112102\/revisions"}],"predecessor-version":[{"id":116540,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/posts\/112102\/revisions\/116540"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/media\/116534"}],"wp:attachment":[{"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/media?parent=112102"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/categories?post=112102"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.guvi.in\/blog\/wp-json\/wp\/v2\/tags?post=112102"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}