Menu

Understanding Data Leakage

Understanding Data Leakage

One of the most common mistakes in machine learning is data leakage.

Data leakage occurs when information from the testing dataset influences the training process.

Examples include:

  • Training on the full dataset before splitting
  • Accidentally exposing test labels during training
  • Performing transformations incorrectly

Data leakage leads to overly optimistic evaluation results.

By splitting the dataset before training, we reduce the risk of introducing bias into our evaluation.