Skip to content

Project 4: Mushroom Classification

Overview

This project performs exploratory data analysis and classification modeling on the UCI Mushroom Dataset. The goal is to predict whether a mushroom is edible or poisonous based on its physical characteristics using a Decision Tree Classifier.

Dataset

The Mushroom dataset was loaded directly from the UCI Machine Learning Repository and contains 8,124 observations with categorical features describing attributes such as cap shape, odor, gill color, habitat, and more. The target variable is poisonous, indicating whether a mushroom is safe to eat or not.

Project Structure

ml04_webb_midterm.ipynb: Main Jupyter notebook containing the full analysis

README.md: Project documentation

Key Steps

  1. Data Import and Inspection
  2. Load the mushroom dataset from the UCI repository
  3. Inspect structure, feature names, and data types
  4. Check for missing values and describe overall dataset composition

  5. Data Exploration and Preparation

  6. Visualize feature distributions using count plots
  7. Identify outliers and patterns in categorical variables
  8. Handle missing values (imputed stalk-root using mode)
  9. Drop uninformative column (veil-type)
  10. Encode categorical features using label encoding

  11. Feature Selection

  12. Selected five key predictors: odor, spore-print-color, gill-color, population, and habitat
  13. Defined target variable: poisonous
  14. Justified feature choice based on interpretability and biological relevance

  15. Model Training and Evaluation

  16. Trained a Decision Tree Classifier using stratified train/test split
  17. Evaluated model performance with accuracy, precision, recall, F1-score, and confusion matrix
  18. Initial model achieved 100% accuracy, revealing potential feature leakage

  19. Model Tuning and Interpretation

  20. Removed suspected leakage features (odor and spore-print-color)
  21. Re-trained Decision Tree achieving 94% accuracy with balanced precision and recall
  22. Visualized tuned tree and interpreted top decision nodes (gill-color, stalk-shape, habitat)

  23. Summary and Insights

  24. Discussed the impact of feature leakage on model performance
  25. Compared original and tuned model accuracy and generalization
  26. Reflected on lessons learned about data preprocessing, model interpretability, and evaluation

Technologies Used

Python 3.x

pandas

numpy

matplotlib

seaborn

scikit-learn

Requirements

All libraries used are included in standard data science environments (e.g., Anaconda or Jupyter). No additional installations are required beyond scikit-learn, matplotlib, and seaborn.