Project 4: Mushroom Classification¶

Overview¶

This project performs exploratory data analysis and classification modeling on the UCI Mushroom Dataset. The goal is to predict whether a mushroom is edible or poisonous based on its physical characteristics using a Decision Tree Classifier.

Dataset¶

The Mushroom dataset was loaded directly from the UCI Machine Learning Repository and contains 8,124 observations with categorical features describing attributes such as cap shape, odor, gill color, habitat, and more. The target variable is poisonous, indicating whether a mushroom is safe to eat or not.

Project Structure¶

ml04_webb_midterm.ipynb: Main Jupyter notebook containing the full analysis

README.md: Project documentation

Key Steps¶

Data Import and Inspection
Load the mushroom dataset from the UCI repository
Inspect structure, feature names, and data types
Check for missing values and describe overall dataset composition
Data Exploration and Preparation
Visualize feature distributions using count plots
Identify outliers and patterns in categorical variables
Handle missing values (imputed stalk-root using mode)
Drop uninformative column (veil-type)
Encode categorical features using label encoding
Feature Selection
Selected five key predictors: odor, spore-print-color, gill-color, population, and habitat
Defined target variable: poisonous
Justified feature choice based on interpretability and biological relevance
Model Training and Evaluation
Trained a Decision Tree Classifier using stratified train/test split
Evaluated model performance with accuracy, precision, recall, F1-score, and confusion matrix
Initial model achieved 100% accuracy, revealing potential feature leakage
Model Tuning and Interpretation
Removed suspected leakage features (odor and spore-print-color)
Re-trained Decision Tree achieving 94% accuracy with balanced precision and recall
Visualized tuned tree and interpreted top decision nodes (gill-color, stalk-shape, habitat)
Summary and Insights
Discussed the impact of feature leakage on model performance
Compared original and tuned model accuracy and generalization
Reflected on lessons learned about data preprocessing, model interpretability, and evaluation

Technologies Used¶

Python 3.x

pandas

numpy

matplotlib

seaborn

scikit-learn

Requirements¶

All libraries used are included in standard data science environments (e.g., Anaconda or Jupyter). No additional installations are required beyond scikit-learn, matplotlib, and seaborn.