Project 4: Mushroom Classification¶
Overview¶
This project performs exploratory data analysis and classification modeling on the UCI Mushroom Dataset. The goal is to predict whether a mushroom is edible or poisonous based on its physical characteristics using a Decision Tree Classifier.
Dataset¶
The Mushroom dataset was loaded directly from the UCI Machine Learning Repository and contains 8,124 observations with categorical features describing attributes such as cap shape, odor, gill color, habitat, and more. The target variable is poisonous, indicating whether a mushroom is safe to eat or not.
Project Structure¶
ml04_webb_midterm.ipynb: Main Jupyter notebook containing the full analysis
README.md: Project documentation
Key Steps¶
- Data Import and Inspection
- Load the mushroom dataset from the UCI repository
- Inspect structure, feature names, and data types
-
Check for missing values and describe overall dataset composition
-
Data Exploration and Preparation
- Visualize feature distributions using count plots
- Identify outliers and patterns in categorical variables
- Handle missing values (imputed stalk-root using mode)
- Drop uninformative column (veil-type)
-
Encode categorical features using label encoding
-
Feature Selection
- Selected five key predictors: odor, spore-print-color, gill-color, population, and habitat
- Defined target variable: poisonous
-
Justified feature choice based on interpretability and biological relevance
-
Model Training and Evaluation
- Trained a Decision Tree Classifier using stratified train/test split
- Evaluated model performance with accuracy, precision, recall, F1-score, and confusion matrix
-
Initial model achieved 100% accuracy, revealing potential feature leakage
-
Model Tuning and Interpretation
- Removed suspected leakage features (odor and spore-print-color)
- Re-trained Decision Tree achieving 94% accuracy with balanced precision and recall
-
Visualized tuned tree and interpreted top decision nodes (gill-color, stalk-shape, habitat)
-
Summary and Insights
- Discussed the impact of feature leakage on model performance
- Compared original and tuned model accuracy and generalization
- Reflected on lessons learned about data preprocessing, model interpretability, and evaluation
Technologies Used¶
Python 3.x
pandas
numpy
matplotlib
seaborn
scikit-learn
Requirements¶
All libraries used are included in standard data science environments (e.g., Anaconda or Jupyter). No additional installations are required beyond scikit-learn, matplotlib, and seaborn.