This project focuses on predicting whether a person has diabetes or not based on various factors. It includes a dataset file diabetes_prediction_dataset.csv
and a Jupyter Notebook file notebook.ipynb
which contains the code and analysis.
The dataset file diabetes_prediction_dataset.csv
contains 100,000 rows and 9 columns with the following column names:
- gender: Three unique values - male, female, other.
- age: Age of the individual.
- hypertension: Binary value (0 or 1) indicating the presence of hypertension.
- heart_disease: Binary value (0 or 1) indicating the presence of heart disease.
- smoking_history: Six unique values - never, no info, current, former, ever, not current.
- bmi: Body Mass Index (BMI) of the individual.
- HbA1c_level: Level of HbA1c (glycated hemoglobin) in the blood.
- blood_glucose_level: Blood glucose level of the individual.
- diabetes: Binary value (0 or 1) indicating the presence of diabetes.
The Jupyter Notebook file notebook.ipynb
contains the code and analysis for the diabetes prediction project. Here is an overview of the steps performed in the notebook:
- Importing required libraries.
- Importing the
diabetes_prediction_dataset.csv
file. - Removing duplicated values from the dataset.
- Data visualization:
- Countplot of the count of individuals by smoking history.
- Countplot of the count of individuals by smoking history and diabetes status.
- Countplot of the count of individuals by gender.
- Countplot of the count of individuals by gender and diabetes status.
- Histogram of age distribution.
- Box plot of age distribution by diabetes.
- Countplot of the count of individuals by hypertension and diabetes status.
- Countplot of the count of individuals by heart disease and diabetes status.
- Box plot of BMI distribution by diabetes.
- Box plot of HbA1c level distribution by diabetes.
- Box plot of blood glucose level distribution by diabetes.
- Correlation heatmap.
- Performing one-hot encoding on the
gender
andsmoking_history
columns. - Concatenating the encoded columns with other columns (
age
,hypertension
,heart_disease
,bmi
,HbA1c_level
,blood_glucose_level
) and saving the resulting dataframe in the variableX
. - Normalizing
X
using MinMaxScaler. - Defining the target variable
y
asdf['diabetes']
. - Balancing the class values using SMOTE, as the count of 0 (non-diabetic) is 87,664 and the count of 1 (diabetic) is 8,482.
- Defining a dictionary of algorithms in the
algos
variable. - Training the algorithms with different hyperparameters and saving the model, best score, and best parameters in the
scores
variable. - Converting the
scores
into a dataframe. - Splitting the data of
X
andy
into training, testing and validating datasets usingtrain_test_split
. - Training a Random Forest Classifier with
n_estimators=100
andcriterion='entropy'
. - Calculating the accuracy score.
- Predicting the values for the test data
X_valid
and saving the predictions iny_pred
. - Creating a confusion matrix and heatmap of the confusion matrix.
- Creating a classification report.
- Applying PCA (Principal Component Analysis) on
X
for dimensionality reduction. - Training the model again with the Random Forest Classifier using the same parameters.
- Creating a confusion matrix and classification report based on the reduced dimensions.
Please refer to the notebook.ipynb
file for detailed code implementation and further analysis of the diabetes prediction project.