Diabetes Prediction Project

This project focuses on predicting whether a person has diabetes or not based on various factors. It includes a dataset file diabetes_prediction_dataset.csv and a Jupyter Notebook file notebook.ipynb which contains the code and analysis.

Dataset

The dataset file diabetes_prediction_dataset.csv contains 100,000 rows and 9 columns with the following column names:

gender: Three unique values - male, female, other.
age: Age of the individual.
hypertension: Binary value (0 or 1) indicating the presence of hypertension.
heart_disease: Binary value (0 or 1) indicating the presence of heart disease.
smoking_history: Six unique values - never, no info, current, former, ever, not current.
bmi: Body Mass Index (BMI) of the individual.
HbA1c_level: Level of HbA1c (glycated hemoglobin) in the blood.
blood_glucose_level: Blood glucose level of the individual.
diabetes: Binary value (0 or 1) indicating the presence of diabetes.

Notebook

The Jupyter Notebook file notebook.ipynb contains the code and analysis for the diabetes prediction project. Here is an overview of the steps performed in the notebook:

Importing required libraries.
Importing the diabetes_prediction_dataset.csv file.
Removing duplicated values from the dataset.
Data visualization:
- Countplot of the count of individuals by smoking history.
- Countplot of the count of individuals by smoking history and diabetes status.
- Countplot of the count of individuals by gender.
- Countplot of the count of individuals by gender and diabetes status.
- Histogram of age distribution.
- Box plot of age distribution by diabetes.
- Countplot of the count of individuals by hypertension and diabetes status.
- Countplot of the count of individuals by heart disease and diabetes status.
- Box plot of BMI distribution by diabetes.
- Box plot of HbA1c level distribution by diabetes.
- Box plot of blood glucose level distribution by diabetes.
- Correlation heatmap.
Performing one-hot encoding on the gender and smoking_history columns.
Concatenating the encoded columns with other columns (age, hypertension, heart_disease, bmi, HbA1c_level, blood_glucose_level) and saving the resulting dataframe in the variable X.
Normalizing X using MinMaxScaler.
Defining the target variable y as df['diabetes'].
Balancing the class values using SMOTE, as the count of 0 (non-diabetic) is 87,664 and the count of 1 (diabetic) is 8,482.
Defining a dictionary of algorithms in the algos variable.
Training the algorithms with different hyperparameters and saving the model, best score, and best parameters in the scores variable.
Converting the scores into a dataframe.
Splitting the data of X and y into training, testing and validating datasets using train_test_split.
Training a Random Forest Classifier with n_estimators=100 and criterion='entropy'.
Calculating the accuracy score.
Predicting the values for the test data X_valid and saving the predictions in y_pred.
Creating a confusion matrix and heatmap of the confusion matrix.
Creating a classification report.
Applying PCA (Principal Component Analysis) on X for dimensionality reduction.
Training the model again with the Random Forest Classifier using the same parameters.
Creating a confusion matrix and classification report based on the reduced dimensions.

Please refer to the notebook.ipynb file for detailed code implementation and further analysis of the diabetes prediction project.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
diabetes_prediction_dataset.csv		diabetes_prediction_dataset.csv
notebook.ipynb		notebook.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diabetes Prediction Project

Dataset

Notebook

About

Releases

Packages

Languages

manitp14/diabetes_prediction

Folders and files

Latest commit

History

Repository files navigation

Diabetes Prediction Project

Dataset

Notebook

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages