Diabetes Projection

Predicting based on diagnostic measurements using Random Forest Classification algorithm

Posted by Begen Yussupov on October 25, 2021 · 10 mins read

Here in this project, we will be discovering new insights on diabetes dataset. We really hope our findings will be helpful. We have diagnostic measurements such as pregnancy, glucose level, blood pressure, skin thickness, insulin, BMI, Diabetes Pedigree Function (DPF) that gives some information on risk level related to hereditary and age. We will be building a model that predicts whether patient has diabetes based on those measurements.

Diabetes occurs when pancreas human organ can’t produce enough insulin in blood. Insulin’s role in a human body is to control glucose levels. Produced insulin acts as a directional tool for glucose and helps to deliver glucose into each human cell. Without insulin, glucose in blood keeps circulating and can’t be delivered to human cells. As a result, glucose level increases in human.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix, roc_auc_score,r2_score

We import all our necessary libraries and data from github Our dataset is located on personal GitHub page that can be imported in our notebook

df = pd.read_csv("https://raw.githubusercontent.com/begen/diabetes/master/diabetes.csv")
df.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

We can see that the data is small with 768 rows and 9 columns.

df.shape 
(768, 9)

We would like to see the correlation between given variables

df.corr()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
Pregnancies 1.000000 0.129459 0.141282 -0.081672 -0.073535 0.017683 -0.033523 0.544341 0.221898
Glucose 0.129459 1.000000 0.152590 0.057328 0.331357 0.221071 0.137337 0.263514 0.466581
BloodPressure 0.141282 0.152590 1.000000 0.207371 0.088933 0.281805 0.041265 0.239528 0.065068
SkinThickness -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573 0.183928 -0.113970 0.074752
Insulin -0.073535 0.331357 0.088933 0.436783 1.000000 0.197859 0.185071 -0.042163 0.130548
BMI 0.017683 0.221071 0.281805 0.392573 0.197859 1.000000 0.140647 0.036242 0.292695
DiabetesPedigreeFunction -0.033523 0.137337 0.041265 0.183928 0.185071 0.140647 1.000000 0.033561 0.173844
Age 0.544341 0.263514 0.239528 -0.113970 -0.042163 0.036242 0.033561 1.000000 0.238356
Outcome 0.221898 0.466581 0.065068 0.074752 0.130548 0.292695 0.173844 0.238356 1.000000

In order to visualize out correlation matrix, we will use seaborn’s heatmap. Correlation Matrix From this correlation heatmap, we can see that BMI with coefficient of 0.29, glucose level of 0.47, age with the coefficient of 0.24 and DPF of 0.17 are all positively correlated with the diabetes diagnosis. It means that people with BMI out of range, people who are older and whose glucose level in blood is higher have higher change of being diagnosed as diabetes. This is mainly related to type 2 diabetes, not type 1.

Another correlation visualization is helpful to see correlations between variables with following line of code:

plt.figure(figsize=(9,9))
sns.heatmap(df.corr(), annot=True, mask=np.triu(df.corr()))
plt.ylim(9,0);

Correltion Heatmap

Next, we need to check if there are some null values in the dataset. For that, we can run following command:

df.isnull().sum()
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64
df.describe().transpose()
count mean std min 25% 50% 75% max
Pregnancies 768.0 3.845052 3.369578 0.000 1.00000 3.0000 6.00000 17.00
Glucose 768.0 120.894531 31.972618 0.000 99.00000 117.0000 140.25000 199.00
BloodPressure 768.0 69.105469 19.355807 0.000 62.00000 72.0000 80.00000 122.00
SkinThickness 768.0 20.536458 15.952218 0.000 0.00000 23.0000 32.00000 99.00
Insulin 768.0 79.799479 115.244002 0.000 0.00000 30.5000 127.25000 846.00
BMI 768.0 31.992578 7.884160 0.000 27.30000 32.0000 36.60000 67.10
DiabetesPedigreeFunction 768.0 0.471876 0.331329 0.078 0.24375 0.3725 0.62625 2.42
Age 768.0 33.240885 11.760232 21.000 24.00000 29.0000 41.00000 81.00
Outcome 768.0 0.348958 0.476951 0.000 0.00000 0.0000 1.00000 1.00

Lets see outcome values of diabetes

df['Outcome'].value_counts() # want to see outcome values of diabetes
0    500
1    268
Name: Outcome, dtype: int64

There is 268 people have been diagnosed with diabetes and 500 people not diagnosed.

It would be helpful to allocate BMI with positive outcomes into 6 bins.

plt.figure(figsize=(20,10))
sns.histplot(df[df['Outcome']==1]['BMI'], bins=6);

png

From this picture, we can see that people with BMI between 22 and 56 have been diagnosed with diabetes. If one’s BMI falls within 18.5 to 24.9, it is considered as a normal range.

Now, we would like to see distribution, histogram and box plots for each of our variables or measurements. We can do this by running following defined function plotgr:

def plotgr(col):
    for num in col:
        print('Plots : ',num)
        plt.figure(figsize=(20,10))
        
        #Distribution
        plt.subplot(1,3,1)
        sns.distplot(df[num])
        plt.title('Distribution Plot')
        
        # Histogram
        plt.subplot(1,3,2)
        sns.histplot(df[num])
        plt.title('Histogram plot')
        
         # Box plot
        plt.subplot(1,3,3)
        sns.boxplot(df[num])
        plt.title('Box Plot')
        
        plt.show()
Pregnancies:

Image

Glucose:

Image

Blood Pressure:

Image

Skin Thikness:

Image

Insulin:

Image

BMI:

Image

DiabetesPedigreeFunction:

Image

Age:

Image

Now lets build our model. First, we split our data into train and test data. We train and test our data in 0.2 and 0.8 ratio, respectively.

y=df['Outcome']
x_train,x_test,y_train,y_test=train_test_split(df,y,test_size=0.2, random_state=15)
scaler=StandardScaler()
x_train=scaler.fit_transform(x_train)
x_test=scaler.transform(x_test)
classifier=RandomForestClassifier()
classifier.fit(x_train,y_train)
y_predicted=classifier.predict(x_test)
r2_score(y_test,y_predicted)
1.0
print(classification_report(y_test,y_predicted))
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       108
           1       1.00      1.00      1.00        46

    accuracy                           1.00       154
   macro avg       1.00      1.00      1.00       154
weighted avg       1.00      1.00      1.00       154
print(confusion_matrix(y_test,y_predicted))
[[108   0]
 [  0  46]]