Using R and Random Forest to predict Diabetes Risk.

2 min readDec 25, 2020

Time for something a bit different from my previous posts, but hopefully more common in the future.

I was looking through the UCI Machine Learning Repository for a couple of data sets I could use for some simple machine learning problems to try interesting problems and keep my abilities sharp. This week I found the Early Stage Diabetes Risk Prediction. This comes from a paper on early prediction of diabetes risk. The abstract goes over the methods used, indicating Random Forest as doing particularly well. Since I don’t have access to the actual paper, I wanted to try out using R to try this model out.

It turns out, that this data isn’t terribly interesting when using Random Forest. Not much cleaning up or tuning is needed to get good accuracy. For the sake of completeness though, I’ll share my snippets and results.

First, we need to load the appropriate libraries and the data set. Here I load the columns except for age as factors, as well as clean up the column names.

library(tidyverse)
library(randomForest)
library(reprtree)diabetesdf <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00529/diabetes_data_upload.csv",
                col_types = "dffffffffffffffff")
colnames(diabetesdf) <- make.names(colnames(diabetesdf))
names(diabetesdf)[names(diabetesdf) == 'class'] <- "result"

Now, so we can test our out of sample error rate, let’s split 80% of the data for training our model and then test it with the remaining 20%

sample_size = round(nrow(diabetesdf)*.80)
index <- sample(seq_len(nrow(diabetesdf)), size = sample_size)train <- diabetesdf[index, ]
test <- diabetesdf[-index, ]
trainboost <- train
testboost <- test
trainboost$result <- as.numeric(trainboost$result) - 1
testboost$result <- as.numeric(testboost$result) - 1

Now, let’s train our Random Forest model using the default parameters and calculate the accuracy.

ddffit <- randomForest(result ~., data = train)
ddftest <- predict(ddffit, newdata = test, type = "response")
sum(ddftest == test$result)/nrow(test)

Getting an accuracy in prediction of around 96%.

Now, just for the sake of giving an interesting visualization, let’s use the reptree package by Abhijit Dasgupta. This enables us to create a visualization for the entirety of the Random Forest model made with the training set.

plot(ReprTree(ddffit, train, metric = 'd2'))

So, while this isn’t a particularly challenging data set for Random Forest to classify, it does provide an illustration on the basics of creating a model using Random Forest, as well as to try out reptree to make a visualization for the overall decision tree.

Since I know this data set works well and have more experience with implementing Machine Learning algorithms in R, I will come back to this to transfer more of my skill over to Python.

Originally published at https://heril.github.io/.

Using R and Random Forest to predict Diabetes Risk.

Written by Spencer R Hall