Predictive Models on the 2013 NCDB Colon Cancer Data

Published: 4 May 2021| Version 1 | DOI: 10.17632/jg44fgspzk.1


The attached file contains R code which encompasses and describes the process of loading data, cleaning data, selecting variables, imputing missing values, creating training and test sets, model building and evaluation. Additionally, the code contains the process to create graphs and tables for data and model evaluation. The goal was to build a logistic regression model to predict outcomes after surgery for colon cancer and to compare its performance with machine learning algorithms. An XGBgoost model, a Random Forest model and an XGBoost model from oversampled data using SMOTE were built and compared with logistic regression. Overall, the machine learning algorithms had improved AUC.

Files are not publicly available

You can contact the author to request the files

Steps to reproduce

Execute this code in R software with access to the 2013 Colon section of the National Cancer Database


Elsevier BV


Health Sciences