Introduction

Research conducted by Division of Endocrinology, Diabetes, and Metabolism, Ohio State University, suggests that hospital readmission is an important contributor to total medical expenditure and is an emerging indicator of quality of care. Many factors such as patient demographics, diagnositc procedures, medications, etc. influence patient readmission. The goal of this project is to analyze key factors that impact the readmission of a patient and build a classification model that predicts readmission of a patient based on the key factors. This work presented here has been limited to diabetic patient readmission.We started with the exploratory analysis of the various features mentioned in the dataset to detect interesting patterns.The statistical testing section highlights the results of various hypothesis testing we performed to test the signifance of features on the readmission outcome.Finally, we used machine learning to build various classification models on the data.The accuracy of the classifiers is outlined in the Machine Learning section.

Tools & Techniques Used:

R was primarily used for the data prepartion. Python was used to perform hypothesis test and Machine Learning was performed using Weka GUI.

For the visualizations, we used D3.JS and Google Charts API

Data Collection

The data contains over 50 features representing patient and hospital outcomes. Information was extracted from the database for encounters that satisfied the following criteria:

  1. It is an inpatient encounter (a hospital admission).
  2. It is a diabetic encounter, that is, one during which any kind of diabetes was entered to the system as a diagnosis.
  3. The length of stay was at least 1 day and at most 14 days.
  4. Laboratory tests were performed during the encounter.
  5. Medications were administered during the encounter.

The data contains such attributes as patient number, race, gender, age, admission type, time in hospital, medical specialty of admitting physician, number of lab test performed, HbA1c test result, diagnosis, number of medication, diabetic medications, number of outpatient, inpatient, and emergency visits in the year before the hospitalization, etc.

The original dataset can be found here.

Data Preparation

Missing Values

Weight was removed from the analysis as it has significantly huge missing values. Payer Code was found to be irrelevant to the analysis and hence was removed from the analysis. A special value, "missing" was used to fill missing values for Medical Speciality. For Race, Diagnosis-1, Diagnosis-2 and Diagnosis-3, the rows with missing values were removed from the analysis.

Remapping Output Variable

Original

After Mapping

The original dataset had three values for the readmission variable: "<30", ">30" and "NO". This basically indicated if the patient was readmitted within 30 days, after 30 days or not readmitted at all. Since we were more focused on analyzing the factors that affect readmission, we combined the " <30 " and ">30" into a single value called "YES".

Removing Outliers

This graph indicates the frequency with which patients are being readmitted. There's a downward trend in the frequency of being readmitted and we considered patients who were readmitted more than 5 times as outliers and excluded them from further analysis.

The original data contained all the patient admissions. For building Machine Learning models, it is required that each row be an independent instance. Since the same patient was readmitted multiple times, this assumption would be invalid. Hence, the data prepared for Machine Learning part contained only the first encounter of every patient.

Data Exploration

Age distribution by race for all patients

Age distribution by race for all readmitted patients

After analyzing the age distribution of diabetic patients across different races, we observed that for Asians, across various age groups the distribution of diabetic patients is more or less similar as compared to other races. However, the readmission rate among 70-80 age group is highest among Asians.

Summary of the Time spent by patients in the hospital

This graph indicates that majority of the diabetes patients spend a maximum of 3 days in the hospital.

Analyzing Readmission vs Glucose Serum Test levels

Hover over the sectors of the pie chart to observe the distribution of patients depending upon Glucose Serum levels and readmission outcomes.It can be observed that for Glucose Serum levels above 300, a higher proportion of the patients were readmitted. For above 200 and Normal levels, more patients were found not to be readmitted.