Data Pipeline - Project 2

Feature Selection

Information Gain is a widely used technique to select a subset of features. The following graph shows the Information Gain at a feature level. The approach we followed for feature selection was backward elimination i.e. selectively removing features with the lowest information gain till we reach a higher level of model accuracy.

The results of the information gain are found to be consistent with the Hypothesis Test. The values of diabetesMed and change (of medication) are found to have significant information gain and are included in the final model as well.

Based on the above chart, the following features were removed in order to aim for better model accuracy:

acarbose, miglitol, glyburide.metformin, glimepiride, chlorpropamide, nateglinide, acetohexamide, tolbutamide, metformin.pioglitazone, glipizide.metformin, tolazamide, troglitazone, citoglipton, examide, metformin.rosiglitazone, & glimepiride.pioglitazone

While the accuracy is only slightly improved, its better to use a model with less number of features as it leads to a simpler model.

Feature Transformation

In order to experiment with the effects of feature transformation, the number_emergency,number_inpatient and number_outpatient features were converted from numerical to binary variables. In each of the cases, the variable indicates whether or not the incident occured in the previous year.

Number of Emergencies

Number of Outpatient Visits

Number of Inpatient Visits

Comparing Models

All the models were trained in Weka using a 10-fold cross-validation in order to avoid overfitting.

Given the importance of understanding the factors that help in predicting the radmissions, we focus on achieving a better True Positive Rate for the classifier.

Based on the above values of TPR, Bayesian Network was found to perform the best. With the tranformed features, Naive Bayes gave a comparable value of TPR. More information about Bayes Network can be found on this wikipedia article.