Automatic Feature Selection with R & CART

For a recent work project I had a statistical problem I had not dealt with before: multi-class classification. We have a system that can classify about 75% of phone calls, and I needed to classify the remainder into one of five types. The data also had what to me were many features (~300) and I wanted a way to dynamically select features as this classification would be performed weekly but the call data could change significantly from week-to-week and over time.

I did research and implemented a framework developed in this paper by Ratanamahatana & Gunopulos which tried many different feature selection methods and then compared naive-bayes classification results. Using a modified version of their feature selection method plus naive-bayes the first version of the model classified 76% of the validation sample accurately. Without feature selection only 18% of calls were accurately classified.

I have now generalized the feature selection algorithm so that it can be used as an initial step for other classification and prediction modeling processes. This presentation walks through the code as I apply this approach to a Kaggle competition data set.

Please use try this approach out on your own problems and let me know how it performs! You can tweet @chrisumphlett or email christopher.umphlett [at] gmail.com.