CHAID vs. You can report issue about the content on this page here Want to share your content on R-bloggers? Imagine yourself in a fictional company faced with the task of trying to predict which customers are going to leave your business for another provider a. Being able to predict churn even a little bit better could save us lots of money, especially if we can identify the key indicators and influence them. In the original posting I spent a great deal of time explaining the mechanics of loading and prepping the data. We have data on 7, customers across 21 variables.
|Country:||Sao Tome and Principe|
|Published (Last):||13 July 2019|
|PDF File Size:||9.23 Mb|
|ePub File Size:||3.51 Mb|
|Price:||Free* [*Free Regsitration Required]|
CHAID vs. You can report issue about the content on this page here Want to share your content on R-bloggers? Imagine yourself in a fictional company faced with the task of trying to predict which customers are going to leave your business for another provider a.
Being able to predict churn even a little bit better could save us lots of money, especially if we can identify the key indicators and influence them. In the original posting I spent a great deal of time explaining the mechanics of loading and prepping the data.
We have data on 7, customers across 21 variables. Churn is what we want to predict so we have 19 potential predictor variables to work with. No way of knowing for certain but it could be that these are just the newest customers with very little time using our service. Replacing with the median value is simple and easy but it may well not be the most accurate choice. From the? Imputation via bagging fits a bagged tree model for each predictor as a function of all the others.
This method is simple, accurate and accepts missing values, but it has much higher computational cost. Imputation via medians takes the median of each predictor in the training set, and uses them to fill missing values. This method is simple, fast, and accepts missing values, but treats each predictor independently, and may be inaccurate.
Then we can see how the results compare. There is no warning about missing values and if you scroll back and compare with the original plots of the raw variables the shape of tenure and TotalCharges have changed significantly because of the transformation. We really only needed it to compare a few specific cases. One more step before we start using CHAID, ranger, and xgboost and while we have the data in one frame.
We could operate directly by invoking the individual model functions directly but caret will allow us to use some common steps. This is important because our data is already pretty lop-sided for outcomes. The two subsequent lines serve to take the vector intrain and produce two separate dataframes, testing and training.
They have and customers respectively. Turns out that many models do not perform well when you feed them a formula for the model even if they claim to support a formula interface as CHAID does.
Chapter 5 in the caret doco covers it in great detail. As a matter of fact it will allow us to build a grid of those parameters and test all the permutations we like, using the same cross-validation process. The function in caret is tuneGrid. The output gives us a nice concise summary. It gives us an idea of how many of the cases were used in the individual folds Summary of sample sizes: , , , , If you need a review of what alpha2, alpha4, and alpha3 are please review the?
As a matter of fact we will be creating one object per run and then using the stored information to build a nice comparison later. Note that the scaling deceives the eye and the results are close across the plot Check on variable importance varImp chaid.
Look in chaid. This is a key step because it reassures us that we have not overfit if you want a fuller understanding please consider reading this post on EliteDataScience our model. Our accuracy on testing actually exceeds the accuracy we achieved in training. Random Forest via ranger One of the nicest things about using caret is that it is pretty straight-forward to move from one model to another. The amount of work we have to do while moving from CHAID to ranger and eventually xgboost is actually quite modest.
When we consult the documentation for ranger within caret we see that we can adjust mtry, splitrule, and min. The only additional line we need besides changing from chaid to ranger is to tell it what to use to capture variable importance e. Now we can run the exact same set of commands as we did with chaid. Once again our accuracy on testing actually exceeds the accuracy we achieved in training.
As far as tuning goes caret supports 7 of the many parameters that you could feed to? I initially ran it that way but below for purposes of this post have chosen only a few that seem to make the largest difference to accuracy and set the rest to a constant.
One final important note about the code below. DMatrix as the input. After a relatively brief moment the results are back. Average accuracy on the training is. We can run the same additional commands simply by listing xgboost. Once again our accuracy on testing. Separately grab the elapsed times for training with commands like chaid. What do we know? What data should we focus on and what conclusions can we draw from our little exercise in comparative modeling? I will draw your attention back to this webpage to review the terminology for classification models and how to interpret a confusion matrix.
So Accuracy, Kappa, and F1 are all measures of overall accuracy. There are merits to each.
CHAID and caret – a good combo – June 6, 2018
We will build these trees as well as comprehend their underlying concepts. We will also go through their applications, types as well as various advantages and disadvantages. Decision Trees are a popular Data Mining technique that makes use of a tree-like structure to deliver consequences based on input decisions. One important property of decision trees is that it is used for both regression and classification. This type of classification method is capable of handling heterogeneous as well as missing data. Decision Trees are further capable of producing understandable rules. Furthermore, classifications can be performed without many computations.
R Decision Trees – The Best Tutorial on Tree Based Modeling in R!