Note: This tutorial was based on an older version of the abalone data that had a binary old varibale rather than a numeric age variable. It has been modified lightly so that it uses a manual old variable (is the abalone older than 10 or not) and ignores the numeric age variable.Materials prepared by Rebecca Barter. Package developed by Max Kuhn.An interactive Jupyter Notebook version of this tutorial can be found at. Feel free to download it and use for your own learning or teaching adventures!R has a wide number of packages for machine learning (ML), which is great, but also quite frustrating since each package was designed independently and has very different syntax, inputs and outputs.This means that if you want to do machine learning in R, you have to learn a large number of separate methods.Recognizing this, Max Kuhn (at the time working in drug discovery at Pfizer, now at RStudio) put together a single package for performing any machine learning method you like. This package is called caret. Caret stands for Classification And Regression Training.
![]()
Library(devtools) installgithub(' topepo/caret/pkg/caret ') but you would have to be able to compile from source. There is a compiled version here but only for another day or so.
Apparently caret has little to do with our orange friend, the carrot.Not only does caret allow you to run a plethora of ML methods, it also provides tools for auxiliary techniques such as:.Data preparation (imputation, centering/scaling data, removing correlated predictors, reducing skewness).Data splitting.Variable selection.Model evaluationAn extensive vignette for caret can be found here. A simple view of caret: the default train functionTo implement your machine learning model of choice using caret you will use the train function. The types of modeling options available are many and are listed here:. In the example below, we will use the ranger implementation of random forest to predict whether abalone are “old” or not based on a bunch of physical properties of the abalone (sex, height, weight, diameter, etc). Pre-processing ( preProcess)There are a number of pre-processing steps that are easily implemented by caret. Several stand-alone functions from caret target specific issues that might arise when setting up the model. These include.dummyVars: creating dummy variables from categorical variables with multiple categories.nearZeroVar: identifying zero- and near zero-variance predictors (these may cause issues when subsampling).findCorrelation: identifying correlated predictors.findLinearCombos: identify linear dependencies between predictorsIn addition to these individual functions, there also exists the preProcess function which can be used to perform more common tasks such as centering and scaling, imputation and transformation.
PreProcess takes in a data frame to be processed and a method which can be any of “BoxCox”, “YeoJohnson”, “expoTrans”, “center”, “scale”, “range”, “knnImpute”, “bagImpute”, “medianImpute”, “pca”, “ica”, “spatialSign”, “corr”, “zv”, “nzv”, and “conditionalX”. # center, scale and perform a YeoJohnson transformation# identify and remove variables with near zero variance# perform pcaabalonenonzvpca. Data splitting ( createDataPartition and groupKFold)Generating subsets of the data is easy with the createDataPartition function.
While this function can be used to simply generate training and testing sets, it can also be used to subset the data while respecting important groupings that exist within the data.First, we show an example of performing general sample splitting to generate 10 different 80% subsamples. # identify the indices of 10 80% subsamples of the iris datatrainindex.
![]()
Resampling options ( trainControl)One of the most important part of training ML models is tuning parameters. You can use the trainControl function to specify a number of parameters (including sampling parameters) in your model. The object that is outputted from trainControl will be provided as an argument for train. Set.seed(998)# create a testing and training setintraining.
Model parameter tuning options ( tuneGrid =)You could specify your own tuning grid for model parameters using the tuneGrid argument of the train function. For example, you can define a grid of parameter combinations.
# define a grid of parameter options to tryrfgrid.
Before you begin.Make sure you have:. An internet connection (internet service provider fees may apply). Sufficient data storage available on a computer, USB or external drive for the download.A blank USB or DVD (and DVD burner) with at least 4 GB of space if you want to create media.
We recommend using a blankUSB or blank DVD, because any content on it will be deleted.Read.Read the.If you will be installing the operating system for the first time, you will need your Windows product key (xxxxx-xxxxx-xxxxx-xxxxx-xxxxx).For more information about product keys and when they are required, visit thepage.For Enterprise editions please visit the.
![]() Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
January 2023
Categories |