Create indices of folds with blocking and stratification (cvo object) Create a cross-validation object (cvo), which contain a list of indices for each fold of (repeated) k-fold cross-validation. Options of blocking and stratification are available. See more in "Details".
cvo_create_folds( data = NULL, stratify_by = NULL, block_by = NULL, folds = 5, times = 1, seeds = NA_real_, kind = NULL, mode = c("caret", "mlr")[1], returnTrain = c(TRUE, FALSE, "both")[1], predict = c("test", "train", "both")[1], k = folds ) # S3 method for cvo print(x, ...)
data | A data frame, that contains variables which names are denoted
by arguments |
---|---|
stratify_by | A vector or a name of factor variable in |
block_by | A vector or a name of variable in |
folds, k | ( |
times | ( |
seeds | (
For more information about random number generation see
|
kind | ( Generator |
mode | ( |
returnTrain | ( |
predict | ( |
x | A |
... | (any)
|
(list
) A list of folds. In each fold there are indices
observations. The structure of outputs is the similar to one
created with either function createFolds
from caret or function
makeResampleInstance
in mlr.
Function cvo_create_folds
randomly divides observations into
folds that are used for (repeated) k-fold cross-validation. In these
folds observations are:
blocked by values in variable block_by
(i.e. observations with the same "ID" or other kind of blocking factor
are treated as one unit (a block) and are always in the same fold);
stratified by levels of factor variable stratify_by
(the proportions of these grouped units of observations per each
group (level) are kept approximately constant throughout all folds).
If folds
is too big and cases of at least one group (i.e.,
level in stratify_by
) are not included in at least one fold,
an error is returned. In that case smaller value of folds
is
recommended.
Function createFolds
from package
caret.
Function makeResampleInstance
from package
mlr.
Test if folds are blocked and stratified cvo_test_bs
Vilmantas Gegzna
library(manyROC) set.seed(123456) # Data DataSet1 <- data.frame(ID = rep(1:20, each = 2), gr = gl(4, 10, labels = LETTERS[1:4]), .row = 1:40) # Explore data str(DataSet1)#> 'data.frame': 40 obs. of 3 variables: #> $ ID : int 1 1 2 2 3 3 4 4 5 5 ... #> $ gr : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ... #> $ .row: int 1 2 3 4 5 6 7 8 9 10 ...#> ID #> gr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 #> A 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 #> B 0 0 0 0 0 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 #> C 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 0 0 0 0 0 #> D 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2summary(DataSet1)#> ID gr .row #> Min. : 1.00 A:10 Min. : 1.00 #> 1st Qu.: 5.75 B:10 1st Qu.:10.75 #> Median :10.50 C:10 Median :20.50 #> Mean :10.50 D:10 Mean :20.50 #> 3rd Qu.:15.25 3rd Qu.:30.25 #> Max. :20.00 Max. :40.00# Explore functions nFolds <- 5 # If variables of data frame are provided: Folds1_a <- cvo_create_folds(data = DataSet1, stratify_by = "gr", block_by = "ID", k = nFolds, returnTrain = FALSE) Folds1_a#> --- A cvo object: ---------------------------------------------------- #> indices stratified blocked cv_type k repetitions sample_size #> Test TRUE TRUE k-fold 5 1 40 #> ----------------------------------------------------------------------str(Folds1_a)#> List of 5 #> $ Rep1_Fold1: int [1:8] 7 8 19 20 25 26 39 40 #> $ Rep1_Fold2: int [1:8] 3 4 13 14 27 28 35 36 #> $ Rep1_Fold3: int [1:8] 9 10 17 18 21 22 31 32 #> $ Rep1_Fold4: int [1:8] 1 2 15 16 23 24 37 38 #> $ Rep1_Fold5: int [1:8] 5 6 11 12 29 30 33 34 #> - attr(*, "class")= chr [1:2] "cvo_caret" "cvo" #> - attr(*, "info")='data.frame': 1 obs. of 7 variables: #> ..$ indices : chr "Test" #> ..$ stratified : logi TRUE #> ..$ blocked : logi TRUE #> ..$ cv_type : chr "k-fold" #> ..$ k : num 5 #> ..$ repetitions: num 1 #> ..$ sample_size: int 40 #> - attr(*, "seeds")=List of 2 #> ..$ generator: NULL #> ..$ seeds : NULL#> ************************************************************\n____________________________________________________________\n Test for STRATIFICATION #> #> A B C D <<< >>> A B C D #> Rep1_Fold1 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold2 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold3 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold4 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold5 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> #> If stratified, the proportions of each group in each fold #> (row) should be (approximately) equal and with no zero values. #> Test is not valid if data is blocked and number of cases in #> each block differs significantly. #> ____________________________________________________________\n Test for BLOCKING: BLOCKED #> #> ID #> 1 2 3 4 5 6 7 8 9 10 .. #> Rep1_Fold1 0 0 0 2 0 0 0 0 0 2 .. #> Rep1_Fold2 0 2 0 0 0 0 2 0 0 0 .. #> Rep1_Fold3 0 0 0 0 2 0 0 0 2 0 .. #> Rep1_Fold4 2 0 0 0 0 0 0 2 0 0 .. #> Rep1_Fold5 0 0 2 0 0 2 0 0 0 0 .. #> #> Table shows number of observations in each fold. #> If blocked, the same ID appears just in one fold. #> 10 (of 20) first columns are displayed. #> ************************************************************\n# If "free" variables are provided: Folds1_b <- cvo_create_folds(stratify_by = DataSet1$gr, block_by = DataSet1$ID, k = nFolds, returnTrain = FALSE) # str(Folds1_b) cvo_test_bs(Folds1_b, "gr", "ID", DataSet1)#> ************************************************************\n____________________________________________________________\n Test for STRATIFICATION #> #> A B C D <<< >>> A B C D #> Rep1_Fold1 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold2 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold3 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold4 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold5 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> #> If stratified, the proportions of each group in each fold #> (row) should be (approximately) equal and with no zero values. #> Test is not valid if data is blocked and number of cases in #> each block differs significantly. #> ____________________________________________________________\n Test for BLOCKING: BLOCKED #> #> ID #> 1 2 3 4 5 6 7 8 9 10 .. #> Rep1_Fold1 0 2 0 0 0 0 0 0 2 0 .. #> Rep1_Fold2 0 0 0 2 0 0 0 2 0 0 .. #> Rep1_Fold3 0 0 2 0 0 0 2 0 0 0 .. #> Rep1_Fold4 0 0 0 0 2 0 0 0 0 2 .. #> Rep1_Fold5 2 0 0 0 0 2 0 0 0 0 .. #> #> Table shows number of observations in each fold. #> If blocked, the same ID appears just in one fold. #> 10 (of 20) first columns are displayed. #> ************************************************************\n# Not blocked but stratified Folds1_c <- cvo_create_folds(stratify_by = DataSet1$gr, k = nFolds, returnTrain = FALSE) # str(Folds1_c) cvo_test_bs(Folds1_c, "gr", "ID", DataSet1)#> ************************************************************\n____________________________________________________________\n Test for STRATIFICATION #> #> A B C D <<< >>> A B C D #> Rep1_Fold1 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold2 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold3 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold4 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold5 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> #> If stratified, the proportions of each group in each fold #> (row) should be (approximately) equal and with no zero values. #> Test is not valid if data is blocked and number of cases in #> each block differs significantly. #> ____________________________________________________________\n Test for BLOCKING: NOT BLOCKED #> #> ID #> 1 2 3 4 5 6 7 8 9 10 .. #> Rep1_Fold1 0 2 0 0 0 0 0 1 1 0 .. #> Rep1_Fold2 1 0 1 0 0 0 1 0 0 1 .. #> Rep1_Fold3 1 0 0 0 1 1 0 1 0 0 .. #> Rep1_Fold4 0 0 1 1 0 1 0 0 1 0 .. #> Rep1_Fold5 0 0 0 1 1 0 1 0 0 1 .. #> #> Table shows number of observations in each fold. #> If blocked, the same ID appears just in one fold. #> 10 (of 20) first columns are displayed. #> ************************************************************\n# Blocked but not stratified Folds1_d <- cvo_create_folds(block_by = DataSet1$ID, k = nFolds, returnTrain = FALSE) # str(Folds1_d) cvo_test_bs(Folds1_d, "gr", "ID", DataSet1)#> ************************************************************\n____________________________________________________________\n Test for STRATIFICATION #> #> A B C D <<< >>> A B C D #> Rep1_Fold1 0 2 4 2 <-Counts | Proportions-> 0.00 0.25 0.50 0.25 #> Rep1_Fold2 4 2 2 0 <-Counts | Proportions-> 0.50 0.25 0.25 0.00 #> Rep1_Fold3 2 2 2 2 <-Counts | Proportions-> 0.25 0.25 0.25 0.25 #> Rep1_Fold4 2 0 0 6 <-Counts | Proportions-> 0.25 0.00 0.00 0.75 #> Rep1_Fold5 2 4 2 0 <-Counts | Proportions-> 0.25 0.50 0.25 0.00 #> #> If stratified, the proportions of each group in each fold #> (row) should be (approximately) equal and with no zero values. #> Test is not valid if data is blocked and number of cases in #> each block differs significantly. #> ____________________________________________________________\n Test for BLOCKING: BLOCKED #> #> ID #> 1 2 3 4 5 6 7 8 9 10 .. #> Rep1_Fold1 0 0 0 0 0 0 0 0 2 0 .. #> Rep1_Fold2 0 2 0 0 2 0 0 0 0 2 .. #> Rep1_Fold3 2 0 0 0 0 0 2 0 0 0 .. #> Rep1_Fold4 0 0 0 2 0 0 0 0 0 0 .. #> Rep1_Fold5 0 0 2 0 0 2 0 2 0 0 .. #> #> Table shows number of observations in each fold. #> If blocked, the same ID appears just in one fold. #> 10 (of 20) first columns are displayed. #> ************************************************************\n