Create a cvo (cross-valitation object)

Create indices of folds with blocking and stratification (cvo object) Create a cross-validation object (cvo), which contain a list of indices for each fold of (repeated) k-fold cross-validation. Options of blocking and stratification are available. See more in "Details".

cvo_create_folds(
  data = NULL,
  stratify_by = NULL,
  block_by = NULL,
  folds = 5,
  times = 1,
  seeds = NA_real_,
  kind = NULL,
  mode = c("caret", "mlr")[1],
  returnTrain = c(TRUE, FALSE, "both")[1],
  predict = c("test", "train", "both")[1],
  k = folds
)

# S3 method for cvo
print(x, ...)

Arguments

data	A data frame, that contains variables which names are denoted by arguments `block_by` and by `stratify_by`.
stratify_by	A vector or a name of factor variable in `data`, which levels will be used for stratification. E.g., a vector with medical groups.
block_by	A vector or a name of variable in `data`, that contains identification codes/numbers (ID). These codes will be used for blocking.
folds, k	(`integer`) A number of folds, default `folds = 5`.
times	(`integer`) A number of repetitions for repeated cross-validation.
seeds	(`NA_real_` \| `NULL` \| vector of integers) Seeds for random number generator for each repetition. If `seeds = NA_real_` (default), no seeds are set, parameter `kind` is also ignored. If `seeds = NULL` random seeds are generated automatically and registered in attribute `"seeds"`. If numeric vector, then these seeds will be used for each repetition of cross-validation. If the number of repetitions is greater than the number of provided seeds, additional seeds are generated and added to the vector. The first seed will be used to ensure reproducibility of the randomly generated seeds. For more information about random number generation see `set.seed`.
kind	(`NULL` \| `character`) The kind of (pseudo)random number generator. Default is `NULL`, which selects the currently-used generator (including that used in the previous session if the workspace has been restored): if no generator has been used it selects `"default"`. Generator `"L'Ecuyer-CMRG"` is recommended if package parallel is used for for parallel computing. In this case each seed should have 6 elements neither the first three nor the last three should be all zero. More information at `set.seed`.
mode	(`character`) Either caret-like or mlr-like cvo object. This option is not implemented yet!
returnTrain	(`logical` \| `character`) If `TRUE`, returns indices of variables in a training set (caret style). If `FALSE`, returns indices of variables in a test set (caret style). If `"both"`, returns indices of variables in both training and test sets (mlr style).
predict	(`character(1)`) What to predict during resampling: “train”, “test” or “both” sets. Default is “test”.
x	A `cvo` object.
...	(any) Further parameters for strategies. iters (`integer(1)`) Number of iterations, for “CV”, “Subsample” and “Bootstrap”. split (`numeric(1)`) Proportion of training cases for “Holdout” and “Subsample” between 0 and 1. Default is 2 / 3. reps (`integer(1)`) Repeats for “RepCV”. Here `iters = folds * reps`. Default is 10. folds (`integer(1)`) Folds in the repeated CV for `RepCV`. Here `iters = folds * reps`. Default is 10. horizon (`numeric(1)`) Number of observations in the forecast test set for “GrowingWindowCV” and “FixedWindowCV”. When `horizon > 1` this will be treated as the number of observations to forecast, else it will be a fraction of the initial window. IE, for 100 observations, initial window of .5, and horizon of .2, the test set will have 10 observations. Default is 1. initial.window (`numeric(1)`) Fraction of observations to start with in the training set for “GrowingWindowCV” and “FixedWindowCV”. When `initial.window > 1` this will be treated as the number of observations in the initial window, else it will be treated as the fraction of observations to have in the initial window. Default is 0.5. skip (`numeric(1)`) How many resamples to skip to thin the total amount for “GrowingWindowCV” and “FixedWindowCV”. This is passed through as the “by” argument in `seq()`. When `skip > 1` this will be treated as the increment of the sequence of resampling indices, else it will be a fraction of the total training indices. IE for 100 training sets and a value of .2, the increment of the resampling indices will be 20. Default is “horizon” which gives mutually exclusive chunks of test indices.

Value

(list) A list of folds. In each fold there are indices observations. The structure of outputs is the similar to one created with either function createFolds from caret or function makeResampleInstance in mlr.

Details

Function cvo_create_folds randomly divides observations into folds that are used for (repeated) k-fold cross-validation. In these folds observations are:

blocked by values in variable block_by (i.e. observations with the same "ID" or other kind of blocking factor are treated as one unit (a block) and are always in the same fold);
stratified by levels of factor variable stratify_by (the proportions of these grouped units of observations per each group (level) are kept approximately constant throughout all folds).

Note

If folds is too big and cases of at least one group (i.e., level in stratify_by) are not included in at least one fold, an error is returned. In that case smaller value of folds is recommended.

Author

Vilmantas Gegzna

Examples

library(manyROC)
set.seed(123456)

# Data
DataSet1 <- data.frame(ID = rep(1:20, each = 2),
  gr = gl(4, 10, labels = LETTERS[1:4]),
  .row = 1:40)

# Explore data
str(DataSet1)
#> 'data.frame':	40 obs. of  3 variables:
#>  $ ID  : int  1 1 2 2 3 3 4 4 5 5 ...
#>  $ gr  : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ .row: int  1 2 3 4 5 6 7 8 9 10 ...

table(DataSet1[, c("gr", "ID")])
#>    ID
#> gr  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#>   A 2 2 2 2 2 0 0 0 0  0  0  0  0  0  0  0  0  0  0  0
#>   B 0 0 0 0 0 2 2 2 2  2  0  0  0  0  0  0  0  0  0  0
#>   C 0 0 0 0 0 0 0 0 0  0  2  2  2  2  2  0  0  0  0  0
#>   D 0 0 0 0 0 0 0 0 0  0  0  0  0  0  0  2  2  2  2  2

summary(DataSet1)
#>        ID        gr          .row      
#>  Min.   : 1.00   A:10   Min.   : 1.00  
#>  1st Qu.: 5.75   B:10   1st Qu.:10.75  
#>  Median :10.50   C:10   Median :20.50  
#>  Mean   :10.50   D:10   Mean   :20.50  
#>  3rd Qu.:15.25          3rd Qu.:30.25  
#>  Max.   :20.00          Max.   :40.00  


# Explore functions
nFolds <- 5

# If variables of data frame are provided:
Folds1_a <- cvo_create_folds(data = DataSet1,
  stratify_by = "gr", block_by = "ID",
  k = nFolds, returnTrain = FALSE)
Folds1_a
#> --- A cvo object: ----------------------------------------------------
#>  indices stratified blocked cv_type k repetitions sample_size
#>     Test       TRUE    TRUE  k-fold 5           1          40
#> ----------------------------------------------------------------------

str(Folds1_a)
#> List of 5
#>  $ Rep1_Fold1: int [1:8] 7 8 19 20 25 26 39 40
#>  $ Rep1_Fold2: int [1:8] 3 4 13 14 27 28 35 36
#>  $ Rep1_Fold3: int [1:8] 9 10 17 18 21 22 31 32
#>  $ Rep1_Fold4: int [1:8] 1 2 15 16 23 24 37 38
#>  $ Rep1_Fold5: int [1:8] 5 6 11 12 29 30 33 34
#>  - attr(*, "class")= chr [1:2] "cvo_caret" "cvo"
#>  - attr(*, "info")='data.frame':	1 obs. of  7 variables:
#>   ..$ indices    : chr "Test"
#>   ..$ stratified : logi TRUE
#>   ..$ blocked    : logi TRUE
#>   ..$ cv_type    : chr "k-fold"
#>   ..$ k          : num 5
#>   ..$ repetitions: num 1
#>   ..$ sample_size: int 40
#>  - attr(*, "seeds")=List of 2
#>   ..$ generator: NULL
#>   ..$ seeds    : NULL

cvo_test_bs(Folds1_a, "gr", "ID", DataSet1)
#> ************************************************************\n____________________________________________________________\n                Test for STRATIFICATION 
#> 
#>            A B C D      <<<     >>>              A    B    C    D
#> Rep1_Fold1 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold2 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold3 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold4 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold5 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> 
#> If stratified, the proportions of each group in each fold
#> (row) should be (approximately) equal and with no zero values.
#> Test is not valid if data is blocked and number of cases in 
#> each block differs significantly.
#> ____________________________________________________________\n                Test for BLOCKING: BLOCKED
#> 
#>       ID
#>            1 2 3 4 5 6 7 8 9 10 ..
#> Rep1_Fold1 0 0 0 2 0 0 0 0 0  2 ..
#> Rep1_Fold2 0 2 0 0 0 0 2 0 0  0 ..
#> Rep1_Fold3 0 0 0 0 2 0 0 0 2  0 ..
#> Rep1_Fold4 2 0 0 0 0 0 0 2 0  0 ..
#> Rep1_Fold5 0 0 2 0 0 2 0 0 0  0 ..
#> 
#> Table shows number of observations in each fold.
#> If blocked, the same ID appears just in one fold.
#> 10 (of 20) first columns are displayed.
#> ************************************************************\n

# If "free" variables are provided:
Folds1_b <- cvo_create_folds(stratify_by = DataSet1$gr,
  block_by = DataSet1$ID,
  k = nFolds,
  returnTrain = FALSE)
# str(Folds1_b)
cvo_test_bs(Folds1_b, "gr", "ID", DataSet1)
#> ************************************************************\n____________________________________________________________\n                Test for STRATIFICATION 
#> 
#>            A B C D      <<<     >>>              A    B    C    D
#> Rep1_Fold1 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold2 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold3 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold4 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold5 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> 
#> If stratified, the proportions of each group in each fold
#> (row) should be (approximately) equal and with no zero values.
#> Test is not valid if data is blocked and number of cases in 
#> each block differs significantly.
#> ____________________________________________________________\n                Test for BLOCKING: BLOCKED
#> 
#>       ID
#>            1 2 3 4 5 6 7 8 9 10 ..
#> Rep1_Fold1 0 2 0 0 0 0 0 0 2  0 ..
#> Rep1_Fold2 0 0 0 2 0 0 0 2 0  0 ..
#> Rep1_Fold3 0 0 2 0 0 0 2 0 0  0 ..
#> Rep1_Fold4 0 0 0 0 2 0 0 0 0  2 ..
#> Rep1_Fold5 2 0 0 0 0 2 0 0 0  0 ..
#> 
#> Table shows number of observations in each fold.
#> If blocked, the same ID appears just in one fold.
#> 10 (of 20) first columns are displayed.
#> ************************************************************\n

# Not blocked but stratified
Folds1_c <- cvo_create_folds(stratify_by = DataSet1$gr,
  k = nFolds,
  returnTrain = FALSE)
# str(Folds1_c)
cvo_test_bs(Folds1_c, "gr", "ID", DataSet1)
#> ************************************************************\n____________________________________________________________\n                Test for STRATIFICATION 
#> 
#>            A B C D      <<<     >>>              A    B    C    D
#> Rep1_Fold1 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold2 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold3 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold4 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold5 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> 
#> If stratified, the proportions of each group in each fold
#> (row) should be (approximately) equal and with no zero values.
#> Test is not valid if data is blocked and number of cases in 
#> each block differs significantly.
#> ____________________________________________________________\n                Test for BLOCKING: NOT BLOCKED
#> 
#>       ID
#>            1 2 3 4 5 6 7 8 9 10 ..
#> Rep1_Fold1 0 2 0 0 0 0 0 1 1  0 ..
#> Rep1_Fold2 1 0 1 0 0 0 1 0 0  1 ..
#> Rep1_Fold3 1 0 0 0 1 1 0 1 0  0 ..
#> Rep1_Fold4 0 0 1 1 0 1 0 0 1  0 ..
#> Rep1_Fold5 0 0 0 1 1 0 1 0 0  1 ..
#> 
#> Table shows number of observations in each fold.
#> If blocked, the same ID appears just in one fold.
#> 10 (of 20) first columns are displayed.
#> ************************************************************\n

# Blocked but not stratified
Folds1_d <- cvo_create_folds(block_by = DataSet1$ID,
  k = nFolds,
  returnTrain = FALSE)
# str(Folds1_d)
cvo_test_bs(Folds1_d, "gr", "ID", DataSet1)
#> ************************************************************\n____________________________________________________________\n                Test for STRATIFICATION 
#> 
#>            A B C D      <<<     >>>              A    B    C    D
#> Rep1_Fold1 0 2 4 2  <-Counts | Proportions->  0.00 0.25 0.50 0.25
#> Rep1_Fold2 4 2 2 0  <-Counts | Proportions->  0.50 0.25 0.25 0.00
#> Rep1_Fold3 2 2 2 2  <-Counts | Proportions->  0.25 0.25 0.25 0.25
#> Rep1_Fold4 2 0 0 6  <-Counts | Proportions->  0.25 0.00 0.00 0.75
#> Rep1_Fold5 2 4 2 0  <-Counts | Proportions->  0.25 0.50 0.25 0.00
#> 
#> If stratified, the proportions of each group in each fold
#> (row) should be (approximately) equal and with no zero values.
#> Test is not valid if data is blocked and number of cases in 
#> each block differs significantly.
#> ____________________________________________________________\n                Test for BLOCKING: BLOCKED
#> 
#>       ID
#>            1 2 3 4 5 6 7 8 9 10 ..
#> Rep1_Fold1 0 0 0 0 0 0 0 0 2  0 ..
#> Rep1_Fold2 0 2 0 0 2 0 0 0 0  2 ..
#> Rep1_Fold3 2 0 0 0 0 0 2 0 0  0 ..
#> Rep1_Fold4 0 0 0 2 0 0 0 0 0  0 ..
#> Rep1_Fold5 0 0 2 0 0 2 0 2 0  0 ..
#> 
#> Table shows number of observations in each fold.
#> If blocked, the same ID appears just in one fold.
#> 10 (of 20) first columns are displayed.
#> ************************************************************\n