Estimating true generalization performance of a predictor is harder than you might think. This talk will build-up nested cross-validation from basic principles, point out common mistakes, and provide rules of thumb about what type model selection/evaluation to use and when.
It is common to perform model selection while also attempting to estimate accuracy on a held-out set. The traditional solution is to split a data set into training, validation, and test subsets. On small datasets, however, this strategy suffers from high variance. A common approach to reusing a small number of samples for model selection is cross-validation, which typically is applied across an entire dataset. Then the best model is evaluated on the test set. This approach has a fundamental flaw: if the test is small, the performance estimate is high variance. The solution is double (or nested) cross-validation, which will be explained in this talk.