TrainSet Academy

Cross-Validation the Right Way
1. Introduction

I assume you know the basic idea behind Cross-Validation (CV). Here are just a few points that I consider to be important:

  1. CV is widely used for model selection, because it allows you to estimate the performance of the fitted model on unseen data.
  2. Typically you want to use:
    • KFold CV for regression problems
    • StratifiedKFold CV for classification problems (especially if the distribution of target labels is not uniform)

We will focus entirely on Cross-Validation, eliminating all the other steps of a data science pipeline (such as EDA, pre-processing, etc). Then we will cover some of the methods for performing CV with code samples.

2. Dependencies & Dataset

In this article we will use Boston House Prices Dataset which is a good and simple dataset for regression problem.

Snippet 1. Dependencies & Dataset

As you can see from Snippet 1, the dataset contains 506 records with 13 features and 1 target variable.

3. Cross-Validation

1.It is crucially important, that before moving to CV, you split your data into the train and test sets. In this case, after you are done with model selection, you will be able to get an unbiased estimate of model performance on unseen data.

2.Often times I see people do all the data preprocessing prior to CV. However, in this case such phenomenon as data leakage can be introduced. The general rule here can be summarized in two points:

  1. All the row-wise transformations (when for each transformation you need to know just one value, but not the values of the whole column) can be performed outside CV loop. Examples: converting kilometers to meters; transforming “full_name” into “first_name” and “last_name”.
  2. All the column-wise transformations (when for each transformation you need to know the values of the whole column) should be performed inside CV loop. Examples: standardization (because you need to calculate mean and standard deviation); rank transformation.

However, transforming data inside CV loop can significantly slow down the whole process. Thus, the smart approach would be to perform as much data preprocessing prior to CV as possible.

For example, if we look at standardization we need to compute the mean and standard deviation of the whole column. However, if the dataset is big enough and we shuffle the data prior to subsetting it into folds, we can assume that data from different folds come from the same distribution, and thus it has the same mean and standard deviation across different subsets. In this case, even a column-wise transformation can be performed outside CV loop. transformation can be performed outside CV loop.

4. Cross-Validation Methods

In this section for simplicity we will stick to just one model (LGBMRegressor) and use cross-validation to select its hyperparameters. For the sake of hyperparameters space visualization we will tune just two parameters (_max_depth_ and _learning_rate)._ We will consider the most popular methods for performing CV as well as some less popular ones that are very powerful.

4.1 Grid Search CV

Grid Search CV performs an exhaustive search over the specified range of hyperparameters (grid). For this method you need to specify every single value for each parameter (which can be tricky, especially for the continuous parameters) that you want your model to try.

#

Figure 2. Randomized Search CV Hyperparameters Space Example

Snippet 2. Grid Search CV Code Example

As you can see from Snippet 2 (cell 11), the best_score has a negative value. It happened because the metric we pass to GridSearchCV is Negative MSE (“neg_mean_squared_error”).

One of the major downsides of Grid Search CV is that it can be the case, that for example _learning_rate_\=0.45 always leads to terrible performance no matter what values other parameters have, but in the example above the value of _learning_rate_\=0.45 is still used 5 times (see Figure 1) which leads to basically wasting of these 5 trials.

Another disadvantage of Grid Search CV is that it suffers when it comes to dimensionality, as each additional hyperparameter leads to exponential growth of hyperparameters space.

4.2 Randomized Search CV

In contrast to Grid Search CV, Randomized Search CV doesn’t set up a grid of hyperparameter values. Instead, we have to specify a distribution for each hyperparameter we want to tune. Randomized Search CV then sample values from these distributions and selects their random combinations. This allows you to explicitly the number of parameter combinations that are attempted. The number of search iterations is set based on time requirements or available resources.

#

Figure 2. Randomized Search CV Hyperparameters Space Example

Snippet 3. Randomized Search CV Code Example

As you can see from Snippet 2 (cell 11), the best_score has a negative value. It happened because the metric we pass to GridSearchCV is Negative MSE (“neg_mean_squared_error”).

4.3 Bayesian Methods

Both Grid Search CV and Randomized Search CV perform different trials independently. That is why the next set of hyperparameters is selected in so-called uninformed manner, meaning we are not using the history of the past trials to select the next set of hyperparameters.

However, more advanced approaches are using the history of past trials to select hyperparameters for each trial in an informed manner. This often results in the faster hyperparameter tuning process and more accurate resulting models.

I will provide examples of two of these methods: Hyperopt and Optuna. You can read more about them here and here respectively.

4.3.1 Hyperopt
#

Figure 3. Hyperopt Hyperparameters Space Example

Take a look at the color bar in the right part of the graph. It indicates the dynamic of hyperparameters combinations selection. You can see that it converges at _max_depth_\=3 and _learning_rate_≈0.20. Note, that for Grid Search CV and Randomized CV we didn’t plot the color bar, because in those cases all the trials were performed independently.

Snippet 4. Hyperopt Code Example

5. Conclusion

In this article we’ve discussed some important details about Cross-Validation procedures, as well as some of the most popular methods for performing it. As a bonus for those who were determined enough and made it to the end of the article, here are some useful links:

Andrew Wolf

Written by Andrew Wolf

May 25, 2020

up