Your First Machine Learning Pipeline - Part II

Jun 12, 2020

Motivation

I see you’re ready for the next part of the "Your First Machine Learning Pipeline" series – part II! If you’re new to this series, you can begin by following along to the first blog post, "Your First Machine Learning Pipeline – Part I", where we outlined the various uses of the sci-kit learn package. In this blog post, we will focus on the ease and advantages of using feature selection to optimise predictions for your model.

First things first, we need to define what feature selection is. When building a Machine Learning (ML) model, it’s trained on features (also known as variables or predictors) - these are usually the columns in your training set. Similar to when you’re picking ingredients for making pancakes or any other delicious dessert, some ingredients are more important than others. Fact. This is why we use feature elimination, the removal of a subset of detrimental features from the initial ones, to obtain a better feature set. Optimisation of the feature set can vastly improve the prediction power of the model, hence why it is a key aspect of any ML project.

There are many ways of going about feature elimination - the most intuitive (but least efficient) one uses heuristic arguments, which essentially uses domain knowledge, and iterative manual analysis to remove features. This approach is time-consuming, making it unrealistic to use on larger feature sets. Furthermore, the importance of each feature is not always known so using manual elimination may not be viable.

As explained before, each feature has a degree of importance associated to it, subsequently impacting the prediction power of the model. Therefore, it is important to rank which features are most valuable when building a model. One example of a popular feature selection method is gini importance, commonly used with Random Forest models. Unfortunately, issues typically arise with feature selection when using a Random Forest model; one particular issue is that Random Forest models are more inclined to choose variables with more categories, inducing biased variable selection.

Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) helps resolve some of the issues commonly faced during feature elimination. RFE is a ML algorithm that is typically used on small samples to improve classfication performance. It differs from alternative feature elimination algorithms by disregarding insignificant features, resulting in better predictions.

Some of the benefits of using RFE include:

It can be used on very large feature sets.
It does not require knowledge of what each feature represents.
It is compatible with a plethora of different models.

So... we now know about the advantages of using RFE, but how exactly does it work? The following step-by-step guide shows you how RFE works (in theory):

Fit your model on the training dataset
Record the corresponding scoring metrics for the model e.g. - accuracy, precision, recall*
Determine which feature is the least important in making predictions on the testing dataset and drop this feature.
Our model has now reduced its feature set by 1.
Select the feature set which gives the highest (or lowest) scoring metric (depending on which one you use). In this case, we would pick the feature set that gives us the highest accuracy score.

If the feature set has more than one feature, repeat from step 1, otherwise skip to step 5.

After following these steps, we are now able to successfully use RFE to generate different feature sets corresponding to a unique scoring metric. It must be noted that when dealing with large datasets, it is wise to subset the features. Choosing not to is not only computationally expensive in terms of runtime, but makes our model susceptible to the curse of dimensionality.

To better illustrate how RFE works, please refer to the figure below:

Figure 1. Accuracy vs. the number of features in the feature set (1-30).

Figure 1 shows the relationship between accuracy against increasing feature numbers in the feature set, ranging from 1-30. The highest model accuracy for the lowest model complexity was achieved using 9 features, indicating that this could be an optimal feature set.

This graph was generated using the Sci-KitLearn RFECV algorithm, which can be reproduced by following the step-by-step code guide in the later part of this post.

The Principle of Parsimony

After running RFE, it is common that several feature sets will have similarly high accuracy scores. Furthermore, there are also times where some larger feature set has a slightly higher accuracy score than a smaller set. When this scenario occurs, it is best practice to utilise the feature set with the least number of features as the accuracy difference is negligible. Not only will the model benefit by having a faster runtime, it can also help prevent over-fitting of the model.

The more complicated our model is (e.g. the more features it has), the higher the chance of over-fitting our model. In simple terms, this means that our model becomes so streamlined in optimising the test set it is being tested on that it will only produce accurate results on this test set. If we were to train this over-fitting model on a new dataset and compare it to a model which does not suffer from over-fitting, our results would not be as accurate. This is the reason why simpler models tend to be preferred. We call this the Principle of Parsimony.

RFE - How does it work practically?

Similar to other ML algorithms, implementation of the RFE algorithm follows the same 4 steps (instantiate, fit, score and predict) that were discussed in the part I of this blog series.

As with most ML algorithms, RFE has a ready built algorithm within Python's sci-Kit Learn package.

Step 1 - Instantiating

Before we can instantiate the RFE class, we must first create an instance of the model class to pass as a variable into our RFE.

Step 2 - Fitting

Next comes the fitting the instance of the class – this also happens to be the most computationally intensive and expensive step of the overall process.

The model is fitted on a training dataset made up of a feature set (without a response). This is denoted as X in the Sci-Kit Learn documentation and the response variable is denoted as y. Note that there is a slight difference compared to when we first fitted the model as the RFECV fit is not done in-place.

Now that our data has been fitted to the model, there are several useful attributes of RFE than we can call. The one we shall focus on is the support_ attribute. This gives us a boolean array (a mask) of which features to keep. To put this simply, it tells us which feature set will optimise our scoring metric.

Exercise 1

Run RFE on the same data set but instead of using the SGDClassifier, use the Random Forest model.

Hint: Import the Random Forest model’s estimated class from its associated module.

Do you end up with the same optimal feature set? If not, what are some reasons that might have happened?

Automation of the RFE

RFE enables us to generate an improved feature set and as a result, the prediction power of the model improves. It is possible to automate the feature selection process to improve our ML pipeline. We shall briefly outline how this can be done below.

Start by importing the following model classes (classifiers): SGDClassifier, RandomForestClassifier and LogisticRegression.

We can then define a function to run RFE optimisation, taking into account the model class and the appropriate training and testing dataset:

We have now successfully automated the RFE process, and can use each estimator's respective ideal features when making use of that estimator. We could then either make the models generated from these estimators compete, selecting the best of them, or stack them to make use of them all.

Exercise 2

Create some additional features in your training dataset and run RFE on the new training set.

Hint: It might be useful to create new features using the python package, pandas.

Not to worry if you’re still a bit lost about what kind of features you want to create. Why not try creating:

The sum of mean perimeter
The mean area
The range between the mean fractal dimension
The worst fractal dimension

Now compare the feature sets. Are the newly generated features kept in the final set? If so, did they improve your overall accuracy score?

Conclusion

Prior to reading this blog post, you were probably able to fit a ML model onto a training dataset. Now, you have officially levelled up (well done!) and are able to optimise your feature set using RFE, thus optimising the accuracy of your ML model.

In case you were wondering if there were other means of improving your feature set, the creation (not the elimination) of new features is always an option. Some examples include: - The time elapsed (difference) between two date columns e.g. start and end dates - The average of two float columns which are related - The range between two float columns that are related.

As you can see, there are endless ways of going about feature selection, so it is your responsibility to determine the most effective feature selection method to improve the predictive power of the model.

Stay tuned for the next part of this blog series, "Your First Machine Learning Pipeline - Part III", where we will learn how to analyse the results generated from our ML algorithms.

#dataintelligence #machinelearning #RFE

Asia Blog