How to create useful features for Machine Learning
This article is originally published at https://www.dataschool.io/
Recently, a member of Data School Insiders asked the following question in our private forum:
I'm new to Machine Learning. When I want to create a predictive model, what are the techniques I should use to do "feature engineering"?
Great question! Let's start with the basics:
What is feature engineering?
Feature engineering is the process of creating features (also called "attributes") that don't already exist in the dataset. This means that if your dataset already contains enough "useful" features, you don't necessarily need to engineer additional features.
But what is a "useful" feature? It's a feature that your Machine Learning model can learn from in order to more accurately predict the value of your target variable. In other words, it's a feature that helps your model to make better predictions!
Let's pretend you have a dataset with a "datetime" column:
- If your goal is to predict the temperature, you might use the datetime column to engineer an integer
hourfeature (0-23), since the hour of the day is a useful predictor of the temperature.
- If your goal is to predict the number of cars on the road, you might use the datetime column to engineer boolean
is_holidayfeatures (True/False), since those are useful predictors of traffic.
Thus when creating features, you have to take the target variable ("temperature" or "number of cars") into account, because different features will be useful for different targets.
Considering model type
Ideally, you should also take into account the type of Machine Learning model you're using:
- If you're using a linear model (such as linear regression), the
hourfeature might not be useful for predicting temperature since there's a non-linear relationship between
hour(0-23) and temperature. Instead, you might create an
is_nightboolean feature that represents the hours 0-4 and 20-23.
- If you're using a non-linear model (such as decision trees), the
hourfeature might work well for predicting temperature since decision trees can learn from non-linear features.
Art versus science?
There were some insightful comments in the forum about the process of feature engineering:
Feature engineering is more of an art than a formal process. You need to know the domain you're working with to come up with reasonable features. If you're trying to predict credit card fraud, for example, study the most common fraud schemes, the main vulnerabilities in the credit card system, which behaviors are considered suspicious in an online transaction, etc. Then, look for data to represent those features, and test which combinations of features lead to the best results.
Also, keep in mind that feature engineering uses different methods on different data types: categorical, numerical, spacial, text, audio, images, missing data, etc.
If you'd like to learn more about feature engineering, here are a few resources I recommend:
Non-Mathematical Feature Engineering Techniques for Data Science (blog post) includes a handful of examples of basic feature engineering techniques.
A Few Useful Things to Know about Machine Learning is a highly readable paper by Pedro Domingos (author of The Master Algorithm) about feature engineering, overfitting, the curse of dimensionality and other crucial Machine Learning topics.
Feature Engineering Made Easy (book) covers the feature engineering workflow in-depth and includes a ton of Python and scikit-learn code. (It was written by Sinan Ozdemir, one of my former data science co-instructors!)
Machine Learning with Text in Python is my online course that gives you hands-on experience with feature engineering, Natural Language Processing, ensembling, model evaluation, and much more to help you to master Machine Learning and extract value from your text-based data.
Do you have questions about feature engineering or resources to suggest? Let me know in the comments below!
P.S. Are you new to Machine Learning? Check out my free video series, Introduction to Machine Learning in Python with scikit-learn.
Please visit source website for post related comments.