The process of transforming data to improve machine learning models’ predictive performance is known as feature engineering. it is a technique used to create new features from the existing data that could help to gain more insight into the data.
It is used in Preparing and analyzing the available data per the machine learning algorithm’s requirements. Categorical data is incompatible with most machine learning algorithms. As a result, we’ll need to convert the column to numeric so that the algorithm can take in all of the relevant data.
It Improves the machine learning models’ accuracy. Every predictive model’s ultimate aim is to achieve the best possible results. Using a suitable algorithm and tuning the parameters are two ways to boost results. However, I believe that adding new features aids the most in improving performance
Feature engineering strategies such as standardization and normalization also result in the better weighting of variables, increasing precision and sometimes leading to faster convergence.
One of the main reasons why feature engineering is recommended is that exploratory data analysis allows us to understand the data and the potential for new functionality.
By the end of this article, one is expected to know about
- Preparation of data
- Merge train and test
- Handling missing values
- How to handle categorical features
- Using groupby() and transform() for aggregation features/statistics
- Normalizing / standardizing data
- Date and time features
Preparation of data
The first step is to load data into a platform to extract these features. The platform may be a Jupiter notebook, and you can start feature extraction by importing libraries.
In python, the “import“ keyword is used to select relevant libraries
Input Data can be retrieved from data sources such as a spreadsheets file used in a python environment. For instance, Pandas can read lots of kinds of files (e.g. CSV, XLS, XLSX) from a computer or internet as follows: Reading a (CSV) file can be done by
1 | df= pd.read_csv(path) |
once done, we can have a better understanding of our data. Let’s see the example below
1 2 3 4 5 | #Importing libraries import pandas as pd #Importing and printing data data = pd.read_csv('train.csv') data.head() |
Our example, The Big Mart Sales Prediction data, is the focus of our efforts. Given such variables, the challenge is to forecast goods sold in different stores in different cities.
Item_Id | Item_Weight | Item_Fat_Content | Item_Visibility | Item_Type | Item_MRP | Outlet_Establishment_Year | Outlet_Size | Outlet_Location_Type | Outlet_Type | Item_Outlet_Sales | |
0 | BAA19 | 9.3 | Low Fat | 0.016047 | Dairy | 249.8092 | 1999 | Medium | Tier 1 | Supermarket Type 1 | 3735.138 |
1 | DBC05 | 5.92 | Regular | 0.019278 | Soft Drinks | 48.2692 | 2009 | Medium | Tier 3 | Supermarket Type 2 | 443.4228 |
2 | ACN11 | 17.5 | Low Fat | 0.01676 | Meat | 141.618 | 1999 | Medium | Tier 1 | Supermarket Type 1 | 2097.27 |
3 | EDX06 | 19.2 | Regular | 0 | Fruits and Vegetables | 182.095 | 1998 | Nan | Tier 3 | Grocery store | 732.38 |
4 | NCD19 | 8.93 | Low Fat | 0 | Household | 53.8614 | 1987 | High | Tier 3 | Supermarket Type1 | 994.7052 |
Merge train and test
What happens when we have data from many sources? in such a scenario, we can use merge function()
It is often advised to operate on the entire Data Frame when conducting features engineering to provide a general model; if you have two 2 files, merge them (train and test). An example of a code snippet is as shown below
1 2 3 4 5 | df = pd.concat([train[col],test[col]],axis=0) # For test rows, the marking column is set to NULL.# FEATURE ENGINEERING HERE train[col] = df[:len(train)] test[col] = df[len(train):] |
Handling missing values
After preparing your data one of the various problems to handle in feature engineering is missing data, for example we might have a dataset that looks like this:
1 2 3 4 5 6 7 | import numpy as np X = np.array([[ np.nan, 0, 3 ], [ 3, 7, 9 ], [ 3, 5, 2 ], [ 4, np.nan, 6 ], [ 8, 8, 1 ]]) y = np.array([14, 16, -1, 8, -5]) |
Using a standard machine learning model on such data, we must first fill in the gaps with relevant fill values. This is known as the imputation of missing values. sklearn offers the Imputer class a baseline imputation method that uses the mean, median, or most frequent value.
1 2 3 4 | from sklearn.preprocessing import Imputer imp = Imputer(strategy='mean') X2 = imp.fit transform(X) X2 |
Out[1]:
array([[ 4.5, 0. , 3. ], [ 3. , 7. , 9. ], [ 3. , 5. , 2. ], [ 4. , 5. , 6. ], [ 8. , 8. , 1. ]])
The two missing values in the resulting data have been replaced with the mean of the column’s remaining values. This imputed data can then be fed directly into, for example, a Linear Regression estimator:
1 2 | model = LinearRegression().fit(X2, y) model.predict(X2) |
Out[2]:
array([ 13.14869292, 14.3784627 , -1.15539732, 10.96606197, -5.33782027])
How to handle categorical features
Categorical data is a common category of non-numerical data.
1 2 3 4 5 6 | data = [ {'price': 450000, 'suite': 4, 'local': 'jade}, {'price': 200000, 'suite': 3, 'local': 'mark'}, {'price': 350000, 'suite': 3, 'local': 'ben'}, {'price': 400000, 'suite': 2, 'local': 'jack'} ] |
You may be tempted to use a simple numerical mapping to encode this data:
{‘jade’: 1, ‘jack’: 2, ‘mark’: 3};
One-hot encoding, which essentially generates extra columns showing the presence or absence of a category with a value of 1 or 0, is an established technique in this situation. When your data is in the form of a list of dictionaries, you can use Scikit-DictVectorizer Learns to do the following:
1 2 3 | from sklearn.feature_extraction import DictVectorizer vec = DictVectorizer(sparse=False, dtype=int) vec.fit_transform(data) |
Out[3]:
1 2 3 4 | array([[ 0, 1, 0, 8450000, 4], [ 1, 0, 0, 200000, 3], [ 0, 0, 1, 350000, 3], [ 1, 0, 0, 400000, 2]], dtype=int64) |
The ‘local’ column has been split into three columns to represent the three local labels, and each row has a 1 in the column that corresponds to its local. After you’ve encoded these categorical features, you can fit a sklearn model as usual.
Using groupby() and transform() for aggregation features/statistics
Group by function can break data into different types to obtain previously unavailable information. Group By helps us Group our data based on various characteristics to get more detailed insights into our data. To perform tasks ranging from data analysis to feature engineering, we can combine it with other functions such as add, agg, transform, and filter.
We can use Group by function on any categorical variable and any aggregation function, such as mean, median, mode, count.
For this illustration, we’ll look at the mean Item Outlet Sales after grouping the variables Item Identifier and Item Type.
1 2 | data['Item_Outlet_Sales_mean'] = data.groupby(['Item_Identifier', 'Item_Type'])['Item_Outlet_Sales']\ . transform(lambda x: x.mean()) |
Item_Identifier | item_Type | Item_outlet_Sales | Item_Outlet_Sales_Mean | |
0 | FDF22 | Snack Foods | 2778.3834 | 3232.542225 |
1 | FDS36 | Baking Goods | 549.285 | 2636.568 |
2 | NCJ29 | Health and Hygiene | 1193.1136 | 1221.521067 |
3 | FDN46 | Snack foods | 1845.5976 | 2067.752867 |
4 | DRG01 | Soft Drinks | 765.67 | 1225.072 |
We can deduce from the first row that if the Item Identifier is FD22 and the Item Type is Snack Foods, the average sales would be 3232.54. When conducting this type of feature engineering, be cautious because using the target variable to build new features can cause your model to become biased.
Normalizing / Standardizing data
Normalizing means rescale the values into a range of 0 to 1 [0,1]. Each observation (row) is modified to have a length of 1 (called a unit norm or a vector with 1 in linear algebra). That is, the sum of the squares is always up to 1 in each row.
It is most suitable for algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbours. In python, normalization is done using normalize( ) function in the scikit-learn package
Normalization is converting numeric columns’ values in a dataset to a standard scale without distorting the values’ ranges.
Standardization means rescale data to have a normal distribution with a mean of 0 and a standard deviation of 1 (unit variance). It implies that the data should be centered about 0 and scaled to the standard deviation. It is most suitable for techniques that assume normal distribution in the input variables such as linear regression, logistic regression, and linear discriminate analysis.
In python, data is standardized using sci-kit-learn the Standard Scaler class.
Date and time features
During feature engineering of a dataset, do you encounter a date or time column or a date-time variable and begin to question such a variable’s relevance and what insights we can gain from it? Well, you might be amazed by how much information we can gain.
For starters, a date-time variable can create many new variables. Apart from being able to analyze the day, month, year, hour, minutes, and seconds we can also generate other analysis such as days of a week calculation, quarters of the year, day of the month, and so much more.
These variables can come in handy when doing one to one analysis and logging of events. although this form of feature engineering is helpful, too many features generate irrelevant variables; this increases the tendency to have noise in your model
Pandas can extract these features out of a dataset using the dt function and much more, but we first have to deal with converting the dataset into a date-type format
Converting data format from string to date type
In our dataset below, let’s illustrate how we can convert raw data into date format. Date types are mostly saved in a string format. It may affect our analysis, no worries. We can use pandas to change the format to the format we need.
For example, if we are looking to convert a date column with the following format: 20 DEC 2019 , let’s use this piece of code:
1 | df['date'] = pd.to_datetime(df[col], format='%d %b %Y') |
Once your column is converted to Date-Time, we may need to extract date-time components
1 2 3 4 5 | df['year'] = df['date']. year df['month'] = df['date']. month df['day'] = df['date']. day # code to confirm the date type has changed df. types |
For the next illustration, a taxi duration dataset is our example to illustrate how to extract date-time features
id | driver_id | pickup_datetime | dropoff_datetime | passenger_count | pickup_longitude | pickup_latitude | dropoff_longitude | dropoff_latitude |
id1875421 | 2 | 14/03/2016 17:24 | 14/03/2016 17:32 | 1 | -73.9822 | 40.76794 | -73.9646 | 40.7656 |
id1377394 | 1 | 12/06/2016 00:43 | 12/06/2016 00:54 | 1 | -73.9804 | 40.73856 | -73.9995 | 40.73115 |
id1858529 | 2 | 19/01/2016 11:35 | 19/01/2016 12:10 | 1 | -73.979 | 40.76394 | -74.0053 | 40.71009 |
id2504673 | 2 | 06/04/2016 19:32 | 06/04/2016 19:39 | 1 | -74.01 | 40.71997 | -74.0123 | 40.70672 |
As we can see we have two date-time columns pickup_datetime and droff_date time
Let’s use the pickup_datetime column to extract features using dt function ()
This is a code snippet to extract the features we want
1 2 3 4 5 6 | data ['pickup_year'] = data['pickup_datetime'].dt. year data['pickup_dayofyear'] = data['pickup_datetime'].dt.day data['pickup_monthofyear'] = data['pickup_datetime'].dt.month data['pickup_hourofday'] = data['pickup_datetime'].dt.hour data['pickup_dayofweek'] = data['pickup_datetime'].dt.dayofweek data['pickup_weekofyear'] = data['pickup_datetime'].dt.weekofyear |
pickup_Year | pickup_dayofyear | pickup_monthofyear | pickup_hourofday | pickup_dayofweek | pickup_weekofyear | |
0 | 2016 | 29 | 2 | 16 | 0 | 9 |
1 | 2016 | 11 | 3 | 23 | 4 | 10 |
2 | 2016 | 21 | 2 | 17 | 6 | 7 |
3 | 2016 | 5 | 1 | 9 | 1 | 1 |
4 | 2016 | 17 | 2 | 6 | 2 | 7 |
As you can see, the results have a lot on the information we can use for modeling as we can see the metrics on a deeper level. We can use date-time for many other scenarios depending on the dataset we have.
Cover photo credit: https://k21academy.com/