Feature engineering with Pandas in Python

Preparation of data

Merge train and test

Handling missing values

How to handle categorical features

Using groupby() and transform() for aggregation features/statistics

Normalizing / Standardizing data

Date and time features

on October 4, 2020

The process of transforming data to improve machine learning models’ predictive performance is known as feature engineering. it is a technique used to create new features from the existing data that could help to gain more insight into the data.

It is used in Preparing and analyzing the available data per the machine learning algorithm’s requirements. Categorical data is incompatible with most machine learning algorithms. As a result, we’ll need to convert the column to numeric so that the algorithm can take in all of the relevant data.

It Improves the machine learning models’ accuracy. Every predictive model’s ultimate aim is to achieve the best possible results. Using a suitable algorithm and tuning the parameters are two ways to boost results. However, I believe that adding new features aids the most in improving performance

Feature engineering strategies such as standardization and normalization also result in the better weighting of variables, increasing precision and sometimes leading to faster convergence.

One of the main reasons why feature engineering is recommended is that exploratory data analysis allows us to understand the data and the potential for new functionality.

By the end of this article, one is expected to know about

Preparation of data
Merge train and test
Handling missing values
How to handle categorical features
Using groupby() and transform() for aggregation features/statistics
Normalizing / standardizing data
Date and time features

The first step is to load data into a platform to extract these features. The platform may be a Jupiter notebook, and you can start feature extraction by importing libraries.

In python, the “import“ keyword is used to select relevant libraries

Input Data can be retrieved from data sources such as a spreadsheets file used in a python environment. For instance, Pandas can read lots of kinds of files (e.g. CSV, XLS, XLSX) from a computer or internet as follows: Reading a (CSV) file can be done by

df= pd.read_csv(path)

1	df= pd.read_csv(path)

once done, we can have a better understanding of our data. Let’s see the example below

#Importing libraries
import pandas as pd
#Importing and printing data
data = pd.read_csv('train.csv')
data.head()

#Importing libraries

import pandas as pd

#Importing and printing data

data = pd.read_csv('train.csv')

data.head()

Our example, The Big Mart Sales Prediction data, is the focus of our efforts. Given such variables, the challenge is to forecast goods sold in different stores in different cities.

	Item_Id	Item_Weight	Item_Fat_Content	Item_Visibility	Item_Type	Item_MRP	Outlet_Establishment_Year	Outlet_Size	Outlet_Location_Type	Outlet_Type	Item_Outlet_Sales
0	BAA19	9.3	Low Fat	0.016047	Dairy	249.8092	1999	Medium	Tier 1	Supermarket Type 1	3735.138
1	DBC05	5.92	Regular	0.019278	Soft Drinks	48.2692	2009	Medium	Tier 3	Supermarket Type 2	443.4228
2	ACN11	17.5	Low Fat	0.01676	Meat	141.618	1999	Medium	Tier 1	Supermarket Type 1	2097.27
3	EDX06	19.2	Regular	0	Fruits and Vegetables	182.095	1998	Nan	Tier 3	Grocery store	732.38
4	NCD19	8.93	Low Fat	0	Household	53.8614	1987	High	Tier 3	Supermarket Type1	994.7052

What happens when we have data from many sources? in such a scenario, we can use merge function()

It is often advised to operate on the entire Data Frame when conducting features engineering to provide a general model; if you have two 2 files, merge them (train and test). An example of a code snippet is as shown below

df = pd.concat([train[col],test[col]],axis=0)
# For test rows, the marking column is set to NULL.# FEATURE ENGINEERING HERE

train[col] = df[:len(train)]
test[col] = df[len(train):]

df = pd.concat([train[col],test[col]],axis=0)

# For test rows, the marking column is set to NULL.# FEATURE ENGINEERING HERE

train[col] = df[:len(train)]

test[col] = df[len(train):]

After preparing your data one of the various problems to handle in feature engineering is missing data, for example we might have a dataset that looks like this:

import numpy as np
X = np.array([[ np.nan, 0,   3  ],
              [ 3,   7,   9  ],
              [ 3,   5,   2  ],
              [ 4,   np.nan, 6  ],
              [ 8,   8,   1  ]])
y = np.array([14, 16, -1,  8, -5])

import numpy as np

X = np.array([[ np.nan, 0, 3 ],

[ 3, 7, 9 ],

[ 3, 5, 2 ],

[ 4, np.nan, 6 ],

[ 8, 8, 1 ]])

y = np.array([14, 16, -1, 8, -5])

Using a standard machine learning model on such data, we must first fill in the gaps with relevant fill values. This is known as the imputation of missing values. sklearn offers the Imputer class a baseline imputation method that uses the mean, median, or most frequent value.

from sklearn.preprocessing import Imputer
imp = Imputer(strategy='mean')
X2 = imp.fit transform(X)
X2

from sklearn.preprocessing import Imputer

imp = Imputer(strategy='mean')

X2 = imp.fit transform(X)

Out[1]:

array([[ 4.5, 0. , 3. ], [ 3. , 7. , 9. ], [ 3. , 5. , 2. ], [ 4. , 5. , 6. ], [ 8. , 8. , 1. ]])

The two missing values in the resulting data have been replaced with the mean of the column’s remaining values. This imputed data can then be fed directly into, for example, a Linear Regression estimator:

model = LinearRegression().fit(X2, y)
model.predict(X2)

1 2	model = LinearRegression().fit(X2, y) model.predict(X2)

Out[2]:

array([ 13.14869292, 14.3784627 , -1.15539732, 10.96606197, -5.33782027])

Categorical data is a common category of non-numerical data.

data = [
    {'price': 450000, 'suite': 4, 'local': 'jade},
    {'price': 200000, 'suite': 3, 'local': 'mark'},
    {'price': 350000, 'suite': 3, 'local': 'ben'},
    {'price': 400000, 'suite': 2, 'local': 'jack'}
]

data = [

{'price': 450000, 'suite': 4, 'local': 'jade},

{'price': 200000, 'suite': 3, 'local': 'mark'},

{'price': 350000, 'suite': 3, 'local': 'ben'},

{'price': 400000, 'suite': 2, 'local': 'jack'}

]

You may be tempted to use a simple numerical mapping to encode this data:

{‘jade’: 1, ‘jack’: 2, ‘mark’: 3};

One-hot encoding, which essentially generates extra columns showing the presence or absence of a category with a value of 1 or 0, is an established technique in this situation. When your data is in the form of a list of dictionaries, you can use Scikit-DictVectorizer Learns to do the following:

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False, dtype=int)
vec.fit_transform(data)

from sklearn.feature_extraction import DictVectorizer

vec = DictVectorizer(sparse=False, dtype=int)

vec.fit_transform(data)

Out[3]:

array([[     0,      1,      0, 8450000,      4],
       [     1,      0,      0, 200000,      3],
       [     0,      0,      1, 350000,      3],
       [     1,      0,      0, 400000,      2]], dtype=int64)

array([[ 0, 1, 0, 8450000, 4],

[ 1, 0, 0, 200000, 3],

[ 0, 0, 1, 350000, 3],

[ 1, 0, 0, 400000, 2]], dtype=int64)

The ‘local’ column has been split into three columns to represent the three local labels, and each row has a 1 in the column that corresponds to its local. After you’ve encoded these categorical features, you can fit a sklearn model as usual.

Group by function can break data into different types to obtain previously unavailable information. Group By helps us Group our data based on various characteristics to get more detailed insights into our data. To perform tasks ranging from data analysis to feature engineering, we can combine it with other functions such as add, agg, transform, and filter.

We can use Group by function on any categorical variable and any aggregation function, such as mean, median, mode, count.

For this illustration, we’ll look at the mean Item Outlet Sales after grouping the variables Item Identifier and Item Type.

data['Item_Outlet_Sales_mean'] = data.groupby(['Item_Identifier', 'Item_Type'])['Item_Outlet_Sales']\
	                                     . transform(lambda x: x.mean())

1 2	data['Item_Outlet_Sales_mean'] = data.groupby(['Item_Identifier', 'Item_Type'])['Item_Outlet_Sales']\ . transform(lambda x: x.mean())

	Item_Identifier	item_Type	Item_outlet_Sales	Item_Outlet_Sales_Mean
0	FDF22	Snack Foods	2778.3834	3232.542225
1	FDS36	Baking Goods	549.285	2636.568
2	NCJ29	Health and Hygiene	1193.1136	1221.521067
3	FDN46	Snack foods	1845.5976	2067.752867
4	DRG01	Soft Drinks	765.67	1225.072

We can deduce from the first row that if the Item Identifier is FD22 and the Item Type is Snack Foods, the average sales would be 3232.54. When conducting this type of feature engineering, be cautious because using the target variable to build new features can cause your model to become biased.

Normalizing means rescale the values into a range of 0 to 1 [0,1]. Each observation (row) is modified to have a length of 1 (called a unit norm or a vector with 1 in linear algebra). That is, the sum of the squares is always up to 1 in each row.

It is most suitable for algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbours. In python, normalization is done using normalize( ) function in the scikit-learn package

Normalization is converting numeric columns’ values in a dataset to a standard scale without distorting the values’ ranges.

Standardization means rescale data to have a normal distribution with a mean of 0 and a standard deviation of 1 (unit variance). It implies that the data should be centered about 0 and scaled to the standard deviation. It is most suitable for techniques that assume normal distribution in the input variables such as linear regression, logistic regression, and linear discriminate analysis.

In python, data is standardized using sci-kit-learn the Standard Scaler class.

pickup_Year

pickup_dayofyear

pickup_monthofyear

pickup_hourofday

pickup_dayofweek

pickup_weekofyear

During feature engineering of a dataset, do you encounter a date or time column or a date-time variable and begin to question such a variable’s relevance and what insights we can gain from it? Well, you might be amazed by how much information we can gain.

For starters, a date-time variable can create many new variables. Apart from being able to analyze the day, month, year, hour, minutes, and seconds we can also generate other analysis such as days of a week calculation, quarters of the year, day of the month, and so much more.

These variables can come in handy when doing one to one analysis and logging of events. although this form of feature engineering is helpful, too many features generate irrelevant variables; this increases the tendency to have noise in your model

Pandas can extract these features out of a dataset using the dt function and much more, but we first have to deal with converting the dataset into a date-type format

Converting data format from string to date type

In our dataset below, let’s illustrate how we can convert raw data into date format. Date types are mostly saved in a string format. It may affect our analysis, no worries. We can use pandas to change the format to the format we need.

For example, if we are looking to convert a date column with the following format: 20 DEC 2019 , let’s use this piece of code:

df['date'] =  pd.to_datetime(df[col], format='%d %b %Y')

1	df['date'] = pd.to_datetime(df[col], format='%d %b %Y')

Once your column is converted to Date-Time, we may need to extract date-time components

df['year'] = df['date']. year
df['month'] = df['date']. month
df['day'] = df['date']. day
# code to confirm the date type has changed
df. types

df['year'] = df['date']. year

df['month'] = df['date']. month

df['day'] = df['date']. day

# code to confirm the date type has changed

df. types

For the next illustration, a taxi duration dataset is our example to illustrate how to extract date-time features

id	driver_id	pickup_datetime	dropoff_datetime	passenger_count	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude
id1875421	2	14/03/2016 17:24	14/03/2016 17:32	1	-73.9822	40.76794	-73.9646	40.7656
id1377394	1	12/06/2016 00:43	12/06/2016 00:54	1	-73.9804	40.73856	-73.9995	40.73115
id1858529	2	19/01/2016 11:35	19/01/2016 12:10	1	-73.979	40.76394	-74.0053	40.71009
id2504673	2	06/04/2016 19:32	06/04/2016 19:39	1	-74.01	40.71997	-74.0123	40.70672

As we can see we have two date-time columns pickup_datetime and droff_date time

Let’s use the pickup_datetime column to extract features using dt function ()

This is a code snippet to extract the features we want

data ['pickup_year'] = data['pickup_datetime'].dt. year
data['pickup_dayofyear']  = data['pickup_datetime'].dt.day
data['pickup_monthofyear'] = data['pickup_datetime'].dt.month
data['pickup_hourofday'] = data['pickup_datetime'].dt.hour
data['pickup_dayofweek'] = data['pickup_datetime'].dt.dayofweek
data['pickup_weekofyear'] = data['pickup_datetime'].dt.weekofyear

data ['pickup_year'] = data['pickup_datetime'].dt. year

data['pickup_dayofyear'] = data['pickup_datetime'].dt.day

data['pickup_monthofyear'] = data['pickup_datetime'].dt.month

data['pickup_hourofday'] = data['pickup_datetime'].dt.hour

data['pickup_dayofweek'] = data['pickup_datetime'].dt.dayofweek

data['pickup_weekofyear'] = data['pickup_datetime'].dt.weekofyear

As you can see, the results have a lot on the information we can use for modeling as we can see the metrics on a deeper level. We can use date-time for many other scenarios depending on the dataset we have.

Cover photo credit: https://k21academy.com/

Categories:

Tags:

Missing values Pandas

0 0 votes

Article Rating

0 Comments

Inline Feedbacks

View all comments