Feature engineering with pandas in python

The process of transforming data to improve machine learning models’ predictive performance is known as feature engineering. it is a technique used to create new features from the existing data that could help to gain more insight into the data.

It is used in Preparing and analyzing the available data per the machine learning algorithm’s requirements. Categorical data is incompatible with most machine learning algorithms. As a result, we’ll need to convert the column to numeric so that the algorithm can take in all of the relevant data.

It Improves the machine learning models’ accuracy. Every predictive model’s ultimate aim is to achieve the best possible results. Using a suitable algorithm and tuning the parameters are two ways to boost results. However, I believe that adding new features aids the most in improving performance

Feature engineering strategies such as standardization and normalization also result in the better weighting of variables, increasing precision and sometimes leading to faster convergence.

One of the main reasons why feature engineering is recommended is that exploratory data analysis allows us to understand the data and the potential for new functionality.

By the end of this article, one is expected to know about

  1. Preparation of data
  2. Merge train and test
  3. Handling missing values
  4. How to handle categorical features
  5. Using groupby() and transform() for aggregation features/statistics
  6. Normalizing / standardizing data
  7. Date and time features

Preparation of data

The first step is to load data into a platform to extract these features. The platform may be a Jupiter notebook, and you can start feature extraction by importing libraries.

In python, the “import keyword is used to select relevant libraries

Input Data can be retrieved from data sources such as a spreadsheets file used in a python environment.  For instance, Pandas can read lots of kinds of files (e.g. CSV, XLS, XLSX) from a computer or internet as follows: Reading a (CSV) file can be done by

once done, we can have a better understanding of our data. Let’s see the example below

Our example, The Big Mart Sales Prediction data, is the focus of our efforts. Given such variables, the challenge is to forecast goods sold in different stores in different cities.

Item_IdItem_WeightItem_Fat_ContentItem_VisibilityItem_TypeItem_MRPOutlet_Establishment_YearOutlet_SizeOutlet_Location_TypeOutlet_TypeItem_Outlet_Sales
0BAA199.3Low Fat0.016047    Dairy249.80921999MediumTier 1Supermarket Type 13735.138
1DBC055.92Regular0.019278Soft Drinks48.26922009MediumTier 3Supermarket Type 2443.4228
2ACN1117.5Low Fat0.01676Meat141.6181999MediumTier 1Supermarket Type 12097.27
3EDX0619.2Regular0Fruits and Vegetables182.0951998NanTier 3Grocery store732.38
4NCD198.93Low Fat0Household53.86141987HighTier 3Supermarket Type1994.7052

Merge train and test

What happens when we have data from many sources? in such a scenario, we can use merge function()

It is often advised to operate on the entire Data Frame when conducting features engineering to provide a general model; if you have two 2 files, merge them (train and test). An example of a code snippet is as shown below

Handling missing values

After preparing your data  one of the various problems to handle in feature engineering is  missing data, for example we might have a dataset that looks like this:

Using a standard machine learning model on such data, we must first fill in the gaps with relevant fill values. This is known as the imputation of missing values. sklearn offers the Imputer class a baseline imputation method that uses the mean, median, or most frequent value.

Out[1]:

array([[ 4.5, 0. , 3. ], [ 3. , 7. , 9. ], [ 3. , 5. , 2. ], [ 4. , 5. , 6. ], [ 8. , 8. , 1. ]])

 

The two missing values in the resulting data have been replaced with the mean of the column’s remaining values. This imputed data can then be fed directly into, for example, a Linear Regression estimator:

Out[2]:

array([ 13.14869292,  14.3784627 ,  -1.15539732,  10.96606197,  -5.33782027])

How to handle categorical features

Categorical data is a common category of non-numerical data.

You may be tempted to use a simple numerical mapping to encode this data:

{‘jade’: 1, ‘jack’: 2, ‘mark’: 3};

One-hot encoding, which essentially generates extra columns showing the presence or absence of a category with a value of 1 or 0, is an established technique in this situation. When your data is in the form of a list of dictionaries, you can use Scikit-DictVectorizer Learns to do the following:

Out[3]:

The ‘local’ column has been split into three columns to represent the three local labels, and each row has a 1 in the column that corresponds to its local. After you’ve encoded these categorical features, you can fit a sklearn model as usual.

Using groupby() and transform() for aggregation features/statistics

Group by function can break data into different types to obtain previously unavailable information. Group By helps us Group our data based on various characteristics to get more detailed insights into our data. To perform tasks ranging from data analysis to feature engineering, we can combine it with other functions such as add, agg, transform, and filter.

We can use Group by function on any categorical variable and any aggregation function, such as mean, median, mode, count.

For this illustration, we’ll look at the mean Item Outlet Sales after grouping the variables Item Identifier and Item Type.

Item_Identifieritem_TypeItem_outlet_SalesItem_Outlet_Sales_Mean
0FDF22Snack Foods2778.38343232.542225
1FDS36Baking Goods549.2852636.568
2NCJ29Health and Hygiene1193.11361221.521067
3FDN46Snack foods1845.59762067.752867
4DRG01Soft Drinks765.671225.072

We can deduce from the first row that if the Item Identifier is FD22 and the Item Type is Snack Foods, the average sales would be 3232.54.  When conducting this type of feature engineering, be cautious because using the target variable to build new features can cause your model to become biased.

Normalizing / Standardizing data

Normalizing means rescale the values into a range of 0 to 1 [0,1]. Each observation (row) is modified to have a length of 1 (called a unit norm or a vector with 1 in linear algebra). That is, the sum of the squares is always up to 1 in each row.

It is most suitable for algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbours. In python, normalization is done using normalize( ) function in the scikit-learn package

Normalization is converting numeric columns’ values in a dataset to a standard scale without distorting the values’ ranges.

Standardization means rescale data to have a normal distribution with a mean of 0 and a standard deviation of 1 (unit variance).  It implies that the data should be centered about 0 and scaled to the standard deviation. It is most suitable for techniques that assume normal distribution in the input variables such as linear regression, logistic regression, and linear discriminate analysis.

In python, data is standardized using sci-kit-learn the Standard Scaler class.

 

Date and time features

During feature engineering of a dataset, do you encounter a date or time column or a date-time variable and begin to question such a variable’s relevance and what insights we can gain from it? Well, you might be amazed by how much information we can gain.

For starters, a date-time variable can create many new variables. Apart from being able to analyze the day, month, year, hour, minutes, and seconds we can also generate other analysis such as days of a week calculation, quarters of the year, day of the month, and so much more.

These variables can come in handy when doing one to one analysis and logging of events. although this form of feature engineering is helpful, too many features generate irrelevant variables; this  increases the tendency to have  noise in your model

Pandas can extract these features out of a dataset using the dt function and much more, but we first have to deal with converting the dataset into a date-type format

 

Converting data format from string to date type

In our dataset below, let’s illustrate how we can convert raw data into date format. Date types are mostly saved in a string format. It may affect our analysis, no worries. We can use pandas to change the format to the format we need.

For example, if we are looking to convert a date column with the following format: 20 DEC 2019 , let’s use this piece of code:

Once your column is converted to Date-Time, we may need to extract date-time components

For the next illustration,  a taxi duration dataset is our example to illustrate how to extract date-time features

iddriver_idpickup_datetimedropoff_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitude
id1875421214/03/2016 17:2414/03/2016 17:321-73.982240.76794-73.964640.7656
id1377394112/06/2016 00:4312/06/2016 00:541-73.980440.73856-73.999540.73115
id1858529219/01/2016 11:3519/01/2016 12:101-73.97940.76394-74.005340.71009
id2504673206/04/2016 19:3206/04/2016 19:391-74.0140.71997-74.012340.70672

As we can see we have two date-time columns pickup_datetime and droff_date time

Let’s use the pickup_datetime column to extract features using dt function ()

This is a code snippet to extract the features we want

pickup_Year pickup_dayofyearpickup_monthofyear         pickup_hourofdaypickup_dayofweekpickup_weekofyear
020162921609
1201611323410
220162121767
3201651911
42016172627

As you can see, the results have a lot on the information we can use for modeling as we can see the metrics on a deeper level. We can use date-time for many other scenarios depending on the dataset we have.

 

 

Cover photo credit: https://k21academy.com/

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments

Leave a Reply

Your email address will not be published. Required fields are marked *

0
Would love your thoughts, please comment.x
()
x