Beginner’s Guide to Exploratory Data Analysis and Feature Engineering

7 minute read

Introduction

Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. When I started my journey in the Data Science field I always had difficulty with the starting point of any problem but after reading a few exemplary Kernels in Kaggle I have realized the power of Exploratory Data Analysis and its impact on Data Modeling and Predictions

I am trying to explain how we can do EDA and Feature Engineering as the simplest way to get some insight into the Titanic Disaster. I have put only specific code snippets before each visualization and analysis, if anyone is interested in full code then refer to the link provided at the end.

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

I have used dataset which is provided by Kaggle for Titanic: Machine Learning from Disaster Competition

Features Analysis

Let’s import required libraries for EDA

#Importing required libraries
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from sklearn.ensemble import RandomForestClassifier
#Importing train data set
ds_train=pd.read_csv("/<InputDirectory>/train.csv")
#Checking features in train data set
ds_train.head()

Now , analyze these features/variables one by one

Survived is a target variable where the survival of a passenger is predicted in binary format i.e. 0 for Not Survived and 1 for Survived

PassengerId and Ticket variables can be assumed as Random unique Identifiers of Passengers and they don’t have any impact on survival, hence we can ignore them

Pclass is an ordinal datatype for the ticket class, it can be considered as the passenger’s Socio-Economic Status and it may impact the passenger’s survival chances so we will keep this in our analysis. It’s unique values are 1 = Upper Class, 2 = Middle Class and 3= Lower Class

Name is self-explanatory, we will skip this variable from our analysis

Sex or Gender could have played an important role in survival because during any evacuation from disaster, preference will be given to the female gender and to test this notion we will consider gender in our analysis

SibSp and Parch represent the total number of the passenger’s siblings/spouse and parents/children on board respectively, they could be used to create a new variable called ‘Family Size’ (Creating a new feature/variable is an example of Feature Engineering)

Age could have also played a role in survival, so we will keep this in our feature list

Fare is also an indicator of the Socio-Economic Status of passengers, let’s keep this in our feature list

Cabin is the Cabin number of the passenger and it can be used in Feature engineering to get an approximate position of the passenger when the accident happened, also from deck level, we can deduce Socioeconomic status. However, after looking at the data it looks like there are many null values so we can drop this column from our feature list.

Embarked is a port of embarkation of passengers and this may have an impact on the target variable so we will keep this variable for now. It has 3 unique values , C = Cherbourg ,Q = Queenstown and S = Southampton

Visualization

Now, we will try to see the relation between selected features by creating Seaborn and Plotly visualization

First, start with the passenger’s Age

#Converting Age into series and visualizing the age distribution
age_series=pd.Series(ds_train['Age'].value_counts())
fig=px.scatter(age_series,y=age_series.values,x=age_series.index)
fig.update_layout(
    title="Age Distribution",
    xaxis_title="Age in Years",
    yaxis_title="Count of People",
    font=dict(
        family="Courier New, monospace",
        size=18,
    )
)
fig.show()

We can deduce a few points from the above graph

Majority of passengers aged more than 20 years and less than 50 years
30 passengers share the same age i.e. 24 years
164 passengers share the same age

Let’s check how Gender is distributed among passengers

print("Number of Passengers Gender Wise \n{}".format(ds_train['Sex'].value_counts()))
#Gender wise distribution
fig = go.Figure(data=[go.Pie(labels=ds_train['Sex'],hole=.4)])
fig.update_layout(
    title="Sex Distribution",
    font=dict(
        family="Courier New, monospace",
        size=18
    ))
fig.show()

It’s quite evident that the number of male passengers is almost double of female passengers.

Let’s see how many females and males survived across different age groups.

#Create categorical variable graph for Age,Sex and Survived variables
sns.catplot(x="Survived", y="Age", hue="Sex", kind="swarm", data=ds_train,height=10,aspect=1.5)
plt.title('Passengers Survival Distribution: Age and Sex',size=25)
plt.show()

Well it’s pretty evident from the above graph that the majority of female passengers are survived

Majority of Male passengers aged between 20 to 50 years had not survived. It means most of the young men had not survived this disaster
Oldest male passenger aged 80 years,had survived
Age and Sex were major factors in deciding the passenger’s fate

Now, let’s see Pclass variable relation with survival

#Visualize relation between Pclass and Survival
fig = go.Figure(data=[go.Pie(labels=ds_train['Pclass'],hole=.4)])
fig.update_layout(
    title="PClass Distribution",
    font=dict(
        family="Courier New, monospace",
        size=18
    ))
fig.show()

More than half of the passengers were traveling in Lower Class.

Let’s see how survival is linked with Pclass

#Visualize PClass and Survival
#Create categorical variable graph for Age,Pclass and Survived variables
sns.catplot(x="Survived", y="Age", hue="Pclass", kind="swarm", data=ds_train,height=10,aspect=1.5)
plt.title('Passengers Survival Distribution: Age and Pclass',size=25)
plt.show()

Again , majority of young male passengers aged between 20 to 50 years and travelling in lower class had not survived</b>
Oldest male passenger who survived the disaster was travelling in upper class
Young men who survived the disaster were travelling in upper class

If the passenger was a man aged between 20–50 years, and not so rich at the time of travel then their chances of survival were very less

To support our Socio-Economic Status theory let’s focus on one more variable Fare

#Visualize Fare and Survival
#Create categorical variable graph for Sex,Fare and Survived variables
sns.catplot(x="Survived", y="Fare", hue="Sex", kind="swarm", data=ds_train,height=10,aspect=1.5)
plt.title('Passengers Survival Distribution: Fare and Sex',size=25)
plt.show()

In the above graph, for the feature ‘Sex’ consider 1 for females and 0 for males. It’s evident that female passengers with lower ticket fares survived the disaster and a few male passengers with the highest fare also survived.

It means when it comes to gender, the female got preference across all the classes otherwise Socio-Economic Status played an important role in survival.

Now, we will see Embarked variable’s impact on survival

#Visualize relation between Embarked and Survival
fig = go.Figure(data=[go.Pie(labels=ds_train['Embarked'],hole=.4)])
fig.update_layout(
    title="Embarked Distribution",
    font=dict(
        family="Courier New, monospace",
        size=18
    ))
fig.show()

The majority of passengers embarked from Southampton, let’s visualize its survival distribution.

#Visualize Embarked and Survival
#Create categorical variable graph for Embarked,Age and Survived variables
sns.catplot(x="Survived", y="Age", hue="Embarked", kind="swarm", data=ds_train,height=10,aspect=1.5)
plt.title('Passengers Survival Distribution: Embarked and Age',size=25)
plt.show()

We can not deduce any direct relation between Embarked and Survival.

Let’s check the correlation coefficient between these features

# Training set high correlations
ds_train.corr()

We can see a direct correlation between the ‘Survived’ and ‘Fare’ variables, other variables are in-directly related to Survival

Age is correlated to Fare and Fare is correlated to Survived and our analysis also shows how Age played a role in survival, by this we can say that Age is related to Survival
SibSp and Parch are related to each other and also both are related to Fare which makes sense because more people means more fare, by virtue of this both can be related to Survived

Feature Engineering

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms. Having and engineering good features will allow you to most accurately represent the underlying structure of the data and therefore create the best model.

Features can be engineered by decomposing or splitting features, from external data sources, or aggregating or combining features to create new features.

Let’s start Feature Engineering by creating a new variable Family Size by adding SibSp, Parch, and One(Current Passenger)

#Add new column 'Family Size' in training model set
ds_train['Family_Size'] = ds_train['SibSp'] + ds_train['Parch'] + 1
print("Family Size column created sucessfully")
ds_train.head()

Now we will see how the Family size will is related to Survived variable

#Visualize Family size and Survival
sns.barplot(x="Family_Size", y="Age", hue="Survived", data=ds_train,palette = 'rainbow')
plt.title('Family Size - Age Survival Distribution',size=20)
plt.show()

sns.catplot(y="Family_Size", x="Survived", hue='Sex',kind="swarm", data=ds_train,height=8,aspect=1.5)
plt.title('Family Size - Gender Survival Distribution',size=25)
plt.show()

Chances of survival are less for large Families (>5 members)
If the family size is small then the main passenger’s gender decides on survival, this supports the previous deduction of Gender’s role in the survival

Note: Survival data is marked for main passengers and not for the whole family, whereas family members’ names must be there in the list and they may or may not be survived. In other words, by just looking at the survival column we can not deduce that the fate of all family members was the same

Last Word

We can see that by just visualizing the relation between a few variables we got so many insights and further we can use this newly gained knowledge regarding a feature in training data models by adding new features and removing the unnecessary ones.

Refer to Kaggle Kernel or Juypter Notebook for whole analysis and data modeling

Disclaimer: Just to let you know, this blog post was originally published on Medium. If you’d like to check out the original, you can find it at this link.

Share on

Twitter Facebook LinkedIn

Kush Bhatnagar

Beginner’s Guide to Exploratory Data Analysis and Feature Engineering

Introduction

Features Analysis

Visualization

Feature Engineering

Last Word

Share on

Leave a comment

You may also enjoy

Why AWS Lambda with DVC is not a best choice for Data Versioning Pipeline

What is Feature Engineering

Psychology of Money - My Perspective and Key Takeaways

What is Gradient Descent

Kush Bhatnagar

Introduction

Features Analysis

Visualization

Feature Engineering

Last Word

Be the first to hear about new posts by subscribing

Share on

Leave a comment

You may also enjoy

Why AWS Lambda with DVC is not a best choice for Data Versioning Pipeline

What is Feature Engineering

Psychology of Money - My Perspective and Key Takeaways

What is Gradient Descent