Exploratory Data Analysis: A Beginner’s Guide

yeshwanth kumar maringanti
11 min readMay 5, 2021

--

Photo by Luke Chesser on Unsplash

EDA is a good place to start if you want to kick start machine learning. So, let’s get started.

What is an EDA ?

Exploratory data analysis is a method for determining the most important information in a given dataset by comparing and contrasting all of the data’s attributes (independent variables ) individually and collectively which is useful to classify output variable.

Why do we need EDA ?

The primary goal of EDA is to assist in the analysis of data prior to making any assumptions. It can aid in the detection of obvious errors, as well as a better understanding of data patterns, the detection of outliers or anomalous events, and the discovery of interesting relationships between variables.

You can easily download the data set from kaggle by following the link provided below.

This dataset contains cases from a study on the survival of patients who had undergone breast cancer surgery at the University of Chicago’s Billings Hospital between 1958 and 1970

Let’s take it one step at a time, first understanding the data set and then trying out a few different plots to see what we can learn from it.

  1. Understanding the data
  2. The goal of our experiment
  3. Univariate analysis
  4. Bivariate analysis
  5. Conclusion
  6. References

1. Understanding the dataset

1.1 Import the libraries required

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

1.2 Loading the dataset in pandas dataframe object

data = pd.read_csv("haberman.csv")#shape of data tells us no of columns , rows present in the data setdata.shape

Output: (306, 4)

# This 
gives us top 5 rows of the data along with col names
data.head()

Output:

1.3 Description of atrributes

1.“Age” — This tells us the age of patients during the time of the operation , it is numerical data

2.“Year” — This tells us the operation is conducted in which year , it is numerical data

3.“Nodes” — No of positive axillary nodes (I changed it to nodes in dataframe) detected in the patient’s body. these nodes are called as lymph nodes and are present in under arms if cancer is spread

4.“Status” — This tells us what is the total no of years for which patient survived after the operation

5.The input variables in our dataset are Age , Year and Nodes. They are also called as features through which we can analyse our data.

6. “Status” here is called as output variable / class variabe

# get the information about the attributes present in data
data.describe()

1.4 Checking null values present in any of the rows

data.isnull().values.any()

Output: False

In our dataset, there are no null values. We’ll look at how to deal with data that has missing or null values in one of the next few blogs.

#To get information about data type of attributesdata.info()

Output:

#checking different values present in status variable data['status'].value_counts()

Output:

Above image is showing total no of 1’s , 2’s in status variable

Status attribute has two types of data points , either “1” or “2”

1 -> patient survived 5 or more years

2 -> patient survived less than 5 years

2. The Goal of Our Experiment

Our main goal is to use different exploratory data analysis techniques such as pair plots , univariate plots and other plots to classify the patient into survival (class 1) or non-survival (class 2) using independent variables such as age, year, and nodes, so that the relationship between these independent variables can be drawn and we can determine what are the deciding factors for our classification problem.

3. Univariate Analysis

  1. Univariate analysis is a type of statistical analysis in which only one variable is used. This analysis aids us in identifying patterns based on a single feature.

2. Different univariate plots such as PDF , CDF , box plots and violin plots can be used to understand which features are useful towards the classification

3.1 PDF ( Probability density function )

  1. PDF is obtained by smoothening the histogram plotted (for any numerical variable) through the process called as KDE (kernel density estimation)
  2. We look at each input variable to see if it can add value to the output value on its own. Similarly, we do the same thing for all input variables and see which ones can provide the highest value to classify the output class as 1(survival) or 2(non-survival).
  3. In the following PDF plot x-axis contains different values of our independent variable considered and y-axis contains the frequency / count of the x-axis for a given instance.

PDF for Age vs counts

#pdf for "age" input variable
sns.set_style("whitegrid")
age_plot = sns.FacetGrid(data, hue="status", height=6)
age_plot.map(sns.distplot, 'age').add_legend()
plt.xlabel("PDF of AGE")
plt.ylabel("Counts")
plt.title("Probability density function between Age vs Counts")
plt.show()

Output:

Observations:

1.Most of the data points from two categories ( 1 , 2 )are overlapped so it states that survival of patient is not that dependent on age of the patient.

2.Between age 30 to 40 chances of survival are more , between age 40 to 60 chances of survival are less

3.Between 60 to 75 both survival and not survival are of equal chances roughly

4.Beyond 90 , there are no chances of survival

5.Eventhough we can infer from this attribute ,we cannot conclude survival just by single parameter

PDF for Year vs counts

# pdf of year plot
sns.set_style("whitegrid")
year_plot = sns.FacetGrid(data, hue="status", height=6)
year_plot.map(sns.distplot, 'year').add_legend()
plt.xlabel("PDF of Years")
plt.ylabel("Counts")
plt.title("PDF between Years vs Counts")
plt.show()

Observations :

1.Most of the region in this year’s plot is overlapped by both 1 and 2, so this cannot be the deciding factor for the patient’s survival.

2.However, most operations fail between 1957 and 1965, as evidenced by the brown colour representing 2 (not surviving more than 5 years) being most prevalent between 1957 and 1965.

PDF for Nodes vs counts

sns.set_style("whitegrid")
nodes_plot = sns.FacetGrid(data, hue="status", height=6)
nodes_plot.map(sns.distplot, 'nodes').add_legend()
plt.xlabel("PDF of nodes")
plt.ylabel("Counts")
plt.title("PDF between Nodes vs Counts")
plt.show()

Observations:

1.Survival chances are high if no of nodes that a particular patient contains is 0 to 2

2.If no of nodes are more that 26 then there are very less chances of survival

3.2 CDF (Cumulative Distributive Function)

  1. CDF is very useful for analysing where the majority of the data is because it gives the percentage of total values that exist until a specific limit of x.

2. For instance , for a given value at some x=x1 , CDF tells us percentage of values present till the val

Plotting CDF of Ages

#splitting the data frame into two data frames having status as 1,2data_yes = data.loc[data["status"] == 1];
data_no = data.loc[data["status"] == 2];
#data_yes = all the data points in status where values are 1
#data_no = all the data points in status where values are 0
#status = 1
counts, bin_edges = np.histogram(data_yes['age'], bins=10,
density = True)
#pdf gives you the total percent of output values (1 or 2) present #for the selected x value (which is age in our case)pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf,label=1)
# status = 0
counts, bin_edges = np.histogram(data_no['age'], bins=10,
density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf,label=2)
plt.title("CDF for age of patients")plt.xlabel("Age of Patients")
plt.ylabel('cumulative % of patients')
plt.legend()

Output:

Observations :

1. The plot makes it clear that the attribute “age” alone cannot be used to determine whether or not a patient will survive (since both the orange and blue lines are getting overlapped throughout the region )

2.As you can see, if the patient’s age is ≤ 50, 40 % of people may or may not survive (since both lines are existing and not overlapping). As a result, we can’t rely on this attribute to correctly classify the result.

Plotting CDF of Year

data_yes = data.loc[data["status"] == 1];
data_no = data.loc[data["status"] == 2];
#data_yes = all the data points in status where values are 1
#data_no = all the data points in status where values are 0
#status = 1
counts, bin_edges = np.histogram(data_yes['year'], bins=10,
density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf,label=1)
# status = 0
counts, bin_edges = np.histogram(data_no['year'], bins=10,
density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf,label=2)
plt.title("CDF for year of operation")
plt.xlabel("year of operation")
plt.ylabel('cumulative % of patients')
plt.legend()

Output:

Observation :

1.Even year attribute does not help us to classify a given patient into one of the two different classes of patients as the graph is overlapping and leading in same way.

Plotting CDF of Nodes

data_yes = data.loc[data["status"] == 1];
data_no = data.loc[data["status"] == 2];
#data_yes = all the data points in status where values are 1
#data_no = all the data points in status where values are 0
#status = 1
counts, bin_edges = np.histogram(data_yes['nodes'], bins=10,
density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf,label=1)
# status = 0
counts, bin_edges = np.histogram(data_no['nodes'], bins=10,
density = True)
pdf = counts/(sum(counts))
cdf = np.cumsum(pdf)
plt.plot(bin_edges[1:], cdf,label=2)
plt.title("CDF for No of nodes of patients")
plt.xlabel("Nodes")
plt.ylabel('cumulative % of patients')
plt.legend()

Output:

Observation:

1.This attribute is superior to age and year in terms of gaining insight into the patient’s survival details.

2.If a patient has ≤ 4 axillary nodes, the chances of survival are nearly 83 percent; however, as the number of axillary nodes increases, the chances of survival decrease significantly.

Box Plots

A box plot is a method for representing groups of numerical data through their quartiles .

Lower line of box represents 25th percentile ( called as lower quartile ), middle line represents 50th percentile (median) and upperline represents 75th percentile (upper quartile ) of the values plotted.

Box plot for Age , Year and Nodes

#box plot for status vs age
sns.boxplot(x='status',y='age', data=data)
plt.title('status vs age boxplot')
plt.show()
#box plot for status vs year of operation
sns.boxplot(x='status',y='year', data=data)
plt.title('status vs year boxplot')
plt.show()
#box plot for status vs nodes
sns.boxplot(x='status',y='nodes', data=data)
plt.title('status vs node boxplot')
plt.show()

Outputs of all 3 plots:

Width of the box seen ( in horizontal direction) doesn’t imply anything

Observations :

1.For the box plots of both age and years its hard to classify as both of the attributes have most of the data in common because of which we cannot classify them perfectly under Class “1” or “2” But in Year plot one point to be inferred can be that chances of survival before 1960 is very less and the chances of survival are improved after 1965

2.Box plot of nodes has less overlap compared to that of age and year box plots where there is huge overlap and survival chance is mostly concentrate if no of nodes that patients have is less than 4

Violin Plot

It is a mix of both box plot (highlighted region in the centre) and PDF( when we look horizontally , symmetric pdf’s can be seen on both left and right side of violin plot)

In a violin plot, denser data regions are fatter (meaning there is more data in this region) and sparser data regions are thinner ( less amount of data is present in this region).

Violin plots of Age , Year and nodes

#VIOLIN PLOT FOR AGE ATTRIBUTE
sns.violinplot(x='status',y='age',data=data)
plt.title('status vs age violinplot')
plt.show();
#VIOLIN PLOT FOR YEAR ATTRIBUTE
sns.violinplot(x='status',y='year',data=data)
plt.title("status vs year violinplot")
plt.show();
#VIOLIN PLOT FOR NODES ATTRIBUTE
sns.violinplot(x='status',y='nodes',data=data)
plt.title('status vs nodes violinplot')
plt.show();

Outputs:

Black region at centre represents box plot

Observations:

Similar to box plot , the overlapping in status vs age and status vs year plots are high and the overlap still exists in violin plot of status vs nodes too thus we cannot classify exactly according to status. The observations same as that of box plot only.

OVERALL UNIVARIATE ANALYSIS OBSERVATION(S)

1.No of nodes are indirectly proportional to chances of survival , as we can see greater the nodes , lesser the chances of survival

2.Patients who have 1 or zero nodes survived the most and at the same time even there are few chances of not survival in this region too. So we cannot exactly classify them according to status

3.Age alone cannot be the deciding parameter for determining the survival of the patient

4.Nodes parameter is best useful out of all the three parameters but still it is not always true

4. BIVARIATE ANALYSIS

In this analysis , two variables are analysed simultaneously in order to determine the relationship between them which can classify output variable Ex. pair plots, scatter plots (2D)

4.1 Scatter Plot

# Plot between age and year
sns.set_style("whitegrid");
sns.FacetGrid(data, hue="status", height=5).map(plt.scatter, "age", "year").add_legend();
plt.title('Scatter plot between Age vs Year')
plt.show();

Output:

Observation:

We can’t judge a patient’s survival based on age vs year plot because the data represented can be seen unbalanced. However, we can see that as the patient’s age increases, the randomness of the status data points remains the same, so we can plot other pair points and check the relationship.

PAIR PLOTS

Plotting scatter plots all at once for given independent variables so that we can compare a pair plot plotted with 2 independent variables with other pair plot plotted with any other combination of 2 independent variables (age,year , nodes) to see which combination better results in classification of the output variable

sns.set_style("whitegrid")
sns.pairplot(data, hue="status", height = 5)
plt.show()

Output:

Understanding plots :

  1. Plots 2 , plot 4 are same plots of age,year (only attributes interchanged) , similarly 3rd and 7th plots are same (age,nodes) , 6th and 8th plots are same(year , nodes)
  2. Plot 1 , 5 and 9 represent pdfs of age,year and nodes respectively

Observations :

  1. Out of all the pair plots , it is clear that pair plot between Year and nodes i.e plot 6 or plot 8 is comparitively better
  2. All the other plots does not have much valuable information to infer.

5. CONCLUSIONS

1.Classification of survival status of the patient who has undergone operation is very difficult with the given data as the data is imbalanced hence we need more balanced data or other features which can helpful to draw valuable insights

2.Attribute like Age is not individually helpful to categorise survival of the patients

3.Greater the no of nodes , lesser the chance of survival and chances of survival is not assured even when no of nodes are very less or zero. so we cannot say exactly.

6.References

  1. https://www.kaggle.com/gilsousa/habermans-survival-data-set
  2. http://archive.ics.uci.edu/ml/datasets/Haberman's+SurvivalHIGH
  3. https://unsplash.com/@lukechesser?utm_source=medium&utm_medium=referral

Thank you for your patience and for reading the entire blog. Please let me know if there are any errors in the blog so that I can correct them.

I’m also happy to answer your questions.. See you in next blog :)

--

--