Exploratory Data Analysis with Python
What is Exploratory Data Analysis? EDA involves analysing the data you have by simply just taking a look at it. It should therefore not come as a surprise that most of EDA will entail visual aspects e.g. Printing values and Plotting Graphs.
This does not mean that there will be no transformation to the data, on the contrary, here we perform transformations to get a more suitable form for display. This means data will be formatted in a manner in which it will convey as much information as possible , in line with the objectives.
EDA in Python mostly involves, but are not limited to, the following tools :
- Pandas
- Matplotlib, Seaborn and Plotly
- Geopandas
In this article we take a look at:
- Reading General Statistics from the data using pandas
- Plotting to observe different statistics
You can also find the entire Notebook on my GitHub
1. Initialisation and importing relevant libraries
Mount your drive if your workspace is on Colab. If your files are elsewhere e.g. another website you can explore alternatives on how to initialize/stage them on Colab.
You mount your drive using the following code and follow the instructions.
from google.colab import drive
drive.mount('/content/drive')
Next we import all relevant libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
2. Importing data from CSV or other file formats
Use the pandas library to read the csv and display the results.
Print out the first few lines of the dataframe to see what the data looks like.
expresso_churn_df = pd.read_csv('/content/drive/MyDrive/Expresso Churn/Train.csv')
expresso_churn_df.head()
3. Exploring the properties of the dataset
Have a look at the shape of the dataset and the data types of each column. The Data types are essential for when you want to apply certain functions or transformations to the data as each data type has restrictions on what can be done to/with it.
If you haven’t already noticed, the shape always outputs a tuple , starting with the number of rows/observations/records followed by the number of columns/labels i.e. we have 19 columns in our dataframe.
Next we take a look at the general statistics of our data and we do that using describe().
Now we check to see how many missing values are in our dataset.
Missing data will always skew your model/results. One thing people tend to do is to remove the missing data up to a certain threshold they are satisfied with, or using whichever criteria they see fit to fill in the nuls, or do both dropping and filling. For this case I chose to drop rows with too many missing values.
# Remove any row with more than 50% missing values
expresso_churn_few_nulls = expresso_churn_df [ expresso_churn_df.count(axis=1) > int(expresso_churn_df.shape[1] * 0.5)]
Another thing I chose to do was combine all the info with regards to data types and missing values into one dataframe.
columns = ['Non-null','Dtype','Unique']
df = pd.DataFrame([expresso_churn_few_nulls.count(),expresso_churn_few_nulls.dtypes,
expresso_churn_few_nulls.nunique()]).T
df.columns = columns
df
The “Unique” column could help decide, when you get to data preparation, where to use encoders and how to group the categories/values. This will also depend on the Data Type. It could also help determine the significance of a column and how much variance it may cause e.g. MRG has only one value all through so you can drop it later on.
We take a look at the current state of our data and see how many missing values we still have. Also we can look at the columns(This can help when you want to reference column names).
4. Formatting data and handling missing values
We now fill in the null values.
fill_not_given = ['REGION', 'TENURE','TOP_PACK']
fill_zeros = [ 'FREQUENCE_RECH','MONTANT','DATA_VOLUME','REVENUE' , 'ARPU_SEGMENT','FREQUENCE',
'ON_NET', 'ORANGE', 'TIGO', 'ZONE1', 'ZONE2' ,'FREQ_TOP_PACK' ]
#Assume Nan Revenue implies they don't earn
# Therefore ARPU_SEGMENT will ,FREQUENCE will also be zeros, if revenues is zero
expresso_churn_few_nulls[fill_not_given] = expresso_churn_few_nulls[fill_not_given].fillna('NotGiven')
expresso_churn_few_nulls[fill_zeros] = expresso_churn_few_nulls[fill_zeros].fillna(0)
# Those to fill with the mode
null_count_series = expresso_churn_few_nulls.isnull().sum()
fill_mode = null_count_series.index[ null_count_series > 0 ]
for col in expresso_churn_few_nulls.columns:
my_series = expresso_churn_few_nulls.loc[:,col]
col_mode = my_series.mode()
expresso_churn_few_nulls.loc[:,col] = my_series.fillna(col_mode)
expresso_churn_few_nulls.loc[:,col] = np.nan_to_num(my_series,
copy = False,
nan = col_mode,
posinf = col_mode,
neginf = col_mode)
Here we add another column which is the sum of another set of columns.
# Total Number of Columns
expresso_churn_few_nulls['TOTAL_CALLS'] = expresso_churn_few_nulls[['ON_NET','ORANGE','TIGO','ZONE1','ZONE2']].sum(axis=1)
5. Plotting the Visualisation Graphs
Next we plot charts. This helps us :
- Observe trends in our data
- See where many values lie
- Observe outliers
- See how one value relates to another.
We create a function to plot and run it to plot the figures.
Plotting Bar Graphs
def plot_for_frequency_count(col_name, plot_title):
keys, values = np.unique(expresso_churn_few_nulls[col_name] , return_counts = True)
data_ = pd.DataFrame({col_name:keys, 'Values':values})
plt.figure(figsize = (17,7))
sns_plot = sns.barplot(x = col_name , y= 'Values', data= data_ )
for index, row in data_.iterrows():
sns_plot.text(row.name, row.Values ,row.Values, color='black', ha="center",fontsize=14)
plt.title(plot_title)
plt.savefig(plot_title+".png")
plt.show()
plot_for_frequency_count('TENURE', 'user count per duration in the network')
plot_for_frequency_count('REGION', 'SUBSCRIBERS PER REGION')
# Plot Volume of call to each 'LINE'/Call Service Provider
total_vol_of_calls = expresso_churn_few_nulls[['ON_NET','ORANGE','TIGO','ZONE1','ZONE2']].sum().sum()
per_csp = expresso_churn_few_nulls[['ON_NET','ORANGE','TIGO','ZONE1','ZONE2']].sum()
percentage_volume = (per_csp / total_vol_of_calls)*100
call_vol_df = pd.DataFrame( {'Provider':per_csp.index ,
'Volume':per_csp.values} )
plt.figure(figsize=(14,7))
call_vol_plot = sns.barplot(x= 'Provider', y='Volume',data = call_vol_df)
for percent, row_tupple in zip(percentage_volume, call_vol_df.iterrows() ):
"""
Zip the volume data with the percentage data and Iterate over them
The 'row' will be a tupple containing the index and value of that row
"""
row = row_tupple[1]
text = str(round(row.Volume/1000000, 1)) + ' (' + str(round(percent,1)) + '%)'
call_vol_plot.text(row.name, row.Volume, text, color= 'black', ha= 'center', fontsize=14,)
plt.title('Call Volume to Other Providers (in Millions)', fontsize=17)
plt.savefig("Call Volume to Other Providers.png")
plt.show()
Plotting a pie chart
# Churn ratios (Pie Chart)
keys,values = np.unique(expresso_churn_few_nulls['CHURN'], return_counts=True)
def func(percent, allvalues):
actual_val = round( (percent/100)*values.sum(), 1 )
return "{:,d}\n ({:.1f}%) ".format(int(actual_val), percent )
plt.figure(figsize=(5,5))
wedges, texts, autotexts = plt.pie(values, autopct=lambda x : func(x, values),
pctdistance = 0.6,
textprops = dict(color='black') ,
)
legend = plt.legend( wedges, ["No", "Yes"],
title = "Churned",
frameon = False,
loc = "best",
bbox_to_anchor=(1, 0, 0.5, 1) ,
prop={'size': 14})
plt.setp(autotexts, size=14 )
plt.setp(legend.get_title(),fontsize=14)
plt.title("EXPRESSO CHURN")
plt.savefig("Churn ratio.png")
plt.show()
Plotting stacked bar charts for comparison
# Churn Per Region
cols = ['user_id', 'CHURN', 'REGION']
churn_per_region = expresso_churn_few_nulls[cols].pivot_table(values = 'user_id',
index='REGION',
columns = 'CHURN' ,
aggfunc= 'count')
churn_per_region[1] = churn_per_region[1]*(-1)
plt.figure(figsize=(20,7))
ax = sns.barplot(x = churn_per_region.index.values ,
y = churn_per_region[0] ,
color = 'lightblue')
ax = sns.barplot(x = churn_per_region.index.values ,
y = churn_per_region[1] ,
color = 'gray')
for plot in ax.patches:
signed = 1
selected_color = 'black'
if plot.get_height() < 0:
signed = -1
selected_color = 'red'
ax.annotate( '{:,d}'.format(abs(int(plot.get_height())) ),
(plot.get_x() + plot.get_width() / 2,
plot.get_height()), ha='center', va='center',
size=10, xytext=(0, signed*10),
textcoords='offset points' ,
color = selected_color
)
plt.ylim( churn_per_region[1].min()-100000, churn_per_region[0].max()+100000)
plt.tick_params( right=False, left = False, labelleft = False)
plt.title('CHURN NUMBERS PER REGION')
plt.savefig('churn numbers per region.png')
plt.show()
# Call volume per region to various CSPs
cols = ['REGION','ON_NET', 'ORANGE', 'TIGO', 'ZONE1', 'ZONE2']
csp_cols = ['ON_NET', 'ORANGE', 'TIGO', 'ZONE1', 'ZONE2']
call_vol_per_region = expresso_churn_few_nulls[cols].groupby('REGION').agg('sum')
call_vol_per_region = call_vol_per_region.sort_values(csp_cols,
ascending = True )
fig = plt.figure(figsize=(17,10))
ax = fig.subplots()
h1 = ax.barh(call_vol_per_region.index.values ,
call_vol_per_region['ON_NET'] ,
label = 'ON_NET')
h2 = ax.barh(call_vol_per_region.index.values ,
call_vol_per_region['ORANGE'],
label = 'ORANGE')
h3 = ax.barh(call_vol_per_region.index.values ,
call_vol_per_region['TIGO'],
label = 'TIGO')
h4 = ax.barh(call_vol_per_region.index.values ,
call_vol_per_region['ZONE1'],
label = 'ZONE1')
h5 = ax.barh(call_vol_per_region.index.values ,
call_vol_per_region['ZONE2'],
label = 'ZONE2')
for r1, r2, r3, r4, r5 in zip(h1, h2, h3, h4, h5):
h1 = r1.get_width()
h2 = r2.get_width()
h3 = r3.get_width()
h4 = r4.get_width()
h5 = r5.get_width()
y = r1.get_y()+r1.get_height()/2
x = h1+ h2+ h3+ h4+ h5 # Last value total + prev value end
x_pos = h1+ r1.get_x() + 7000000
plt.text(x_pos, y, '{:,.0f}'.format(x), ha = 'center', va='bottom')
plt.legend(title = 'Call Service Provider', loc='best')
x_limit = call_vol_per_region.sum(axis=1).max()+1000
plt.xlim(0 , x_limit)
plt.title('Call Volume(Per region) to various providers ')
plt.savefig('Call Volume to various providers.png')
plt.show()