Tips of Exploratory Data Analysis with Python

Understand high level knowledge about data

Xia Song
4 min readAug 31, 2019

Before diving into data, it will be very helpful to understand data’s attributes, structure, data types and missing values. Followings are some basic knowledge about data we should explore.

  1. Data dimensions
df.shape

2. Data columns

df.columns

3. Data types of each column

df.types

Select specific datatypes (for exampe: datetime64)columns from a dataframe

df_cols = [i for i in df.select_dtypes(inclue = ['datetime64'])]

4. Percentage of missing values of each column

df.isnull().sum()/df.shape[0] * 100

5. For categorical column, check the counts of each category

df[column].value_counts()

6. For categorical column, check how many categories totally

df[column].nunique()

7. For categorical column, check categories

df[column].unique()

Check how many data types for one specific column

For one specific column, there may have different data types, specifically some date time data interpreted to numbers and some numbers interpreted to date time. Under these situations, it is necessary to make the data type consistent for one column data, otherwise wrong statistical results can be caused from inconsistent data types.

  1. Check data types for each column of whole data frame
for col in df.columns:
print (col)
print (df[col].apply(type).value_counts())
print ('---------')

2. If only one data type, leave the data alone, however, if there are more than one data types, convert not correct data type into proper data type. Two most frequent functions are list below.

# convert numerical data into date
import time
import datetime
def numtodate(x, _epoch0 = datetime.datetime(1989,12,31)):
if pd.isnull(x):
return x
if isinstance(x, unicode):
try:
return pd.to_datetime(str(x))
except:
return None
if isinstance(x, datetime.date):
return x
# if the original data is excel format
if x > 59:
x -= 1 #excel consider 1900 as leap year, it is not.
return (_epoch0+datetime.timedelta(days=x)).replace(microsecond=0)

# conver date time data into numerical data
def datetonumber(x):
epoch0 = datetime.datetime(1899,12,31)
if pd.isnull(x):
return x
if isinstance(x, (float, int)):
return x
dif = (x - epoch0).total_seconds()/86400
if dif > 59:
return dif + 1
else:
return dif

Group by one column then aggregate other columns

The following code is used to group by column of ‘id’, meanwhile aggregate other columns: for each group, sum ‘col1’ values, according to row index to take the last record of ‘col2’, take the first non none value of ‘col3’, average ‘col4’ values, count ‘col5’ total records.

df_group = df.groupby('id').agg({'col1': 'sum', 'col2': lambda x: x.iloc[-1], 'col3': first(), 'col4': 'mean', 'col5': 'count'})

Pandas Grouper is one handy function for datetime column group by. Below code will group by two columns: col1 and date in annual frequency and December as the last month of year.

df.groupby(['col1', pd.Grouper(key='date', freq='A-DEC')])[' price'].sum()

If there is only one column to aggregated, for example following code will list all the unique values of ‘col1’ in each group, and save it to a dictionary.

df_group_dict = df.groupby('id')['col1'].unique().to_dict()

For the dictionary df_group_dict, the key is id, and the value is a list of unique values of col1. If you want to reverse the key and value of dictionary, however, list is mutable and can not be a key of dictionary. One way to do this is to make each unique value of col1 as a key and its corresponding key as value.

key_value_reverse_dict = {v:k for k in df_group_dict for v in df_group_dict[k]}

If another data frame have the same column col1 and you want to create a new column for the data frame, and the value of new column is the value of key_value_reverse_dict corresponding the key.

df_new['new colum'] = df_new['col1'].map(key_value_reverse_dict).values

Calculate time delta between different columns

It is very common to calculate the difference between two dates. Followings are the code to calculate how many days difference between two column dates ‘col1’ and ‘col2’ and save it to a new column ‘dif_days’. The second line of code is to calculate how many months difference between the two columns.

df['dif_days'] = (df['col1'] - df['col2']).astype('timedelta64[D]')
df['dif_months'] = (df['col1']-df['col2']).astype('timedelta64[M]')

It is also common to get the last date of month based on one date time column.

from pandas.tseries.offsets import MonthEnd
df['lastdaymonth'] = pd.to_datetime(df['datetime'],format = "%Y%m") + MonthEnd(0)

Column wise calculation

There are two different methods for column wise calculation: one define a function then apply the function to column; second use lambda function. The following code is the example to plus one for each value of one column.

# using a function to plus one for each value of column
def plusone(x):
return (x+1)
df['plusone'] = df['col'].apply(plusone)
# using lambda function to plus one for each value of column
df['plusone'] = df['col'].apply(lambda x: x+1)

Visualize data

  1. Visualize cumulative percentage distribution. Cumulative percentage distribution can help to understand probabilities above a certain value, below a certain value, or between two values.
# calculate cumulative value for column of col1
df['col1_cum'] = df['col1'].cumsum()
cols_sum = df['col1'].sum()# calculate the cumulative percentage for each cumulative value
def col_cum_perc(x):
return round((x / cols_sum * 100), 2)
df['col1_cum_perc'] = df['col1_cum'].apply(col_cum_perc)# get the location index of each desired percentage
cols_cum_perc = list(df['col1_cum_perc'])
perc_lists = [10,20,30,40,50,60,70,80,90,100]perc_dict = {}for perc_list in perc_lists:
index = df.loc[(df.col1_cum_perc - perc_list).abs() < 0.18].index.values[-1]
perc_dict[perc_list] = index
perc_dict_dict = {'percentage':[x for x in perc_dict.keys()], 'count':x for x in perc_dict.values()}df_perc_dict = pd.DataFrame.from_dict(perc_dict_dict)
df_perc_dict_index = df_perc_dict.set_index('count')
#use matplotlib to plot cumulative plot
plt.style.use(['dark_background'])
fig = plt.figure(figsize = (9,6))
plt.plot(y, color = 'red')
plt.annotate(df_perc_dict_index, (3000, 5), size = 15, color = 'orange', bbox = dict(boxstyle = 'round, pad = 0.2', fc = 'black', alpha = 1))
plt.xlabel('Count of Accounts', fontsize = 16)
plt.ylabel('Accumulated Amount Percentage', fontsize = 16)
matplotlib.rc('ytick', labelsize = 14)
matplotlib.rc('xtick', labelsize = 14)
plt.show()

--

--