AI Insights

A Beginner’s Guide to Exploratory Data Analysis with Python

February 27, 2024


article featured image

Introduction

What is Exploratory Data Analysis?

Exploratory Data Analysis is one of the main components of the Data Science Life Cycle, it is a technique to understand the various aspects of the data. For performing Exploratory Data Analysis (EDA) with python, you will need to get your hands on python’s libraries such as Pandas, NumPy, Matplotlib & Seaborn. Pandas is used for data exploration while matplotlib & Seaborn are used for plotting the data set to get more insights from the data.

Even if you are new to the field and you don’t have much practice with libraries, you can still learn Exploratory Data Analysis (EDA) with python from this article, because for your convenience, each part of the code is described clearly.

DataFrame Built-In Functions

.head() shows up 5 records from top.

.tail() prints 5 records from bottom.

.shape() tells the shape of the dataset in the form of the number of rows and number of columns

.describe() describes the details of the dataset.

Import Libraries

The first step of exploring any dataset is to import the required libraries.

Import Libraries

Upload the Dataset

Pandas is the library used to upload the data and data manipulation. So, we will use pandas alias ‘pd’ to access the dataset in the jupyter notebook.

Dataset of IPL

Dataset of IPL

.Shape

The shape of the data shows the number of columns(18) and the number of records (636) in the dataset.

Shape of Dataset

Shape of Dataset

.Describe

This is how .describe shows every detail about the data.

Details of Column

Details of Column

Null value

To see if there are any null values in the data, you can use the following line of code given below in the picture.

Additionally, The below output shows that ‘umpire3’ is entirely null, so we can remove such columns.

Null Values

Null Values

Removing a column

Sometimes we find out there are columns in the dataset that are not useful or don’t contain any information, in such a scenario we have to drop the extra columns.

According to my analysis, I found out the column ‘dl_applied’ is not giving any useful information and the column ‘umpire3’ is totally null, there for these two columns are dropped.

Dropping Columns

Dropping Columns

Correlation

Correlation is a statical method to see how strongly variables are related to each other. The below correlation chart is showing some positive and some negative values, but in our case, it’s not that information so we have to separately find variables’ relationships.

Correlation

Correlation

Pairplot

A pair plot represents plots for every variable in the dataset, so you get to know what columns’ contain more data and you further explore those variables.

Pair-plot

Pair-plot

If the hue is set then the specific variable’s information is represented according to columns.

Season’s pair plot

Season’s pair plot

Player of match

To see how many times and how many players won the title “player-of-match”

Player of the match

Player of the match

Top 10 players from “Player of match” column

To see the 10 best performers recorded in the dataset.

Player-of-match

Player-of-match

Bar-plot of top 3 players

As we know that plots are used to visually see the data for better understanding. The given plot shows that “Ch Gayle” has won the most player of the match titles than the other two players.

Bar-plot of 3 Best Players

Bar-plot of 3 Best Players

Type of Match Results

Matches are mostly normal, which means 1 team wins and other losses, a tie means both the teams have the same score, and sometimes we have no records of matches, which means the match has been cancelled due to some reasons.

To see what type of matches have been recorded in the data we will do the following line of code.

Type of Match Results

Type of Match Results

Toss Winners

To see the number of tosses won by teams.

Toss Winners

Toss Winners

Won-by-runs

If you want to see the number of matches won by the team that plays first, you can do the following line of code. This will let you know the best performer and the worst performance team by the number of matches any team has won.

Won-by-Run

Won-by-Run

Plot of Win-by-runs

This plot shows the margin of runs with distribution, by seeing in the plot you will get to know that the most matches win-by-run have won by 1 to 10 runs, and the best winning teams win-by-run have won by 140.

Win-by-Runs

Win-by-Runs

Top 5 win-by-run teams

To know the teams that won most matches even after doing batting first.

Teams win-by-run

Teams win-by-run

Win-by-Run Percentage Distribution through Pie Chart

Below is the pie chart you can see the win percentage of every team.

Pie Chart of Win-by-Run

Pie Chart of Win-by-Run

Win-by-Wickets

To see what teams have won the most matches that got second bating and won.

win-by-wicket

win-by-wicket

Histogram on Win-by-Wickets

Below you can see the number of matches won by a team that got batting second.

 

Here you can see the exact value of matches of team win-by-wickets.

 

 

Top 3 teams Win-by-Wickets

These are the top 3 teams with the most win-by-wicket matches means teams that got second batting.

 

Win-by-Wicket Percentage Distribution through Pie Chart

Below is the pie chart you can see the win percentage of every team.

 

 

Year

To see the number of matches played every year.

Note: In the data set the year column is named as ‘season’ so if you want to change the name, this can be done directly from excel/CSV file.

Matches won in a City

To see the exact number of matches won in a city.

 

Toss-Winning V/S Match-Winning

To see if there is any relation between the toss-wining team and match-winning team, the following.

The output clearly says that there is no relation between toss-wining and match-winning

 

 

The main idea of EDA is to get maximum useful information from the dataset. In this article, we have tried to see every aspect of data by using libraries, different charts & plots, built-in functions, and methods. We have driven really interesting information about win-by-run, win-by-wicket, top 3 players of the match, the number of matches won in a year, the number of matches won in a city and the relationship between toss-winning and match-winning.

Exploratory Data Analysis live Sessions and Project Code through Omdena Pakistan Chapter

EDA project was completed in a week which was based on six sessions. There was every day a new agenda which was collaboratively decided by the entire team of the Omdena Pakistan Chapter. So the 1st session was on “Introduction to NumPy & Matplotlib”, the 2nd session was on “Introduction to Pandas & Python Dictionaries”, the 3rd session was on “DataFrames & Aggregating Data”, the 4th session was on “Slicing, Indexing, creating & Visualizing data”, the 5th session was on “Joining the data”, and the last was on “Filter join & merging”.

Below are the videos of Live sessions of “Exploratory Data Analysis”

Solved Project

How to contact Omdena Pakistan Chapter?

Omdena Pakistan chapter is highly active on social media platforms to assist you with any concerned issue. You can follow us on bellow mentioned social accounts to stay updated about ongoing projects and workshops.

This article is written by Iqra Anwar.

Ready to test your skills?

 

media card
A Simple Guide to Optimizing Memory Usage and Computation Time in Big Data
media card
Smart Solutions Battling Malaria in Liberia with AI
media card
Harnessing AI to Monitor and Optimize Reforestation Efforts in Madagascar
media card
How We Leveraged Advanced Data Science and AI to Make Farms Greener