Introduction
Data science is the study of data to extract knowledge and insights from the data and apply knowledge and actionable insights. In this tutorial, we will work on IPL Data Analysis and Visualization Project using Python where we will explore interesting insights from the data of IPL matches like most run by a player, most wicket taken by a player, and much more from IPL season 2008-2020.
So if you are an IPL cricket fan and love data analysis with Python this project is perfect for you.
Importing Libraries
In this tutorial, we will use NumPy and Pandas libraries of Python for data analysis and for data visualization Seaborn and Matplotlib libraries.
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
IPL Dataset
Our IPL dataset contains ball by ball records from the first match played in the 2008 season till the complete 2020 season.
Importing IPL Dataset
We have imported the CSV dataset below with the help of pandas read_csv functions We can see the content of the dataset by using head() function.
df=pd.read_csv('data.csv')
df.head()
match_id | season | start_date | venue | innings | ball | batting_team | bowling_team | striker | non_striker | … | runs_off_bat | extras | wides | noballs | byes | legbyes | wicket_type | player_dismissed | run | over | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 335982 | 2008 | 2008-04-18 | M.Chinnaswamy Stadium | 1 | 0.1 | Kolkata Knight Riders | Royal Challengers Bangalore | SC Ganguly | BB McCullum | … | 0 | 1 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 0 | ||
1 | 335982 | 2008 | 2008-04-18 | M.Chinnaswamy Stadium | 1 | 0.2 | Kolkata Knight Riders | Royal Challengers Bangalore | BB McCullum | SC Ganguly | … | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | ||
2 | 335982 | 2008 | 2008-04-18 | M.Chinnaswamy Stadium | 1 | 0.3 | Kolkata Knight Riders | Royal Challengers Bangalore | BB McCullum | SC Ganguly | … | 0 | 1 | 1.0 | 0.0 | 0.0 | 0.0 | 1 | 0 | ||
3 | 335982 | 2008 | 2008-04-18 | M.Chinnaswamy Stadium | 1 | 0.4 | Kolkata Knight Riders | Royal Challengers Bangalore | BB McCullum | SC Ganguly | … | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | ||
4 | 335982 | 2008 | 2008-04-18 | M.Chinnaswamy Stadium | 1 | 0.5 | Kolkata Knight Riders | Royal Challengers Bangalore | BB McCullum | SC Ganguly | … | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 |
5 rows × 21 columns
Checking IPL Dataset Attributes
Before we proceed with our Python data analysis of IPL data, we should know what columns are present in the dataset, their count, and data type. For this, we use Pandas info() function.
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 193617 entries, 0 to 193616 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 match_id 193617 non-null int64 1 season 193617 non-null int64 2 start_date 193617 non-null object 3 venue 193617 non-null object 4 innings 193617 non-null int64 5 ball 193617 non-null float64 6 batting_team 193617 non-null object 7 bowling_team 193617 non-null object 8 striker 193617 non-null object 9 non_striker 193617 non-null object 10 bowler 193617 non-null object 11 runs_off_bat 193617 non-null int64 12 extras 193617 non-null int64 13 wides 193617 non-null float64 14 noballs 193617 non-null float64 15 byes 193617 non-null float64 16 legbyes 193617 non-null float64 17 wicket_type 193617 non-null object 18 player_dismissed 193617 non-null object 19 run 193617 non-null int64 20 over 193617 non-null int64 dtypes: float64(5), int64(7), object(9) memory usage: 25.9+ MB
IPL Data Analysis and Visualization with Python
Now, with a basic understanding of the attributes let us now start our project of data analysis and visualization of the IPL dataset with Python. We will initially perform simple statistical analysis and then slowly build to more advanced analysis.
i) General Analysis of IPL Matches
1. List of Seasons
We can get the list of seasons from the dataset by applying unique() function on the season column which confirms that our dataset contains data of matches played from season 2008-2020.
The data set we have includes the data of each and every match played from season 2008 to 2021.
df['season'].unique()
array([2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2019, 2018, 2020], dtype=int64)
2. First ball of IPL history
Each data point describes the match_id, season, start_date, venue, innings, ball, batting_team, bowling_team, striker, non_striker, bowler, runs_off_bat, extras, wides, no balls, byes, leg byes, wicket_type, player_dismissed, run which are self-explanatory.
Here we have fetched the first row of the data sets which corresponds to the first ball of the first match of IPL history played between KKR and RCB played on 4th July 2008.
df.iloc[0]
Unnamed: 0 0 match_id 335982 season 2008 start_date 2008-04-18 venue M Chinnaswamy Stadium innings 1 ball 0.1 batting_team Kolkata Knight Riders bowling_team Royal Challengers Bangalore striker SC Ganguly non_striker BB McCullum bowler P Kumar runs_off_bat 0 extras 1 wides 0 noballs 0 byes 0 legbyes 1 wicket_type player_dismissed Name: 0, dtype: object
3. Season Wise IPL Matches
We can find the number of matches played in each season by grouping the match_id, season column and counting out the data, and then calling the index out of it by dropping the first index layer that is the match_id.
We can see the visualization of the IPL matches using the Matlotlib library.
plt.figure(figsize=(10,8))
data = df.groupby(['match_id','season']).count().index.droplevel(level=0).value_counts().sort_index()
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Matches Played')
plt.ylabel('Season')
plt.show()
4. Most IPL Matches played in a Venue
The analysis shows most of the IPL matches were played in Chennai, Mumbai, Kolkata, Banglore, and Delhi.
df.groupby(['venue','match_id']).count().droplevel(level=1).index.value_counts()
M.Chinnaswamy Stadium 80 Eden Gardens 77 Wankhede Stadium, Mumbai 74 Arun Jaitley Stadium 74 Rajiv Gandhi International Stadium, Uppal 64 MA Chidambaram Stadium, Chepauk, Chennai 59 Punjab Cricket Association IS Bindra Stadium, Mohali 56 Sawai Mansingh Stadium 47 Dubai International Cricket Stadium 33 Sheikh Zayed Stadium 29 Maharashtra Cricket Association Stadium 21 Sharjah Cricket Stadium 18 Subrata Roy Sahara Stadium 17 Dr DY Patil Sports Academy 17 Kingsmead 15 Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium 13 Sardar Patel Stadium, Motera 12 SuperSport Park 12 Brabourne Stadium 11 Saurashtra Cricket Association Stadium 10 Himachal Pradesh Cricket Association Stadium 9 Holkar Cricket Stadium 9 New Wanderers Stadium 8 JSCA International Stadium Complex 7 Barabati Stadium 7 Newlands 7 St George's Park 7 Shaheed Veer Narayan Singh International Stadium 6 Nehru Stadium 5 Green Park 4 Vidarbha Cricket Association Stadium, Jamtha 3 De Beers Diamond Oval 3 Buffalo Park 3 OUTsurance Oval 2 Name: venue, dtype: int64
5. IPL Matches Played by Each Team
We can find out the matches played by each team by the same process which is grouping the batting_team and the match_id column and counting the data and then dropping the first index layer which is match_id.
plt.figure(figsize=(10,8))
data = df['bowling_team'].value_counts().sort_values(ascending=False)
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Matches Played')
plt.ylabel('Team')
plt.show()
ii) IPL Batting Analysis
6. Most Run Scored by IPL Teams
To calculate the most run scored by a team across all seasons we have grouped by Team and have summed up the run scored by them. And finally, sort them in descending order.
Without any surprise, MI is at the top of the list.
df.groupby(['batting_team'])['run'].sum().sort_values(ascending=False)
batting_team Mumbai Indians 32329 Royal Challengers Bangalore 30255 Kings XI Punjab 30064 Kolkata Knight Riders 29419 Chennai Super Kings 28372 Rajasthan Royals 24542 Delhi Daredevils 24296 Sunrisers Hyderabad 19362 Deccan Chargers 11463 Pune Warriors 6358 Delhi Capitals 5309 Gujarat Lions 4862 Rising Pune Supergiant 4533 Kochi Tuskers Kerala 1901 Name: run, dtype: int64
7. Most IPL Runs by a Batsman
From the below visualization we can see that the Run-Machine, Virat Kohli is at the top of this list with more than 6,000 runs followed by Suresh Raina and Shikhar Dhawan.
plt.figure(figsize=(10,8))
data = df.groupby(['striker'])['runs_off_bat'].sum().sort_values(ascending=False)[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Batsman')
plt.ylabel('Runs')
plt.show()
8. Avg Run by Teams in Powerplay
Team Delhi Capital has the best average in the powerplay with an average of 48 runs followed by SRH and RPS.
df[df['over']<6].groupby(['match_id','batting_team']).sum()['run'].groupby('batting_team').mean().sort_values(ascending=False)[2:]
batting_team Sunrisers Hyderabad 47.959677 Delhi Capitals 47.666667 Rising Pune Supergiant 47.433333 Kings XI Punjab 47.126316 Kolkata Knight Riders 46.390625 Delhi Daredevils 45.714286 Deccan Chargers 45.560000 Mumbai Indians 45.551724 Chennai Super Kings 45.264045 Rajasthan Royals 44.912500 Royal Challengers Bangalore 44.820513 Pune Warriors 42.111111 Name: run, dtype: float64
9. Most IPL Century by a Player
The Universe Boss Chris Gayle is at the top of the list in scoring the most number of centuries in IPL history. He has hit six tons and has scored 4804 runs in IPL.
His former teammate Virat Kohli has scored five hundred’s and he is at the second spot in the list followed by Watson, AB de Villiers, Brendon McCullum, and David Warner.
This can be calculated by grouping the columns striker and match_id and then calculating the sum.
runs = df.groupby(['striker','match_id'])['runs_off_bat'].sum()
runs[runs >= 100].droplevel(level=1).groupby('striker').count().sort_values(ascending=False)[:10]
striker CH Gayle 6 V Kohli 5 SR Watson 4 DA Warner 4 AB de Villiers 3 BA Stokes 2 S Dhawan 2 M Vijay 2 HM Amla 2 BB McCullum 2 Name: runs_off_bat, dtype: int64
10. Most IPL Fifty by Player
When a number of the fifties comes Warner is top in the list followed by Virat Kohli and Shikhar Dhawan. This will also be calculated by the same method as above, plus we have shown a bar graph visualization for better representation.
plt.figure(figsize=(10,8))
runs = df.groupby(['striker','start_date'])['runs_off_bat'].sum()
data = runs[runs >= 50].droplevel(level=1).groupby('striker').count().sort_values(ascending=False)[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Half-Centuries')
plt.ylabel('Batsman')
plt.show()
11. Orange Cap Holder Each Season
The batsman with the most runs in the tournament during the course of the season would wear the Orange Cap while fielding, with the overall leading run-scorer at the conclusion of the tournament winning the actual Orange Cap award on the day of the season’s final.
Shaun Marsh became the first winner of the award in 2008, the complete list is presented below from the dataset.
data = df.groupby(['season','striker'])['runs_off_bat'].sum().groupby('season').max()
temp_df=pd.DataFrame(df.groupby(['season','striker'])['runs_off_bat'].sum())
print("{0:10}{1:20}{2:30}".format("Season","Player","Runs"))
for season,run in data.items():
player = temp_df.loc[season][temp_df.loc[season]['runs_off_bat'] == run].index[0]
print(season,'\t ',player,'\t\t',run)
Season Player Runs 2008 SE Marsh 616 2009 ML Hayden 572 2010 SR Tendulkar 618 2011 CH Gayle 608 2012 CH Gayle 733 2013 MEK Hussey 733 2014 RV Uthappa 660 2015 DA Warner 562 2016 V Kohli 973 2017 DA Warner 641 2018 KS Williamson 735 2019 DA Warner 692 2020 KL Rahul 676
12. Most Sixes in an IPL Inning
Chris Gayle has hit the highest number of sixes in an inning with the number being 17 in the entire IPL history. Followed by Brendon McCullum and AB De Villiers.
df[df['runs_off_bat'] == 6].groupby(['start_date','striker']).count()['season'].sort_values(ascending=False).droplevel(level=0)[:10]
striker CH Gayle 17 BB McCullum 13 CH Gayle 13 CH Gayle 12 AB de Villiers 12 AD Russell 11 ST Jayasuriya 11 M Vijay 11 CH Gayle 11 SV Samson 10 Name: season, dtype: int64
13. Most Boundary (4s) hit by a Batsman
The Indian Gabbar, Shikhar Dhawan is at the top of the list with more than 600 boundaries followed by Virat Kohli and David warner.
plt.figure(figsize=(10,8))
data = df[df['runs_off_bat'] == 4]['striker'].value_counts()[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Fours')
plt.ylabel('Batsman')
plt.show()
14. Most runs in an IPL season by Player
The run machine, Virat Kohli is at the top of the list with 973 runs in 2016 season followed by David Warner and Kane Williamson with 848 and 735 runs in the 2016 and 2018 season respectively.
df.groupby(['striker','season'])['runs_off_bat'].sum().sort_values(ascending=False)[:10]
striker season V Kohli 2016 973 DA Warner 2016 848 KS Williamson 2018 735 MEK Hussey 2013 733 CH Gayle 2012 733 2013 720 DA Warner 2019 692 AB de Villiers 2016 687 RR Pant 2018 684 KL Rahul 2020 676 Name: runs_off_bat, dtype: int64
15. No. of Sixes in IPL Seasons
2018 is the season with the most number of sixes hit. Followed by season 2019 and 2020 in the list of most sixes in a season.
plt.figure(figsize=(10,8))
data = df[df['runs_off_bat'] == 6].groupby('season').count()['match_id'].sort_values(ascending=False)
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Sixes')
plt.ylabel('Season')
plt.show()
16. Highest Total by IPL Teams
Royal Challengers Bangalore is at the top of the list of highest run by a team. The match was played against Pune Warrior in the 2019 season.
df.groupby(['start_date','batting_team']).sum()['run'].droplevel(level=0).sort_values(ascending=False)[:10]
batting_team Royal Challengers Bangalore 263 Royal Challengers Bangalore 248 Chennai Super Kings 246 Kolkata Knight Riders 245 Chennai Super Kings 240 Royal Challengers Bangalore 235 Kolkata Knight Riders 232 Kings XI Punjab 232 Sunrisers Hyderabad 231 Delhi Daredevils 231 Name: run, dtype: int64
17. Most IPL Sixes Hit by a batsman
The universe Boss, Chris gale is at the top of the list in the most hitting sixes followed by AB De Villiers and MS Dhoni
plt.figure(figsize=(10,8))
data = df[df['runs_off_bat'] == 6]['striker'].value_counts()[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Sixes')
plt.ylabel('Batsman')
plt.show()
Chris Gayle playing against Pune Warrior has hit the highest individual score in the 2013 season. Brendon McCullum and Ab de Villiers are in the second and third positions on the list.
df.groupby(['striker','start_date'])['runs_off_bat'].sum().sort_values(ascending=False)[:10]
striker start_date CH Gayle 2013-04-23 175 BB McCullum 2008-04-18 158 AB de Villiers 2015-05-10 133 KL Rahul 2020-09-24 132 AB de Villiers 2016-05-14 129 CH Gayle 2012-05-17 128 RR Pant 2018-05-10 128 M Vijay 2010-04-03 127 DA Warner 2017-04-30 126 V Sehwag 2014-05-30 122 Name: runs_off_bat, dtype: int64
iii) Bowling Statistics
19. Most run conceded by a bowler in an inning
Basil Thampi playing for SRH against RCB in the 2008 season has conceded 70 runs and is at the top of the list followed by Bangladesh player Mujeeb Ur Rahman and Ishant Sharma.
df.groupby(['bowler','start_date'])['run'].sum().droplevel(level=1).sort_values(ascending=False)[:10]
bowler Basil Thampi 70 Mujeeb Ur Rahman 66 I Sharma 66 Sandeep Sharma 66 PJ Cummins 65 UT Yadav 65 AS Rajpoot 64 S Kaul 64 VR Aaron 63 TA Boult 63 Name: run, dtype: int64
20. Purple Cap Holders
The bowler with the most wickets in the tournament during the course of the season would wear the Purple Cap while fielding, with the overall leading wicket-taker at the conclusion of the tournament winning the actual Purple Cap award on the day of the season’s final.
Below is the list of bowlers with purple caps.
lst = 'caught,bowled,lbw,stumped,caught and bowled,hit wicket'
data = df[df['wicket_type'].apply(lambda x: True if x in lst and x != ' ' else False)].groupby(['season','bowler']).count()['ball']
data=data.sort_values(ascending=False)[:30].sort_index(level=0)
val=0
lst=[]
print("{0:10}{1:20}{2:30}".format("Season","Player","Runs"))
for (season,bowler),wicket in data.items():
if season == val:
lst.append(wicket)
else:
print(season,'\t ',bowler,'\t\t',wicket)
val = season
lst=[]
Season Player Runs 2008 Sohail Tanvir 22 2009 A Kumble 21 2011 MM Patel 22 2012 M Morkel 25 2013 DJ Bravo 32 2014 MM Sharma 23 2015 A Nehra 22 2016 B Kumar 23 2017 B Kumar 26 2018 AJ Tye 24 2019 DL Chahar 22 2020 A Nortje 22
21. Most IPL Wickets by a Bowler
Srilankan bowler Malinga is at the top of the list with 170 wickets followed by Amit Mishra and Push Chawla with 160 and 156 wickets respectively.
lst = 'caught,bowled,lbw,stumped,caught and bowled,hit wicket'
df[df['wicket_type'].apply(lambda x: True if x in lst and x != ' ' else False)]['bowler'].value_counts()[:10]
SL Malinga 170 A Mishra 160 PP Chawla 156 DJ Bravo 153 Harbhajan Singh 150 R Ashwin 138 B Kumar 136 SP Narine 127 YS Chahal 121 UT Yadav 119 Name: bowler, dtype: int64
22. Most Dot Ball by a Bowler
The Indian bowler Harbhajan Singh has bowled the most number of Dot balls followed by R. Ashwin and Bhuvneshwar Kumar
plt.figure(figsize=(10,8))
data = df[df['run'] == 0].groupby('bowler').count()['match_id'].sort_values(ascending=False)[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Dot Balls')
plt.ylabel('bowler')
plt.show()
23. Most Maiden over by a Bowler
Indian right-hand medium-pacer bowler Praveen Kumar is at the top of the list with the most maiden overs followed by Irfan Pathan and Dale Stain.
data = df.groupby(['start_date','bowler','over'])['run'].sum()
data = data[data.values == 0].droplevel(level=[0,2])
data.index.value_counts()[:10]
P Kumar 12 IK Pathan 9 DW Steyn 8 SL Malinga 8 B Kumar 7 DS Kulkarni 7 Sandeep Sharma 6 DJ Bravo 6 DL Chahar 5 Z Khan 5 Name: bowler, dtype: int64
24. Most Wickets by an IPL Team
The Mumbai Indian has taken the most number of wickets in IPL followed by Royal Challengers Banglore and Chennai Super Kings
plt.figure(figsize=(10,8))
lst = 'caught,bowled,lbw,stumped,caught and bowled,hit wicket'
data = df[df['wicket_type'].apply(lambda x: True if x in lst and x != ' ' else False)]['bowling_team'].value_counts()
df.groupby(['batting_team'])['extras'].agg('sum').sort_values(ascending=False)
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Wickets')
plt.ylabel('Teams')
plt.show()
25. Most No Balls by an IPL team
Royal Challengers Bangalore has given most no balls followed by Mumbai Indians and Chennai Super Kings
df.groupby(['batting_team'])['noballs'].agg('sum').sort_values(ascending=False)
batting_team Royal Challengers Bangalore 105.0 Chennai Super Kings 95.0 Rajasthan Royals 94.0 Mumbai Indians 94.0 Kolkata Knight Riders 90.0 Delhi Daredevils 73.0 Kings XI Punjab 71.0 Sunrisers Hyderabad 53.0 Deccan Chargers 49.0 Pune Warriors 24.0 Delhi Capitals 20.0 Gujarat Lions 17.0 Kochi Tuskers Kerala 11.0 Rising Pune Supergiant 8.0 Name: noballs, dtype: float64
26. Most No Balls by an IPL Bowler
Indian bowler S Sreesanth has bowled the most number of no balls followed by Jasprit Bumrah and Amit Mishra
df[df['noballs'] != 0]['bowler'].value_counts()[:10]
JJ Bumrah 23 S Sreesanth 23 A Mishra 21 I Sharma 21 UT Yadav 19 SL Malinga 18 AB Dinda 14 RP Singh 13 M Morkel 13 JA Morkel 13 Name: bowler, dtype: int64
27. Most run given by a team in Extras
Mumbai Indians have given the most number of extras (byes, no balls, wides) followed by Kolkata Knight Riders and Kings XI Punjab.
plt.figure(figsize=(10,8))
data = df.groupby(['batting_team'])['extras'].agg('sum').sort_values(ascending=False)
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Runs')
plt.ylabel('Teams')
plt.show()
28. Most Wides Conceded by an IPL team
Mumbai Indians has given most wides followed by Kolkata Knight Riders and Royal Challengers Bangalore
df.groupby(['batting_team'])['wides'].agg('sum').sort_values(ascending=False)
batting_team Mumbai Indians 1000.0 Kolkata Knight Riders 905.0 Royal Challengers Bangalore 805.0 Kings XI Punjab 786.0 Chennai Super Kings 783.0 Delhi Daredevils 717.0 Rajasthan Royals 652.0 Sunrisers Hyderabad 543.0 Deccan Chargers 279.0 Pune Warriors 169.0 Gujarat Lions 134.0 Rising Pune Supergiant 118.0 Delhi Capitals 114.0 Kochi Tuskers Kerala 89.0 Name: wides, dtype: float64
- Also Read – Machine Learning Projects in Python with Code in GitHub to give you Ideas
- Also Read – 13 Cool Computer Vision GitHub Projects To Inspire You
- Also Read – 11 Interesting Natural Language Processing GitHub Projects To Inspire You
- Also Read – 7 Reinforcement Learning GitHub Repositories To Give You Project Ideas
Conclusion
Hope you liked our project on IPL Data analysis and Visualization using Python. We just listed some basics to medium-advanced analysis over here, to give you an idea of how to use the data set. You can come up with your own data analysis of IPL data with Python libraries and even do some machine learning projects.
IPL Dataset Download
The IPL dataset used in this tutorial can be downloaded from this link. Enjoy exploring it!
-
This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.
View all posts
9 Responses
Hey !! Can I get the data set
Hello Sameer you can find the dataset on Kaggle.
Hey!! Can i get the same dataset which is used in this program. please send the link or the dataset same as used in this. i couldn’t found on kaggle. please help.
Hello Vishwajeet, the link for downloading the IPL dataset used in this tutorial has been updated at the end of the article.
Thank You so much.
link pls
Hello, the link for downloading the IPL dataset used in this tutorial has been updated at the end of the article.
The link for downloading the IPL dataset used in this tutorial has been updated at the end of the article.
Hey!! I am getting this error TypeError: ‘in ‘ requires string as left operand, not float. In program no 20. purple cap Holders, please help.