IPL Data Analysis and Visualization Project using Python

Data science is the study of data to extract knowledge and insights from the data and apply knowledge and actionable insights. In this tutorial, we will work on IPL Data Analysis and Visualization Project using Python where we will explore interesting insights from the data of IPL matches like most run by a player, most wicket taken by a player, and much more from IPL season 2008-2020.

So if you are an IPL cricket fan and love data analysis with Python this project is perfect for you.

IPL data analysis python project

Importing Libraries

In this tutorial, we will use NumPy and Pandas libraries of Python for data analysis and for data visualization Seaborn and Matplotlib libraries.

In [1]:
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

IPL Dataset

Our IPL dataset contains ball by ball records from the first match played in the 2008 season till the complete 2020 season.

Importing IPL Dataset

We have imported the CSV dataset below with the help of pandas read_csv functions We can see the content of the dataset by using head() function.

In [2]:
[Out] :
match_id season start_date venue innings ball batting_team bowling_team striker non_striker runs_off_bat extras wides noballs byes legbyes wicket_type player_dismissed run over
0 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.1 Kolkata Knight Riders Royal Challengers Bangalore SC Ganguly BB McCullum 0 1 0.0 0.0 0.0 1.0 1 0
1 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.2 Kolkata Knight Riders Royal Challengers Bangalore BB McCullum SC Ganguly 0 0 0.0 0.0 0.0 0.0 0 0
2 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.3 Kolkata Knight Riders Royal Challengers Bangalore BB McCullum SC Ganguly 0 1 1.0 0.0 0.0 0.0 1 0
3 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.4 Kolkata Knight Riders Royal Challengers Bangalore BB McCullum SC Ganguly 0 0 0.0 0.0 0.0 0.0 0 0
4 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.5 Kolkata Knight Riders Royal Challengers Bangalore BB McCullum SC Ganguly 0 0 0.0 0.0 0.0 0.0 0 0

5 rows × 21 columns

Checking IPL Dataset Attributes

Before we proceed with our Python data analysis of IPL data, we should know what columns are present in the dataset, their count, and data type. For this, we use Pandas info() function.

In [3]:
[Out] :
<class 'pandas.core.frame.DataFrame'>
Int64Index: 193617 entries, 0 to 193616
Data columns (total 21 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   match_id          193617 non-null  int64  
 1   season            193617 non-null  int64  
 2   start_date        193617 non-null  object 
 3   venue             193617 non-null  object 
 4   innings           193617 non-null  int64  
 5   ball              193617 non-null  float64
 6   batting_team      193617 non-null  object 
 7   bowling_team      193617 non-null  object 
 8   striker           193617 non-null  object 
 9   non_striker       193617 non-null  object 
 10  bowler            193617 non-null  object 
 11  runs_off_bat      193617 non-null  int64  
 12  extras            193617 non-null  int64  
 13  wides             193617 non-null  float64
 14  noballs           193617 non-null  float64
 15  byes              193617 non-null  float64
 16  legbyes           193617 non-null  float64
 17  wicket_type       193617 non-null  object 
 18  player_dismissed  193617 non-null  object 
 19  run               193617 non-null  int64  
 20  over              193617 non-null  int64  
dtypes: float64(5), int64(7), object(9)
memory usage: 25.9+ MB

IPL Data Analysis and Visualization with Python

Now, with a basic understanding of the attributes let us now start our project of data analysis and visualization of the IPL dataset with Python. We will initially perform simple statistical analysis and then slowly build to more advanced analysis.

i) General Analysis of IPL Matches 

1. List of Seasons

We can get the list of seasons from the dataset by applying unique() function on the season column which confirms that our dataset contains data of matches played from season 2008-2020.

The data set we have includes the data of each and every match played from season 2008 to 2021.

In [4]:
[Out] :
array([2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2019,
       2018, 2020], dtype=int64)

2. First ball of IPL history

Each data point describes the match_id, season, start_date, venue, innings, ball, batting_team, bowling_team, striker, non_striker, bowler, runs_off_bat, extras, wides, no balls, byes, leg byes, wicket_type, player_dismissed, run which are self-explanatory.

Here we have fetched the first row of the data sets which corresponds to the first ball of the first match of IPL history played between KKR and RCB played on 4th July 2008.

In [5]:
[Out] :
Unnamed: 0                                    0
match_id                                 335982
season                                     2008
start_date                           2008-04-18
venue                     M Chinnaswamy Stadium
innings                                       1
ball                                        0.1
batting_team              Kolkata Knight Riders
bowling_team        Royal Challengers Bangalore
striker                              SC Ganguly
non_striker                         BB McCullum
bowler                                  P Kumar
runs_off_bat                                  0
extras                                        1
wides                                         0
noballs                                       0
byes                                          0
legbyes                                       1
Name: 0, dtype: object

3. Season Wise IPL Matches

We can find the number of matches played in each season by grouping the match_id, season column and counting out the data, and then calling the index out of it by dropping the first index layer that is the match_id.

We can see the visualization of the IPL matches using the Matlotlib library.

In [6] :
data = df.groupby(['match_id','season']).count().index.droplevel(level=0).value_counts().sort_index()
plt.xlabel('Matches Played')
[Out] :
IPL data analysis python project

4. Most IPL Matches played in a Venue

The analysis shows most of the IPL matches were played in Chennai, Mumbai, Kolkata, Banglore, and Delhi.

In [7]:
[Out] :
M.Chinnaswamy Stadium                                   80
Eden Gardens                                            77
Wankhede Stadium, Mumbai                                74
Arun Jaitley Stadium                                    74
Rajiv Gandhi International Stadium, Uppal               64
MA Chidambaram Stadium, Chepauk, Chennai                59
Punjab Cricket Association IS Bindra Stadium, Mohali    56
Sawai Mansingh Stadium                                  47
Dubai International Cricket Stadium                     33
Sheikh Zayed Stadium                                    29
Maharashtra Cricket Association Stadium                 21
Sharjah Cricket Stadium                                 18
Subrata Roy Sahara Stadium                              17
Dr DY Patil Sports Academy                              17
Kingsmead                                               15
Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium     13
Sardar Patel Stadium, Motera                            12
SuperSport Park                                         12
Brabourne Stadium                                       11
Saurashtra Cricket Association Stadium                  10
Himachal Pradesh Cricket Association Stadium             9
Holkar Cricket Stadium                                   9
New Wanderers Stadium                                    8
JSCA International Stadium Complex                       7
Barabati Stadium                                         7
Newlands                                                 7
St George's Park                                         7
Shaheed Veer Narayan Singh International Stadium         6
Nehru Stadium                                            5
Green Park                                               4
Vidarbha Cricket Association Stadium, Jamtha             3
De Beers Diamond Oval                                    3
Buffalo Park                                             3
OUTsurance Oval                                          2
Name: venue, dtype: int64

5. IPL Matches Played by Each Team

We can find out the matches played by each team by the same process which is grouping the batting_team and the match_id column and counting the data and then dropping the first index layer which is match_id.

In [8]:
data = df['bowling_team'].value_counts().sort_values(ascending=False)
plt.xlabel('Matches Played')
[Out] :
IPL data analysis python project

ii) IPL Batting Analysis

6. Most Run Scored by IPL Teams

To calculate the most run scored by a team across all seasons we have grouped by Team and have summed up the run scored by them. And finally, sort them in descending order.

Without any surprise, MI is at the top of the list.

In [9]:
[Out] :
Mumbai Indians                 32329
Royal Challengers Bangalore    30255
Kings XI Punjab                30064
Kolkata Knight Riders          29419
Chennai Super Kings            28372
Rajasthan Royals               24542
Delhi Daredevils               24296
Sunrisers Hyderabad            19362
Deccan Chargers                11463
Pune Warriors                   6358
Delhi Capitals                  5309
Gujarat Lions                   4862
Rising Pune Supergiant          4533
Kochi Tuskers Kerala            1901
Name: run, dtype: int64

7. Most IPL Runs by a Batsman

From the below visualization we can see that the Run-Machine, Virat Kohli is at the top of this list with more than 6,000 runs followed by Suresh Raina and Shikhar Dhawan.

In [10]:
data = df.groupby(['striker'])['runs_off_bat'].sum().sort_values(ascending=False)[:10]
[Out] :
IPL data analysis python project

8. Avg Run by Teams in Powerplay

Team Delhi Capital has the best average in the powerplay with an average of 48 runs followed by SRH and RPS.

In [11]:
[Out] :
Sunrisers Hyderabad            47.959677
Delhi Capitals                 47.666667
Rising Pune Supergiant         47.433333
Kings XI Punjab                47.126316
Kolkata Knight Riders          46.390625
Delhi Daredevils               45.714286
Deccan Chargers                45.560000
Mumbai Indians                 45.551724
Chennai Super Kings            45.264045
Rajasthan Royals               44.912500
Royal Challengers Bangalore    44.820513
Pune Warriors                  42.111111
Name: run, dtype: float64

9. Most IPL Century by a Player

The Universe Boss Chris Gayle is at the top of the list in scoring the most number of centuries in IPL history. He has hit six tons and has scored 4804 runs in IPL.

His former teammate Virat Kohli has scored five hundred’s and he is at the second spot in the list followed by Watson, AB de Villiers, Brendon McCullum, and David Warner.

This can be calculated by grouping the columns striker and match_id and then calculating the sum.

In [12]:
runs = df.groupby(['striker','match_id'])['runs_off_bat'].sum()
runs[runs >= 100].droplevel(level=1).groupby('striker').count().sort_values(ascending=False)[:10]
[Out] :
CH Gayle          6
V Kohli           5
SR Watson         4
DA Warner         4
AB de Villiers    3
BA Stokes         2
S Dhawan          2
M Vijay           2
HM Amla           2
BB McCullum       2
Name: runs_off_bat, dtype: int64

10. Most IPL Fifty by Player

When a number of the fifties comes Warner is top in the list followed by Virat Kohli and Shikhar Dhawan. This will also be calculated by the same method as above, plus we have shown a bar graph visualization for better representation.

In [13]:
runs = df.groupby(['striker','start_date'])['runs_off_bat'].sum()
data = runs[runs >= 50].droplevel(level=1).groupby('striker').count().sort_values(ascending=False)[:10]
[Out] :
IPL data visualization python project

11. Orange Cap Holder Each Season

The batsman with the most runs in the tournament during the course of the season would wear the Orange Cap while fielding, with the overall leading run-scorer at the conclusion of the tournament winning the actual Orange Cap award on the day of the season’s final.

Shaun Marsh became the first winner of the award in 2008, the complete list is presented below from the dataset.

In [14]:
data = df.groupby(['season','striker'])['runs_off_bat'].sum().groupby('season').max()
for season,run in data.items():
    player = temp_df.loc[season][temp_df.loc[season]['runs_off_bat'] == run].index[0]
    print(season,'\t ',player,'\t\t',run)
[Out] :
Season    Player                 Runs                          
2008 	  SE Marsh 		 616
2009 	  ML Hayden 		 572
2010 	  SR Tendulkar 		 618
2011 	  CH Gayle 		 608
2012 	  CH Gayle 		 733
2013 	  MEK Hussey 		 733
2014 	  RV Uthappa 		 660
2015 	  DA Warner 		 562
2016 	  V Kohli 		 973
2017 	  DA Warner 		 641
2018 	  KS Williamson		 735
2019 	  DA Warner 		 692
2020 	  KL Rahul 		 676

12. Most Sixes in an IPL Inning

Chris Gayle has hit the highest number of sixes in an inning with the number being 17 in the entire IPL history. Followed by Brendon McCullum and AB De Villiers.

In [15]:
df[df['runs_off_bat'] == 6].groupby(['start_date','striker']).count()['season'].sort_values(ascending=False).droplevel(level=0)[:10]
[Out] :
CH Gayle          17
BB McCullum       13
CH Gayle          13
CH Gayle          12
AB de Villiers    12
AD Russell        11
ST Jayasuriya     11
M Vijay           11
CH Gayle          11
SV Samson         10
Name: season, dtype: int64

13. Most Boundary (4s) hit by a Batsman

The Indian Gabbar, Shikhar Dhawan is at the top of the list with more than 600 boundaries followed by Virat Kohli and David warner.

In [16]:
data = df[df['runs_off_bat'] == 4]['striker'].value_counts()[:10]
[Out] :
most boundries by batsman

14. Most runs in an IPL season by Player

The run machine, Virat Kohli is at the top of the list with 973 runs in 2016 season followed by David Warner and Kane Williamson with 848 and 735 runs in the 2016 and 2018 season respectively.

In [17]:
[Out] :
striker         season
V Kohli         2016      973
DA Warner       2016      848
KS Williamson   2018      735
MEK Hussey      2013      733
CH Gayle        2012      733
                2013      720
DA Warner       2019      692
AB de Villiers  2016      687
RR Pant         2018      684
KL Rahul        2020      676
Name: runs_off_bat, dtype: int64

15. No. of Sixes in IPL Seasons

2018 is the season with the most number of sixes hit. Followed by season 2019 and 2020 in the list of most sixes in a season.

In [19]:
data = df[df['runs_off_bat'] == 6].groupby('season').count()['match_id'].sort_values(ascending=False)
[Out] :

16. Highest Total by IPL Teams

Royal Challengers Bangalore is at the top of the list of highest run by a team. The match was played against Pune Warrior in the 2019 season.

In [20]:
[Out] :
Royal Challengers Bangalore    263
Royal Challengers Bangalore    248
Chennai Super Kings            246
Kolkata Knight Riders          245
Chennai Super Kings            240
Royal Challengers Bangalore    235
Kolkata Knight Riders          232
Kings XI Punjab                232
Sunrisers Hyderabad            231
Delhi Daredevils               231
Name: run, dtype: int64

17. Most IPL Sixes Hit by a batsman

The universe Boss, Chris gale is at the top of the list in the most hitting sixes followed by AB De Villiers and MS Dhoni

In [21]:
data = df[df['runs_off_bat'] == 6]['striker'].value_counts()[:10]
[Out] :
18. Highest Individual IPL Score

Chris Gayle playing against Pune Warrior has hit the highest individual score in the 2013 season. Brendon McCullum and Ab de Villiers are in the second and third positions on the list.

In [22]:
[Out] :
striker         start_date
CH Gayle        2013-04-23    175
BB McCullum     2008-04-18    158
AB de Villiers  2015-05-10    133
KL Rahul        2020-09-24    132
AB de Villiers  2016-05-14    129
CH Gayle        2012-05-17    128
RR Pant         2018-05-10    128
M Vijay         2010-04-03    127
DA Warner       2017-04-30    126
V Sehwag        2014-05-30    122
Name: runs_off_bat, dtype: int64

iii) Bowling Statistics

19. Most run conceded by a bowler in an inning

Basil Thampi playing for SRH against RCB in the 2008 season has conceded 70 runs and is at the top of the list followed by Bangladesh player Mujeeb Ur Rahman and Ishant Sharma.

In [18]:
[Out] :
Basil Thampi        70
Mujeeb Ur Rahman    66
I Sharma            66
Sandeep Sharma      66
PJ Cummins          65
UT Yadav            65
AS Rajpoot          64
S Kaul              64
VR Aaron            63
TA Boult            63
Name: run, dtype: int64

20. Purple Cap Holders

The bowler with the most wickets in the tournament during the course of the season would wear the Purple Cap while fielding, with the overall leading wicket-taker at the conclusion of the tournament winning the actual Purple Cap award on the day of the season’s final.

Below is the list of bowlers with purple caps.

In [23]:
lst = 'caught,bowled,lbw,stumped,caught and bowled,hit wicket'
data = df[df['wicket_type'].apply(lambda x: True if x in lst and x != ' ' else False)].groupby(['season','bowler']).count()['ball']
for (season,bowler),wicket in data.items():
    if season == val:
        print(season,'\t ',bowler,'\t\t',wicket)
        val = season
[Out] :
Season    Player              Runs                          
2008 	  Sohail Tanvir 	 22
2009 	  A Kumble 		 21
2011 	  MM Patel 		 22
2012 	  M Morkel 		 25
2013 	  DJ Bravo 		 32
2014 	  MM Sharma 		 23
2015 	  A Nehra 		 22
2016 	  B Kumar 		 23
2017 	  B Kumar 		 26
2018 	  AJ Tye 		 24
2019 	  DL Chahar 		 22
2020 	  A Nortje 		 22

21. Most IPL Wickets by a Bowler

Srilankan bowler Malinga is at the top of the list with 170 wickets followed by Amit Mishra and Push Chawla with 160 and 156 wickets respectively.

In [24]:
lst = 'caught,bowled,lbw,stumped,caught and bowled,hit wicket'
df[df['wicket_type'].apply(lambda x: True if x in lst and x != ' ' else False)]['bowler'].value_counts()[:10]
[Out] :
SL Malinga         170
A Mishra           160
PP Chawla          156
DJ Bravo           153
Harbhajan Singh    150
R Ashwin           138
B Kumar            136
SP Narine          127
YS Chahal          121
UT Yadav           119
Name: bowler, dtype: int64

22. Most Dot Ball by a Bowler

The Indian bowler Harbhajan Singh has bowled the most number of Dot balls followed by R. Ashwin and Bhuvneshwar Kumar

In [25]:
data = df[df['run'] == 0].groupby('bowler').count()['match_id'].sort_values(ascending=False)[:10]
plt.xlabel('Dot Balls')
[Out] :
IPL data visualization python project

23. Most Maiden over by a Bowler

Indian right-hand medium-pacer bowler Praveen Kumar is at the top of the list with the most maiden overs followed by Irfan Pathan and Dale Stain.

In [26]:
data = df.groupby(['start_date','bowler','over'])['run'].sum()
data = data[data.values == 0].droplevel(level=[0,2])
[Out] :
P Kumar           12
IK Pathan          9
DW Steyn           8
SL Malinga         8
B Kumar            7
DS Kulkarni        7
Sandeep Sharma     6
DJ Bravo           6
DL Chahar          5
Z Khan             5
Name: bowler, dtype: int64

24. Most Wickets by an IPL Team

The Mumbai Indian has taken the most number of wickets in IPL followed by Royal Challengers Banglore and Chennai Super Kings

In [27]:
lst = 'caught,bowled,lbw,stumped,caught and bowled,hit wicket'
data = df[df['wicket_type'].apply(lambda x: True if x in lst and x != ' ' else False)]['bowling_team'].value_counts()
[Out] :
IPL data visualization python project

25. Most No Balls by an IPL team

Royal Challengers Bangalore has given most no balls followed by Mumbai Indians and Chennai Super Kings

In [28]:
[Out] :
Royal Challengers Bangalore    105.0
Chennai Super Kings             95.0
Rajasthan Royals                94.0
Mumbai Indians                  94.0
Kolkata Knight Riders           90.0
Delhi Daredevils                73.0
Kings XI Punjab                 71.0
Sunrisers Hyderabad             53.0
Deccan Chargers                 49.0
Pune Warriors                   24.0
Delhi Capitals                  20.0
Gujarat Lions                   17.0
Kochi Tuskers Kerala            11.0
Rising Pune Supergiant           8.0
Name: noballs, dtype: float64

26. Most No Balls by an IPL Bowler

Indian bowler S Sreesanth has bowled the most number of no balls followed by Jasprit Bumrah and Amit Mishra

In [29]:
df[df['noballs'] != 0]['bowler'].value_counts()[:10]
[Out] :
JJ Bumrah      23
S Sreesanth    23
A Mishra       21
I Sharma       21
UT Yadav       19
SL Malinga     18
AB Dinda       14
RP Singh       13
M Morkel       13
JA Morkel      13
Name: bowler, dtype: int64

27. Most run given by a team in Extras

Mumbai Indians have given the most number of extras (byes, no balls, wides) followed by Kolkata Knight Riders and Kings XI Punjab.

In [30]:
data = df.groupby(['batting_team'])['extras'].agg('sum').sort_values(ascending=False)
[Out] :

28. Most Wides Conceded by an IPL team

Mumbai Indians has given most wides followed by Kolkata Knight Riders and Royal Challengers Bangalore

In [31]:
[Out] :
Mumbai Indians                 1000.0
Kolkata Knight Riders           905.0
Royal Challengers Bangalore     805.0
Kings XI Punjab                 786.0
Chennai Super Kings             783.0
Delhi Daredevils                717.0
Rajasthan Royals                652.0
Sunrisers Hyderabad             543.0
Deccan Chargers                 279.0
Pune Warriors                   169.0
Gujarat Lions                   134.0
Rising Pune Supergiant          118.0
Delhi Capitals                  114.0
Kochi Tuskers Kerala             89.0
Name: wides, dtype: float64



Hope you liked our project on IPL Data analysis and Visualization using Python. We just listed some basics to medium-advanced analysis over here, to give you an idea of how to use the data set. You can come up with your own data analysis of IPL data with Python libraries and even do some machine learning projects.

IPL Dataset Download

The IPL dataset used in this tutorial can be downloaded from this link. Enjoy exploring it!

  • Afham Fardeen

    This is Afham Fardeen, who loves the field of Machine Learning and enjoys reading and writing on it. The idea of enabling a machine to learn strikes me.

Follow Us

9 Responses

      1. Hey!! Can i get the same dataset which is used in this program. please send the link or the dataset same as used in this. i couldn’t found on kaggle. please help.

        1. Hello Vishwajeet, the link for downloading the IPL dataset used in this tutorial has been updated at the end of the article.

  1. Hey!! I am getting this error TypeError: ‘in ‘ requires string as left operand, not float. In program no 20. purple cap Holders, please help.

Leave a Reply

Your email address will not be published. Required fields are marked *