# IPL Data Analysis and Visualization Project using Python

Data science is the study of data to extract knowledge and insights from the data and apply knowledge and actionable insights. In this tutorial, we will work on IPL Data Analysis and Visualization Project using Python where we will explore interesting insights from the data of IPL matches like most run by a player, most wicket taken by a player, and much more from IPL season 2008-2020.

So if you are an IPL cricket fan and love data analysis with Python this project is perfect for you.

## Importing Libraries

In this tutorial, we will use NumPy and Pandas libraries of Python for data analysis and for data visualization Seaborn and Matplotlib libraries.

In [1]:
```import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline```

## IPL Dataset

Our IPL dataset contains ball by ball records from the first match played in the 2008 season till the complete 2020 season.

### Importing IPL Dataset

We have imported the CSV dataset below with the help of pandas read_csv functions We can see the content of the dataset by using head() function.

In [2]:
```df=pd.read_csv('data.csv')
```
[Out] :
match_id season start_date venue innings ball batting_team bowling_team striker non_striker â€¦ runs_off_bat extras wides noballs byes legbyes wicket_type player_dismissed run over
0 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.1 Kolkata Knight Riders Royal Challengers Bangalore SC Ganguly BB McCullum â€¦ 0 1 0.0 0.0 0.0 1.0 1 0
1 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.2 Kolkata Knight Riders Royal Challengers Bangalore BB McCullum SC Ganguly â€¦ 0 0 0.0 0.0 0.0 0.0 0 0
2 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.3 Kolkata Knight Riders Royal Challengers Bangalore BB McCullum SC Ganguly â€¦ 0 1 1.0 0.0 0.0 0.0 1 0
3 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.4 Kolkata Knight Riders Royal Challengers Bangalore BB McCullum SC Ganguly â€¦ 0 0 0.0 0.0 0.0 0.0 0 0
4 335982 2008 2008-04-18 M.Chinnaswamy Stadium 1 0.5 Kolkata Knight Riders Royal Challengers Bangalore BB McCullum SC Ganguly â€¦ 0 0 0.0 0.0 0.0 0.0 0 0

5 rows Ã— 21 columns

### Checking IPL Dataset Attributes

Before we proceed with our Python data analysis of IPL data, we should know what columns are present in the dataset, their count, and data type. For this, we use Pandas info() function.

In [3]:
```df.info()
```
[Out] :
```<class 'pandas.core.frame.DataFrame'>
Int64Index: 193617 entries, 0 to 193616
Data columns (total 21 columns):
#   Column            Non-Null Count   Dtype
---  ------            --------------   -----
0   match_id          193617 non-null  int64
1   season            193617 non-null  int64
2   start_date        193617 non-null  object
3   venue             193617 non-null  object
4   innings           193617 non-null  int64
5   ball              193617 non-null  float64
6   batting_team      193617 non-null  object
7   bowling_team      193617 non-null  object
8   striker           193617 non-null  object
9   non_striker       193617 non-null  object
10  bowler            193617 non-null  object
11  runs_off_bat      193617 non-null  int64
12  extras            193617 non-null  int64
13  wides             193617 non-null  float64
14  noballs           193617 non-null  float64
15  byes              193617 non-null  float64
16  legbyes           193617 non-null  float64
17  wicket_type       193617 non-null  object
18  player_dismissed  193617 non-null  object
19  run               193617 non-null  int64
20  over              193617 non-null  int64
dtypes: float64(5), int64(7), object(9)
memory usage: 25.9+ MB```

## IPL Data Analysis and Visualization with Python

Now, with a basic understanding of the attributes let us now start our project of data analysis and visualization of the IPL dataset with Python. We will initially perform simple statistical analysis and then slowly build to more advanced analysis.

## i) General Analysis of IPL MatchesÂ

### 1. List of Seasons

We can get the list of seasons from the dataset by applying unique() function on the season column which confirms that our dataset contains data of matches played from season 2008-2020.

The data set we have includes the data of each and every match played from season 2008 to 2021.

In [4]:
```df['season'].unique()
```
[Out] :
```array([2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2019,
2018, 2020], dtype=int64)```

### 2. First ball of IPL history

Each data point describes the match_id, season, start_date, venue, innings, ball, batting_team, bowling_team, striker, non_striker, bowler, runs_off_bat, extras, wides, no balls, byes, leg byes, wicket_type, player_dismissed, run which are self-explanatory.

Here we have fetched the first row of the data sets which corresponds to the first ball of the first match of IPL history played between KKR and RCB played on 4th July 2008.

In [5]:
```df.iloc[0]
```
[Out] :
```Unnamed: 0                                    0
match_id                                 335982
season                                     2008
start_date                           2008-04-18
innings                                       1
ball                                        0.1
batting_team              Kolkata Knight Riders
bowling_team        Royal Challengers Bangalore
striker                              SC Ganguly
non_striker                         BB McCullum
bowler                                  P Kumar
runs_off_bat                                  0
extras                                        1
wides                                         0
noballs                                       0
byes                                          0
legbyes                                       1
wicket_type
player_dismissed
Name: 0, dtype: object```

### 3. Season Wise IPL Matches

We can find the number of matches played in each season by grouping the match_id, season column and counting out the data, and then calling the index out of it by dropping the first index layer that is the match_id.

We can see the visualization of the IPL matches using the Matlotlib library.

In [6] :
```plt.figure(figsize=(10,8))
data = df.groupby(['match_id','season']).count().index.droplevel(level=0).value_counts().sort_index()
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Matches Played')
plt.ylabel('Season')
plt.show()
```
[Out] :

### 4. Most IPL Matches played in a Venue

The analysis shows most of the IPL matches were played in Chennai, Mumbai, Kolkata, Banglore, and Delhi.

In [7]:
```df.groupby(['venue','match_id']).count().droplevel(level=1).index.value_counts()
```
[Out] :
```M.Chinnaswamy Stadium                                   80
Eden Gardens                                            77
Rajiv Gandhi International Stadium, Uppal               64
MA Chidambaram Stadium, Chepauk, Chennai                59
Punjab Cricket Association IS Bindra Stadium, Mohali    56
Dr DY Patil Sports Academy                              17
Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium     13
SuperSport Park                                         12
Newlands                                                 7
St George's Park                                         7
Shaheed Veer Narayan Singh International Stadium         6
Green Park                                               4
Vidarbha Cricket Association Stadium, Jamtha             3
De Beers Diamond Oval                                    3
Buffalo Park                                             3
OUTsurance Oval                                          2
Name: venue, dtype: int64```

### 5. IPL Matches Played by Each Team

We can find out the matches played by each team by the same process which is grouping the batting_team and the match_id column and counting the data and then dropping the first index layer which is match_id.

In [8]:
```plt.figure(figsize=(10,8))
data = df['bowling_team'].value_counts().sort_values(ascending=False)
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Matches Played')
plt.ylabel('Team')
plt.show()
```
[Out] :

## ii) IPL Batting Analysis

### 6. Most Run Scored by IPL Teams

To calculate the most run scored by a team across all seasons we have grouped by Team and have summed up the run scored by them. And finally, sort them in descending order.

Without any surprise, MI is at the top of the list.

In [9]:
```df.groupby(['batting_team'])['run'].sum().sort_values(ascending=False)
```
[Out] :
```batting_team
Mumbai Indians                 32329
Royal Challengers Bangalore    30255
Kings XI Punjab                30064
Kolkata Knight Riders          29419
Chennai Super Kings            28372
Rajasthan Royals               24542
Delhi Daredevils               24296
Deccan Chargers                11463
Pune Warriors                   6358
Delhi Capitals                  5309
Gujarat Lions                   4862
Rising Pune Supergiant          4533
Kochi Tuskers Kerala            1901
Name: run, dtype: int64```

### 7. Most IPL Runs by a Batsman

From the below visualization we can see that the Run-Machine, Virat Kohli is at the top of this list with more than 6,000 runs followed by Suresh Raina and Shikhar Dhawan.

In [10]:
```plt.figure(figsize=(10,8))
data = df.groupby(['striker'])['runs_off_bat'].sum().sort_values(ascending=False)[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Batsman')
plt.ylabel('Runs')
plt.show()```
[Out] :

### 8. Avg Run by Teams in Powerplay

Team Delhi Capital has the best average in the powerplay with an average of 48 runs followed by SRH and RPS.

In [11]:
```df[df['over']<6].groupby(['match_id','batting_team']).sum()['run'].groupby('batting_team').mean().sort_values(ascending=False)[2:]
```
[Out] :
```batting_team
Delhi Capitals                 47.666667
Rising Pune Supergiant         47.433333
Kings XI Punjab                47.126316
Kolkata Knight Riders          46.390625
Delhi Daredevils               45.714286
Deccan Chargers                45.560000
Mumbai Indians                 45.551724
Chennai Super Kings            45.264045
Rajasthan Royals               44.912500
Royal Challengers Bangalore    44.820513
Pune Warriors                  42.111111
Name: run, dtype: float64```

### 9. Most IPL Century by a Player

The Universe Boss Chris Gayle is at the top of the list in scoring the most number of centuries in IPL history. He has hit six tons and has scored 4804 runs in IPL.

His former teammate Virat Kohli has scored five hundredâ€™s and he is at the second spot in the list followed by Watson, AB de Villiers, Brendon McCullum, and David Warner.

This can be calculated by grouping the columns striker and match_id and then calculating the sum.

In [12]:
```runs = df.groupby(['striker','match_id'])['runs_off_bat'].sum()
runs[runs >= 100].droplevel(level=1).groupby('striker').count().sort_values(ascending=False)[:10]
```
[Out] :
```striker
CH Gayle          6
V Kohli           5
SR Watson         4
DA Warner         4
AB de Villiers    3
BA Stokes         2
S Dhawan          2
M Vijay           2
HM Amla           2
BB McCullum       2
Name: runs_off_bat, dtype: int64```

### 10. Most IPL Fifty by Player

When a number of the fifties comes Warner is top in the list followed by Virat Kohli and Shikhar Dhawan. This will also be calculated by the same method as above, plus we have shown a bar graph visualization for better representation.

In [13]:
```plt.figure(figsize=(10,8))
runs = df.groupby(['striker','start_date'])['runs_off_bat'].sum()
data = runs[runs >= 50].droplevel(level=1).groupby('striker').count().sort_values(ascending=False)[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Half-Centuries')
plt.ylabel('Batsman')
plt.show()
```
[Out] :

### 11. Orange Cap Holder Each Season

The batsman with the most runs in the tournament during the course of the season would wear the Orange Cap while fielding, with the overall leading run-scorer at the conclusion of the tournament winning the actual Orange Cap award on the day of the seasonâ€™s final.

Shaun Marsh became the first winner of the award in 2008, the complete list is presented below from the dataset.

In [14]:
```data = df.groupby(['season','striker'])['runs_off_bat'].sum().groupby('season').max()
temp_df=pd.DataFrame(df.groupby(['season','striker'])['runs_off_bat'].sum())
print("{0:10}{1:20}{2:30}".format("Season","Player","Runs"))
for season,run in data.items():
player = temp_df.loc[season][temp_df.loc[season]['runs_off_bat'] == run].index[0]
print(season,'\t ',player,'\t\t',run)
```
[Out] :
```Season    Player                 Runs
2008 	  SE Marsh 		 616
2009 	  ML Hayden 		 572
2010 	  SR Tendulkar 		 618
2011 	  CH Gayle 		 608
2012 	  CH Gayle 		 733
2013 	  MEK Hussey 		 733
2014 	  RV Uthappa 		 660
2015 	  DA Warner 		 562
2016 	  V Kohli 		 973
2017 	  DA Warner 		 641
2018 	  KS Williamson		 735
2019 	  DA Warner 		 692
2020 	  KL Rahul 		 676```

### 12. Most Sixes in an IPL Inning

Chris Gayle has hit the highest number of sixes in an inning with the number being 17 in the entire IPL history. Followed by Brendon McCullum and AB De Villiers.

In [15]:
```df[df['runs_off_bat'] == 6].groupby(['start_date','striker']).count()['season'].sort_values(ascending=False).droplevel(level=0)[:10]
```
[Out] :
```striker
CH Gayle          17
BB McCullum       13
CH Gayle          13
CH Gayle          12
AB de Villiers    12
ST Jayasuriya     11
M Vijay           11
CH Gayle          11
SV Samson         10
Name: season, dtype: int64```

### 13. Most Boundary (4s) hit by a Batsman

The Indian Gabbar, Shikhar Dhawan is at the top of the list with more than 600 boundaries followed by Virat Kohli and David warner.

In [16]:
```plt.figure(figsize=(10,8))
data = df[df['runs_off_bat'] == 4]['striker'].value_counts()[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Fours')
plt.ylabel('Batsman')
plt.show()```
[Out] :

### 14. Most runs in an IPL season by Player

The run machine, Virat Kohli is at the top of the list with 973 runs in 2016 season followed by David Warner and Kane Williamson with 848 and 735 runs in the 2016 and 2018 season respectively.

In [17]:
```df.groupby(['striker','season'])['runs_off_bat'].sum().sort_values(ascending=False)[:10]
```
[Out] :
```striker         season
V Kohli         2016      973
DA Warner       2016      848
KS Williamson   2018      735
MEK Hussey      2013      733
CH Gayle        2012      733
2013      720
DA Warner       2019      692
AB de Villiers  2016      687
RR Pant         2018      684
KL Rahul        2020      676
Name: runs_off_bat, dtype: int64```

### 15. No. of Sixes in IPL Seasons

2018 is the season with the most number of sixes hit. Followed by season 2019 and 2020 in the list of most sixes in a season.

In [19]:
```plt.figure(figsize=(10,8))
data = df[df['runs_off_bat'] == 6].groupby('season').count()['match_id'].sort_values(ascending=False)
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Sixes')
plt.ylabel('Season')
plt.show()
```
[Out] :

### 16. Highest Total by IPL Teams

Royal Challengers Bangalore is at the top of the list of highest run by a team. The match was played against Pune Warrior in the 2019 season.

In [20]:
```df.groupby(['start_date','batting_team']).sum()['run'].droplevel(level=0).sort_values(ascending=False)[:10]
```
[Out] :
```batting_team
Royal Challengers Bangalore    263
Royal Challengers Bangalore    248
Chennai Super Kings            246
Kolkata Knight Riders          245
Chennai Super Kings            240
Royal Challengers Bangalore    235
Kolkata Knight Riders          232
Kings XI Punjab                232
Delhi Daredevils               231
Name: run, dtype: int64```

### 17. Most IPL Sixes Hit by a batsman

The universe Boss, Chris gale is at the top of the list in the most hitting sixes followed by AB De Villiers and MS Dhoni

In [21]:
```plt.figure(figsize=(10,8))
data = df[df['runs_off_bat'] == 6]['striker'].value_counts()[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Sixes')
plt.ylabel('Batsman')
plt.show()
```
[Out] :
18. Highest Individual IPL Score

Chris Gayle playing against Pune Warrior has hit the highest individual score in the 2013 season. Brendon McCullum and Ab de Villiers are in the second and third positions on the list.

In [22]:
```df.groupby(['striker','start_date'])['runs_off_bat'].sum().sort_values(ascending=False)[:10]
```
[Out] :
```striker         start_date
CH Gayle        2013-04-23    175
BB McCullum     2008-04-18    158
AB de Villiers  2015-05-10    133
KL Rahul        2020-09-24    132
AB de Villiers  2016-05-14    129
CH Gayle        2012-05-17    128
RR Pant         2018-05-10    128
M Vijay         2010-04-03    127
DA Warner       2017-04-30    126
V Sehwag        2014-05-30    122
Name: runs_off_bat, dtype: int64```

## iii) Bowling Statistics

### 19. Most run conceded by a bowler in an inning

Basil Thampi playing for SRH against RCB in the 2008 season has conceded 70 runs and is at the top of the list followed by Bangladesh player Mujeeb Ur Rahman and Ishant Sharma.

In [18]:
```df.groupby(['bowler','start_date'])['run'].sum().droplevel(level=1).sort_values(ascending=False)[:10]
```
[Out] :
```bowler
Basil Thampi        70
Mujeeb Ur Rahman    66
I Sharma            66
Sandeep Sharma      66
PJ Cummins          65
AS Rajpoot          64
S Kaul              64
VR Aaron            63
TA Boult            63
Name: run, dtype: int64```

### 20. Purple Cap Holders

The bowler with the most wickets in the tournament during the course of the season would wear the Purple Cap while fielding, with the overall leading wicket-taker at the conclusion of the tournament winning the actual Purple Cap award on the day of the seasonâ€™s final.

Below is the list of bowlers with purple caps.

In [23]:
```lst = 'caught,bowled,lbw,stumped,caught and bowled,hit wicket'
data = df[df['wicket_type'].apply(lambda x: True if x in lst and x != ' ' else False)].groupby(['season','bowler']).count()['ball']
data=data.sort_values(ascending=False)[:30].sort_index(level=0)
val=0
lst=[]
print("{0:10}{1:20}{2:30}".format("Season","Player","Runs"))
for (season,bowler),wicket in data.items():
if season == val:
lst.append(wicket)
else:
print(season,'\t ',bowler,'\t\t',wicket)
val = season
lst=[]
```
[Out] :
```Season    Player              Runs
2008 	  Sohail Tanvir 	 22
2009 	  A Kumble 		 21
2011 	  MM Patel 		 22
2012 	  M Morkel 		 25
2013 	  DJ Bravo 		 32
2014 	  MM Sharma 		 23
2015 	  A Nehra 		 22
2016 	  B Kumar 		 23
2017 	  B Kumar 		 26
2018 	  AJ Tye 		 24
2019 	  DL Chahar 		 22
2020 	  A Nortje 		 22```

### 21. Most IPL Wickets by a Bowler

Srilankan bowler Malinga is at the top of the list with 170 wickets followed by Amit Mishra and Push Chawla with 160 and 156 wickets respectively.

In [24]:
```lst = 'caught,bowled,lbw,stumped,caught and bowled,hit wicket'
df[df['wicket_type'].apply(lambda x: True if x in lst and x != ' ' else False)]['bowler'].value_counts()[:10]
```
[Out] :
```SL Malinga         170
A Mishra           160
PP Chawla          156
DJ Bravo           153
Harbhajan Singh    150
R Ashwin           138
B Kumar            136
SP Narine          127
YS Chahal          121
Name: bowler, dtype: int64```

### 22. Most Dot Ball by a Bowler

The Indian bowler Harbhajan Singh has bowled the most number of Dot balls followed by R. Ashwin and Bhuvneshwar Kumar

In [25]:
```plt.figure(figsize=(10,8))
data = df[df['run'] == 0].groupby('bowler').count()['match_id'].sort_values(ascending=False)[:10]
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Dot Balls')
plt.ylabel('bowler')
plt.show()
```
[Out] :

### 23. Most Maiden over by a Bowler

Indian right-hand medium-pacer bowler Praveen Kumar is at the top of the list with the most maiden overs followed by Irfan Pathan and Dale Stain.

In [26]:
```data = df.groupby(['start_date','bowler','over'])['run'].sum()
data = data[data.values == 0].droplevel(level=[0,2])
data.index.value_counts()[:10]
```
[Out] :
```P Kumar           12
IK Pathan          9
DW Steyn           8
SL Malinga         8
B Kumar            7
DS Kulkarni        7
Sandeep Sharma     6
DJ Bravo           6
DL Chahar          5
Z Khan             5
Name: bowler, dtype: int64```

### 24. Most Wickets by an IPL Team

The Mumbai Indian has taken the most number of wickets in IPL followed by Royal Challengers Banglore and Chennai Super Kings

In [27]:
```plt.figure(figsize=(10,8))
lst = 'caught,bowled,lbw,stumped,caught and bowled,hit wicket'
data = df[df['wicket_type'].apply(lambda x: True if x in lst and x != ' ' else False)]['bowling_team'].value_counts()
df.groupby(['batting_team'])['extras'].agg('sum').sort_values(ascending=False)
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Wickets')
plt.ylabel('Teams')
plt.show()
```
[Out] :

### 25. Most No Balls by an IPL team

Royal Challengers Bangalore has given most no balls followed by Mumbai Indians and Chennai Super Kings

In [28]:
```df.groupby(['batting_team'])['noballs'].agg('sum').sort_values(ascending=False)
```
[Out] :
```batting_team
Royal Challengers Bangalore    105.0
Chennai Super Kings             95.0
Rajasthan Royals                94.0
Mumbai Indians                  94.0
Kolkata Knight Riders           90.0
Delhi Daredevils                73.0
Kings XI Punjab                 71.0
Deccan Chargers                 49.0
Pune Warriors                   24.0
Delhi Capitals                  20.0
Gujarat Lions                   17.0
Kochi Tuskers Kerala            11.0
Rising Pune Supergiant           8.0
Name: noballs, dtype: float64```

### 26. Most No Balls by an IPL Bowler

Indian bowler S Sreesanth has bowled the most number of no balls followed by Jasprit Bumrah and Amit Mishra

In [29]:
```df[df['noballs'] != 0]['bowler'].value_counts()[:10]
```
[Out] :
```JJ Bumrah      23
S Sreesanth    23
A Mishra       21
I Sharma       21
SL Malinga     18
AB Dinda       14
RP Singh       13
M Morkel       13
JA Morkel      13
Name: bowler, dtype: int64```

### 27. Most run given by a team in Extras

Mumbai Indians have given the most number of extras (byes, no balls, wides) followed by Kolkata Knight Riders and Kings XI Punjab.

In [30]:
```plt.figure(figsize=(10,8))
data = df.groupby(['batting_team'])['extras'].agg('sum').sort_values(ascending=False)
sns.barplot(y=data.index,x=data,orient='h')
plt.xlabel('Runs')
plt.ylabel('Teams')
plt.show()
```
[Out] :

### 28. Most Wides Conceded by an IPL team

Mumbai Indians has given most wides followed by Kolkata Knight Riders and Royal Challengers Bangalore

In [31]:
```df.groupby(['batting_team'])['wides'].agg('sum').sort_values(ascending=False)
```
[Out] :
```batting_team
Mumbai Indians                 1000.0
Kolkata Knight Riders           905.0
Royal Challengers Bangalore     805.0
Kings XI Punjab                 786.0
Chennai Super Kings             783.0
Delhi Daredevils                717.0
Rajasthan Royals                652.0
Deccan Chargers                 279.0
Pune Warriors                   169.0
Gujarat Lions                   134.0
Rising Pune Supergiant          118.0
Delhi Capitals                  114.0
Kochi Tuskers Kerala             89.0
Name: wides, dtype: float64```

Â

## Conclusion

Hope you liked our project on IPL Data analysis and Visualization using Python. We just listed some basics to medium-advanced analysis over here, to give you an idea of how to use the data set. You can come up with your own data analysis of IPL data with Python libraries and even do some machine learning projects.