Introduction
As an initial step, in machine learning or data science projects, we carry out data exploration to understand our data. If we are handling the data with the help of pandas library, we have the advantage of exploring our data easily by using pandas functions such as describe(), head(), unique() and count(). In this article, we will look at these functions and learn how they can be used for data exploration with some examples.
Importing Pandas Library
We will be starting this tutorial by importing pandas library.
import pandas as pd
import numpy as np
Starting this article with pandas describe function.
Pandas Describe : describe()
The describe() function is used for generating descriptive statistics of a dataset.
This pandas function provides the dataset’s information about central tendency, data dispersion, and shape of a dataset.
Syntax
pandas.DataFrame.describe(self,percentiles,include,exclude)
self : DataFrame or Series – This is the dataframe or series which is passed to describe() function for finding its descriptive statistics.
percentiles : list-like of numbers – Here we provide the desired percentiles which should be included in the output. The default values are 0.25,0.5 and 0.75 i.e. 25th percentile, 50th percentile and 75th percentile. All the values should be between 0 and 1.
include : list-like of dtypes or None(optional) – This is the acceptable list of data types that can be included in the output.
exclude : list-like of dtypes or None(optional) – This is the list of data types which should not be included in the output.**
As an output, we get summarized statistics of series or dataframe.
Example 1: describing a series
Here we will apply describe() function over a series.
s = pd.Series([7, 9, 11])
s
0 7 1 9 2 11 dtype: int64
As we can see, we have obtained different descriptive statistics parameter such as count, mean, std i.e. standard deviation and many more.
s.describe()
count 3.0 mean 9.0 std 2.0 min 7.0 25% 8.0 50% 9.0 75% 10.0 max 11.0 dtype: float64
Pandas describe() function can be used over categorical data as well.
s = pd.Series(['P', 'P', 'Q', 'R'])
s
0 P 1 P 2 Q 3 R dtype: object
The pandas describe() can help in describing categorical data i.e. text data.
s.describe()
count 4 unique 3 top P freq 2 dtype: object
Example 3: Describing dataframe
As we mostly deal with dataframes, let’s see how they are described using pandas describe() function.
df = pd.DataFrame({'categorical': pd.Categorical(['A','B','C']),
'numeric': [3, 6, 9],
'object': ['P', 'Q', 'R']
})
df
categorical | numeric | object | |
---|---|---|---|
0 | A | 3 | P |
1 | B | 6 | Q |
2 | C | 9 | R |
In this example, the numeric data is described.
df.describe()
numeric | |
---|---|
count | 3.0 |
mean | 6.0 |
std | 3.0 |
min | 3.0 |
25% | 4.5 |
50% | 6.0 |
75% | 7.5 |
max | 9.0 |
By using include parameter, we can get the descriptive statistics for each data type present in dataframe.
df.describe(include='all')
categorical | numeric | object | |
---|---|---|---|
count | 3 | 3.0 | 3 |
unique | 3 | NaN | 3 |
top | C | NaN | R |
freq | 1 | NaN | 1 |
mean | NaN | 6.0 | NaN |
std | NaN | 3.0 | NaN |
min | NaN | 3.0 | NaN |
25% | NaN | 4.5 | NaN |
50% | NaN | 6.0 | NaN |
75% | NaN | 7.5 | NaN |
max | NaN | 9.0 | NaN |
The next function in the list is pandas head function
Pandas head : head()
The head() returns the first n rows of an object. It helps in knowing the data and datatype of the object.
Syntax
pandas.DataFrame.head(n=5)
n : int(default = 5) – This provides information about the number of rows which will be returned.
The head function returns the object with the desired number of rows.
Example 1: Simple example of head() function
In this example, we will look at how head function returns a sample of dataframe with ‘n’ number of rows.
stud = pd.DataFrame({'Students': ['Jack', 'Dale', 'Shaun', 'Shane',
'Brett', 'Patrick', 'Mitchell', 'David', 'Zoe']})
stud
Students | |
---|---|
0 | Jack |
1 | Dale |
2 | Shaun |
3 | Shane |
4 | Brett |
5 | Patrick |
6 | Mitchell |
7 | David |
8 | Zoe |
stud.head()
Students | |
---|---|
0 | Jack |
1 | Dale |
2 | Shaun |
3 | Shane |
4 | Brett |
Example 2: providing value of ‘n’
As we know, we can provide the value of ‘n’. So in this example, we will be providing value of ‘n’.
Since we provided the value of ‘n’ as ‘3’, we get three rows in the output.
stud.head(3)
Students | |
---|---|
0 | Jack |
1 | Dale |
2 | Shaun |
Example 3: using tail function
For accessing the dataframe’s ending values, we will use tail() function. By default, we will get the last 5 values of dataframe.
stud.tail()
Students | |
---|---|
4 | Brett |
5 | Patrick |
6 | Mitchell |
7 | David |
8 | Zoe |
The third function in the list is pandas unique function.
[adrotate banner=”3″]
Pandas unique : unique()
The unique() function returns unique values present in series object. The values are returned in the order of appearance.
Syntax
series.unqiue()
Here the unique function is applied over series object and then the unique values are returned.
The output of this function is an array.
Example 1: using pandas unique() over series object
In the below-given example, we will be applying unique() function on the series object.
In the output, we get an array with unique values.
pd.Series([7, 14, 9, 9], name='Test').unique()
array([ 7, 14, 9], dtype=int64)
Example 2: unique function on categorical data
As mentioned earlier, categorical data is text data. So let’s see how the unique function operates over a series containing categorical data.
In this first categorical data, we can see that the list is divided into different categories.
pd.Series(pd.Categorical(list('gpprs'))).unique()
[g, p, r, s] Categories (4, object): [g, p, r, s]
In this example, the same categorical data is displayed in ordered form. This is because we have specified ordered keyword.
pd.Series(pd.Categorical(list('gpprs'), categories=list('gprs'),
ordered=True)).unique()
[g, p, r, s] Categories (4, object): [g < p < r < s]
The last function in this article which we’ll look at is pandas count.
Pandas Count : count()
The pandas count() function helps in counting non-NA cells of each column or row.
Syntax
pandas.DataFrame.count(axis=0,level=None,numeric_only=False)
axis : {0 or ‘index’, 1 or ‘columns’}, default 0 – If the value provided is 0, then counts are generated for each column. If value provided is 1, then counts are generated for rows.
level : int or str(optional) – It is used to specify the level along which counting should be done. Generally used for hierarchical i.e. multi-index dataframes.
numeric_only : bool – For specifying which kind of data, i.e. either float, int or boolean data.
The output is a Series or DataFrame. For each column/row, the non-NA entries are counted.
Example 1: counting non-NA values
Here a dataframe is created with the help of a dictionary.
df = pd.DataFrame({"Employee":
["Rakesh", "Ramesh", "Suresh", "Jayesh", "Bhavesh"],
"Age": [27, 36, 30, np.nan, 23],
"Married_Status": [False, True, False, True, False]})
df
Employee | Age | Married_Status | |
---|---|---|---|
0 | Rakesh | 27.0 | False |
1 | Ramesh | 36.0 | True |
2 | Suresh | 30.0 | False |
3 | Jayesh | NaN | True |
4 | Bhavesh | 23.0 | False |
The below output shows the results of count() function.
df.count()
Employee 5 Age 4 Married_Status 5 dtype: int64
Example 2: applying count() function over columns
In this count() function example, we have applied count function over axis of columns. This is the reason for 3rd index the count is 2 as compared to other columns where 3 values are present.
df.count(axis='columns')
0 3 1 3 2 3 3 2 4 3 dtype: int64
Conclusion
Now it’s time to end this article, in this tutorial we covered four different pandas functions which are beneficial to use when we want to understand and explore our data for data preprocessing operations and for taking crucial decisions using this data. The functions which we covered are describe(),head(),unique() and count(). These are some useful pandas functions applied over dataframes for understanding our data stored in it.
Reference – https://pandas.pydata.org/docs/
-
I am Palash Sharma, an undergraduate student who loves to explore and garner in-depth knowledge in the fields like Artificial Intelligence and Machine Learning. I am captivated by the wonders these fields have produced with their novel implementations. With this, I have a desire to share my knowledge with others in all my capacity.
View all posts