Introduction
In this tutorial, we will see how to implement Linear Regression in the Python Sklearn library. We will see the LinearRegression module of Scitkit Learn, understand its syntax, and associated hyperparameters. And then we will deep dive into an example to see the proper implementation of linear regression in Sklearn with a dataset.
But first of all, we will have a quick overview of linear regression.
What is Linear Regression
Linear Regression is a kind of modeling technique that helps in building relationships between a dependent scalar variable and one or more independent variables. They are also known as the outcome variable and predictor variables.
Although it has roots in statistics, Linear Regression is also an essential tool in machine learning for tasks like predictive modeling.
Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data.
Hyperparameter in Linear Regression
Hyperparameters are parameters that are given as input by the users to the machine learning algorithmsÂ Hyperparameter tuning can increase the accuracy of the model.
However, in simple linear regression, there is no hyperparameter tuning
Linear Regression in Python Sklearn
If we want to perform linear regression in Python, we have a function LinearRegression() available in the Scikit Learn package that can make our job quite easy.
Let us understand the syntax of LinearRegression() below.
Syntax of LinearRegression()
class sklearn.linear_model.LinearRegression(*, fit_intercept=True, normalize=False, copy_X=True, n_jobs=None, positive=False)
Parameters
 fit_intercept: This parameter determines whether an intercept has to be calculated or not. It is a boolean value and by default it is True.
 normalize: When the value of the above fit_intercept is False then this parameter is ignored. When fit_intercept is True then regressor X will be normalized. The default value is False.
 copy_X: If set to True, then X will be copied otherwise it is overwritten. By default it is True.
 n_jobs: It denotes the number of jobs to be used for computation. It is int value and by default it is None.
 positive: If set as true, coefficients are forced to be true. It only works for dense arrays. By default it is False.
Example of Linear Regression with Python Sklearn
In this section, we will see an example of endtoend linear regression with the Sklearn library with a proper dataset.
We will work with water salinity data and will try to predict the temperature of the water using salinity
1. Loading the Libraries
We first load the necessary libraries for our example like numpy, pandas, matplotlib, and seaborn. Finally, we load several modules from sklearn including our LinearRegression.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
2. Loading the Dataset
Now we will load the dataset for building the linear regression model. Since itâ€™s a huge dataset as we can see below, weâ€™ll be focusing on two main columns for the purpose of this tutorial.
df = pd.read_csv('bottle.csv')
df.head()
Output:
Cst_Cnt  Btl_Cnt  Sta_ID  Depth_ID  Depthm  T_degC  Salnty  O2ml_L  STheta  O2Sat  â€¦  R_PHAEO  R_PRES  R_SAMP  DIC1  DIC2  TA1  TA2  pH2  pH1  DIC Quality Comment  

0  1  1  054.0 056.0  194903CRHY0600930054005600000A3  0  10.50  33.440  NaN  25.649  NaN  â€¦  NaN  0  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
1  1  2  054.0 056.0  194903CRHY0600930054005600008A3  8  10.46  33.440  NaN  25.656  NaN  â€¦  NaN  8  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
2  1  3  054.0 056.0  194903CRHY0600930054005600010A7  10  10.46  33.437  NaN  25.654  NaN  â€¦  NaN  10  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
3  1  4  054.0 056.0  194903CRHY0600930054005600019A3  19  10.45  33.420  NaN  25.643  NaN  â€¦  NaN  19  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
4  1  5  054.0 056.0  194903CRHY0600930054005600020A7  20  10.45  33.421  NaN  25.643  NaN  â€¦  NaN  20  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN 
5 rows Ã— 74 columns
3. Exploratory Data Analysis
Since the dataset is quite huge, weâ€™ll be utilizing only the first 500 values of this dataset. So, letâ€™s first build a dataframe that contains only 500 values, and then, weâ€™ll plot a scatter plot to understand the trend of the dataset.
This plot gives us an idea about the trend of our data and we can try to fit the linear regression model here.
df_binary500 = df_binary[:][:500]
sns.scatterplot(x ="Sal", y ="Temp", data = df_binary500)
4. Data Preprocessing
We only want to work with two relevant columns that will tell about the salinity and temperature of oceans and will be helpful to create the regression model.
We make use of the below code to create a new dataframe with Salinity and Temperature.
df_binary = df[['Salnty', 'T_degC']]
df_binary.columns = ['Sal', 'Temp']
df_binary.head()
Sal  Temp  

0  33.440  10.50 
1  33.440  10.46 
2  33.437  10.46 
3  33.420  10.45 
4  33.421  10.45 
To get our dataset to perform better, we will fill the null values in the dataframes using fillna() function.
df_binary500.fillna(method ='ffill', inplace = True)
We also have to reshape the two columns of our dataframe, this will then be passed as variables for model building.
X = np.array(df_binary500['Sal']).reshape(1, 1)
y = np.array(df_binary500['Temp']).reshape(1, 1)
5. Train Test Split
We will now split our dataset into train and test sets. The training set will be used for creating a linear regression model and then its accuracy will be tested with the testing dataset.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
6. Model Training
Now we will train the model using LinearRegression() module of sklearn using the training dataset.
We create an instance of LinearRegression() and then we fit X_train and y_train.
regr = LinearRegression()
regr.fit(X_train, y_train)
7. Linear Regression Score
Now we will evaluate the linear regression model on the training data and then on test data using the score function of sklearn.
In [13]:
train_score = regr.score(X_train, y_train)
print("The training score of model is: ", train_score)
The training score of model is: 0.8442369113235618
test_score = regr.score(X_test, y_test)
print("The score of the model on test data is:", test_score )
The score of the model on test data is: 0.839197956273302
As we can see, the linear regression model has achieved a score of 0.839 on the test data set and it was 0.842 on the train data set. So overall we have created a good linear regression model in Sklearn.
8. Visualizing the Results
As a final step, we will visualize the result of the linear regression model by plotting the regression line with test data.
y_pred = regr.predict(X_test)
plt.scatter(X_test, y_test, color ='b')
plt.plot(X_test, y_pred, color ='k')
plt.show()
 Also Read â€“ Regression vs Classification â€“ No More Confusion !!
Conclusion
In this tutorial, we learned about the implementation of linear regression in the Python sklearn library. We discussed the syntax of the linear regression function in sklearn and finally saw an endtoend example of linear regression with sklearn using a dataset.

I am Palash Sharma, an undergraduate student who loves to explore and garner indepth knowledge in the fields like Artificial Intelligence and Machine Learning. I am captivated by the wonders these fields have produced with their novel implementations. With this, I have a desire to share my knowledge with others in all my capacity.
View all posts