**Linear regression** is one of the most popular Machine Learning / Data Science algorithms people study when they take up on this field. This is because the concepts behind it are relatively easy and it also helps aspiring data scientists / machine learning developers build a good knowledge foundation for more advanced topics.

*Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.*

In this article we will take a look at the basic concepts behind linear regression, explaining the logic and taking a quick peek at the maths behind this family of algorithms. Finally, we will take a real-world dataset and use Python and Scikit-Learn to implement linear regression to predict house prices based on a few indicators. In short, here's the outline of the article:

- Linear Regression explained
- Linear Regression applications
- Linear Regression project setup
- Linear Regression dataset analysis - Boston House Dataset
- Linear Regression implementation using Python and Scikit-Learn
- Conclusions

# Linear Regression explained

**Linear Regression **is a type of algorithm used to identify and model relationships between variables. Let's say over a certain period of time we have observed *n *characteristics of a certain phenomenon. Imagine we have data about all houses sold during the last few years in the city.

For every house sold we know the *n *characteristics: for example number of rooms, the size of the house, whether it has a garden or not, the size of the garden, the number of floors and so on. We also know the price for which each house has been sold.

Now imagine that we want to predict the price for which a new house can be sold in the same area. For obvious reasons we think there will be a relation between all these characteristics and the price.

This is where Linear Regression comes in handy. Linear Regression helps us figure out whether there is a relation between the observed characteristics and a certain target characteristic. And if there is a relation, this method will allow us to estimate the target characteristics based on the observed characteristics of a new entry in our dataset.

The observed characteristics are called **independent variables**. The target characteristic (in our example the price of the house) is called **dependent variable.**

Now, for simple models/relationships, we might need only one independent variable. In this case, we talk about **Simple Linear Regression**. In real-world cases, we can rarely predict the target variable based on only one observed variable. So when we have more than one independent variable, we are performing a **Multiple Linear Regression**.

Advanced maths for Linear Regression is beyond the scope of this article, but please read the Wikipedia page about this subject, it is very well explained there.

But in simple terms, here's how linear regression works. Let's say we have a model with 2 independent variables (x1 and x2) and 1 dependent variable (y). Then we assume we can model the relationship between x1, x2 and y based on this equation.

We know who *y*, *x1* and *x2* are, so the goal is to find the values for *a*,* b*,* c. *But actually, there's a catch here. For complex models, we may never know the exact values of *a, b *and *c. *We will actually find only approximations for these values (which of course we intend to get as close to the ground truth), so when we introduce them into the equation above, we will also get an approximation for *y*. Getting *a, b *and *c *to be as close as possible to their real values will also help us find an approximation for *y *which is very close to the real *y*. This is done using the root-mean-squared error method or RMSE and using the gradient descent method to update *a, b, *and *c.*

After we have found our final, approximated values for *a*, *b*, and *c, *we can use them for new entries in our dataset (new houses for which we want to predict the price).

# Linear Regression applications

Linear Regression can be used whenever we think there is a dependency relationship between certain variables in our dataset. We want to be able to predict the value of a variable after we observe the real values of the other characteristics. This can be used for certain applications like:

- Price prediction
- Trading algorithms
- Trends modelling
- Other more advanced Machine Learning concepts like Neural Networks. Actually, some might argue that simple neural networks are actually linear regression models on steroids 😃.

# Linear Regression project setup

In this article we are going to use Python and Scikit-Learn to implement a Multiple Regression model that will try to predict house prices based on a few characteristics of those houses and historical data.

We need to install a few dependencies before we can continue.

We will also use pandas, numpy and matplotlib, but these 3 are already installed when we instal the dependencies above. Having mentioned that, let's import all our dependencies into the project.

# Linear Regression dataset analysis - Boston House Dataset

The Boston House dataset is already included into the Scikit-Learn library. The module actually provides a nice utility function we can use to download and load the dataset.

```
# Loading the dataset
dataset = load_boston()
```

Let's now take a look at the data. First let's see a description of the dataset.

```
# Describe the dataset
print(dataset.DESCR)
```

We know need to rearrange the data just a little bit to make it easy for us to manipulate it later. We will load the data into a pandas data frame and add the target values.

```
# Load into data frame
dataFrame = pd.DataFrame(dataset.data)
# Fetch column names
dataFrame.columns = dataset.feature_names
dataFrame['ACTUAL_PRICE'] = dataset.target
print (dataFrame.head())
```

Having loaded the dataset into a dataframe, we can already get a few interesting statistics about it.

```
# Statistics
print (dataFrame.describe())
```

Before we proceed with our Linear Regression implementation, we agreed that we need to check whether the values are correlated, meaning the variables that we consider to be independent have an influence over the dependent variable.

For this we can use the **Pearson coefficient** to measure the correlation between any 2 variables. The Pearson coefficient is a value between [-1, 1] which indicates the strength and the direction of the correlation between 2 variables.

- The closer the Pearson coefficient is to the margins of the interval (-1 or 1), the stronger the correlation is.
- A value closer to 0 indicates a lower correlation.
- A negative value indicates a negative correlation between the independent and the dependent variables. An increase in the independent variable will cause a decrease in the dependent variable.
- A positive value indicates a positive correlation between the independent and the dependent variable. An increase in the independent variable will cause the dependent variable to also increase.

We can use pandas to display the correlations between each and every variable.

```
# Correlations
correlations = dataFrame.corr(method='pearson')
print (correlations)
```

We can also use seaborn to display a heatmap of our correlations.

```
# Visualize correlations
sns.heatmap(data=correlations, cmap="YlGnBu")
```

So we can draw the conclusion that the independent variables and the dependent variable are certainly correlated (some more, some less). But still, I think it's safe to start on the Linear Regression implementation.

# Linear Regression implementation using Python and Scikit-Learn

We'll first split our dataset into X and Y, meaning our independent and dependent variables.

```
# Split features and target
X = dataFrame.drop('ACTUAL_PRICE', axis=1)
Y = dataFrame['ACTUAL_PRICE']
```

Now we want to perform a train-test dataset split. This means we are going to randomly select a portion of our dataset for the model training, while keeping a separate subset for model testing and evaluation. This ensures that the model is not biased and helps us obtain a better evaluation.

For the purpose of this article we are going to use 70% of the dataset for the training set, while the remaining 30% goes to the test set.

```
# Train-test split validation
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)
```

Now it's time to perform Linear regression.

```
# Train the model
model = LinearRegression()
model.fit(X_train, Y_train)
```

From this moment on, our model contains our estimations for our equation parameters. We can use these estimations to obtain our price estimations.

```
# Test the model
Y_pred = model.predict(X_test)
```

As a quick peek into our model evaluation, let's print our RMSE value.

```
# Print RMSE
print(np.sqrt(mean_squared_error(Y_test, Y_pred)))
# Prints 4.753674559409141
```

How do we interpret this value? It means that for every real median house price in our dataset, our model will be wrong, on average, with about $4753. This is not such a big error, judging by the small size of our dataset.

This means that if tomorrow we register new values for the characteristics followed in the dataset, we can feed them through our model and get an estimation for the median house price in the Boston area. But we have to be aware that the estimated price will be, on average, $4753 higher or lower than the actual value. The error might be larger, smaller, but on average it's $4753.

Let's also have a visual representation of our model, to see how well it performed. For ease of visualisation purpose, let's choose only the first 50 entries in the dataset and compare the actual price with the predicted price.

Not bad I'd say! We can also take a look at each and every coefficient determined by our Linear Regression model.

```
# Explain results
print(list(zip(X.columns, model.coef_)))
[('CRIM', -0.06931430702717156), ('ZN', 0.03955444157679673), ('INDUS', 0.054245604950544764), ('CHAS', 2.0830343274015717), ('NOX', -17.483500368235607), ('RM', 3.8989734326661645), ('AGE', 0.0022662258867003834), ('DIS', -1.4021122673518704), ('RAD', 0.2936192749806645), ('TAX', -0.012887975521511379), ('PTRATIO', -1.1012155759509838), ('B', 0.010559654442887385), ('LSTAT', -0.5595486955800708)]
```

To understand how to interpret this, let's take a look at the first coefficient.

('CRIM', -0.06931430702717156)

This means that if the per capita crime rate will increase by 1, this will make the median house price decrease by approximately $693. We can go through each coefficient and write an explanation for how it affects the median house price.

# Conclusions

Today we discussed about Linear Regression. We went through a brief explanation on how Linear Regression works then we presented some applications for this method. We then loaded a dataset and analyzed it for a bit to see if we can perform Linear Regression on it. Then we went for the Linear Regression implementation using Python and Scikit-Learn. I hope you've enjoyed this article as much as I've enjoyed writing it. The code for this article is available here.

*Interested in more stories like this? Follow me on Twitter at @b_dmarius and I'll post there every new article.*