The Most Beloved Linear Regression Project, House price prediction

In this blog, you will learn How to use linear regression for House price prediction. I am a huge believer in learning on the go while doing projects. It teaches us a lot of new things which were not possible by just learning from videos and books. In this blog, I am intentionally pasting screenshots of my code, so that you will go to your Jupyter notebook and type it by yourself instead of just copy-pasting.

In this, I am not going to explain What is Linear Regression as you may have learned it before jumping into the project, but if you have not, don’t worry you check my latest thread on Twitter here .

Let’s get started.

Link to the dataset- Dataset

The dataset contains 7 columns and 5000 rows with CSV extension. The data contains the following columns :

‘Avg. Area Income’: Avg. Income of householder of the city house is located in.
‘Avg. Area House Age’: Avg. Age of Houses in the same city.
‘Avg. Area Number of Rooms: Avg. Number of Rooms for Houses in the same city.
‘Avg. Area Number of Bedrooms: Avg. no. of Bedrooms for Houses in the same city.
‘Area Population’: Population of the city.
‘Price’: Price that the house sold at.
‘Address’: Address of the houses.

The first step is of course importing all the required libraries.

-Pandas is mainly used for data analysis and associated manipulation of tabular data in Dataframes. Pandas allow importing data from various file formats such as comma-separated values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.

-NumPy is a Python library used for working with arrays. It also has functions for working in the domain of linear algebra, Fourier transform, and matrices.

-Seaborn is an open-source Python library built on top of matplotlib. It is used for data visualization and exploratory data analysis. Seaborn works easily with dataframes and the Pandas library.

-Matplotlib is a cross-platform, data visualization and graphical plotting library for Python and its numerical extension NumPy.

-You can use the magic function %matplotlib inline to enable the inline plotting, where the plots/graphs will be displayed just below the cell where your plotting commands are written.

here HouseDF variable is created by assigning CSV values. You can explore pd.read_csv() here.

.head() will show the first five lines of our dataset.

.info() print information about dataset. This information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).

.describe() generate descriptive statistics. It includes those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. Explore more about .describe() here.

.columns will show the column labels of the DataFrame.

sns.pairplot(Dataframe) plots pairwise relationships in a dataset. By default, this function will create a grid of Axes such that each numeric variable in data will by shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column. Explore more about pairplot here.

Pair plots are used to understand the best set of characteristics to describe the relationship between two variables or form the most distant clusters. It also helps to form some simple classification models by drawing some simple lines or linear separations on the dataset.

.heatmap will plot rectangular data as a color-encoded matrix. ‘annot=True’ writes the data value in each cell. Explore more about .heatmap here.

The main purpose of heatmaps is to better visualize the amount of places/events in the dataset, allowing the viewer to navigate to the areas of data visualization that matter most.

here X and y values are assigned as we are going to plot a Linear graph to predict Price.

Explore more about it here. To know how it works exactly here’s an illustration.

Image by Michael Galarnyk

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=101)

test_size represent the proportion of the dataset to include in the test split.

random_state controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.