In this blog, you will learn How to use linear regression for House price prediction. I am a huge believer in learning on the go while doing projects. It teaches us a lot of new things which were not possible by just learning from videos and books. In this blog, I am intentionally pasting screenshots of my code, so that you will go to your Jupyter notebook and type it by yourself instead of just copy-pasting.
In this, I am not going to explain What is Linear Regression as you may have learned it before jumping into the project, but if you have not, don’t worry you check my latest thread on Twitter here .
Let’s get started.
Link to the dataset- Dataset
The dataset contains 7 columns and 5000 rows with CSV extension. The data contains the following columns :
‘Avg. Area Income’: Avg. Income of householder of the city house is located in.
‘Avg. Area House Age’: Avg. Age of Houses in the same city.
‘Avg. Area Number of Rooms: Avg. Number of Rooms for Houses in the same city.
‘Avg. Area Number of Bedrooms: Avg. no. of Bedrooms for Houses in the same city.
‘Area Population’: Population of the city.
‘Price’: Price that the house sold at.
‘Address’: Address of the houses.
The first step is of course importing all the required libraries.
-Pandas is mainly used for data analysis and associated manipulation of tabular data in Dataframes. Pandas allow importing data from various file formats such as comma-separated values, JSON, Parquet, SQL database tables or queries, and Microsoft Excel.
-NumPy is a Python library used for working with arrays. It also has functions for working in the domain of linear algebra, Fourier transform, and matrices.
-Seaborn is an open-source Python library built on top of matplotlib. It is used for data visualization and exploratory data analysis. Seaborn works easily with dataframes and the Pandas library.
-Matplotlib is a cross-platform, data visualization and graphical plotting library for Python and its numerical extension NumPy.
-You can use the magic function %matplotlib inline to enable the inline plotting, where the plots/graphs will be displayed just below the cell where your plotting commands are written.
here HouseDF variable is created by assigning CSV values. You can explore pd.read_csv() here.
.head() will show the first five lines of our dataset.
.info() print information about dataset. This information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values).
.describe() generate descriptive statistics. It includes those that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN
values. Explore more about .describe() here.
.columns will show the column labels of the DataFrame.
sns.pairplot(Dataframe) plots pairwise relationships in a dataset. By default, this function will create a grid of Axes such that each numeric variable in data
will by shared across the y-axes across a single row and the x-axes across a single column. The diagonal plots are treated differently: a univariate distribution plot is drawn to show the marginal distribution of the data in each column. Explore more about pairplot here.
Pair plots are used to understand the best set of characteristics to describe the relationship between two variables or form the most distant clusters. It also helps to form some simple classification models by drawing some simple lines or linear separations on the dataset.
.heatmap will plot rectangular data as a color-encoded matrix. ‘annot=True’ writes the data value in each cell. Explore more about .heatmap here.
The main purpose of heatmaps is to better visualize the amount of places/events in the dataset, allowing the viewer to navigate to the areas of data visualization that matter most.
here X and y values are assigned as we are going to plot a Linear graph to predict Price.
Explore more about it here. To know how it works exactly here’s an illustration.
Image by Michael Galarnyk
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=101)
test_size represent the proportion of the dataset to include in the test split.
random_state controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.
Explore more about LinearRegression here.
Trains the model using the training sets.
The coef_ gives the coefficient of the features of your dataset.
here is a scatter plot, a good thing about is, it is in a line form. Which says that our model has predicted very well.
As you can see it is in a bell shape curve means it is normalised & our model is well predicted.
Conlusion
We have created a Linear Regression Model which will help the real state agent for estimating the house price.
Github link here.
Thank you for reading.
Have a nice day!
For more such content make sure to subscribe to my Newsletter here
Follow me on