Predicting-Ames Housing Data

Fnu Vishal
3 min readAug 2, 2021

Reports

1. Overview

The project name is Ames Housing Data. It is a machine learning-based problem. We got the data from Kaggle because it was a Kaggle challenge. This project aims to create a regression model based on the Ames Housing Dataset. As a final output, this model will predict the price of a house at the sale. We have followed various techniques and methods to achieve this. Those steps are Data Cleaning and EDA, Prediction Model, Model Evaluation, and Results. First, we have cleaned the data and built the EDA of it. Checking the trend or pattern of the data. After that, we have created three linear regression models and test with the unknown data set. Based on the RMSE value, we have selected the best one. We got a good accuracy on that model, and based on the final model, we have predicted the result.

2. Context & Challenge

o Background & Description

This type of model is generally used to get a prediction for future purposes. This type of project helps the client to know the value of their business in the coming year. Based on this project, when the model is built, we have to deploy it in the cloud environment and make the model permanent. So based on his or her business, he or she can check the value of the house will the unknown dataset like the coming year’s data.

o Problem

Two datasets were given for the project. Based on the training data, we have to build a linear regression model. We have to pass the test dataset to the model and predict the values of the Ames House.

o Goals & Objectives

The goal of this project is to create a regression model based on the Ames Housing Dataset. This model will predict the price of a house at sale.

3. Process & Insight

The train dataset i.e. train.csv contains 2051 rows and 81 columns. So, it is huge. First, we have to understand what is needed and what is not. I started by generating a list of percentages of null values in each feature after the initial import into the DataFrame. It helps me to understand the data and gives a broad overview of which features have so many missing values. I have considered such features, which have 90% null values that were discarded. Then I moved towards correlation and several visualizations of the data to check the outliers and maximum missing values. I have processed columns and apply LabelEncoder to categorical features. I have added one extra feature i.e. total SQ footage feature. It is the sum of Total Bsmt SF, 1st Flr SF, and 2nd Flr SF. Then I have converted features into dummies for expansion. The shape was (2927, 224).

4. Solution

I have used three regression models; the result is given below

For Linear Regression, mean_squared_error is 438897985.1559217 and R2 score of 93%. For Ridge Regression, mean_squared_error is 451262096.295255, and the R2 score is 92.8%. For Lasso Regression, mean_squared_error is 449613781.456452, and the R2 score is 92.8%

Compare to all these three, the most Accurate regression model is the linear regression model. In Cross-Validation, I have passed the linear model and got the mean of 90%. For the Ridge model, I got the mean of 91%. For the Lasso model, I got a mean of 90%.

5. The Results

Due to the linear model is the best one, we have tested test data on that model. Based on that we have predicted the values of salesprice on that test data and exported the data in CSV format as the instruction given. Price will increase through various factors such as Neighbourhood_stoneBr, Condition 2_PosA, Exterior 1st_Cblock, Sale, Type_Con, Roof Mati_WdShngl, etc. Price will decrease through various factors such as Mics Feature_TenC, MS Zoning_A(agr), Heating_OthW, Condition 2_RRAe, Roof Style_Mansard, etc.

Stone Brook seems like it might be a good investment. The Misc Feature (Tennis Court) hurts the value of the home the most. To increase the value of the home, the owner chooses cinder block as the exterior covering and chooses sale type as Contract 15% Down payment regular terms.

--

--