Data import
Exploratory data analysis (looking for missing and duplicated data points)
Splitting the data into the training set and testing set
building a predictive model to predict future house prices
Evaluate the model results to identify the most important features that significantly impact the price.
A brief generalization of what the predictor variables are describing:
Each predictor variable generally provides information about the situation of the house like the age of the house built in and the number of rooms, and the other columns provide information about the location for example the town crime rate and the weighted distance from the city employment centers.
The following is the exploratory data analysis I implemented:
The histogram above is for the response variable: The distribution of the dependent variable indicates that any value above 40k is considered an outlier, as shown in the provided box plot. If we exclude these outliers, the median house price will follow a normal distribution with a mean of 22.5 and a standard deviation of 9.2.
The above plot is the density plot of house age: the distribution is skewed to the left and that means that most of the houses in the data are ages around 100 years.
A correlation matrix depicting the correlation between the independent variables and the dependent variable:
In the following correlation plot, we conclude that the predictor variables that are highly correlated with the house price is:
o 1- lstat : lower status of the population. Negatively correlated with Medv (0.74)
o 2- rm : average number of rooms. positively correlated with Medv (0.69)
o 3- ptratio : pupil-teacher ratio by town. Negatively correlated with Medv (0.50)
Modeling and results:
- Partitioning the data into training and testing sets, we train our regression model on the training set and check how well it can predict the dependent variable using the testing data set. and then develop a linear regression model to predict the median house price.
Interpret the above model results
- The F-stat is significant so there is at least one feature related to the dependent variable.
- After checking the significance level of each feature there are 10 out of 12 predictors related to the median house price.
- The model shows that 73% of the variability in the median house price is explained by the predictors.
- The features with stars are statistically significant and directly influence the median house price.
Possible recommendations to the client:
The key for the client is to pay attention to the significant features. These features have a big say in determining the median house price. So, if you want the model to make accurate predictions, make sure to prioritize these crucial features. The ones that matter the most for predicting the price are:
each 1 percent increase in the lower status of the population will decrease the price by 0.5k, holding everything else constant.
every 1 unit increase in the average distance to five Boston employment centers will decrease the median house price by 1.36k, holding everything else constant.
Every one-room increase in the average number of rooms will increase the median house price by 3.47k, holding everything else constant.
every 1 unit increase in the index of accessibility to radial highways will increase the median house price by 0.29k, holding everything else constant.
The code is available in the following GitHub repository:
https://github.com/israajashaami/Codes/blob/main/House%20Prices.R