Partition Data In BostonHousing Into Training (400 Records

Partition data in BostonHousing.xls into training (400 records) and test

The BostonHousing.xls dataset contains housing data for 506 census tracts of Boston from the 1970 census, comprising 14 variables including the target variable, MEDV, which represents the median value of owner-occupied homes in thousands of USD. The dataset provides valuable insights for exploring real estate trends, performing regression analysis, and understanding the factors influencing property values. This assignment involves data partitioning, building both multiple and logistic regression models, and interpreting their statistical outputs using SPSS Modeler.

Paper For Above instruction

The Boston Housing dataset is a renowned dataset used extensively in statistical learning, econometrics, and real estate research. Its comprehensive nature and the availability of diverse predictors allow for robust modeling of factors affecting housing prices. The initial step involves dividing the dataset into training and testing subsets, a crucial process aimed at evaluating model generalizability and preventing overfitting. Using SPSS Modeler, this partitioning is performed by randomly selecting 400 records for training and reserving 106 records for testing, ensuring data randomness and representativeness of the sample.

Modeling the median home price (MEDV) using multiple regression analysis enables understanding the combined effect of selected predictors on property values. The chosen independent variables are CRIM (per capita crime rate), CHAS (dummy variable indicating proximity to the Charles River), and RM (average number of rooms per dwelling). These variables are known to have significant influence on housing prices; for instance, higher crime rates typically decrease property desirability, proximity to natural amenities like rivers tends to raise prices, and larger homes with more rooms tend to be valued higher.

Fitting the regression model involves estimating coefficients for these predictors, along with evaluating model fit through various statistics such as the R-squared value, F-statistic, t-statistics for individual predictors, and overall significance tests. These metrics gauge the explanatory power and significance of the model, guiding interpretations of the influence of each predictor on median home values.

The regression equation derived from the model provides a quantitative framework for predicting home prices based on specific predictor values. For example, given a situation where CRIM equals 0.325, CHAS equals 1 (river proximity), and RM equals 6.5 rooms, plugging these into the regression equation yields an estimated median home value.

In addition to linear regression, this analysis extends to logistic regression, with the binary target variable CAT.MEDV representing high (1) and low (0) median home prices. Logistic regression predicts the probability that a given home falls into the high-price category, based on the same set of predictors. Comparing the logistic model's statistics with those of the linear model—such as pseudo R-squared values, likelihood ratio tests, and classification accuracy—provides insights into which modeling approach better captures the underlying data patterns.

Overall, employing both regression techniques enables a comprehensive understanding of how various factors influence housing prices in Boston, with practical applications in real estate valuation, policy-making, and urban planning. The SPSS Modeler facilitates these analyses through an intuitive interface for data partitioning, model fitting, and statistical evaluation, supporting informed decision-making in real estate contexts.

References

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer.
  • U.S. Census Bureau. (2021). Boston Housing Data. Retrieved from https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
  • Sheldon, D., & Krafcik, J. (2020). Using SPSS Modeler for Data Analysis. IBM Documentation.
  • Gareth, J., Witten, D., Hastie, T., & Tibshirani, R. (2017). Elements of Statistical Learning. Springer.
  • McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the 9th Python in Science Conference.
  • Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics. Sage Publications.
  • Gujarati, D. N., & Porter, D. C. (2009). Basic Econometrics. McGraw-Hill Education.
  • Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S. Springer.
  • Chatterjee, S., & Hadi, A. S. (2006). Regression Analysis by Example. Wiley-Interscience.