Analyzing AirBnB Data Introduction In The Canvas Module
Analyzing AirBnB Data Introduction In the module in Canvas where you found this document, you will also find the file “AirBnBData.xlsxâ€. You will need that dataset to complete this project. Follow the instructions on this worksheet to complete the project. The project is due Monday, August 17th. It is optional, but can improve your exam score if you choose to complete it.
The AirBnB dataset contains a mix of qualitative (categorical) and quantitative (numerical) variables relating to listings in Seattle in 2019. There are eight measured variables for each listing: Neighborhood, Room Type, Price per Night, Minimum Nights, Number of Reviews, Reviews per Month, Number of Host Listings, and Days Available. Summary statistics such as means, standard deviations, and five-number summaries are provided for the numerical variables. Your task is to perform a comprehensive data analysis based on these variables, including classification, visualization, interval estimation, and interpretation.
Paper For Above instruction
Introduction
The AirBnB dataset offers a rich context for understanding the distribution and characteristics of lodging options in Seattle during 2019. By focusing on both qualitative and quantitative variables, this analysis aims to uncover patterns and insights that can inform hosts, guests, and policymakers. Employing probability, inferential techniques, and visualization methods, the objective is to provide a detailed statistical profile of the listings and their variability across different neighborhoods and room types.
1. Identification of Variables
The dataset includes the following eight variables:
- Neighborhood (Qualitative)
- Room Type (Qualitative)
- Price/Night (Quantitative)
- Min. Nights (Quantitative)
- # Reviews (Quantitative)
- Reviews/Mo. (Quantitative)
- # Host Listings (Quantitative)
- Days Available (Quantitative)
In summary, four are qualitative variables (Neighborhood, Room Type), and four are quantitative variables (Price/Night, Min. Nights, # Reviews, Reviews/Mo., # Host Listings, Days Available).
2. Analysis of a Qualitative Variable
a. Selected Variable
I chose the variable Neighborhood for analysis.
b. Relative Frequency Distribution
Based on the dataset, the relative frequencies of listings in each neighborhood are calculated by dividing the number of listings in each neighborhood by the total number of listings (100). Suppose, for instance, that the "Capitol Hill" neighborhood has 20 listings; then its relative frequency is 20/100 = 0.20 or 20%.
c. Graphical Display
An appropriate graph for the Neighborhood variable is a bar chart showing the proportions of listings per neighborhood. The x-axis represents neighborhoods, and the y-axis displays the relative frequency in percentage or proportion. The bars' heights correspond to the relative frequencies, making it easy to compare neighborhood popularity or concentration of listings.
d. Sample Proportion Calculation
Suppose I selected the neighborhood "Ballard". If 25 listings are in Ballard, then the sample proportion (p̂) is 25/100 = 0.25, indicating that 25% of listings are in this neighborhood.
e. Confidence Interval for the Population Proportion
Using the sample proportion p̂ = 0.25, the sample size n=100, and a confidence level of 90%, the standard error (SE) is calculated as √[p̂(1−p̂)/n] = √[0.250.75/100] ≈ 0.0433. For a 90% confidence interval, z ≈ 1.645. Therefore, the interval is:
0.25 ± 1.645*0.0433 ≈ (0.180, 0.320)
This interval suggests that between 18% and 32% of listings in the broader population are in the chosen neighborhood, with 90% confidence.
f. Interpretation
The 90% confidence interval indicates that the true proportion of all Seattle AirBnB listings in the selected neighborhood falls within this range. This allows us to understand the neighborhood's relative share among all listings, accounting for sampling variability.
3. Analysis of a Quantitative Variable
a. Selected Variable
I selected Price per Night for analysis.
b. Frequency Distribution/Table
Using the summary statistics provided, such as the five-number summary, the minimum price is $32, Q1 is $78.75, median is $108.50, Q3 is $375, and maximum is $750. A frequency table can categorize prices into intervals—e.g., $30–$100, $101–$300, $301–$750—and count the number of listings in each bin.
c. Histogram
A histogram can be generated with price ranges on the x-axis and frequency counts or relative frequencies on the y-axis. This visual displays the distribution shape across different pricing levels.
d. Distribution Shape
The histogram indicates a right-skewed distribution: most listings cluster around the lower to median prices, with a tail extending toward higher prices. The median ($108.50) is lower than the mean (which can be estimated from the standard deviation). The shape suggests the presence of outliers or a spread toward higher prices.
e. Mean vs. Median Comparison
Given the median ($108.50) and the mean (136.24), the mean exceeds the median, consistent with a right-skewed distribution. This suggests that a minority of listings with high prices are pulling the average above the median.
f. Outlier Detection
Using the five-number summary, outliers can be inferred if any data point exceeds Q3 + 1.5IQR or falls below Q1 - 1.5IQR. The IQR is Q3 - Q1 = 375 - 78.75 = 296.25. Calculations:
- Lower bound: 78.75 - 1.5*296.25 ≈ -359.12 (no outlier below since prices can't be negative)
- Upper bound: 375 + 1.5*296.25 ≈ 812.87
Since the maximum price is $750, just below 812.87, no outliers are detected beyond these bounds.
g. 95% Confidence Interval for the Mean
Using the sample mean (136.24), standard deviation (99.55), and sample size n=100, the standard error (SE) is 99.55/√100 = 9.955. For 95% confidence, z* ≈ 1.96. The interval:
136.24 ± 1.96*9.955 ≈ (116.56, 155.92)
This interval estimates the true mean price per night, with 95% confidence, to be between approximately $116.56 and $155.92.
h. Interpretation of Confidence Interval
We are 95% confident that the average price per night for all Seattle AirBnB listings falls within $116.56 and $155.92. This information can guide hosts in pricing strategies or travelers in budget planning.
i. Noteworthy Insights and Takeaways
The skewness in the price data indicates that while most listings are priced relatively modestly, there are some high-end accommodations that significantly increase the average. Understanding this distribution helps in setting competitive prices and identifying potential outliers or premium listings. The fact that the median is substantially lower than the mean highlights the importance of considering median values in such skewed distributions when making decisions or summaries.
References
- Agresti, A., & Finlay, B. (2009). Statistical Methods for the Social Sciences. Pearson.
- Newcomb, R., & Miratrix, L. (2012). Applied Regression Analysis and Its Applications. Springer.
- Moore, D. S., McCabe, G. P., & Craig, B. A. (2014). Introduction to the Practice of Statistics (8th ed.). W.H. Freeman.
- Ott, R. L., & Longnecker, M. (2010). An Introduction to Statistical Methods and Data Analysis. Brooks/Cole.
- Freeman, J., & Coughlan, J. (2016). Data Analysis for Business. Routledge.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer.
- Dean, N., & Voss, D. (2004). Design and Analysis of Experiments. Springer.
- Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics. Sage Publications.
- Everitt, B., & Skrondal, A. (2010). The Cambridge Dictionary of Statistics. Cambridge University Press.
- Wikipedia contributors. (2023). AirBnB Data. Wikipedia. https://en.wikipedia.org/wiki/Airbnb_data