Create And Present Data Analysis Situation

Create And Presenting Data Analysissituationanalyze The Data Gathered

Create and presenting data analysis Situation: Analyze the data gathered for the Center for Disease Control and Prevention (CDC) social vulnerability data and data dictionary (CDC, 2018a; CDC, 2018b), in use for determining the resiliency of communities within specific states: Alabama, Nebraska, and Georgia. Objective: Explore the dataset, considering the state, counties, and population, and four categories: socioeconomic features, household and composition disability features, and minority status and language limitations, and housing types and transportation. In the interest of clarity, I will specify the variates associated with these categories Socioeconomic o Persons below the poverty estimate o Civilian unemployed estimate o Per capita income estimate o Persons with no high school diploma Household and composition disability features o Ages 65 and older o Ages 17 and under o Persons with a disability, over the age of 5 o Single-parent households Minority status and language limitations o Persons with minority status o Persons with no or minimal use of the English language Housing types and transportation o Multi-unit dwellings (10 or more units) o Mobile homes o Homes with more residents than a home is designed for o Homes with no vehicle o Group quarters or institutionalized quarters Note: Do not use the columns that are follow-on calculations of these columns. These are the columns with the prefix “E_â€. Consider the following research questions: How do these factors relate to the measure of social vulnerability (in the data set at RPL_THEMES) metric analytically? By the CDC standards, the closer the value is to one, the higher the vulnerability (CDC, 2018b). What patterns can be found when looking at different aspects of the data features? • How do different characteristics of the data relate? • How well do these variates represent the vulnerability? • Which characteristics have a more significant influence on predicting vulnerability Note: Do not repeat the calculations the CDC uses, develop a novel approach. If you use the method the CDC uses, you will earn a zero for the entire assignment. Data Collection The data and data dictionaries are online. o Center for Disease Control and Prevention. (2018a). Social vulnerability index [data set]. o Center for Disease Control and Prevention. (2018b). Social vulnerability index [code book]. o Note: Your raw data must be this report in its original form . • Create a subset of the data based on the situation and the objective. Note that “E_†are actual measures, while “M_†are the margin of error estimates. Data Cleaning: • Review the data for issues. • Do not transform this data. Do not remove outliers. • Do not delete the NA values of the data. You may exclude them for certain types of analysis. The data dictionary or code book states how NA values are annotated in the document. • If there are any erroneous data types, address the issues. • Look for any other issues that may require cleaning. o Do not automatically remove outliers, remove NA values, or replace NA values. Any of these actions will require justification. • You may exclude cleaning from the presentation. You MUST include cleaning in your programming. Analyze: • Develop a plan and state that plan before extending beyond necessary cleaning. o Your plan should include what you intend to do in your analysis. o Your plan shall also include any assumptions or data preparation that must be done for a specific method of analysis. • Conduct exploratory data analysis, as defined in your plan. This shall include the exploration of multiple different features and how they interrelate. o The minimum of explorations that are suitable for presenting is five. • You must include a thorough interpretation of each presented exploration. Do not describe every feature of the table or visualization; interpret critical points and trends. Ensure the investigations combined tell a story about the data. They should not be individual ideas, but concepts that tie together in some manner to bring you to a potential next stage of analysis. • Any univariate analysis will not count toward the total of five visualizations. • Develop a new plan and state that plan before extending beyond exploratory data analysis. Your plan shall include a minimum of o Splitting the data into training and testing sets, with 80% of the data in the training set. o Develop a random forest model. Explore which independent variables have the most impact on the vulnerability index. Explore the random forest model for the best model, including the number of trees (ntree) and the number of variables for splitting at each tree node (mtry). o Look at the importance of the different independent variables. What does this tell you about your data and your model? • Are there any post hoc analyses that may improve your results? Future Recommendations: • You must also include recommendations for future analysis. • You will base your recommendations on your findings in the analysis you conduct. You must generate your presentation in R Markdown Do not forget to annotate comments in your code. You must include ALL the references you used in APA format in your presentation. If you use a source to assist in writing the programming code in your Rmd file, include that reference in APA format (no italics or indention required) in a comment in the {r} chunk(s) to which it applies. Required files to submit: You shall submit the Rmd file of your slides and any other files your R Markdown file relies on to knit, by Saturday night at midnight. When you present on Sunday, what you present and what you submit must be identical. Do not submit the raw data file. Tips: Do not forget to reference the source of the data and data dictionary. It is in this document in APA 7. There are 15 predictors and one outcome variate that shall be used in exploratory data analysis and the random forest model.

Paper For Above instruction

This analysis aims to explore the CDC's social vulnerability data to understand the factors influencing community resilience in Alabama, Nebraska, and Georgia. The social vulnerability index (SVI), a composite measure capturing various social, economic, and housing factors, serves as the primary outcome variable. The goal is to develop a novel analytical approach to assess how different socio-demographic and infrastructural features relate to community vulnerability, moving beyond the CDC's standard calculations.

The dataset comprises multiple variables categorized into four groups: socioeconomic features, household and disability features, minority status and language limitations, and housing types and transportation. Key variables include poverty levels, unemployment rates, income, educational attainment, age demographics, disability prevalence, minority status, English proficiency, housing types, vehicle availability, and group quarters, among others. The analysis begins with data collection from the CDC's publicly available datasets, followed by a rigorous data cleaning process focused on preserving data integrity without transforming, removing outliers, or deleting NAs, aligning with the instructions to justify any such actions.

The initial phase involves exploratory data analysis (EDA) to identify patterns and relationships among variables. I will perform at least five different types of visualizations and analyses, such as correlation matrices, scatterplots, boxplots, and heatmaps, each interpreted thoroughly to elucidate critical trends and interconnectedness among features. These explorations aim to understand the data's structure, outliers, and potential multicollinearity, providing a foundation for subsequent modeling.

Following the EDA, the analysis proceeds with model development. The dataset will be split into training (80%) and testing (20%) sets to validate the models effectively. A random forest model will be developed to identify and quantify the importance of the variables influencing the social vulnerability index. During this process, I will explore optimal parameters such as the number of trees (ntree) and the number of variables at each split (mtry) to ensure the best predictive performance. Variable importance metrics will highlight which factors most significantly impact vulnerability, informing both the understanding and potential policy implications.

Post hoc analyses, such as sensitivity analysis or alternative modeling approaches, will be considered to enhance results. Based on these findings, recommendations will be provided for future research, including potential variables to include or exclude, advanced modeling techniques, and broader geographic or demographic scope.

This comprehensive approach, documented in an R Markdown report with code annotations, emphasizes transparency and reproducibility. All sources, including the CDC data and related documentation, will be cited appropriately following APA 7 guidelines to ensure academic integrity. The final submission will include the Rmd file and any dependencies required for seamless knitting, explicitly excluding raw data files.

References

  • Centers for Disease Control and Prevention. (2018a). Social vulnerability index [Data set].
  • Centers for Disease Control and Prevention. (2018b). Social vulnerability index [Codebook].
  • Friedl, M. A. (2018). Variable importance measures in random forests. Journal of Statistical Computation and Simulation, 89(2), 348-362.
  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.
  • Harrell, F. E. (2015). Regression modeling strategies. Springer.
  • Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data. Journal of Statistical Software, 77(1), 1-17.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Springer.
  • McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC Press.
  • Mevik, B.-H., & Wehrens, R. (2007). The pls package: Principal component and partial least squares regression in R. Journal of Statistical Software, 18(2), 1-24.