Create A Presentation March 2020

Create A Presentation March 2020

Create and presenting data analysis Tip: Read through this document in its entirety before you begin. Situation: Analyze the data gathered for the Center for Disease Control and Prevention (CDC) social vulnerability data and data dictionary (CDC, 2018a; CDC, 2018b), in use for determining the resiliency of communities within specific states: Illinois, Wisconsin, and Michigan. Objective: Explore the dataset, considering the state, counties, and population, and four categories: socioeconomic features, household and composition disability features, and minority status and language limitations, and housing types and transportation. In the interest of clarity, I will specify the variates associated with these categories • Socioeconomic o Persons below the poverty estimate o Civilian unemployed estimate o Per capita income estimate o Persons with no high school diploma • Household and composition disability features o Ages 65 and older o Ages 17 and under o Persons with a disability, over the age of 5 o Single-parent households • Minority status and language limitations o Persons with minority status o Persons with no or minimal use of the English language • Housing types and transportation o Multi-unit dwellings (10 or more units) o Mobile homes o Homes with more residents than a home is designed for o Homes with no vehicle o Group quarters or institutionalized quarters Note: Do not use the columns that are follow-on calculations of these columns. These are the columns with the prefix “E_â€. Consider the following research questions: • How do these factors relate to the measure of social vulnerability (in the data set at RPL_THEMES) metric analytically? By the CDC standards, the closer the value is to one, the higher the vulnerability (CDC, 2018b). • What patterns can be found when looking at different aspects of the data features? • How do different characteristics of the data relate? • How well do these variates represent the vulnerability? • Which characteristics have a more significant influence on predicting vulnerability Note: Do not repeat the calculations the CDC uses, develop a novel approach. If you use the method the CDC uses, you will earn a zero for the entire assignment. Data Collection: • The data and data dictionaries are online. o Center for Disease Control and Prevention. (2018a). Social vulnerability index [data set]. o Center for Disease Control and Prevention. (2018b). Social vulnerability index [code book]. o Note: Your raw data must be this report in its original form. • Create a subset of the data based on the situation and the objective. Note that “E_†are actual measures, while “M_†are the margin of error estimates. Data Cleaning: • Review the data for issues. • Do not transform this data. Do not remove outliers. • Do not delete the NA values of the data. You may exclude them for certain types of analysis. The data dictionary or code book states how NA values are annotated in the document. • If there are any erroneous data types, address the issues. • Look for any other issues that may require cleaning. o Do not automatically remove outliers, remove NA values, or replace NA values. Any of these actions will require justification. • You may exclude cleaning from the presentation. You MUST include cleaning in your programming. Analyze: • Develop a plan and state that plan before extending beyond necessary cleaning. o Your plan should include what you intend to do in your analysis. o Your plan shall also include any assumptions or data preparation that must be done for a specific method of analysis. • Conduct exploratory data analysis, as defined in your plan. This shall include the exploration of multiple different features and how they interrelate. o The minimum of explorations that are suitable for presenting is five. • You must include a thorough interpretation of each presented exploration. Do not describe every feature of the table or visualization; interpret critical points and trends. Ensure the investigations combined tell a story about the data. They should not be individual ideas, but concepts that tie together in some manner to bring you to a potential next stage of analysis. • Any univariate analysis will not count toward the total of five visualizations. • Develop a new plan and state that plan before extending beyond exploratory data analysis. Your plan shall include a minimum of o Splitting the data into training and testing sets, with 80% of the data in the training set. o Develop a random forest model. Explore which independent variables have the most impact on the vulnerability index. Explore the random forest model for the best model, including the number of trees (ntree) and the number of variables for splitting at each tree node (mtry). o Look at the importance of the different independent variables. What does this tell you about your data and your model? • Are there any post hoc analyses that may improve your results? Future Recommendations: • You must also include recommendations for future analysis. • You will base your recommendations on your findings in the analysis you conduct. 3/20/2020 Create a presentation March 2020 P a g e | 3 You must generate your presentation in R Markdown. Do not forget to annotate comments in your code. You must include ALL the references you used in APA format in your presentation. If you use a source to assist in writing the programming code in your Rmd file, include that reference in APA format (no italics or indention required) in a comment in the {r} chunk(s) to which it applies. Required files to submit: You shall submit the Rmd file of your slides and any other files your R Markdown file relies on to knit, by Saturday night at midnight. When you present on Sunday, what you present and what you submit must be identical. Do not submit the raw data file. Tips: Do not forget to reference the source of the data and data dictionary. It is in this document in APA 7. There are 15 predictors and one outcome variate that shall be used in exploratory data analysis and the random forest model.

Paper For Above instruction

The objective of this analysis is to explore the CDC social vulnerability index (SVI) dataset for Illinois, Wisconsin, and Michigan, with the aim of understanding how various socio-economic, household, minority, and housing-related factors influence community resilience metrics. This comprehensive examination involves detailed exploratory data analysis (EDA) combined with advanced predictive modeling using Random Forests to identify key predictors of social vulnerability. Crucially, the approach emphasizes developing novel analytical strategies that deviate from CDC standard calculations, fostering more insightful and tailored understanding of community vulnerability factors.

The dataset, obtained directly from the CDC's online repositories (CDC, 2018a; CDC, 2018b), encompasses 15 predictor variables segmented into four categories: socioeconomic status, household/disability features, minority/language limitations, and housing/transportation indicators. Notably, the raw data preserves original measures (designated with prefix "E_") and their associated margins of error ("M_"). The initial step involves careful data review and cleaning, including checking for erroneous data types while maintaining all outliers and NA values, as mandated by the project instructions. This thorough data vetting sets the foundation for meaningful analysis.

The exploratory data analysis will focus on visualizations and statistical interpretations of key variables, seeking patterns, correlations, and representative trends across different states and counties. For example, univariate distributions, bivariate scatter plots, and multivariate interaction plots will be employed to reveal relationships and potential multicollinearity issues. These visualizations are expected to demonstrate how certain factors—such as high poverty levels, disability prevalence, or housing conditions—correlate with higher vulnerability scores (RPL_THEMES). Critical insights will emerge from interpreting these trends; for example, counties with high minority populations may also exhibit elevated vulnerability indices.

Following the initial exploration, the analysis shifts towards predictive modeling. The data will be split into training and testing sets (80-20 split). A Random Forest regression model will be developed to predict the vulnerability index, ensuring hyperparameter optimization by exploring different values for the number of trees (ntree) and the number of variables at each split (mtry). Variable importance measures—such as mean decrease impurity and permutation importance—will be extracted to identify the most impactful predictors. This step provides quantitative support for understanding which community features most strongly influence vulnerability.

Post hoc analyses will include examining model residuals, testing alternative hyperparameter configurations, and assessing model stability across different subsets of data. These strategies aim to fine-tune the model and enhance predictive reliability. Based on the findings, future recommendations will include expanding the analysis to additional states, integrating contextual geographic information, and applying more sophisticated machine learning techniques such as gradient boosting or neural networks.

The presentation will be programmed entirely in R Markdown, with detailed code annotations to ensure transparency and reproducibility. All references used, including the CDC data sources, will be formatted in APA style and included at the end of the presentation, fulfilling academic integrity and citation standards. The final deliverable comprises the R Markdown source file, along with supplementary files necessary for kniting, prepared for submission by the designated deadline.