Download The R Studio Tidyverse And Associated Packages
Download The R Studio Tidyverse And Associated Pack
Use R package tidyverse (see chapter 4 of Introduction to Data Science Data Analysis and Prediction Algorithms with R by Rafael A. Irizarry). You need to go through chapter 4 before attempting the following questions. Using dplyr functions (i.e., filter, mutate ,select, summarise, group_by etc. ) and "murder" dataset (available in dslab R package) and write appropriate R syntax to answer the followings: a. Calculate regional total murder excluding OH, AL, and AZ (Hint: filter(! abb %in% x) # here x is the exclusion vector) b. Display the regional population and regional murder numbers. c. How many states are there in each region? (Hint: n ()) d. What is Ohio's murder rank in the Northern Central Region (Hint: use rank(), row_number()) e. How many states have murder number greater than its regional average (Hint: nrow() ) f. Display 2 least populated states in each region (Hint: slice_min() ) Use pipe %>% operator for all the queries. Show all the output results
Paper For Above instruction
Introduction
The application of R programming language, particularly the tidyverse collection of packages, offers an efficient way to analyze and interpret complex datasets. The 'murder' dataset from the dslab R package provides valuable insights into regional crime statistics across the United States. This paper demonstrates how to utilize dplyr functions within tidyverse to perform specific data analysis tasks, including filtering, summarizing, ranking, and selecting data based on regional and state-level murder statistics.
Data Context and Preparation
The 'murder' dataset contains variables such as state abbreviations, regional classifications, population figures, and murder counts. Before executing the analysis, it is essential to have the tidyverse package installed and loaded, along with the dslab package that includes the 'murder' dataset. The analysis assumes the data is pre-processed as per guidelines in chapter 4 of Irizarry's introductory text, focusing on data manipulation techniques.
Analysis Tasks and R Syntax
The following are detailed steps with corresponding R syntax to fulfill the specified questions:
a. Calculating regional total murders excluding certain states
Using the filter() function, exclude Ohio (OH), Alabama (AL), and Arizona (AZ) by filtering out their abbreviations:
library(dplyr)
library(dslab)
exclusion_states
total_murders_excluding %
filter(!abb %in% exclusion_states) %>%
group_by(region) %>%
summarise(total_murder = sum(murder))
print(total_murders_excluding)
b. Displaying regional population and murder numbers
Summarize the total population and total murders for each region:
regional_stats %
group_by(region) %>%
summarise(total_population = sum(population),
total_murder = sum(murder))
print(regional_stats)
c. Counting states per region
Use n() within group_by() and summarise() to count states:
states_per_region %
group_by(region) %>%
summarise(state_count = n())
print(states_per_region)
d. Ohio's murder rank in the Northern Central Region
First filter for the Northern Central region, then rank Ohio's murder count among the states:
nc_region %
filter(region == "Northern Central")
nc_region %
mutate(rank = row_number(desc(murder)))
ohio_rank %
filter(abb == "OH") %>%
select(state, murder, rank)
print(ohio_rank)
e. Counting states with murder numbers exceeding regional averages
Calculate regional averages and compare each state's murders:
average_murder %
group_by(region) %>%
summarise(regional_avg = mean(murder))
murder_with_avg %
left_join(average_murder, by = "region") %>%
filter(murder > regional_avg)
number_states_above_avg
print(number_states_above_avg)
f. Displaying two least populated states in each region
Use slice_min() for each region to select two states with minimum populations:
least_populated_states %
group_by(region) %>%
slice_min(order_by = population, n = 2)
print(least_populated_states)
Conclusion
This analytical approach leverages the flexibility of dplyr functions within tidyverse to extract meaningful insights from the 'murder' dataset. Filtering, grouping, summarizing, ranking, and selecting data are straightforward with these tools, enabling comprehensive regional and state-level analysis of murder statistics across the United States.
References
- Irizarry, R. A. (2019). Introduction to Data Science: Data Analysis and Prediction Algorithms with R. CRC Press.
- Wickham, H., & Grolemund, G. (2016). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly Media.
- Wickham, H. (2017). Tidy data. Journal of Statistical Software, 59(10), 1-23.
- Wickham, H., François, R., Henry, L., & Müller, K. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.10.
- R Core Team. (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.
- Becker, R. A., Wilks, A. R., Brownrigg, R., & Minka, T. (2018). maps: Draw Geographical Maps. R package version 3.3.0.
- RStudio Team. (2023). RStudio: Integrated Development Environment for R.
- Chambers, J. M. (1999). Software for Data Analysis: Programming with R. Springer.
- Peng, R. D. (2016). R Programming for Data Science. www.rprogramming.net.
- Team, R. C. (2023). R: The R Project for Statistical Computing. https://www.r-project.org/.