Download The R Studio Tidyverse And Associated Packages

Download The R Studio Tidyverse And Associated Pack

Use R package tidyverse (see chapter 4 of Introduction to Data Science Data Analysis and Prediction Algorithms with R by Rafael A. Irizarry). You need to go through chapter 4 before attempting the following questions. Using dplyr functions (i.e., filter, mutate ,select, summarise, group_by etc. ) and "murder" dataset (available in dslab R package) and write appropriate R syntax to answer the followings: a. Calculate regional total murder excluding OH, AL, and AZ (Hint: filter(! abb %in% x) # here x is the exclusion vector) b. Display the regional population and regional murder numbers. c. How many states are there in each region? (Hint: n ()) d. What is Ohio's murder rank in the Northern Central Region (Hint: use rank(), row_number()) e. How many states have murder number greater than its regional average (Hint: nrow() ) f. Display 2 least populated states in each region (Hint: slice_min() ) Use pipe %>% operator for all the queries. Show all the output results

Paper For Above instruction

Download The R Studio Tidyverse And Associated Pack

Introduction

The application of R programming language, particularly the tidyverse collection of packages, offers an efficient way to analyze and interpret complex datasets. The 'murder' dataset from the dslab R package provides valuable insights into regional crime statistics across the United States. This paper demonstrates how to utilize dplyr functions within tidyverse to perform specific data analysis tasks, including filtering, summarizing, ranking, and selecting data based on regional and state-level murder statistics.

Data Context and Preparation

The 'murder' dataset contains variables such as state abbreviations, regional classifications, population figures, and murder counts. Before executing the analysis, it is essential to have the tidyverse package installed and loaded, along with the dslab package that includes the 'murder' dataset. The analysis assumes the data is pre-processed as per guidelines in chapter 4 of Irizarry's introductory text, focusing on data manipulation techniques.

Analysis Tasks and R Syntax

The following are detailed steps with corresponding R syntax to fulfill the specified questions:

a. Calculating regional total murders excluding certain states

Using the filter() function, exclude Ohio (OH), Alabama (AL), and Arizona (AZ) by filtering out their abbreviations:

library(dplyr)

library(dslab)

exclusion_states

total_murders_excluding %

filter(!abb %in% exclusion_states) %>%

group_by(region) %>%

summarise(total_murder = sum(murder))

print(total_murders_excluding)

b. Displaying regional population and murder numbers

Summarize the total population and total murders for each region:

regional_stats %

group_by(region) %>%

summarise(total_population = sum(population),

total_murder = sum(murder))

print(regional_stats)

c. Counting states per region

Use n() within group_by() and summarise() to count states:

states_per_region %

group_by(region) %>%

summarise(state_count = n())

print(states_per_region)

d. Ohio's murder rank in the Northern Central Region

First filter for the Northern Central region, then rank Ohio's murder count among the states:

nc_region %

filter(region == "Northern Central")

nc_region %

mutate(rank = row_number(desc(murder)))

ohio_rank %

filter(abb == "OH") %>%

select(state, murder, rank)

print(ohio_rank)

e. Counting states with murder numbers exceeding regional averages

Calculate regional averages and compare each state's murders:

average_murder %

group_by(region) %>%

summarise(regional_avg = mean(murder))

murder_with_avg %

left_join(average_murder, by = "region") %>%

filter(murder > regional_avg)

number_states_above_avg

print(number_states_above_avg)

f. Displaying two least populated states in each region

Use slice_min() for each region to select two states with minimum populations:

least_populated_states %

group_by(region) %>%

slice_min(order_by = population, n = 2)

print(least_populated_states)

Conclusion

This analytical approach leverages the flexibility of dplyr functions within tidyverse to extract meaningful insights from the 'murder' dataset. Filtering, grouping, summarizing, ranking, and selecting data are straightforward with these tools, enabling comprehensive regional and state-level analysis of murder statistics across the United States.

References

  1. Irizarry, R. A. (2019). Introduction to Data Science: Data Analysis and Prediction Algorithms with R. CRC Press.
  2. Wickham, H., & Grolemund, G. (2016). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O'Reilly Media.
  3. Wickham, H. (2017). Tidy data. Journal of Statistical Software, 59(10), 1-23.
  4. Wickham, H., François, R., Henry, L., & Müller, K. (2022). dplyr: A Grammar of Data Manipulation. R package version 1.0.10.
  5. R Core Team. (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.
  6. Becker, R. A., Wilks, A. R., Brownrigg, R., & Minka, T. (2018). maps: Draw Geographical Maps. R package version 3.3.0.
  7. RStudio Team. (2023). RStudio: Integrated Development Environment for R.
  8. Chambers, J. M. (1999). Software for Data Analysis: Programming with R. Springer.
  9. Peng, R. D. (2016). R Programming for Data Science. www.rprogramming.net.
  10. Team, R. C. (2023). R: The R Project for Statistical Computing. https://www.r-project.org/.