ITS 530: Quiz 2 Sample Report Visualization With Ggplot2 Lib ✓ Solved
ITS 530: Quiz 2 sample report Visualization with ggplot2 lib
ITS 530: Quiz 2 sample report Visualization with ggplot2 library. Read a CSV dataset, inspect it with str(), show null values, and create ggplot2 visualizations: (1) a scatterplot with x = R_C_PCT_CLASSES_GT_50 and y = IS_RANKED and interpret the relationship between class size and university rank; (2) a second scatterplot with an encoding (color or size) and interpret that encoding; (3) three additional charts created from the same dataset with code and figures based on your dataset (do not copy sample code or figures). Submit the code and figures.
Paper For Above Instructions
Overview
This report demonstrates a reproducible workflow for reading a CSV dataset, inspecting its structure, identifying missing values, and producing five ggplot2 visualizations derived from the dataset. The dataset contains institutional-level variables including R_C_PCT_CLASSES_GT_50 (percentage of classes with >50 students), IS_RANKED (a numeric or ordinal ranking indicator), and additional demographic and categorical variables used for encoding. All code examples below are written in R using tidyverse/ggplot2 conventions (Wickham, 2016; Wickham & Grolemund, 2017).
Data import and initial inspection
Code to read the CSV and inspect structure:
# Load libraries
library(readr)
library(dplyr)
library(ggplot2)
Read CSV (replace 'data.csv' with actual filename)
df
Inspect structure
str(df)
summary(df)
The str() output gives the number of observations and variables and types; summary() provides variable summaries. For example, str(df) would show if R_C_PCT_CLASSES_GT_50 and IS_RANKED are numeric, integer, or factor (Wickham & Grolemund, 2017).
Missing value check
Identify missing values and present counts per column:
# Count missing values per column
missing_counts
missing_counts
missing_counts
Plotting a small visualization of missingness can help communicate the issue (e.g., a barplot of counts). If many missing values are concentrated in specific columns, document reasons and handle them appropriately (imputation, exclusion, or flagging) (Gelman & Hill, 2006).
Scatterplot 1 – R_C_PCT_CLASSES_GT_50 vs IS_RANKED
Objective: examine whether institutions with higher percentages of large classes (>50 students) tend to have different ranks.
# Basic scatterplot
ggplot(df, aes(x = R_C_PCT_CLASSES_GT_50, y = IS_RANKED)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "loess", se = TRUE, color = "blue") +
labs(
x = "Pct classes > 50 students (R_C_PCT_CLASSES_GT_50)",
y = "Institution Rank (IS_RANKED)",
title = "Scatterplot: Class size vs Institution Rank"
) +
theme_minimal()
Interpretation: If the loess curve slopes upward as R_C_PCT_CLASSES_GT_50 increases, higher percentages of large classes correspond to higher (worse or better depending on scale) ranks. In many datasets, institutions with lower prestige or resources show higher proportions of large lectures, or conversely, large research universities might have many large introductory courses — interpretation must be tied to the dataset's rank directionality (Tufte, 2001).
Scatterplot 2 – encoding with color or size
Objective: add a categorical or continuous encoding (for example, color by Institution Type: Public/Private, or size by total enrollment) to reveal multivariate patterns.
# Scatterplot with color encoding and size encoding
ggplot(df, aes(x = R_C_PCT_CLASSES_GT_50, y = IS_RANKED,
color = INSTITUTION_TYPE, size = TOTAL_ENROLLMENT)) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(1, 6), guide = guide_legend(title = "Enrollment")) +
labs(
x = "Pct classes > 50 students",
y = "Institution Rank",
color = "Institution Type",
title = "Class Size vs Rank with Institution Type and Enrollment"
) +
theme_minimal()
Interpretation: Color separates public and private institutions; if public institutions cluster at higher R_C_PCT_CLASSES_GT_50, that indicates public schools have more large classes. Size encodes enrollment so you can see whether very large institutions (big circles) have distinct rank/class-size patterns (Wilke, 2019).
Three additional charts
Chart 3 — Histogram of R_C_PCT_CLASSES_GT_50: distribution of the percentage of large classes.
# Histogram
ggplot(df, aes(x = R_C_PCT_CLASSES_GT_50)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white", alpha = 0.8) +
labs(x = "Pct classes > 50 students", y = "Count", title = "Distribution of Large Class Percentages") +
theme_minimal()
Interpretation: Look for skewness, modality, and whether many institutions report zero or very low percentages, which affects modeling choices (Cleveland, 1993).
Chart 4 — Boxplot of IS_RANKED by Institution Type: compare rank distributions.
# Boxplot
ggplot(df, aes(x = INSTITUTION_TYPE, y = IS_RANKED, fill = INSTITUTION_TYPE)) +
geom_boxplot(alpha = 0.7) +
labs(x = "Institution Type", y = "Rank", title = "Rank Distribution by Institution Type") +
theme_minimal() +
theme(legend.position = "none")
Interpretation: Boxplots reveal median rank and interquartile ranges across categories, highlighting whether one type is systematically higher or lower ranked (Few, 2009).
Chart 5 — Faceted bar chart or heatmap showing joint distribution of categorical variables (e.g., IS_RANKED binned vs region):
# Binning rank for a faceted bar chart
df %
mutate(RANK_BIN = cut(IS_RANKED, breaks = quantile(IS_RANKED, probs = seq(0,1,0.2), na.rm = TRUE), include.lowest = TRUE))
ggplot(df, aes(x = RANK_BIN, fill = INSTITUTION_TYPE)) +
geom_bar(position = "dodge") +
labs(x = "Rank bin", y = "Count", title = "Counts by Rank Bin and Institution Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation: This plot detects whether different institution types concentrate in particular rank bins and supports downstream modeling or stratified analysis (Kelleher & Wagener, 2011).
Reproducibility and good practices
All plots include descriptive axis labels, a title, a modest use of color, and alpha transparency for overplotting. Save figures with ggsave() after creating them to submit both code and figure files (Wickham, 2016):
ggsave("scatter_class_vs_rank.png", width = 8, height = 5, dpi = 300)
Document any data cleaning steps (e.g., removing rows with NA in key variables or imputing) and the exact CSV used. Provide script files and figure images as the submission package. Avoid copying sample figures; instead, adapt code to your dataset variables and justify each visualization choice based on the data patterns (Tufte, 2001; Wilke, 2019).
Summary
This report demonstrated a sequence from data import and missing-value assessment to five ggplot2 visualizations: a primary scatterplot of R_C_PCT_CLASSES_GT_50 vs IS_RANKED, a second scatterplot with color/size encodings, and three additional plots (histogram, boxplot, and binned bar chart). Interpretations emphasize context: rank scale direction, institutional type, and enrollment-driven patterns. All code is reproducible in R and examples follow visualization best practices (Wickham, 2016; Wilke, 2019).
References
- Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
- Wickham, H., & Grolemund, G. (2017). R for Data Science. O’Reilly Media.
- Wilke, C. O. (2019). Fundamentals of Data Visualization. O’Reilly Media.
- Tufte, E. R. (2001). The Visual Display of Quantitative Information (2nd ed.). Graphics Press.
- Cleveland, W. S. (1993). Visualizing Data. Hobart Press.
- Few, S. (2009). Now You See It: Simple Visualization Techniques for Quantitative Analysis. Analytics Press.
- Kelleher, C., & Wagener, T. (2011). Ten Guidelines for Effective Data Visualization in Scientific Publications. Environmental Modelling & Software.
- R Core Team (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing.
- RStudio (2024). ggplot2 Cheat Sheet. RStudio. Available at: https://rstudio.com/resources/cheatsheets/
- Heer, J., Bostock, M., & Ogievetsky, V. (2010). A Tour through the Visualization Zoo. Communications of the ACM, 53(6), 59–67.