Semester 2, 2020 BUS105 Computing Assignment Name ✓ Solved

Title: semester 2, 2020 BUS105 computing assignment - Name:

Title: semester 2, 2020 BUS105 computing assignment - Name: - Student number: - Sample: I am using the sample that is allocated to me based on my student number.

Overview: Use the provided materials to answer 10 questions based on three datasets and an automatic dataset summarizer. Materials: an Excel file with datasets (each student uses their allocated sample to extract three datasets), an automatic dataset summarizer, and instructions for checking that you have properly found your sample. You must use your allocated sample.

Datasets:

Dataset 1: Survey of students in a statistics course. Questions: (1) Do you think the course is useful and do you understand why? (2) How many videos have you watched?

Dataset 2: Survey of students in a statistics course. Questions: (1) What style of YouTube video do you prefer, chatty or direct? (2) Are you scared of maths? (3) How many videos did you watch?

Dataset 3: Business uses videos to replace meetings. Variables: duration (seconds) and engagement score (lower if people only watch the first part).

Questions:

1) Paste dataset 1 into the dataset summarizer.

a) Paste descriptive statistics into your report.

b) Using the output, describe the relationship between "course useful?" and "number of videos watched?" using one of: difference between sample means, difference between sample proportions, or correlation coefficient r.

2) Paste dataset 2 into the dataset summarizer.

a) Paste descriptive statistics into your report.

b) Using the output, describe the relationship between "Preferred style?" and "Scared of maths?" using one of: difference between sample means, difference between sample proportions, or correlation coefficient r.

3) Paste dataset 3 into the dataset summarizer.

a) Paste descriptive statistics and the scatterplot into your report.

b) Describe the relationship between "Duration?" and "Engagement score?" using one of: difference between sample means, difference between sample proportions, or correlation coefficient r.

c) Predict the engagement score of a video with duration 600.

4) Using the output for question 1a:

a) For people who do not find the course useful, find the z-score of the sample mean assuming population mean µ=5 and population standard deviation σ=3.

b) For people who do find the course useful, find the z-score of the sample mean assuming µ=5 and σ=3.

5) a) For people who prefer the chatty style, find a 90% confidence interval for the proportion who are scared of maths.

b) For people who prefer the direct style, find a 90% confidence interval for the proportion who are scared of maths.

6) Paste dataset 1 into the dataset summarizer.

a) Paste inferential statistics that measure evidence for a relationship between "course useful?" and "number of videos watched?" for the population.

b) Comment on the output.

c) Using the dataset summarizer output, replace a blank with a number that would make the p-value lower than in part (a).

7) Paste dataset 2 into the dataset summarizer.

a) Paste inferential statistics that measure evidence for a relationship between "Preferred style?" and "Scared of maths?" for the population.

b) Comment on the output.

c) Replace blanks in provided output with numbers that give a smaller p-value than in part (a), maintaining totals.

8) Paste dataset 3 into the dataset summarizer.

a) Paste inferential statistics that measure evidence for a relationship between "Duration?" and "Engagement score?" for the population.

b) Comment on the output.

c) If another sample had a higher correlation, would you expect the p-value to be lower or higher?

9) Briefly discuss a provided sample report (download link). In about 300 words, discuss the dataset, analysis, main message, and communication. Do not cut and paste; paraphrase.

10) Comment (about 300 words) on a provided discussion of p-values. For each case, discuss the relationship between p-value and the percentile distribution of p-values: first case (large difference in population means), and second case (almost no difference).

Paper For Above Instructions

Introduction and approach

This paper outlines the steps and typical analytical results required by the assignment for the three supplied datasets, explains how to compute and interpret descriptive and inferential statistics, and provides worked examples with plausible numeric values to illustrate z-scores, confidence intervals, correlation/regression, and p-value interpretation. All statistical interpretations follow common best practices (Wasserstein & Lazar, 2016; Cumming, 2014).

Question 1 — Dataset 1 descriptive and relationship

After pasting dataset 1 into the summarizer, extract group summaries for "course useful?" (Yes/No) and numeric summaries of "number of videos watched." For example, suppose the summarizer gives means: mean_videos_yes = 5.6 (n=60), mean_videos_no = 4.2 (n=40). The difference between sample means = 1.4 indicates students who find the course useful watched on average 1.4 more videos. If using correlation, compute r between the binary-coded useful (1=yes, 0=no) and videos; r≈0.28 in this hypothetical example, indicating a moderate positive association (Bland & Altman, 1995).

Question 2 — Dataset 2 descriptive and relationship

From the summarizer, obtain proportions of "scared of maths" by preferred style. Example: chatty: 18/50 = 0.36; direct: 8/40 = 0.20. The difference in proportions = 0.16 suggests higher self-reported math anxiety among those preferring chatty videos. Interpret effect size alongside sample sizes and confidence intervals (Agresti, 2018).

Question 3 — Dataset 3 descriptive, scatterplot and prediction

Use the summarizer scatterplot and sample statistics. Example hypothetical values: mean_duration = 400s (sd_x = 200), mean_engagement = 50 (sd_y = 20), correlation r = 0.45. Slope b = r(sd_y/sd_x) = 0.45(20/200) = 0.045. Intercept a = mean_y - bmean_x = 50 - 0.045400 = 32. Predicted engagement for duration 600: y_hat = 32 + 0.045*600 = 59. So a 600-second video is predicted to get engagement ≈59 (see Moore & McCabe, 2017).

Question 4 — Z-scores for sample means

Z = (x̄ - µ)/(σ/√n). Using the dataset 1 subgroup summaries above:

a) Not useful: x̄ = 4.2, n = 40, µ = 5, σ = 3 → SE = 3/√40 ≈ 0.474, z = (4.2−5)/0.474 ≈ −1.69.

b) Useful: x̄ = 5.6, n = 60, SE = 3/√60 ≈ 0.387, z = (5.6−5)/0.387 ≈ 1.55.

These z-scores indicate the sample means are about 1.5–1.7 standard errors away from the hypothesized population mean (Moore & McCabe, 2017).

Question 5 — 90% confidence intervals for proportions

90% CI uses z* = 1.645. Example:

a) Chatty: p̂ = 0.36, n = 50 → SE = sqrt(0.360.64/50) ≈ 0.0678 → CI = 0.36 ± 1.6450.0678 → (0.25, 0.47).

b) Direct: p̂ = 0.20, n = 40 → SE ≈ 0.0632 → CI = 0.20 ± 1.645*0.0632 → (0.09, 0.31).

Interpretation: intervals quantify sampling uncertainty; overlapping intervals suggest the difference may be modest (Agresti, 2018).

Questions 6–8 — Inferential statistics and p-value mechanics

Q6 (dataset 1): Use a two-sample t-test (comparing mean videos) or regression/correlation for the population. If the t-test yields p = 0.03, that indicates moderate evidence against the null of no difference. To reduce p-value in part (c), increase sample size or increase the group mean difference; replacing a blank with a larger effect size or adding observations to the group with the larger mean will lower the p-value (Cohen, 1992).

Q7 (dataset 2): Use a chi-square or Fisher's exact test to evaluate association between style and math anxiety. A small p-value (e.g., p = 0.02) suggests evidence of association. To obtain a smaller p-value in the hypothetical cell-filling task, change cell counts to accentuate imbalance while keeping row/column totals constant (Field, 2013).

Q8 (dataset 3): Use Pearson correlation test. A larger observed correlation in another sample would reduce the p-value (i.e., stronger effect → stronger evidence) all else equal. Conversely, a weaker r increases p-value. This follows from the t-statistic for correlation: t = r√((n−2)/(1−r^2)) (Bland & Altman, 1995).

Question 9 — Sample report critique (approach)

When reviewing the provided sample report, focus on dataset description (sampling frame, variables, sample size), methods (summary stats, graphs, inferential tests), and conclusions (are they supported by results?). A good report transparently presents methods, displays appropriate visualizations, quantifies uncertainty (CIs, p-values), and communicates limitations (Cumming, 2014; Gelman & Stern, 2006). Evaluate whether the main message follows directly from the analyses and whether communication is clear for the intended audience.

Question 10 — Discussion of p-value distributions

When the true population effect is large, repeated-sample p-values are typically small; their distribution concentrates near 0, producing high power. When the true effect is near zero, the p-value distribution is approximately uniform on [0,1] (under ideal assumptions), so small p-values occur only by chance (Wasserstein & Lazar, 2016; Nuzzo, 2014). This distinction explains why p-values alone do not measure effect size and why presenting effect estimates with CIs is preferred (Cumming, 2014).

Conclusion

Follow the summarizer outputs to insert exact descriptive and inferential tables into the report. Use the numerical examples above as templates for calculation and interpretation. Emphasize effect sizes and uncertainty, not p-values alone, and ensure the visualizations (scatterplot, pivot-table summaries) support the written interpretation (Efron & Tibshirani, 1993).

References

  • Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129–133.
  • Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29.
  • Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics. Sage Publications.
  • Moore, D. S., & McCabe, G. P. (2017). Introduction to the Practice of Statistics (9th ed.). W.H. Freeman.
  • Agresti, A. (2018). Statistical Methods for the Social Sciences. Pearson.
  • Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155–159.
  • Bland, J. M., & Altman, D. G. (1995). Statistics notes: Calculating correlation coefficients. BMJ, 310(6977), 597.
  • Gelman, A., & Stern, H. (2006). The difference between "significant" and "not significant" is not itself statistically significant. The American Statistician, 60(4), 328–331.
  • Nuzzo, R. (2014). Scientific method: Statistical errors. Nature, 506(7487), 150–152.
  • Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. CRC Press.