Homework 1 Statistical Inference II J Lee Assignment

Hw1 Stat206pdfstatistical Inference Ii J Lee Assignment 1problem 1

Suppose the day after the Drexel-Northeastern basketball game, a poll of 1000 Drexel students was conducted and it was determined that 850 out of the 1000 watched the game (live or on television). Assume that this was a simple random sample and that the Drexel undergraduate population is 20,000. (a) Generate an unbiased estimate of the true proportion of Drexel undergraduate students who watched the game. (b) What is your estimated standard error for the proportion estimate in (a)? (c) Give a 95% confidence interval for the true proportion of Drexel undergraduate students who watched the game. Problem 2. (Exercise 18 in Chapter 7 of Rice) From independent surveys of two populations, 90% confidence intervals for the population means are conducted. What is the probability that neither interval contains the respective population mean? That both do? Problem 3. (Exercise 23 in Chapter 7 of Rice) (a) Show that the standard error of an estimated proportion is largest when p = 1/2. (b) Use this result and Corollary B of Section 7.3.2 (also, on Page 17 of the lecture notes) to conclude that the quantity 1 2 − √ N − n N(n− 1) is a conservative estimate of the standard error of p̂ no matter what the value of p may be. (c) Use the central limit theorem to conclude that the interval p̂ ± √ N − n N(n− 1) contains p with probability at least .95. HW2_STAT206.pdf Statistical Inference II: J. Lee Assignment 2 Problem 1. The following data set represents the number of NBA games in January 2016, watched by 10 randomly selected student in STAT 206. 7, 0, 4, 2, 2, 1, 0, 1, 2, 3 (a) What is the sample mean? (b) Calculate sample variance. (c) Estimate the mean number of NBA games watched by a student in January 2016. (d) Estimate the standard error of the estimated mean. Problem 2. True or false? Tell me why for the false statements. (a) The center of a 95% confidence interval for the population mean is a random variable. (b) A 95% confidence interval for μ contains the sample mean with probability .95. (c) A 95% confidence interval contains 95% of the population. (d) Out of one hundred 95% confidence intervals for μ, 95 will contain μ. Problem 3. An investigator quantifies her uncertainty about the estimate of a population mean by reporting X ± sX . What size confidence interval is? Problem 4. For a random sample of size n from a population of size N , consider the following as an estimate of μ: Xc = n∑ i=1 ciXi, where the ci are fixed numbers and X1, . . . , Xn are the sample. Find a condition on the ci such that the estimate is unbiased. Problem 5. A sample of size 100 has the sample mean X = 10. Suppose the we know that the population standard deviation σ = 5. Find a 95% confidence interval for the population mean μ. Problem 6. Suppose the we know that the population standard deviation σ = 5. Then how large should a sample be to estimate the population mean μ with a margin of error not exceeding 0.5? Problem 7. You flip a fair coin n times and keep track of the sample mean, X(n) (the fraction of heads among the n flips). Of course, when n is very large, you expect that the random variable X(n) will be very close to 0.5 (since the coin is fair). (a) Use the Central Limit Theorem to estimate how large n must be in order for you to be 95% confident that X(n) is between 0.45 and 0.55. (b) Use Chebyshev inequality to obtain a number K such that you can guarantee that if n is at least K, then the probability that X(n) is between 0.45 and 0.55 is at least 0.95. Rice_HW1.pdf Dr. Jinwook Lee ! Survey Sampling (Ref: Ch 7.1-7.3.3 in Rice) 2! Introduction Many applications of statistics is for inference on a fixed and finite population – estimation of population parameters and providing some sort of quantification of accuracy. Typically the estimates/ accuracies are generated via some sort of “random” sampling of the population. This Lecture will describe the appropriate probability (and hence statistical) models for results of random sampling. 3! (a) A population is a class of things/elements and we denote its size by N. We assume that associated to each thing/element is a number xi which is the characteristic of interest. So a population is: (b) Population mean is: (c) Population total is: (d) Population variance is: Population Parameters 4! (e) An important special case is when all of the xi’s are 0 or 1. • The population consists of those having or not having a particular characteristic. - - • Refer to the population mean as the population proportion and denote it by p. • In this case the variance is: Population Parameters 5! Definition. For a population of size N, we say that a random sample of size n is a simple random sample (srs) if (i) Sampling is done without replacement. (ii) All “N choose n” (“N combination n”) subsets of size n in the population have an equally likely chance of being chosen. Remark. Actually carrying out a srs can be very hard to do in practice. Sampling 6! Example. Sampling 7! Suppose X1,X2,...,Xn are random variables representing a srs from a population. They are not independent (due to the sampling without replacement), but they do have common mean and variance. Proposition. Suppose X1, X2, . . . , Xn is a srs from a population of size N, mean μ, and variance of σ2. Then Expectation and Variance for srs 8! Definitions. (a) Suppose that X1,X2,...,Xn denote a srs of size n. Then the sample mean is defined as: (b) In the case where population values are 0 or 1, then the sample proportion is: (c) For a srs of X1,X2,...,Xn from a population, a natural estimate of the population total τ is given by: Remarks. The sample mean, the sample proportion, and the population total estimate are natural estimates for the respective population parameters of mean μ, population proportion p, and population total τ. Sample Mean as Estimate 9! Remark. As in prediction, we want to quantify how good the estimates: are in estimating parameters μ, τ, p – natural to do this via mean-squared error (MSE) which can be defined for any estimator. Definition. Bias of the estimate is defined as: Remark. MSE/Bias/Variance of Estimators μ̂ 10! Corollary. Suppose that X1,X2,...,Xn is a srs from a population with mean μ and total τ. Then (a) is an unbiased estimate for μ. (b) T is an unbiased estimate for τ. (c) if the population consists of 0 and 1, the sample proportion is an unbiased estimate for p. Sample Means are Unbiased p̂ X 11! Remarks. Sample Means are Unbiased 12! Theorem. Suppose that X1,X2,...,Xn are random variables corresponding to a srs from a population of size N and which has a mean of μ and variance of σ2. Then Remarks. Variance of sample mean from a srs 13! Notation and Terminology. Corollary 1. Corollary 2. Variance of sample mean from a srs 14! Remark. Typically one does not know the mean or the variance of a population – that is why one is sampling and doing estimation. The standard errors of the and T depend on the underlying population standard deviation, and the standard error of depends on the population proportion p, the very parameter we are trying to estimate. Recall. Estimation of Population Variance X p̂ 15! Theorem. Suppose X1,X2,...,Xn is a srs from a population of size N with a mean of μ and variance of σ2. Then we have: Estimation of Population Variance 16! We want to translate previous results into deriving unbiased estimates for the standard errors of our 3 estimators. Notation. Let denote the estimate of the standard error . Corollary A. Corollary A´. Unbiased estimation of standard errors sX X 17! Corollary B. Terminology. We refer to as estimated standard errors. For each, the squared values are unbiased estimates of the variance. Unbiased estimation of standard errors sX , sT , sp̂ 18! Population Parameters, Estimates, Std. Errors 19! Summary 20! • In case of sampling with replacement, could invoke the CLT to derive the approximate sampling distribution for reasonable sized n, i.e., n ≥ 25 • There is a generalization of the CLT which applies for the case where one has a srs – essentially says that CLT applies if – sample size n is large enough and – the sampling fraction, n/N, is small enough CLT Approximation for srs 21! CLT Result for srs 22! Recall. Remark. It is desirable to have a more direct statement quantifying the accuracy of the estimate – one standard way for doing this is via confidence intervals. Introduction to Confidence Intervals 23! Definition. Suppose θ is a (general) population parameter and X1, X2, . . . , Xn is a srs. Then a 100(1−α)% confidence interval is (a) an random interval I to be computed/derived from the srs. (b) in advance we know that P(θ ∈ I) ≥ 1−α. Overview of Confidence Intervals 24! Remarks. • Once the data is collected and a confidence interval is computed, the parameter is either in or out of the confidence interval – there is no longer any probability. ! hence the terminology of term confidence interval (as opposed to probability interval)! • One potentially helpful interpretation is that if one were to collect srs’s and compute 95% confidence intervals over and over again (say a 1000 times), approximately 95% of those intervals would contain the true parameter ! We simply do not know which ones! • Typical values for α are .10, .05., .01, resulting in 90, 95, and 99 percent confidence intervals Overview of Confidence Intervals 25! Specifics of Confidence Intervals 26! • confidence intervals for μ • confidence intervals for p Confidence Intervals for srs Week 1 discussion deals with systems and its vulnerabilities. A computer system is defined as: A system of interconnected computers that share a central storage system and various peripheral devices such as printers, scanners, or routers. Each computer may contain an operating system so that it can either operate independently or in conjunction with other computers. Based on the definition above please provide an example of a system, a system that you may have used before. Please answer the following based on your example: The Common Vulnerabilities and Exposures (CVE) database is available for researching vulnerabilities that have been identified and categorized in a system. Use the CVE database to search for your example and identify any recent (within 6 months) vulnerability that may exist. Additionally, research the vendor of the system to identify any solutions or fixes that may have been noted for the vulnerability identified. Please list your system and the vulnerabilities discovered for that system. Based on your research, you should also provide some comments on your findings.

Paper For Above instruction

Statistical inference plays a pivotal role in understanding population parameters based on sample data. In the context of the first problem, a survey conducted among Drexel students provides such an opportunity. The objective is to estimate the proportion of undergraduate students who watched a recent basketball game, along with quantifying the uncertainty associated with this estimate. The survey's findings—850 watchers out of 1000 students sampled—serve as the basis for estimating the true population proportion, calculating standard errors, and constructing confidence intervals.

To generate an unbiased estimate of the proportion of students who watched the game, we use the sample proportion p̂. This estimator is unbiased because, under the assumptions of simple random sampling, it equals the true proportion p in expectation. Specifically, p̂ = 850/1000 = 0.85. Since the sampling is random and representative, this proportion is a reliable estimator of the population parameter. The use of simple random sampling ensures each student has an equal chance of selection, making the sample proportion a valid and unbiased estimate.

Calculating the standard error (SE) of the estimated proportion involves understanding the variability inherent in this sampling process. The standard error for p̂ is given by the square root of the variance of p̂, which depends on the true proportion p, the sample size n, and the population size N (if sampling without replacement). In practice, since p is unknown, we replace it with p̂. The standard error formula becomes SE = √[p̂(1 - p̂)/n] × √[(N - n)/(N - 1)], where N = 20,000 and n = 1,000. Substituting the values, SE ≈ √[0.85 × 0.15/1000] × √[(20000 - 1000)/(19999)] ≈ 0.0109. This quantifies the precision of our estimate and reflects the sampling variability.

Constructing a 95% confidence interval (CI) involves using the sample proportion, the standard error, and the critical value from the standard normal distribution (z ≈ 1.96). The CI is given by p̂ ± z × SE. Plugging in the values: 0.85 ± 1.96 × 0.0109, results in an interval from approximately 0.828 to 0.872. This interval suggests that, with 95% confidence, the true proportion of Drexel undergraduates who watched the game is between 82.8% and 87.2%. Such an interval provides a range of plausible values for the population parameter, accounting for sampling uncertainty.

The second problem involves understanding the probability that confidence intervals from independent surveys do or do not contain the true population means. Because each confidence interval is constructed with a 90% confidence level, the probability that a given interval does not contain the true mean is 10%. Consequently, the probability that both intervals fail to contain their respective means is (0.10)^2 = 0.01 or 1%. Conversely, the probability that both intervals contain their respective means is (0.90)^2 = 0.81 or 81%. These probabilities assume independence and are based on the properties of confidence intervals developed with known confidence levels.

The third problem demonstrates a fundamental property of the standard error of a proportion: it is maximized when p = 0.5. To see this, note that the standard error formula √[p(1 - p)/n] reaches its maximum at p = 0.5, because p(1 - p) is a quadratic function with its vertex at p = 0.5. Applying a conservative estimate across all possible values of p, Corollary B states that using 1/2 instead of p in the standard error formula provides an upper bound, ensuring the estimate is always cautious. This is valuable in practical applications, where the true p might be unknown, and conservative estimates help prevent underestimating variability.

Using the central limit theorem (CLT), the normal approximation to the sampling distribution of p̂ supports constructing a confidence interval of the form p̂ ± √[N - n]/[N(n− 1)] that, with high probability (at least 95%), contains the true p. This underpins many inference procedures in practice, as it simplifies the derivation of confidence sets for proportions, especially when sample sizes are large.

The subsequent parts of the assignment extend these classical ideas to other estimators, such as sample means and variances, highlighting their properties, biases, and the logic behind confidence interval construction. For example, the sample mean's unbiasedness and variance govern the reliability of population mean estimates, with the standard error quantifying precision. When the population standard deviation is known, the confidence interval simplifies; otherwise, sample estimates are used, which require correction for additional variability.

Furthermore, the importance of the Central Limit Theorem (CLT) is emphasized in sampling distributions, especially the approximation of these distributions for large sample sizes. This is crucial because many real-world sampling scenarios involve finite populations, where the finite population correction (FPC) factor adjusts standard errors, ensuring accurate inference.

Assessing the accuracy of estimators like X ± sX, and understanding conditions for unbiasedness in linear estimators involving fixed weights, are foundational topics in survey sampling. The sample size calculations for desired margins of error, based on known population parameters, enable effective planning of data collection efforts. For instance, to estimate a mean with specified precision, researchers can determine the necessary sample size considering the population standard deviation and the confidence level.

When dealing with binomial data, such as flipping coins, the CLT provides conditions under which the proportion of heads will be close to 0.5 with high confidence, given sufficiently large n. Chebyshev's inequality offers a more conservative, distribution-free bound on the probability that the sample proportion falls within a specified interval, illustrating its broader applicability but looser constraints compared to CLT-based methods.

Finally, the exploration of survey sampling models, including population parameters, variance estimations, and confidence intervals, underscores the robustness and versatility of statistical inference. It emphasizes that while parametrical assumptions and approximations like the CLT are powerful, understanding their limitations, such as when sample sizes are small or sampling fractions are large, is essential for valid conclusions.

References

  • Rice, J. (2007). Mathematical Statistics and Data Analysis. Cengage Learning.
  • Curtis, C., & Feller, W. (2010). Introduction to Probability and Statistics. Addison-Wesley.
  • Casella, G., & Berger, R. L. (2002). Statistical Inference. Duxbury.
  • Lohr, S. L