


Question 1:
You have a dataset of 1000 observations on two variables, X and Y. The correlation between X and Y is 0.8. You want to fit a linear regression model to predict Y based on X. Estimate the regression coefficients and write the equation of the regression line.
Solution:
The regression coefficient β1 can be estimated as:
β1 = Cov(X,Y) / Var(X) where Cov(X,Y) is the covariance between X and Y, and Var(X) is the variance of X. We can estimate these quantities from the sample as follows:
Cov(X,Y) = ∑(Xi - )(Yi - ) / (n - 1) X Ȳ

Var(X) = ∑(Xi - )^2 / (n - 1) X where and are the sample means of X and Y, respectively, and n X Ȳ is the sample size.
Using these formulas and the given information, we get:
Cov(X,Y) = 0.8 * sqrt(Var(X) * Var(Y)) Var(X) = ∑(Xi - )^2 / (n - X
1) = sX^2 = 1 Var(Y) = ∑(Yi - )^2 / (n - 1) = sY^2 = 1 β1 =
Cov(X,Y) / Var(X) = 0.8
The intercept β0 can be estimated as:
β0 = - β1 ȲX

Using the sample means and the estimated regression coefficient, we get:
β0 = - β1 = 0 - 0.8 * 0 = 0 ȲX
Therefore, the equation of the regression line is:
Y = β0 + β1X = 0 + 0.8X
Question 2:
You have a dataset of 500 observations on a variable X, which has a mean of 50 and a standard deviation of 10. You want to test the hypothesis that the true mean of X is 55, using a significance level of 0.05. Perform the hypothesis test and interpret the result.
Solution: We can use a one-sample t-test to test the hypothesis that the true mean of X is 55. The test statistic is calculated as:
t = ( - μ) / (s / sqrt(n)) X
where is the sample mean, μ is the hypothesized population X mean, s is the sample standard deviation, and n is the sample size. Using the given information and the formula, we get:
t = (50 - 55) / (10 / sqrt(500)) = -7.07

The degrees of freedom for the t-distribution are n - 1 = 499. Using a t-table or a statistical software, we find that the critical value of t for a two-tailed test with a significance level of 0.05 and 499 degrees of freedom is ±1.96.
Since the calculated value of t is less than the critical value of t, we reject the null hypothesis that the true mean of X is 55, and conclude that there is strong evidence to suggest that the true mean of X is less than 55.
Question 3:
You have a dataset of 200 observations on two variables, X and Y.
You want to perform a linear regression analysis to predict Y based on X. However, you suspect that the relationship between X and Y may not be strictly linear. How can you check for nonlinearity in the relationship between X and Y
Solution:
To check for nonlinearity in the relationship between X and Y, we can plot the data and examine the scatterplot. If the relationship between X and Y is nonlinear, the scatterplot will not follow a straight line and may exhibit a curved or nonlinear pattern.
In addition to visual inspection of the scatterplot, we can also use a residual plot to check for nonlinearity. A residual plot is a scatterplot of the residuals (the differences between the observed Y values and the predicted Y values) against the X values. If the relationship between X and Y is nonlinear, the residual plot will exhibit a pattern that is not random, such as a curved or U-shaped pattern. If we find evidence of nonlinearity in the relationship between X and Y, we can consider using a nonlinear regression model instead of a linear regression model to better capture the true relationship between the two variables.
