주축량(Pivotal Quantity)과 지수분포에서 모수의 신뢰구간 - 시뮬레이션

지난 포스팅 "주축량과 지수분포에서 모수의 신뢰구간-이론"(https://moogie.tistory.com/150)에서

$X_1, X_2, \ldots, X_n \sim Exp(\theta)$인 경우 MLE와 주축량을 이용해 구한 $\theta$의 신뢰구간은 다음과 같았습니다.

1. MLE를 이용한 경우

$$[\hat{\theta}-z_{\alpha/2}\frac{\hat{\theta}}{\sqrt{n}}, \hat{\theta}+z_{\alpha/2}\frac{\hat{\theta}}{\sqrt{n}}]$$

2. Pivotal Quantity를 이용한 경우

$$[\frac{2\sum X_i}{\chi^2_{\alpha/2}(2n)}, \frac{2\sum X_i}{\chi^2_{1-\alpha/2}(2n)}]$$

이번 포스팅에서는 이를 시뮬레이션을 통해 검증해보고, 표본 수에 따라 신뢰구간의 신뢰도가 어떻게 변하는지 확인해 보겠습니다.

1. 지수분포 모수 설정과 표본 생성

먼저 지수분포의 모수 $\theta$를 설정하기 위해 자유도가 2인 카이제곱분포 $\chi^2(2)$에서 랜덤하게 값을 추출합니다.

theta = rchisq(n=1, df = 2)

왜 자유도가 2인 카이제곱분포을 선택했냐면, 지수분포의 모수는 포아송분포의 모수의 역수이므로 1이하의 값도 빈번하게 나오도록 하기위함입니다.

(참고 : $X \sim \chi^2(2)$일 때, $P(X<1)=0.3934$로 약 40% 확률로 1 이하의 소수가 추출됩니다.)

이제 지수분포 $Exp(\theta)$에서 표본 100개를 추출해 보겠습니다.

n=100
sample=rexp(n = n, rate = 1/theta)

2. 신뢰구간 계산 – MLE와 Pivotal Quantity

MLE 기반의 신뢰구간에 필요한 값은 아래와 같으므로

$$\hat{\theta} = \frac{\sum_i X_i}{n}, \quad \hat{se(\theta)} = \frac{\hat{\theta}}{\sqrt{n}}$$

MLE 기반의 95% 신뢰구간을 아래와 같이 구할 수 있습니다.

theta_hat = sum(sample)/n
se_theta_hat = theta_hat/sqrt(n)
mle_ci = c(theta_hat-qnorm(p=0.025, lower.tail = F)*theta_hat/sqrt(n), 
           theta_hat + qnorm(p=0.025, lower.tail = F)*theta_hat/sqrt(n))

또한 Pivotal Quantity를 기반의 신뢰구간은 아래와 같이 구할 수 있습니다.

sum_sample=sum(sample)
pivotal_ci = c(2*sum_sample/qchisq(p = 0.025, df = 2*n, lower.tail = F), 
               2*sum_sample/qchisq(p=0.975, df=2*n, lower.tail = F))

한번 시뮬레이션 해본 결과 $\theta=0.5304$가 나왔고 $Exp(0.5304)$에서 표본추출한 random sample의 100개의 관측값의 표본평균이 0.4610이 나옵니다. (원래는 표본평균의 기댓값이 0.5304가 나와야하는데 차이가 좀 있는 편이네요)

이때, MLE 기반의 신뢰구간은 [0.3707, 0.5514], Pivotal 기반의 신뢰구간이 [0.3825, 0.5666]으로 모두 모수의 참값 0.5304를 포함하는 것을 확인할 수 있습니다.

3. 시뮬레이션 – 신뢰도 검증

이번에는 10,000번의 시뮬레이션을 통해 MLE, Pivotal 기반의 신뢰구간이 우리가 기대하는 신뢰도와 비슷한지 확인해 봅시다.

아래와 같이 실험수를 10000번으로 설정하고 모수 $\theta$의 값을 자유도가 2인 카이제곱분포에서 랜덤추출하였습니다.

library(tidyverse)
times=10000
exp_frame = tibble(id=1:times, theta=rchisq(n=times, df=2), n=100)

exp_frame에 있는 theta(=지수분포의 모수)와 n(=표본 수 $x_1, x_2, \ldots, x_n$)를 통해 샘플과 신뢰구간을 생성하는 함수를 다음과 같이 만들 수 있습니다.

theta_CI <- function(theta, n, alpha=0.05, method=c("mle", "pivotal")){
  
  method <- match.arg(method)
  sample = rexp(n=n, rate=1/theta)
  
  # method에 따라 다른 작업 수행
  if (method == "mle") {
    theta_hat = sum(sample)/n
    se_theta_hat = theta_hat/sqrt(n)
    mle_ci =  c(theta_hat-qnorm(p=alpha/2, lower.tail = F)*theta_hat/sqrt(n), 
                theta_hat + qnorm(p=alpha/2, lower.tail = F)*theta_hat/sqrt(n))
    return(invisible(mle_ci))
  } else if (method == "pivotal") {
    sum_sample=sum(sample)
    pivotal_ci = c(2*sum_sample/qchisq(p = alpha/2, df = 2*n, lower.tail = F),
                   2*sum_sample/qchisq(p=1-alpha/2, df=2*n, lower.tail = F))
    return(invisible(pivotal_ci))
  }
}

이제 exp_frame과 theta_CI 함수를 통해 시뮬레이션을 진행하였습니다.

exp_frame |> 
  mutate(mle = map2(theta, n, ~theta_CI(.x, .y, alpha=0.05, method="mle")),
         pivotal = map2(theta, n, ~theta_CI(.x, .y, alpha=0.05, method="pivotal"))) |> 
  unnest_wider(col=mle, names_sep = "_") |> 
  unnest_wider(col=pivotal, names_sep = "_") |> 
  rename(mle_lower = mle_1, mle_upper = mle_2, pivotal_lower = pivotal_1, pivotal_upper = pivotal_2) -> exp_frame_ci

모수의 참값이 MLE 기반의 CI, Pivotal 기반의 CI에 포함되는 비율을 구하면 다음과 같이 94.6%, 95.2%가 나오네요.

(물론 시뮬레이션 할때마다 결과가 다르게 나오지만 백만번 실행했을 때 94.5%, 95%가 나옵니다.)

exp_frame_ci |> 
  mutate(mle_flag = between(theta, mle_lower, mle_upper),
         pivotal_flag = between(theta, pivotal_lower, pivotal_upper)) |> 
  summarise(mle_prob = mean(mle_flag),
            pivotal_flag = mean(pivotal_flag))

또한, 300개 정도 샘플링해서 모수 값에 따른 신뢰구간을 시각화해보았습니다.

exp_frame_ci |> 
  mutate(mle_flag = between(theta, mle_lower, mle_upper),
         pivotal_flag = between(theta, pivotal_lower, pivotal_upper)) |> 
  slice_sample(n=300) |> 
  arrange(theta) |> mutate(id=1:300) |> 
  ggplot(mapping=aes(x=id, y=theta)) + geom_line() + 
  geom_errorbar(mapping=aes(ymin=mle_lower, ymax=mle_upper, colour = mle_flag)) + 
  scale_y_sqrt() +
  theme(axis.text.x = element_blank(), axis.title.x = element_blank(), axis.ticks.x = element_blank())

4. 표본 수가 작을 때의 신뢰구간

이번에는 각 시뮬레이션 반복마다 샘플의 수를 10개 ($X_1, X_2, \ldots, X_{10}$)로 설정해 보았습니다.

exp_frame에서 n의 값을 100에서 10으로 수정하였고, times를 1,000,000로 하여 백만번 시뮬레이션을 진행했습니다.

이때 모수의 참값이 MLE 기반의 CI, Pivotal 기반의 CI에 포함되는 비율을 구하면 이전과 다르게 90.4%, 95%가 나오는 것을 확인할 수 있습니다.

pivotal 기반의 신뢰구간은 일정하게 95%가 나오는데 MLE 기반의 신뢰구간은 n=100일때는 94.6%에서 n=10일때 90.4%로 변하는 걸까요?

MLE의 Asymptotic Normality는 아래와 같이 표준정규분포에 분포수렴합니다.

$$\frac{\hat{\theta}-\theta}{\hat{se}} \overset{d}{\rightarrow} N(0,1)$$

분포수렴하기 위해서는 표본수가 충분히 커야하는데 표본의 수가 10으로 작아 정규 분포 근사가 제대로 이루어지지 않기 때문에 95% 신뢰구간을 구해도 95% 신뢰도가 보장되지 못합니다.

따라서 표본 수가 작다면 표본 크기와 상관없이 정확한 Pivotal Quantity 기반의 신뢰구간을 구하는 것이 정확하고 표본 수가 클때는 MLE 기반의 신뢰구간을 구하는 것도 괜찮습니다. (Pivotal Quantity 기반의 신뢰구간은 구하기 쉽지 않음)

'Statistics > Mathmetical Statistics' 카테고리의 다른 글

표본분산(Sample Variance)의 특징 (0)	2024.12.22
모비율 신뢰구간 with Chebyshev & Hoeffding Inequality (0)	2024.12.18
주축량(Pivotal Quantity)과 지수분포에서 모수의 신뢰구간 - 이론 (1)	2024.12.17
[확률과 통계적 추론] 8-4. 비모수적 검정 (Non-parametric Test) (0)	2024.03.25
[확률과 통계적 추론] 8-3. 비율에 대한 가설검정 (0)	2024.03.25

1. 지수분포 모수 설정과 표본 생성

2. 신뢰구간 계산 – MLE와 Pivotal Quantity

3. 시뮬레이션 – 신뢰도 검증

4. 표본 수가 작을 때의 신뢰구간

'Statistics > Mathmetical Statistics' 카테고리의 다른 글

티스토리툴바