Data Structure and Model

  • 자료구조

response$\quad$ explanatory

$\begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{pmatrix}$ $\begin{pmatrix} x_{11} & \cdots & x_{1p} \\ x_{21} & \cdots & x_{2p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{pmatrix}$

중회귀모형(Multiple Linear Regression)

  • 설명변수가 $p$개인 다중(선형)회귀모형
    $\qquad\qquad y_i=\beta_0+\beta_1x_{1i}+\dots+\beta_{p}x_{ip}+\epsilon_i\quad i = 1,2,\dots,n$

    • 회귀모수 : $\beta_0,\beta_1,\dots,\beta_p\qquad\longrightarrow (p+1)개$
    • 설명변수(독립변수) :
      $X_1 = (x_{11},\dots,x_{n1})^T,\dots,X_p = (x_{1p},\dots,x_{np})^T$
    • 반응변수(종속변수) : $Y=(y_1,\dots,y_n)^T$
    • 오차항 : $\epsilon_1,\dots,\epsilon_n,\;(\sim_{i.i.d}N(0,\sigma^2))$
  • 설명변수가 $p$개인 다중(선형)회귀모형 : 행렬형식
    $\qquad\qquad Y=X\beta +\epsilon$

    • 회귀모수 : $\beta = (\beta_0,\beta_1,\dots,\beta_p)^T, (p+1)$ vetor
    • 설명변수(독립변수) :
      $X = (1,X_1,\dots,X_p)^T, (n\times(P+1))$ matrix
    • 반응변수(종속변수) : $Y = (y_1,\dots,y_n)^T,\;n$ vector
    • 오차항 : $\epsilon = (\epsilon_1,\dots,\epsilon_n)^T\;n$ vector
    • 자료구조

    $\begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{pmatrix}$ = $\begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}$ $\begin{pmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{p} \end{pmatrix}$ + $\begin{pmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{n} \end{pmatrix}$

    $\quad\; n\times 1\qquad\qquad n\times(p+1)\qquad\;(p+1)\times 1\quad n\times 1$

Least Square Estimation

  • 최소제곱추정량 :
    $(\hat{\beta_0},\hat{\beta_1},\dots,\hat{\beta_p})$ =
    $_{({\beta_0},\dots,{\beta_p})\in R^{p+1}} argmin\displaystyle\sum_{i=1}^{n}\{y_i - (\beta_0+\beta_1x_{i1}+\dots +\beta_px_{ip})\}^2$

또는
$\qquad\hat{\beta} = _{\beta\in R^{p+1}} argmin\;||Y - X\beta||^2 = (X^T X)^{-1}X^T Y$

Estimation of error variance

  • 잔차 (residual) : $e_i = y_i - \hat{y_i}$ - 오차분산 $(\sigma^2)$의 추정 :
    • 잔차(오차) 제곱합 (residual (or error) sum of squares) :
      $$SSE = \displaystyle\sum_{i=1}^{n}(y_i - \hat{y_i})^2=\displaystyle\sum_{i=1}^{n}e_i^2$$
    • 평균제곱오차 (mean squared error) : $MSE=\frac{SSE}{n-(p+1)}$
    • 오차분산의 추정값 : $\hat{\sigma}^2 = MSE$

Decomposition of deviations

  • 총편차의 분해
    • $y_i - \overline y = (y_i-\hat{y_i})+(\hat{y_i}-\overline y),\;\forall i$
    • 총편차(total deviation) = $y_i - \overline y$
    • 추측값의 편차 = $(\hat{y_i}-\overline{\hat{y_i}})+(\hat{y_i}-\overline y)$
      $\Rightarrow$ 총편차 = 잔차 + 추측값의 편차
  • 제곱합의 분해 : SST = SSE + SSR
    $$\displaystyle\sum_{i=1}^{n}(y_i - \overline y)^2 = \displaystyle\sum_{i=1}^{n}(y_i - \hat{y_i})^2 + \displaystyle\sum_{i=1}^{n}(\hat{y_i} - \overline y)^2$$

제곱합의 종류 $\qquad\qquad$ 정의 및 기호 $\qquad\qquad$ 자유도


총제곱함 $\qquad\qquad SST = \sum_{i=1}^{n}(y_i - \overline y)^2\qquad n-1$
잔차제곱합 $\qquad\quad\; SSE = \sum_{i=1}^{n}(y_i - \hat{y_i})^2\quad n-(p+1)$
회귀제곱합 $\qquad\quad\; SSR = \sum_{i=1}^{n}(\hat{y_i} - \overline y)^2\qquad\quad p$


Coefficient of determination

  • 결정계수 (Coefficient of determination)
    $$R^2 = \frac{SSR}{SST} = 1-\frac{SSE}{SST}$$
  • 수정된 결정계수 (Adjusted multiple correlation coefficient)
    $$R_{adj}^2 = 1-\frac{SSE/(n-p-1)}{SST/(n-1)}$$

회귀직선의 유의성 검정

  • 회귀직선의 유의성 검정 (F-test)
    • 가설 : $H_0 : \beta_1 = \dots = \beta_p=0_{vs.}H_1:notH_0$
    • 검정통계량 : $F = \frac{MSR}{MSE}=\frac{SSR/p}{SSE/(n-(p+1))}\sim_{H_0}F(p,n-p-1)$
    • 검정통계량의 관측값 : $f$.
    • 유의수준 $\alpha$에서의 기각역 : $f\geq F_{\alpha}(p,n-p-1)$
    • 유의확률 = $P(F\geq f)$
  • 회귀직선의 유의성 검정을 위한 분산분석표

요인 $\qquad$ 제곱합 $\qquad$ 자유도(df) $\qquad\qquad$ 평균제곱(MS) $\qquad\qquad f\qquad\quad$ 유의확률


회귀 $\qquad\quad SSR\qquad\quad p\qquad\qquad\quad MSR=\frac{SSR}{P}\qquad\; f=\frac{MSR}{MSE}\qquad P(F\geq f)$
잔차 $\qquad\quad SSE\qquad\quad n-(p+1)\quad\; MSE=\frac{SSE}{n-p-1}$


계 $\qquad\qquad SST\qquad\quad n-1$


  • Reduced model(RM)$_{vs.}$ Full model(FM)
    $\qquad FM : y_i = \beta_0+\beta_1x_{i1}+\beta_qx_{iq}+\dots+\beta_p x_{ip}+\epsilon_i$
    $\qquad RM : y_i = \beta_0+\beta_1x_{i1}+\beta_qx_{iq}+\epsilon_i$
    • 가설 : $H_0:\beta_{q+1} =\dots= \beta_p=0\;_{vs.}\;H_1:not\;H_0$
    • 검정통계량
      $$F = \frac{SSR_{FM}-SSR_{RM}/(p-q)}{SSE_{FM}/(n-p-1)}\sim_{H_o}\;F(p-q,n-p-1)$$
      • 검정통계량의 관측값 : $f$
      • 유의수준 $\alpha$에서의 기각역 : $f\geq F_{\alpha}(p-q,n-p-1)$
      • 유의확률 = $P(F\geq f)$
  • General Linear Hypothesis
    $$H_0:H\beta=0\;_{vs.}\;H_1:not\;H_0$$
    • $H:r\times(p+1)\;matrix, rank(H) = r$
    • $\beta = (\beta_0,\beta_1,\dots,\beta_p)^T$
    • 검정통계량
      $$F=\frac{(SSR_{FM}-SSR_{RM})/(p-q)}{SSE_{FM}/(n-p-1)}\sim_{H_0} F(r,n-p-1)$$

회귀계수에 대한 추론

  • $\beta_1,\beta_2,\dots,\beta_p$ 에 대한 추론
    • $\hat{\beta}=X(X^TX)^{-1}Y$
    • $\frac{\hat{\beta_i}-\beta_i}{_{s.e.}(\hat{\beta_i})}\sim t(n-p-1),\;_{s.e.}(\hat{\beta_i})=d_{ii}\hat{\sigma}$
    • $d_{ii}$ ; diagonal elements of $D^{-1}, i=1,\dots,p,$
      $D = \begin{pmatrix} s_{11} & \cdots & s_{1p} \\ \vdots & \ddots & \vdots \\ s_{p1} & \cdots & s_{pp} \end{pmatrix}$,
      $s_{ij} = \sum_{k=11}^{n}(x_{ki}-\overline x_i)(x_{kj}-\overline x_j)$
    • 가설검정 : $H_0 : \beta_i=\beta_i^0$.
    • 검정통계량 : $T = \frac{\hat{\beta_i}-\beta_i^0}{d_{ii}\hat{\sigma}}\sim_{H0} t(n-p-1), 관측값 : t$

대립가설 $\qquad\qquad\;$ 유의확률 $\qquad\quad$ 유의수준 $\alpha$ 기각역


$H_1:\beta_i>\beta_i^0\qquad P(T\geq t)\qquad\quad t\geq t_{\alpha}(n-p-1)$
$H_1:\beta_i<\beta_i^0\qquad P(T\leq t)\qquad\quad t\geq t_{\alpha}(n-p-1)$
$H_1:\beta_i\neq\beta_i^0\qquad P(|T|\geq |t|)\qquad |t|\geq t_{\alpha/2}(n-p-1)$


  • 모회귀계수(절편)$\beta_0$ 에 대한 추론
    • $\frac{\hat{\beta_0}-\beta_0}{_{s.e.}(\hat{\beta_0})}\sim t(n-p-1)$,
    • $_{s.e.}(\hat{\beta}_0) = \hat{\sigma}(\frac{1}{n}+\displaystyle\sum_{i=1}^{p}\displaystyle\sum_{j=1}^{p}\overline x_i d_{ij}\overline x_{j})^{1/2}$

평균반응예측

  • $X=X0=(x{01},\dots,x_{0p})^T 가 주어졌을 때 평균반응의 예측

    • 평균반응 (mean response) :
      $\mu_0 = E(Y|X_0) = \beta_0+\beta_1x_{01}+\dots+\beta_px_{0p}$
    • 평균반응 추정량 : $\hat{\mu_0}=\hat{\beta_0}+\hat{\beta_1}x_{01}+\dots+\hat{\beta_p}x_{op}$
    • $\frac{\hat{\mu_0}-\mu_0}{_{s.e.}(\hat{\mu_0})}\sim t(n-p-1)$
    • $_{s.e.}(\hat{mu_{0}}) = \hat{\sigma}\bigg(\frac{1}{n}+\displaystyle\sum_{i=1}^{p}\displaystyle\sum_{j=1}^{p}(x_{0i}-\overline x_i)d_{ij}(x_{0j}-\overline x_{j})\bigg)^{1/2}$
    • $\hat{\mu_0}의\;100(1-\alpha)$% 신뢰구간 : $\hat{\mu_{0}}\pm t_{\alpha/2}(n-p-1)_{s.e.}(\hat{\mu_{0}})$

예측

  • $X=X_0$ 가 주어졌을 때 $y=y_0$ 예측
    • $y_0=\beta_0+\beta_1 x_{01}+\dots+\beta_p x_{0p}+\epsilon_0$
    • 예측값 : $\hat{y_0}=\hat{\beta_0}+\hat{\beta_1}x_{01}+\dots+\hat{\beta_{p}}x_{0p}$
    • $\frac{\hat{y_0}-y_0}{_{s.e.}(\hat{y_0})}\sim t(n-p-1)$
    • $_{s.e.}(\hat{y_0}) = \hat{\sigma}\bigg(1+\frac{1}{n}+\displaystyle\sum_{i=1}^{p}\displaystyle\sum_{j=1}^{p}(x_{0i}-\overline x_i)d_{ij}(x_{0j}-\overline x_{j})\bigg)^{1/2}$
    • $\hat{y_0}$의 $100(1-\alpha)$ % 신뢰구간 : $\hat{y_0}\pm t_{\alpha/2}(n-p-1)_{s.e.}(\hat{y_{0}})$