1970年1月1日

2086 Lecture 3 Estimation And Maximum Likelihood

No description yet.

Previous: 2086 Lecture 2 - Expectation, Variance and Probability Distributions Next: 2086 Lecture 4 - Central Limit Theorem and Confidence Intervals

$\hat{}$ (hat): add a hat on any variable mean the estimated result of the variable, e.g $\hat{\theta}$ is the estimation of $\theta$

Sum of the squared error:

\mathrm{SSE}(\mu) = \sum_{i=1}^{n} (y_i - \mu)^2

This represent that the sum of the squared error of every y value to the mean. We can us this to find the estimated mean $\hat{\mu}$

The smaller the squared error is, means the closer the point $y_i$ to $\mu$ SSE measure that how close the $\mu$ is to the samples. Since mean should be the point that have relatively closer to all sample points, then to find the best $\mu$ , we need to make the SSE smallest.

\hat{\mu} = \arg \min_{\mu} \left\{ \sum_{i=1}^{n} (y_i - \mu)^2 \right\}.

Hence we this is the equation of $\mu$ . This estimated mean can also be the best guess of the mean of real data, because we randomly choose samples from the real data, so then mean of sample should be close to mean of real data, especially when sample size goes larger.

Maximum Likelihood Estimation (MLE)

\hat{\theta}_{\mathrm{ML}}(\mathbf{y}) = \arg \max_{\theta} \left\{ p(\mathbf{y} \mid \theta) \right\}.

To find the best estimating parameters of probability distribution, consider we have a samples y where y = { $y_1, y_2, ... ,y_i$ }.

$p(\mathbf{y} \mid \theta)$ stands for the probability distribution we need

MLE is basically find the best probability distribution use parameters $\theta$ that makes the probability of y observed in distribution maximum

For example, If we flip a coin 5 times, we get result y={1,1,1,0,0}

Then we have the $\theta$ = {0.1, 0.2, 0.6, 1}

After using MLE, our best estimated parameter will be 0.6, because under Bernoulli distribution, 0.6 is the parameters that have highest probability to observe the result of y

Since all $y_i$ in $Y$ must be iid, so $p(y_1,y_2,y_3,...,y_i | \theta)$ = $p(y_1|\theta)...p(y_i|\theta)$ Which can be denote as:

p(\mathbf{y} \mid \theta) = \prod_{i=1}^{n} p(y_i \mid \theta)

We can also use negative-log likelihood instead of distribution likelihood, the function is:

\hat{\theta}_{\mathrm{ML}}(\mathbf{y}) = \arg \min_{\theta} \left\{ L(\mathbf{y} \mid \theta) \right\}.

Where:

L(\mathbf{y} \mid \theta) = -\log p(\mathbf{y} \mid \theta)

We can also rewrite this equation using log(ab) = log a + log b:

L(\mathbf{y} \mid \theta) = -\sum_{i=1}^{n} \log p(y_i \mid \theta)

We know that the turning point of a function is local maximum or minimum, and hence the first derivative of the function is 0 at the point.

We can find the $\theta$ using:

0 = \frac{\partial L(\mathbf{y} \mid \theta)}{\partial \theta}

Estimator

Point estimation: estimator like MLE give us a specific value of parameter denote as $\hat{\theta}_{\mathrm{ML}}$

Sampling distribution

When we us MLE to get the sample mean on every single different samples, we get different sample mean, although they are close to actual mean, but we can not measure that how accurate it is.

To find the closeness of our sample mean, we first assume population is normal distributed, where $Y \sim N(\mu, \sigma^2)$

We also have Samples $Y_1,Y_2,...$

Here is the law of normal distribution

\text{if } Y_1 \sim N(\mu_1, \sigma_1^2) \text{ and } Y_2 \sim N(\mu_2, \sigma_2^2) \text{ then } Y_1 + Y_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2);

\text{and if } Y \sim N(\mu, \sigma^2), \text{ then } \frac{Y}{c} \sim N\left(\frac{\mu}{c}, \frac{\sigma^2}{c^2}\right).

This means that if we add the samples from population together, we will get a combined normal distribution where $\sum_{i=1}^{n} Y_i \sim N(n\mu, n\sigma^2)$

and we can then device this by n, we have

\hat{\mu}_{\mathrm{ML}}(Y) \sim N\left(\mu, \frac{\sigma^2}{n}\right)

This indicate that when n, the sample size goes greater, the less the variance, which means there is less randomness in $\hat{\mu}_{\mathrm{ML}}$

Analyzing the Estimator

To find our how well our Estimator is, we have 4 different metrics which are bias, variance, mean squared error and consistency

Bias

This is the equation for bias of a estimator:

b_{\theta}(\hat{\theta}) = \mathbb{E}\!\left[\hat{\theta}(Y)\right] - \theta

bias = Expectation of estimated parameter minus the actual parameter

In the other word, it measures the difference between average estimated parameter and actual parameter

If b>0, then expected estimated value is greater than actual one, then we are over-estimated

If b<0, then expected estimated value is smaller than actual one, then we are under-estimated

If b=0 means it match, then this estimator is unbiased

Variance

\mathrm{Var}_{\theta}(\hat{\theta}) = \mathbb{E}\!\left[\left(\hat{\theta}(Y) - \mathbb{E}\!\left[\hat{\theta}(Y)\right]\right)^2\right] = \mathbb{V}\!\left[\hat{\theta}(Y)\right]

This is the formula of calculating variance of estimator, Basically same as the variance formula of random variable, but using estimated parameter $\hat{\theta}$ instead.

Mean Squared Error(MSE)

\mathrm{MSE}_{\theta}(\hat{\theta}) = \mathbb{E}\!\left[\left(\hat{\theta}(Y) - \theta\right)^2\right]

MSE will measure how well our estimator do the estimating. $(\hat{\theta}(Y)-\theta)^2$ represent the Squared Error, the distance from estimated parameter and actual parameter. Expectation simply just give use the expected or average squared error.

We can also wrote this formula using bias and variance:

\mathrm{MSE}_{\theta}(\hat{\theta}) = b_{\theta}^2(\hat{\theta}) + \mathrm{Var}_{\theta}(\hat{\theta})

Consistency

Consistency tells us that whether a predictor is consistent among samples.

\hat{\theta}_n \xrightarrow{p} \theta

It states that when $n \to \infty$ , then $bias, variance \to \infty$ if estimator is consistent, estimator does not have the systematic error, and also doesn’t have random variance overall.

Backlinks

2086 Lecture 2 Expectation Variance And Probability Distributions

No description yet.

2086 Lecture 4 Central Limit Theorem And Confidence Intervals

No description yet.

3152 Lecture 5

Clustering