1. Populations and Parameters
A population is
any large collection of objects or individuals, such as Koreans, Japanese,
flowers, or students about which information is desired. A parameter is
any summary number, like an average or percentage, which describes the entire
population. However, it is impossible to know a population parameter in majority
of cases. What we can do is just to estimate the parameter. For example, what is
the average height of Koreans? What is the average weight of Americans. Can you
measure the heights of Koreans or weights of Americans?
2. Samples and Statistics
A sample is a
representative group drawn from the population. A statistic is any summary
number, like an average or percentage, that describes the sample. For example,
is a sample proportion.
3. Mathematical Notation
It is useful to introduce
some mathematical notation to some numerical summaries. Just for convenience,
variables are commonly given one-letter names: x, y, z,... Since the number of
alphabets are limited, we use subscripts to identify the individual values in a
set of data: x1, x2,
x3,..., xn.
Some numerical summaries involve summation, which is represented by the symbol
(sigma).
4. Central Tendency
A central tendency
describes the center of a data set or the location of a 'typical value' of a
data set. There are a few central tendency measures which are commonly used and
we will deal with average (arithmetic mean), median, and mode.
1. Mean
The most commonly used
measure of centrality is the mean. If you have a data set which is x1,
x2, x3,...,
xn, you can compute the mean of the
data set using the following equation.
For example, if the mean of 45, 50, 55, 60, and 65 is
. Physical interpretation can be given to the
mean which helps to explain its properties. Imagine a horizontal dot plot where
all dots are equally sized and each dot is a point mass with the same weight,
and that the axis itself has negligible mass. The mean is the position o the
axis where the beam will have a balance.

2. Median
Median is
defined as the score corresponding to the 50th percentile. The median is the
middle score in the distribution when scores are put in order in size. If there
is an even number of scores, the median is computed by averaging the two middle
scores. When the distribution score is symmetric, the mean and the median will
be equal. So, the data set we already used to compute the mean (45, 50, 55, 60,
and 65) will give you the same value for the mean and median which is 55.
However, when you have a data set of 3, 4, 4, 4, 5, 6, 7, and 39, the mean will
be 9.0 and the median will be 4.5 even if the extreme value 39 is changed to
390000000000000000.
3. Mode
The mode is the most
frequently obtained score in the data set. The mode is at best a rough measure
and is generally less useful than the mean or median.
5. Variability
In addition to general
location, there is another important attribute of a distribution of scores,
which is called variability. Variability refers to how spread out or scattered
the scores in a distribution. Therefore, the minimum possible variability is
zero. This will occur only if all of the scores are exactly the same. We will
talk about a few variability measures: range, variance, and standard deviation.
1. Range
The range, R, of a data set
is the difference between the largest and the smallest values in the set; that
is: R = Xmax-Xmin. The range is very easy to compute.
2. Variance
Another measure of the
spread of values in a data set is a variance which is based on their squared
deviations. The variance for can be computed as

This is a basic measure of the variability of any set of data.
However, when the data of a sample are used to estimate the variance of the
variance of the population from which the sample was drawn, the population
variance estimate ( ) is computed as

There is a reason why we use n-1 rather than n. Whenever you use
n, the variance is overestimated and it turned out that when n-1 is used, the
variance become close to the actual number. It has been proven by statisticians.
Although the variance does reflect the spread of values, it is not an easily
interpreted quantity, and should not be used as a descriptive numerical summary
of the data. The problem is that the variance is defined in terms of squared
deviations and therefore its units are the square of the units of the sample
values. For example, if you have a data set which is in kg sale and compute the
variance of the data, you will get the variance in square kg.
3. Standard deviation
Since the unit
of variance is not convenient, we simply take a square root on the variance and
make the unit in the same unit as in raw data. This variability measure is
called standard deviation. Therefore, the standard deviation (SD) of data can be
computed as

6. Confidence Intervals
1. General form of confidence interval
Although we want
to estimate the actual population mean
,
the sample mean. In confidence intervals, we use certain range within which we
can be confident that the actual population mean falls, such as L< <U.
The range of values is called a confidence interval. The general form of
most confidence intervals is Sample estimate±Margin
of Error. Therefore, the lower level L is the estimate-margin of error and
upper limit U is estimate+margin of error. We are confident that the value of
the population parameter is somewhere between L and U.
2. (1-a)100%
t-interval for population mean

Formula in words: Sample mean±(t-multiplier
x standard error)
Formula in notation:
a
level and the degree of freedom which is number
of sample-1.
3. Determining t-multiplier
Typical t-multipliers in science
|
Confidence Coefficient
(1-a) |
Confidence Level
(1-a) x 100% |
 |
| 0.90 |
90% |
0.950 |
| 0.95 |
95% |
0.975 |
| 0.99 |
99% |
0.995 |

7. Hypothesis Testing
1. General idea
Once we make an
initial assumption, we collect data and decide whether we will reject or not
reject the initial assumption based on the available evidence (data).

2. Making the decision
Given the initial assumption
, it is either 'likely' or 'unlikely' that the assumption is right. When it is
'likely', we do not reject the initial assumption. When it is 'unlikely',
then we reject our initial assumption. In other words, if it is unlikely,
then either our initial assumption is correct and we experienced a very unusual
event. The initial assumption, hypothesis can by a null hypothesis (H0)
or an alternative hypothesis (HA).
For example, H0: the heights
of Americans are the same as the heights of Koreans. HA:
Americans are taller than Koreans.
3. Important point
1) Even if we reject the null
hypothesis, we do not prove the alternative hypothesis is true.
2) Even if we do not reject the null
hypothesis, we do not prove the alternative hypothesis is not true.
3) We just say that we have enough evidence
to treat the hypothesis one way or the other.
4) In statistics, there is always a
possibility to make a wrong decision (error)
4. Errors in hypothesis testing
1) Type I Error: The null hypothesis is
rejected when it is true.
2) Type II Error: The null hypothesis is
not rejected when it is false.
Minimizing the a can
decrease Type I error and maximizing the sample size can minimize Type II error.
5. Possible hypotheses about mean
m
In may cases in science research, we would like
to compare two means. For example, after we train our volleyball players with a
new training strategy for 2 weeks, we would like to compared their performance
compared with other volleyball players who did not experience the new training.
We can use right-tail, left-tail, or two-tail test.
| Type |
Null Hypothesis |
Alternative Hypothesis |
| Right-tail |
H0:
m = 3 |
HA:
m = 3 |
| Left-tail |
H0:
m = 3 |
HA:
m < 3 |
| Two-tail |
H0:
m = 3 |
HA:
m >3 |
6. Critical value
approach
1) Assume that the
null hypothesis is true
2) Calculate the value of the statistic
using sample data
3) Set the significance level,
a which is the probability of making a
Type I error, to be small. e.g. 0.05 or 0.01
4) compare the value of the test statistic
(use t-table, number of samples,
a)
and to the known
distribution of the test statistic
5) If the test statistic is more extreme than expected,
it allows for an
a chance of error.
Hence, reject the null hypothesis
6) Otherwise, do not reject the null hypothesis



7. P-value approach
1) Assume that the null hypothesis is true
2) Calculate the value
of the statistic using sample data
3) Using
known distribution of the test statistic, calculate the P-value
4) P-value = If the null hypothesis is true, what is
the probability that we would observe a more extreme test statistic?
5) Set the significance
level, a
which is the probability of making a Type I error, to be small. e.g. 0.05 or
0.01
6) If the probability is smaller than (regardless of
types of tests)
a,
reject the null hypothesis. Otherwise, do not reject the null hypothesis.



|