Fnd. Stat

08/06/04

Home
Up
Fnd. Stat
Regression
Least Square Method
Error Analysis
Circular Statistics
Vector

 

1. Populations and Parameters

          A population is any large collection of objects or individuals, such as Koreans, Japanese, flowers, or students about which information is desired. A parameter is any summary number, like an average or percentage, which describes the entire population. However, it is impossible to know a population parameter in majority of cases. What we can do is just to estimate the parameter. For example, what is the average height of Koreans? What is the average weight of Americans. Can you measure the heights of Koreans or weights of Americans?

2. Samples and Statistics

          A sample is a representative group drawn from the population. A statistic is any summary number, like an average or percentage, that describes the sample. For example, is a sample proportion.

3. Mathematical Notation

         It is useful to introduce some mathematical notation to some numerical summaries. Just for convenience, variables are commonly given one-letter names: x, y, z,... Since the number of alphabets are limited, we use subscripts to identify the individual values in a set of data: x1, x2, x3,..., xn. Some numerical summaries involve summation, which is represented by the symbol (sigma).

4. Central Tendency

          A central tendency describes the center of a data set or the location of a 'typical value' of a data set. There are a few central tendency measures which are commonly used and we will deal with average (arithmetic mean), median, and mode.

1. Mean

         The most commonly used measure of centrality is the mean. If you have a data set which is x1, x2, x3,..., xn, you can compute the mean of the data set using the following equation.

For example, if the mean of 45, 50, 55, 60, and 65 is . Physical interpretation can be given to the mean which helps to explain its properties. Imagine a horizontal dot plot where all dots are equally sized and each dot is a point mass with the same weight, and that the axis itself has negligible mass. The mean is the position o the axis where the beam will have a balance.

2. Median

         Median is defined as the score corresponding to the 50th percentile. The median is the middle score in the distribution when scores are put in order in size. If there is an even number of scores, the median is computed by averaging the two middle scores. When the distribution score is symmetric, the mean and the median will be equal. So, the data set we already used to compute the mean (45, 50, 55, 60, and 65) will give you the same value for the mean and median which is 55. However, when you have a data set of 3, 4, 4, 4, 5, 6, 7, and 39, the mean will be 9.0 and the median will be 4.5 even if the extreme value 39 is changed to 390000000000000000.

3. Mode

        The mode is the most frequently obtained score in the data set. The mode is at best a rough measure and is generally less useful than the mean or median.

5. Variability

          In addition to general location, there is another important attribute of a distribution of scores, which is called variability. Variability refers to how spread out or scattered the scores in a distribution. Therefore, the minimum possible variability is zero. This will occur only if all of the scores are exactly the same. We will talk about a few variability measures: range, variance, and standard deviation.

1. Range

         The range, R, of a data set is the difference between the largest and the smallest values in the set; that is: R = Xmax-Xmin. The range is very easy to compute.

2. Variance

         Another measure of the spread of values in a data set is a variance which is based on their squared deviations. The variance for can be computed as

This is a basic measure of the variability of any set of data. However, when the data of a sample are used to estimate the variance of the variance of the population from which the sample was drawn, the population variance estimate () is computed as

There is a reason why we use n-1 rather than n. Whenever you use n, the variance is overestimated and it turned out that when n-1 is used, the variance become close to the actual number. It has been proven by statisticians. Although the variance does reflect the spread of values, it is not an easily interpreted quantity, and should not be used as a descriptive numerical summary of the data. The problem is that the variance is defined in terms of squared deviations and therefore its units are the square of the units of the sample values. For example, if you have a data set which is in kg sale and compute the variance of the data, you will get the variance in square kg.

3. Standard deviation

         Since the unit of variance is not convenient, we simply take a square root on the variance and make the unit in the same unit as in raw data. This variability measure is called standard deviation. Therefore, the standard deviation (SD) of data can be computed as

6. Confidence Intervals

 1. General form of confidence interval
         Although we wa
nt to estimate the actual population mean , the sample mean. In confidence intervals, we use certain range within which we can be confident that the actual population mean falls, such as L<<U. The range of values is called a confidence interval. The general form of most confidence intervals is Sample estimate±Margin of Error. Therefore, the lower level L is the estimate-margin of error and upper limit U is estimate+margin of error. We are confident that the value of the population parameter is somewhere between L and U.

2. (1-a)100% t-interval for population mean
  
       Formula in words: Sample mean
±(t-multiplier x standard error)
        
Formula in notation:
                                     
a level and the degree of freedom which is number of sample-1.

3. Determining t-multiplier

Typical t-multipliers in science

Confidence Coefficient
(1-a
)

Confidence Level
(1-a
) x 100%

0.90 90% 0.950
0.95 95% 0.975
0.99 99% 0.995

 

 

 

 

 

 

7. Hypothesis Testing

1. General idea

          Once we make an initial assumption, we collect data and decide whether we will reject or not reject the initial assumption based on the available evidence (data).

2. Making the decision

         Given the initial assumption , it is either 'likely' or 'unlikely' that the assumption is right. When it is 'likely', we do not reject the initial assumption. When it is 'unlikely', then we reject our initial assumption. In other words, if it is unlikely, then either our initial assumption is correct and we experienced a very unusual event. The initial assumption, hypothesis can by a null hypothesis (H0) or an alternative hypothesis (HA). For  example, H0: the heights of Americans are the same as the heights of Koreans. HA: Americans are taller than Koreans.

3. Important point

       1) Even if we reject the null hypothesis, we do not prove the alternative hypothesis is true.
       2) Even if we do not reject the null hypothesis, we do not prove the alternative hypothesis is not true.
       3) We just say that we have enough evidence to treat the hypothesis one way or the other.
       4) In statistics, there is always a possibility to make a wrong decision (error)

4. Errors in hypothesis testing

       1) Type I Error: The null hypothesis is rejected when it is true.
       2) Type II Error: The null hypothesis is not rejected when it is false.
Minimizing the a
can decrease Type I error and maximizing the sample size can minimize Type II error.

5. Possible hypotheses about mean m

          In may cases in science research, we would like to compare two means. For example, after we train our volleyball players with a new training strategy for 2 weeks, we would like to compared their performance compared with other volleyball players who did not experience the new training. We can use right-tail, left-tail, or two-tail test.

Type Null Hypothesis Alternative Hypothesis
Right-tail H0: m = 3 HA: m = 3
Left-tail H0: m = 3 HA: m < 3
Two-tail H0: m = 3 HA: m >3

6. Critical value approach

       1) Assume that the null hypothesis is true
       2) Calculate the value of the statistic using sample data
       3) Set the significance level,
a which is the probability of making a Type I error, to be small. e.g. 0.05 or 0.01
     4) compare the value of the test statistic
(use t-table, number of samples, a) and to the known distribution of the test statistic
     5) If the test statistic is more extreme than expected, it allows for an
a
chance of error. Hence, reject the null hypothesis
     6) Otherwise, do not reject the null hypothesis

7. P-value approach

      1) Assume that the null hypothesis is true
       2) Calculate the value of the statistic using sample data
       3)
Using known distribution of the test statistic, calculate the P-value
     4) P-value = If the null hypothesis is true, what is the probability that we would observe a more extreme test statistic?
     5)
Set the significance level, a which is the probability of making a Type I error, to be small. e.g. 0.05 or 0.01
     6) If the probability is smaller than (regardless of types of tests)
a,
reject the null hypothesis. Otherwise, do not reject the null hypothesis.

 

 

Home Contact Me Education & Work Scholarly Work Hand and Finger Biomechanics Motor Control Math and Stat KINES171 Teaching Authorized pages Temporary

This site was last updated 10/18/03