Pearson's chi-square test

Pearson's Chi-Square Test is a statistical test used to test the goodness of fit or to test if there is a difference between samples of data. A one-sample chi-square test is used to test the goodness of fit between observed frequencies and theoretical expected frequencies. An example of a one-sample in a chi-square is soil type on a farm. The farm is the single sample that is being tested. The soil types are the categories. The sample can have multiple categories but one sample is being tested. A two-or-more samples chi-square test is used to test differences between data samples. An example of a two-sample chi-square is testing the differences between soil types on two separate farms. The two farms are the two samples.

The Chi-Square statistical test can be used to asses geographic data. The one-sample test enables geographers to examine the differences between observed data and expected data. Two-or-more samples test enables geographers to examine the differences between samples.

Criteria for a Chi-Square Test
-In terms of the Scale of measurement, the data must be nominal.

-The data may also be "categorized" ordinal or interval data.

-The categories of data must be mutually exclusive.

-The "data must be in frequencies, i.e. the number of discrete objects occurring in different categories". The data cannot be in percentages or proportions.

One-Sample Chi-Square Test
The chi-square statistic is a sum of differences between observed and expected outcome frequencies, each squared and divided by the expectation:


 * $$ \chi^2 = \sum_{i=1}^n {\frac{(O_i - E_i)}{E_i}^2}$$

where:
 * $$O_i$$ = an observed frequency for the $$i^{th}$$ bin
 * $$E_i$$ = an expected (theoretical) frequency for the $$i^{th}$$ bin, asserted by the null hypothesis

The resulting value can be compared to the chi-square distribution to determine the goodness of fit.

In order to determine the degrees of Freedom of the Chi-Squared distribution, one takes the total number of observed frequencies and subtracts one. For example, if there are eight different frequencies, one would compare to a chi-squared with seven degrees of freedom.

Another way to describe the chi-squared statistic is with the differences weighted based on measurement error:
 * $$ \chi^2 = \sum {\frac{(O - E)^2}{\sigma^2}}$$

where $$\sigma^2$$ is the variance of the observation. This definition is useful when one has estimates for the error on the measurements.

The reduced chi-squared statistic is simply the chi-squared divided by the number of degrees of freedom:


 * $$ \chi_{red}^2 = \frac{\chi^2}{\nu} = \frac{1}{\nu} \sum {\frac{(O - E)^2}{\sigma^2}}$$

where $$\nu$$ is the number of degrees of freedom, usually given by $$N-n-1$$, where $$N$$ is the number of bins, and $$n$$ is the number of fit parameters. The advantage of the reduced chi-squared is that it already normalizes for the number of data points and model complexity. As a rule of thumb, a large $$\chi_{red}^2$$ indicates a poor model fit. However $$\chi_{red}^2 < 1$$ indicates that the model is 'over-fitting' the data (either the model is improperly fitting noise, or the error bars have been over-estimated). A $$\chi_{red}^2 > 1$$ indicates that the fit has not fully captured the data (or that the error bars have been under-estimated). In principle a $$\chi_{red}^2 = 1$$ is the best-fit for the given data and error bars.

Binomial Case
A binomial experiment is a sequence of independent trials in which the trials can result in one of two outcomes, success or failure. There are n trials each with probability of success, denoted by p. Provided that npi ≫ 1 for every i (where i = 1, 2, ..., k), then

$$ \chi^2 = \sum_{i=1}^{k} {\frac{(N_i - np_i)^2}{np_i}} = \sum_{\mathrm{all\ cells}}^{} {\frac{(\mathrm{O} - \mathrm{E})^2}{\mathrm{E}}}.$$

This has approximately a chi-squared distribution with k &minus; 1 df. The fact that df = k &minus; 1 is a consequence of the restriction $$ \sum N_i=n$$. We know there are k observed cell counts, however, once any k &minus; 1 are known, the remaining one is uniquely determined. Basically, one can say, there are only k &minus; 1 freely determined cell counts, thus df = k &minus; 1.