Hypotheses suggested by the data fallacy

The testing hypotheses suggested by the data fallacy is an issue in scientific research using statistics and spatial analysis. It occurs when the data set used to formulate a hypothesis is also used to test the hypothesis. This type of testing very often leads to false positives (Type I error).

When the testing hypotheses suggested by the data fallacy occurs
A hypothesis is often formed because of observations found in data or nature. If an interesting phenomena is observed in a data set, scientists may want to test to see if it is significant. This is when a scientist needs to be careful of the testing a hypothesis suggested by the data fallacy. If the scientist uses the data set the phenomena was observed in, it is possible the statistical test will be significant. However, that might not be true in reality, a Type I error. The data set the phenomena was observed in could be unique, dependent on an outside factor, sampled incorrectly, or an outlier. If the data set suggested a relationship it is more likely to produce a significant relationship when one does not, in reality, exist.

Avoiding the testing hypotheses suggested by the data fallacy
One simple way to avoid this error is to test the hypothesis on a data set, which was not used to generate the hypothesis. If a significant result is found in a data set not used to formulate the hypothesis there is a lesser chance of a Type I error. In cases of Geospatial predictive modeling it is possible to use Cross-Validation to avoid this fallacy. Bootstrapping and Scheffé's Rule are also statistical methods that can be used to avoid this fallacy.

Related Fallacies

 * Overfitting
 * Post hoc theorizing