An important part of statistics is its use to be able to quantify the significance of phenomena compared with what we might expect based on theory or chance. For example, say that we flip a coin 100 times, and we get 60 heads. Is this sufficient grounds to claim that the coin is probably biased? Or, say that we give a certain drug, drug A to 100 patients and 38 of them get better. We then give another drug, drug B, to 100 different patients and 45 of them get better. Is there enough evidence to reasonably state that drug B is superior to drug A?
This is where hypothesis testing comes in. The method of testing hypotheses that is used most frequently today was put forth by Jerzy Neyman and Egon Pearson in the 1930s. There are two competing hypotheses under the Neyman-Pearson theory. There is a null hypothesis (denoted H0), and an alternative hypothesis (denoted H1). Informally speaking, H0 asserts that nothing unusual is going on that couldn't be explained by chance or other known factors, and H1 asserts that something different is taking place. So, if we were to test a coin to see whether it is biased towards heads or not, we might formulate the following hypotheses:
Alternately, depending on the aim of the experiment, we could construct different hypotheses, as long as they were mutually exclusive.
We would formulate these hypotheses before performing the experiment, so that our hypotheses are not affected by what we've found. If our result are close to 50% heads and 50% tails, we wouldn't have enough confidence to reject H0, so we would accept it. If the number of heads is "large enough" ("large enough" would, of course, be defined statistically depending on how confident we want to be in our result) then we would reject H0 and accept H1. We would also determine what degree of confidence we're looking for before performing the experiment.
Let's say that we flip our coin 100 times and get 60 heads. Whether we accept or reject H0 depends on what degree of confidence we chose before attempting the experiment. If the coin were fair, there should be less than 60 heads about 97.2% of the time and 60 or more heads about 2.8% of the time. So, if we had chosen a 95% confidence interval, we would have enough evidence to reject the hypothesis that the coin is fair (97.2% > 95%) so we would accept the hypothesis that the coin is unfair. On the other hand, if we had chosen a 99% confidence interval, we wouldn't have enough evidence to reject the null hypothesis (97.2% < 99%) so we could not conclude that the coin is biased.
You might be able to see from the above discussion that there are two types of errors that we could make. We might conclude that the coin is unfair when it really is fair (and it was just by chance that we got so many heads), or we might conclude that the coin is fair when it really is unfair. These are called Type I and Type II errors.