A p-value is quantitative summary of the evidence in favor or against a hypothesis of interest. It is computed using a statistical test. It is used in situations where it is believed that random noise (sampling error) may be the root cause of a finding. The technical definition of a p-value is: the probability of the observed data, or data showing a more extreme departure from the null hypothesis, when the null hypothesis is true.
Example
A study identified that 56% of 499 Chinese people believe that they have minimal say in “getting the government to address issues that interest” them, whereas only 25% of 287 Mexicans believe this of their government. On face value, the level of disenfranchisement of the Chinese is substantially higher than that of the Mexicans. The question of interest is: given that the study has found a difference between the Mexicans and Chinese, how confident can we be that this difference is true?
The null hypothesis
One way we can attempt to answer this question is to start from the assumption that there is no difference between the Chinese and Mexicans -- and test the soundness of such an assumption. The assumption of no difference is called the null hypothesis. That is, the null hypothesis is that the true difference between the Mexicans and Chinese in terms of disenfranchisement is 0%.
If the assumption is not sound, the conclusion we draw is that the Chinese and Mexicans are truly different in terms of their levels of disenfranchisement.
Possible explanations for a difference when the null hypothesis is true
Even if it were true that the Chinese and Mexicans are the same in terms of disenfranchisement, we would expect that any study would identify some difference between the two countries.
One reason to expect differences relates to randomness in the process of selecting people to participate in the study. The study spoke only to 499 Chinese people. Had a different 499 people been interviewed, we would likely have obtained a different result (e.g., perhaps 52% rather than 56%). Unless you collect data from everybody in the population, there will always be some error due to who has been selected (i.e., sampled). This form of error is known as sampling error.
There are many causes of differences other than sampling error. For example, was the meaning of the wordings comparable after translation? Were similar processes for checking data in place? What inducements were made to encourage people to participate? The extent to which the observed value differs from the truth due to such factors is known as the non-sampling error. Non-sampling error is either ignored or used to perform additional adjustments. In the rest of this article it is ignored.
Statistical tests of the null hypothesis
We have observed a difference of 31% between the Chinese and Mexicans. Our null hypothesis is that the true difference is 0%. If the null hypothesis is true, the question of interest is: how likely is it that we would observe a value of 31% or more just due to sampling error? The probability that we will observe a difference of 31% or more between the Chinese and Mexicans -- when the truth is that there is no difference -- is known as the p-value, and in this case, it is p = 0.00000000000000004. By convention, if a p-value is less than or equal to 0.05, the conclusion is that the null hypothesis is false. That is, the conclusion is that the difference between the Chinese and Mexicans is statistically significant.
The process of computing a p-value is known as statistical testing.
Acknowledgments
The definition is from B. S. Everitt (2002): The Cambridge Dictionary of Statistics, Second Edition, Cambridge.
The data is from Jonathan Wand (2012 conditionally accepted) “Credible Comparisons Using Interpersonally Incomparable Data,” American Journal of Political Science, Table 2.