Odds ratio, misclassification, non-differential and differential mismeasurement, log odds ratio.


Case-control study is still one of the most commonly used study designs in epidemiological research. Misclassification of case-control status remains a significant issue because it will bias the results of a case-control study. There exist two types of misclassification, differential versus nondifferential. It is commonly accepted that nondifferential misclassification will bias the results of the study towards the null hypothesis. Conversely, no reports have assessed the impact and direction of differential misclassification on odds ratio (OR) estimate. The goal of the present study is to demonstrate by statistical derivation that patterns exist on the bias induced by differential misclassification.

In epidemiology studies, where the assessment of the relationship between exposure and outcome variables is the main goal, misclassification of exposure variable leads to biased estimate of odds ratio. In multicenter clinical trials with an increasing number of centers, the possibility of misclassification of the exposure variable and the biases induced by it will arise. Methods to correct a possibly misclassified exposure so that the strength of the association between exposure and outcome variables can be precisely assessed have been a focus of statistical and epidemiological research for over 30 years. In 1977. Simulation analyses show that quite a number of biased odds ratios tend to move away from the null hypothesis and result in approaching zero or infinity with increasing proportion of misclassification among cases, controls, or both. These patterns are associated with the exposure status and the values of unbiased odds ratio (<1, 1, or >1).

1.1 Odds ratio

Odds ratio is one measure used in epidemiological to study the association between the exposure and the corresponding disease. Odds ratio is the ratio of the odds of disease among exposed people to the odds of disease among unexposed. Odds represent the ratio of the probability of an event to its complement.

image 1

Suppose that a case control study is conducted to examine the relationship between smoking habit and Lung cancer. Let S denote smokers, NS denote non smokers, D denote having Lung cancer, and D- denote note having Lung cancer. Assume 600 cases and 550 controls are pre-determined by researchers in advance.

Table 1.1: The frequencies of a hypothetical case-control study.

Lung cancer Non Lung cancer Total
Smokers 400 300 700
Nonsmokers 200 250 450
Total 600 550 1150

The frequencies of four combinations of smoking status and disease status as above

image 2

The result of OR indicates the odds of Lung cancer are 1.67 times higher among smokers compared to nonsmokers. The exposure(smoking status) is treated as a risk factor since OR is larger than 1, which indicates a positive association between the exposure and the disease(risk effect). If we have OR smaller than 1, then the exposure is said to have a protective effect on the disease and that exposure is treated as a protective factor. OR equivalent to 1 indicates no association. The coefficients in a logistic regression can be related to odds ratio by the exponential function.

1.2 Odds ratio and Log odds ratio

The sample size is extremely large, the sampling distribution of the odds ratio is highly skewed. When θ=1, then cannot be much smaller than θ, but it could be much larger with nonnegligible probability. Because of this skewness, statistical inference for the odds ratio uses an alternative but equivalent measure then its natural logarithm, log(θ). Independence corresponds to log(θ)=0. That is, an odds ratio of 1.0 is equivalent to a log odds ratio of 0.0. An odds ratio of 2.0 has a log odds ratio of 0.7. The log odds ratio is symmetric about zero, in the sense that reversing rows or reversing columns changes its sign. Two values for log(θ) that are the same except for sign, such as log(2.0)=0.7 and log (0.5)=-0.7,represent the same strength of association. Doubling a log odds ratio corresponds to squaring an odds ratio. Then the log odds ratios of 2(0.7)=1.4 and 2(-0.7)=-1.4 corresponds to odds ratios of 22=4 and 0.52=0.25.

The sample log odds ratio, log , has a less skewed distribution that is bell-shaped. Its approximating normal distribution has a mean of log θ and a standard error of

image 3

The SE decreases as the cell counts increase. Because the sampling distribution is closer to normality for , it is better to construct confidence intervals for log θ. Transform back to form a confidence interval for θ. A large sample confidence interval for log θ is . For table 1 the natural log of equals log (1.67) =0.512, the SE of equals 0.121. For the population, a 95% confidence interval for log θ equals 0.512±1.96 (0.121), or (0.275, 0.749). The corresponding confidence interval for θ is [exp (0.275), exp (0.749)] = (1.316, 2.115). The symbol , denotes the exponential function evaluated at x. the exponential function is the antilog for the logarithm using the natural log scale. This means that =c is equivalent to log(c) =x. for instance, =exp (0) =1 corresponds to log (1) =0. Since the confidence interval (1.316, 2.115) for θ does not contain 1.0, the true odds of TB seem different for the two groups. We estimate that the odds of TB are at least 31% higher for subjects taking smoking than for subjects taking nonsmoker. The end points of the interval are not equally distant =1.67, because the sampling distribution is skewed to the right.

1.3 Non-differential versus Differential Mismeasurement

Mismeasurement can be separated into two types, namely non-differential and differential mismeasurement. Non-differential mismeasurement presents if the observed exposure has no additional information about the response variable when the true value of that exposure is given. If the response variable is binary as disease or non-disease and the exposure is categorical, then the non-differential circumstance simply means possible misclassification probabilities do not vary between cases and controls. Non-differential misclassification could be caused by random errors such as fallible memory and misunderstanding questions and systematic errors such as test failures as long as errors are equally likely to occur in all levels. The perspective cohort studies by nature are more likely to relate to non-differential mismeasurement than differential mismeasurement, Because subjects are not aware of their future disease statuses and they are unable to alter their exposure values based on the unknown disease statuses. It is highly probable for patients to make the same number of errors in exposure a non-differential case may turn into a differential scenario when continuous data that are subject to measurement error are turned into categorical data.

1.4 Misclassification Rates

The categorical exposure is subject to misclassification, probabilities of its observed status given its true status are misclassification rates. The accuracy of a measurement tool can be reflected by misclassification rates. Misclassification rates can further help investigators correct misclassification. They can be written into a matrix form with diagonal ones representing correctly distributed probabilities and off-diagonal ones indicating misclassification rates. For a simple case with only one binary exposure and one binary outcome as variables(2*2 case),misclassification rates include four cells, which are sensitivity,1-sensitivity,specificity,1-specificity. The sensitivity of a test represents the people grouped into the exposed category given they are truly exposed. The specificity of a test indicates of those are truly unexposed individuals. Sensitivity and specificity belong to classification probabilities (correctly probabilities).1-sensitivity is the complement of sensitivity called as false negative rate and 1-specificity is the complement of specificity called as false positive rate. Both of them are misclassification probabilities. For non-differential misclassification, we have one pair of sensitivity and specificity as a result of same amount of misclassification cases and controls. We have two pairs of sensitivity and specificity for cases and controls separately in a differential misclassification. In sensitivity and specificity researchers sometimes need probabilities to have an inverse form of them. In this proportion truly exposed refers to positive predictive value (PPV) and the probability of unexposed refers to negative predictive value (NPV). Then sensitivity is 0.522 and specificity is 0.478.

1.5 Misclassification in relation to an odds ratio

Table 1.2 : The effects of non-differential misclassification when 5% of smokers are misclassified as nonsmokers and 8% of non smokers are misclassified as smokers.

Lung cancer No Lung cancer Total
Smokers 400-20+16=396 300-15+20=305 701
Nonsmokers 200-16+20=204 250-20+15=245 449
Total 600 550 1150

It gives a potential influences of a misclassified binary exposure on OR. 5 percents of smokers are misclassified into the unexposed group, and 8 percents of nonsmoking individuals are miscategorized in to the non smoking category. This kind of misclassification belongs to the non differential scenario because the same amount of misclassification is experienced by cases and controls. With the presence of misclassification OR reduces to 396*245/305*204=1.56 in contrast to 1.67(without misclassification). This suggests OR is underestimated and this underestimation is referred to Attenuation or reducing to the null.

Table 1.3: The effects of differential misclassification when 20% of smoking cases, but not controls, are misclassified as nonsmokers.

Lung cancer No Lung cancer Total
Smokers 400-80=320 300 620
Non smokers 200+80=280 250 530
Total 600 550 1150

Table 1.4: The effects of differential misclassification when 20% of nonsmoking cases, but not controls, are misclassified as smokers.

Lung cancer No Lung cancer Total
Smokers 400+40=440 300 740
Non smokers 200-40=160 250 410
Total 600 550 1150

Differential misclassification occurs in this case-control study. 20% of the smokers who develop Lung cancer are wrongly put into the nonsmoking group. No nonsmoking cases are misclassified into the smoker category. No misclassification happens in the control group.

From 1.3. OR becomes (320*250/300*280)=0.95 , here cases and controls as two groups receive different amounts of misclassification. Specifically controls do not have misclassification. This situation may be due to the potential influence of the recall process. From 1.4. 20persent of the nonsmoking cases are misclassified into the exposed group and no misclassification is within the controls. Then OR turns to be 2.29, which indicates the parameter is overestimated or away from the null.


In this study we work on mismeasurement method. In this we have to find the effects of non-differential misclassification when 5% of smokers are misclassified as nonsmokers and 8% of non smokers are misclassified as smokers and also we are study the effects of differential misclassification when 20% of smoking and non smoking cases, but not controls, are individually misclassified as nonsmokers and smokers. And also explain the misclassification rates.