# Abstract

Odds ratio is one measure used in epidemiological to study the association between the exposure and the corresponding disease. In a case control study is conducted to examine the relationship between smoking habit and Lung cancer. In this study we work on mismeasurement method. The Mismeasurement can be separated into two types, namely non-differential and differential mismeasurement. Misclassification of exposure variables in epidemiologic studies may lead to biased estimation of parameters and loss of power in statistical inferences. Simple estimates for predictive values when misclassification is nondifferential are presented. Using them, we estimated the corrected log odds ratio. In this we have to find the effects of non-differential misclassification when 5% of smokers are misclassified as nonsmokers and 8% of non smokers are misclassified as smokers and also we are study the effects of differential misclassification when 20% of smoking and non smoking cases, but not controls, are misclassified as nonsmokers and smokers.

**Keywords**

Odds ratio, misclassification, non-differential and differential mismeasurement, log odds ratio.

**Introduction**

Case-control study is still one of the most commonly used study designs in epidemiological research. Misclassification of case-control status remains a significant issue because it will bias the results of a case-control study. There exist two types of misclassification, differential versus nondifferential. It is commonly accepted that nondifferential misclassification will bias the results of the study towards the null hypothesis. Conversely, no reports have assessed the impact and direction of differential misclassification on odds ratio (OR) estimate. The goal of the present study is to demonstrate by statistical derivation that patterns exist on the bias induced by differential misclassification.

In epidemiology studies, where the assessment of the relationship between exposure and outcome variables is the main goal, misclassification of exposure variable leads to biased estimate of odds ratio. In multicenter clinical trials with an increasing number of centers, the possibility of misclassification of the exposure variable and the biases induced by it will arise. Methods to correct a possibly misclassified exposure so that the strength of the association between exposure and outcome variables can be precisely assessed have been a focus of statistical and epidemiological research for over 30 years. In 1977. Simulation analyses show that quite a number of biased odds ratios tend to move away from the null hypothesis and result in approaching zero or infinity with increasing proportion of misclassification among cases, controls, or both. These patterns are associated with the exposure status and the values of unbiased odds ratio (<1, 1, or >1).

**1.1 Odds ratio**

Odds ratio is one measure used in epidemiological to study the association between the exposure and the corresponding disease. Odds ratio is the ratio of the odds of disease among exposed people to the odds of disease among unexposed. Odds represent the ratio of the probability of an event to its complement.

Suppose that a case control study is conducted to examine the relationship between smoking habit and Lung cancer. Let S denote smokers, NS denote non smokers, D denote having Lung cancer, and D- denote note having Lung cancer. Assume 600 cases and 550 controls are pre-determined by researchers in advance.

Lung cancer | Non Lung cancer | Total | |
---|---|---|---|

Smokers | 400 | 300 | 700 |

Nonsmokers | 200 | 250 | 450 |

Total | 600 | 550 | 1150 |

The frequencies of four combinations of smoking status and disease status as above

The result of OR indicates the odds of Lung cancer are 1.67 times higher among smokers compared to nonsmokers. The exposure(smoking status) is treated as a risk factor since OR is larger than 1, which indicates a positive association between the exposure and the disease(risk effect). If we have OR smaller than 1, then the exposure is said to have a protective effect on the disease and that exposure is treated as a protective factor. OR equivalent to 1 indicates no association. The coefficients in a logistic regression can be related to odds ratio by the exponential function.

**1.2 Odds ratio and Log odds ratio**

The sample size is extremely large, the sampling distribution of the odds ratio is highly skewed. When θ=1, then cannot be much smaller than θ, but it could be much larger with nonnegligible probability. Because of this skewness, statistical inference for the odds ratio uses an alternative but equivalent measure then its natural logarithm, log(θ). Independence corresponds to log(θ)=0. That is, an odds ratio of 1.0 is equivalent to a log odds ratio of 0.0. An odds ratio of 2.0 has a log odds ratio of 0.7. The log odds ratio is symmetric about zero, in the sense that reversing rows or reversing columns changes its sign. Two values for log(θ) that are the same except for sign, such as log(2.0)=0.7 and log (0.5)=-0.7,represent the same strength of association. Doubling a log odds ratio corresponds to squaring an odds ratio. Then the log odds ratios of 2(0.7)=1.4 and 2(-0.7)=-1.4 corresponds to odds ratios of 22=4 and 0.52=0.25.

The sample log odds ratio, log , has a less skewed distribution that is bell-shaped. Its approximating normal distribution has a mean of log θ and a standard error of

The SE decreases as the cell counts increase. Because the sampling distribution is closer to normality for , it is better to construct confidence intervals for log θ. Transform back to form a confidence interval for θ. A large sample confidence interval for log θ is . For table 1 the natural log of equals log (1.67) =0.512, the SE of equals 0.121. For the population, a 95% confidence interval for log θ equals 0.512±1.96 (0.121), or (0.275, 0.749). The corresponding confidence interval for θ is [exp (0.275), exp (0.749)] = (1.316, 2.115). The symbol , denotes the exponential function evaluated at x. the exponential function is the antilog for the logarithm using the natural log scale. This means that =c is equivalent to log(c) =x. for instance, =exp (0) =1 corresponds to log (1) =0. Since the confidence interval (1.316, 2.115) for θ does not contain 1.0, the true odds of TB seem different for the two groups. We estimate that the odds of TB are at least 31% higher for subjects taking smoking than for subjects taking nonsmoker. The end points of the interval are not equally distant =1.67, because the sampling distribution is skewed to the right.

**1.3 Non-differential versus Differential Mismeasurement**

Mismeasurement can be separated into two types, namely non-differential and differential mismeasurement. Non-differential mismeasurement presents if the observed exposure has no additional information about the response variable when the true value of that exposure is given. If the response variable is binary as disease or non-disease and the exposure is categorical, then the non-differential circumstance simply means possible misclassification probabilities do not vary between cases and controls. Non-differential misclassification could be caused by random errors such as fallible memory and misunderstanding questions and systematic errors such as test failures as long as errors are equally likely to occur in all levels. The perspective cohort studies by nature are more likely to relate to non-differential mismeasurement than differential mismeasurement, Because subjects are not aware of their future disease statuses and they are unable to alter their exposure values based on the unknown disease statuses. It is highly probable for patients to make the same number of errors in exposure a non-differential case may turn into a differential scenario when continuous data that are subject to measurement error are turned into categorical data.

**1.4 Misclassification Rates**

The categorical exposure is subject to misclassification, probabilities of its observed status given its true status are misclassification rates. The accuracy of a measurement tool can be reflected by misclassification rates. Misclassification rates can further help investigators correct misclassification. They can be written into a matrix form with diagonal ones representing correctly distributed probabilities and off-diagonal ones indicating misclassification rates. For a simple case with only one binary exposure and one binary outcome as variables(2*2 case),misclassification rates include four cells, which are sensitivity,1-sensitivity,specificity,1-specificity. The sensitivity of a test represents the people grouped into the exposed category given they are truly exposed. The specificity of a test indicates of those are truly unexposed individuals. Sensitivity and specificity belong to classification probabilities (correctly probabilities).1-sensitivity is the complement of sensitivity called as false negative rate and 1-specificity is the complement of specificity called as false positive rate. Both of them are misclassification probabilities. For non-differential misclassification, we have one pair of sensitivity and specificity as a result of same amount of misclassification cases and controls. We have two pairs of sensitivity and specificity for cases and controls separately in a differential misclassification. In sensitivity and specificity researchers sometimes need probabilities to have an inverse form of them. In this proportion truly exposed refers to positive predictive value (PPV) and the probability of unexposed refers to negative predictive value (NPV). Then sensitivity is 0.522 and specificity is 0.478.

**1.5 Misclassification in relation to an odds ratio**

Lung cancer | No Lung cancer | Total | |
---|---|---|---|

Smokers | 400-20+16=396 | 300-15+20=305 | 701 |

Nonsmokers | 200-16+20=204 | 250-20+15=245 | 449 |

Total | 600 | 550 | 1150 |

It gives a potential influences of a misclassified binary exposure on OR. 5 percents of smokers are misclassified into the unexposed group, and 8 percents of nonsmoking individuals are miscategorized in to the non smoking category. This kind of misclassification belongs to the non differential scenario because the same amount of misclassification is experienced by cases and controls. With the presence of misclassification OR reduces to 396*245/305*204=1.56 in contrast to 1.67(without misclassification). This suggests OR is underestimated and this underestimation is referred to Attenuation or reducing to the null.

Lung cancer | No Lung cancer | Total | |
---|---|---|---|

Smokers | 400-80=320 | 300 | 620 |

Non smokers | 200+80=280 | 250 | 530 |

Total | 600 | 550 | 1150 |

Lung cancer | No Lung cancer | Total | |
---|---|---|---|

Smokers | 400+40=440 | 300 | 740 |

Non smokers | 200-40=160 | 250 | 410 |

Total | 600 | 550 | 1150 |

Differential misclassification occurs in this case-control study. 20% of the smokers who develop Lung cancer are wrongly put into the nonsmoking group. No nonsmoking cases are misclassified into the smoker category. No misclassification happens in the control group.

From 1.3. OR becomes (320*250/300*280)=0.95 , here cases and controls as two groups receive different amounts of misclassification. Specifically controls do not have misclassification. This situation may be due to the potential influence of the recall process. From 1.4. 20persent of the nonsmoking cases are misclassified into the exposed group and no misclassification is within the controls. Then OR turns to be 2.29, which indicates the parameter is overestimated or away from the null.

**Conclusion**

In this study we work on mismeasurement method. In this we have to find the effects of non-differential misclassification when 5% of smokers are misclassified as nonsmokers and 8% of non smokers are misclassified as smokers and also we are study the effects of differential misclassification when 20% of smoking and non smoking cases, but not controls, are individually misclassified as nonsmokers and smokers. And also explain the misclassification rates.

# References

- Agresti A.(1996), An Introduction to Categorical Data Analysis , wiley.
- Barron BA. The effects of misclassification on the estimation of relative risk. Biometrics 1997; 33:414-418.
- Bross I. Misclassification in 2 × 2 tables. Biometrics 1954; 10:478-486.
- Chen TT. A review of methods for misclassified categorical data in epidemiology. Stat Med 1989; 81:1095-1106.
- Chu H, Wang Z, Cole SR, Greenland S. Sensitivity analysis of misclassification: A graphical and a Bayesian approach. Ann Epidemiol 2006; 16:834-841.
- Gullen WH, Bearman JE, Johnson EA. Effects of misclassification in epidemiologic studies. Public Health Rep 1968; 83:914-918.
- Gustafson P. Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Boca Raton, FL: Chapman & Hall/CRC, 2004.
- Greenland S, Kleinbaum DG. Correcting for misclassification in two-way tables and matched-pair studies. Int J Epidemiol 1983; 12:93-97.
- Lyles RH. A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure. Biometrics 2002; 58:1034-1037.
- Morrissey MJ, Spiegelman D. Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons. Biometrics 1999;
- WHO guidelines for the programmatic management of drug-resistant Lung cancer: 2011 update.
- Sreelatha, Ch., Muniswamy, B. (2015). Incidence of Lung cancer in VisakhapatnamDistrict,AndhraPradesh, India. BMR Medicine 1(1). 1-7
- Sreelatha, Ch., Muniswamy, B. (2015) Association with incidence of Lung cancer in Visakhapatnam district at 2012, International Journal of Science and Research (IJSR) ISSN (Online): 2319-7064