ON THE PREFERENCE OF ZERO-INFLATION MODELS WITH THE PRESENCE OF DATA CONTAMINATION
Abstract
Nowadays, data has become a big concern for researchers to solve problems or improve a lifestyle. It is not odd that different data sources generate data with different characters. In fields such as engineering, epidemiology, psychology, sociology, public health, agriculture, road safety, economics, biology, medicine, public health, and others, data commonly contains a high probability of zeros as an outcome of the dependent variable, this phenomenon called Zero-Inflation of the count. For example, the number of deaths in a car accident is most likely zero, or counting rare birds in a specific region will yield zero in most sectors. When the outcome is non-negative count values with zero excessive, the classical models cannot infer the relationship between the covariances and the dependent variable. Furthermore, in some approaches like Neural Networks and Logistic Regression, modeling data requires an equal percentage of the output classes, for instance, an equal number of observations for class 0 and class 1. Following this leads to losing a big portion of the data, resulting in inference based on a sample that does not represent the entire population. To cope with these downsides, scientists focused on developing methods to model the entire set of Zero-Inflated count data. Several approaches have been constructed to express the dependency between the covariates and dependent variables with excessive zero-count data. However, the discrepancy in the performance was founded when some factors played a role. Some algorithms suffer from a certain obstacle. Poisson and Zero-Inflated Poisson regression is restricted with the assumption of equidispersion, and the Negative Binomial cannot handle Zero-Inflation in the response; Zero-Inflated Negative Binomial complains of failure to converge in some cases. Moreover, some other factors could influence the model's selection, such as the proportion of zeros, sample size, and degree of dispersion; different values of those factors could change the execution of the model. In addition, the structure of the selected population could be another factor that impacts the result and the data condition, whether it contains outliers, missing data, or measurement errors. Considering all the above, this study aims to lay down the basic guideline that researchers could follow when dealing with Zero-Inflated phenomena.
DOI/handle
http://hdl.handle.net/10576/44991Collections
- Mathematics, Statistics & Physics [33 items ]