Demographic and epidemiological characteristics of scorpion envenomation and daily forecasting of scorpion sting counts in Touggourt, Algeria

OBJECTIVES This study was conducted to provide better insights into the demographic and epidemiological characteristics of scorpion envenomation in an endemic area in Algeria and to identify the model that best predicted daily scorpion sting counts. METHODS Daily sting data from January 1, 2013 to August 31, 2016 were extracted from questionnaires designed to elicit information on scorpion stings from the two emergency medical service providers in Touggourt, Algeria. Count regression models were applied to the daily sting data. RESULTS A total of 4,712 scorpion sting cases were documented, of which 70% occurred in people aged between 10 years and 49 years. The male-to-female ratio was 1.3. The upper and lower limbs were the most common locations of scorpion stings (90.4% of cases). Most stings (92.8%) were mild. The percent of people stung inside dwellings was 68.8%. The hourly distribution of stings showed a peak between 10:00 a.m. and 11:00 a.m. The daily number of stings ranged from 0 to 24. The occurrence of stings was highest on Sundays. The incidence of scorpion stings increased sharply in the summer. The mean annual incidence rate was 542 cases per 100,000 inhabitants. The fitted count regression models showed that a negative binomial hurdle model was appropriate for forecasting daily stings in terms of temperature and relative humidity, and the fitted data agreed considerably with the actual data. CONCLUSIONS This study showed that daily scorpion sting data provided meaningful insights; and the negative binomial Hurdle model was preferable for predicting daily scorpion sting counts.


Negative Binomial
We can relax the variance assumption of Poisson regression and allow for an over-dispersion parameter by using the Negative Binomial model. Over-dispersion may be accounted for when using the NB model because of the addition of an error term, , to the conditional mean of the Poisson regression model (Sheu et al., 2004), i.e., = ( ′ + ). We normally assume that ( ) has a gamma distribution with mean 1 and variance so that the conditional mean of is still but the conditional variance of becomes (1 + ). As approaches zero, becomes a Poisson distribution and as becomes larger the distribution becomes more dispersed. The NB probability distribution for participant at dose is given by: +1⁄ (2) Where , , and Γ() refer to the mean of the count distribution, the NB dispersion parameter, and the gamma function. The NB model is generally adequate for addressing over-dispersion due to unobserved heterogeneity and/or temporal dependency, but may be inadequate for over-dispersion resulting from excess zeroes.
In recent years, zero-inflated and hurdle models have gained popularity for modeling count data with excess zeroes. According to Cameron and Trivedi (1998), zero-inflated and hurdle models can be viewed as finite mixture models with a degenerate distribution whose mass is concentrated at zero. Excess zeroes arise when the event of interest is not experienced by many of the subjects.

Zero-Inflated Poisson
The zero-inflated Poisson distribution for participant at dose can be defined as The probability of being an excess zero ( in Eq. (3) is often modeled using logistic regression. Here, for all zero-inflated and hurdle models, we use the logistic model to estimate . Hence, is estimated using Where is related to a set of explanatory variables ( ). Zero-inflated models put more weight on the probability of observing a zero by using a mixing distribution. Hence, for ZIP model (3) the probability of observing a zero is given by the sum of observing an excess zero plus the probability of observing a zero in the Poisson model. As illustrated, the ZIP model allows for two separate processes. Conceptually, the first step models the structural zeroes (e.g., logistic regression) and the second step models the Poisson distribution conditional on the excess zeroes, i.e., Poisson regression models the sampling zeroes and counts. The mean and variance of the ZIP model are given by It can be seen from the ZIP mean and variance that when p equals zero the ZIP model reduces to the standard Poisson model. In contrast, as p approaches one the variance increases and the data exhibit greater overdisperion. The over-dispersion accounted for in the ZIP model is conceptually a result of the structural zeroes. Interpretation of the ZIP model depends upon what is being modeled.

Zero-Inflated Negative Binomial (ZINB)
Zero-inflated Negative Binomial models are sometimes preferred because they allow for additional flexibility in the variance. Using Eq. (2) we can express the ZINB model for participant at dose as where all terms have been defined previously and the mean is as for the ZIP model but the variance is given by Note that the variance depends on and the dispersion parameter . The ZINB model allows for added flexibility compared to the ZIP model. It allows for over-dispersion arising from excess zeroes and heterogeneity, whereas the ZIP model only accommodates overdispersion from excess zeroes.
Interpretation of the ZINB model is as for the ZIP model.

Hurdle Models: Poisson and Negative Binomial
In contrast to zero-inflated models, hurdle models can be interpreted as twopart models. The first part is typically a binary response model and the second part is usually a truncated-at-zero count model (Cameron and Trivedi, 1998). Hence, the hurdle model is a modified count model in which separate processes generating the zeroes and positive counts are not constrained to be the same. This allows us to interpret the positive outcomes (> 0) that result from passing the zero hurdle (threshold). The hurdle portion of the two-part model estimates the probability that the threshold is crossed. Theoretically the threshold could be any value, but it's usually taken as zero because this is most often meaningful in the context of the study objectives. Mullahy (1986) laid out the basic form of hurdle count models. Assume that 1 and 2 are any probability density functions for nonnegative integers. A hurdle model can be expressed as Note that 1 ( ) governs the hurdle part and 2 () the count process once the hurdle has been crossed. Furthermore, 1 (0) is the probability of crossing the hurdle and 2 ′ () is the truncated normalization of 2 ().
Note that if 1 () = 2 () the hurdle model collapses to the standard count model. Hurdle models can be specified in various ways by choosing different distributions for 1 ()and 2 () As for the zero-inflated models we use logistic regression to model p. Here we define two hurdle models by specifying f2_•_ as the Poisson and NB distributions. For example, substitution of Eq. (1) into Eq. (6)  All terms are as defined previously and specification of the log-likelihood can be obtained using Eq. (7). The expected value for the Poisson hurdle (PH) model is given by

Substitution of Eq.
(2) into Eq. (6) for 2 () results in the Negative Binomial hurdle (NBH) model. Computing the expected value for the NBH model is as for the PH model.

Score test
A score test first developed by Dean and Lawless (1989) to evaluate whether the amount of overdispersion in a Poisson model is sufficient to violate the basic assumptions of the model may be defined as: The test is post-hoc, i.e. it is performed subsequent to modeling the data.

Lagrange multiplier test
The Lagrange multiplier test is evaluated using a chi2 test rather than on the t-test probability required for the score test.

Likelihood-ratio test
The likelihood-ratio (LR) test is a commonly used comparative fit test. It is generally used for nested models, but has also been used to test different models (e.g. whether data are better modeled using a negative binomial or a Poisson). The traditional likelihood ratio test is defined as where ℒ is the log-likelihood for a full or more complete model and ℒ is the log-likelihood for a reduced model.

Akaike and Bayesian Information Criterion
The Akaike Information Criterion (AIC) was developed by Hirotsugu Akaike in 1974. However, it did not begin to enjoy widespread use until the twenty-first century. It is now one of the most, if not the most, commonly used fit statistic displayed in statistical model output. The second foremost contemporary comparative fit statistic for likelihood based statistical models is the Bayesian Information Criterion (BIC). Again, this statistic has undergone a variety of parameterizations. The original formulation was given by the University ofWashington's Adrian Raftery in 1986.
where ℒ is the model log-likelihood, is the number of predictors including the intercept, and n represents the number of model observations, A smaller AIC indicates a better fitted model.