Predicting over-the-counter antibiotic use in rural Pune, India, using machine learning methods

OBJECTIVES Over-the-counter (OTC) antibiotic use can cause antibiotic resistance, threatening global public health gains. To counter OTC use, this study used machine learning (ML) methods to identify predictors of OTC antibiotic use in rural Pune, India. METHODS The features of OTC antibiotic use were selected using stepwise logistic, lasso, random forest, XGBoost, and Boruta algorithms. Regression and tree-based models with all confirmed and tentatively important features were built to predict the use of OTC antibiotics. Five-fold cross-validation was used to tune the models’ hyperparameters. The final model was selected based on the highest area under the curve (AUROC) with a 95% confidence interval (CI) and the lowest log-loss. RESULTS In rural Pune, the prevalence of OTC antibiotic use was 35.9% (95% CI, 31.6 to 40.5). The perception that buying medicines directly from a medicine shop/pharmacy is useful, using antibiotics for eye-related complaints, more household members consuming antibiotics, and longer duration and higher doses of antibiotic consumption in rural blocks and other social groups were confirmed as important features by the Boruta algorithm. The final model was the XGBoost+Boruta model with 7 predictors (AUROC, 0.934; 95% CI, 0.891 to 0.978; log-loss, 0.279) log-loss. CONCLUSIONS XGBoost+Boruta, with 7 predictors, was the most accurate model for predicting OTC antibiotic use in rural Pune. Using OTC antibiotics for eye-related complaints, higher consumption of antibiotics and the perception that buying antibiotics directly from a medicine shop/pharmacy is useful were identified as key factors for planning interventions to improve awareness about proper antibiotic use.


INTRODUCTION
The emergence of antimicrobial-resistant (AMR) bacterial species that are beyond the reach of medical treatment is a consequence of the over-the-counter (OTC) consumption of antibiotics in human and veterinary medicine [1][2][3][4].The misuse and overuse of antibiotics, along with self-medication, have accelerated the rise of AMR in bacteria.According to a World Health Organization report, 50% of antibiotic prescriptions worldwide are inappropriate, with India being one of the largest consumers of these drugs [5][6][7].The prevalence of OTC antibiotic practices in India can be linked to its highly privatised healthcare infrastructure, informal sectors, and the widespread availability of retail medical stores that sell medicines without valid prescriptions [1].Previous studies have indicated that the high volume of antibiotic consumption in India [8] is associated with a lack of public knowledge, resource limitations in rural areas, the close proximity of retail pharmacies to the population, cultural practices, inadequate formal healthcare services, and a weak regulatory framework and law enforcement [1,2,4,9].In an effort to promote antibiotic stewardship, India has enacted the Drugs and Cosmetics Act, 1940, the Drugs and Cos-metics Rules, 1945, Schedule H1 (an amendment to Schedule H, 2014), and has launched a public awareness campaign known as "Medicines with the Red Line" [1,5,9].Despite these measures, the OTC sale of antibiotics continues to be a widespread practice in the country.Recently, Kerala became the first state in India to initiate Operation Amrith ("Antimicrobial Resistance Intervention for Total Health").This operation involves conducting surprise inspections at retail medical shops to curb the OTC sale of antibiotics.Additionally, a toll-free number (18004253182) has been established for the public to report complaints against medical shops.Upon receiving a complaint, it is forwarded to the relevant zonal office for investigation, and prompt departmental action is taken if any violations are found [10].
As a step forward in antibiotic stewardship, global studies have utilised artificial intelligence (AI) and machine learning (ML) methods to predict AMR across various bacterial strains [11,12] and to assess the susceptibility of bacterial species to AMR, guiding antibiotic prescriptions with personalised antibiograms.After training with whole-genome sequencing data, several machinelearning algorithms, such as support vector machines (SVM), logistic regression (LR) models, and random forests (RF), have demonstrated high accuracy in predicting AMR [12,13].The efficacy of deep learning algorithms in identifying new antibiotics, AMR genes, and AMR peptides has also been recently established [14,15].Studies employing "off-the-shelf " supervised ML algorithms to create predictive models for antibiotic prescribing have yielded promising results, indicating that ML-based solutions can offer essential tools to assist in antimicrobial prescribing and contribute to the fight against AMR [16][17][18].Despite these promising results in controlled environments [16][17][18], the current literature indicates that the application of predictive models to support clinical decisions in antibiotic prescribing and antimicrobial management remains limited and has not yet fully leveraged the significant advancements in data and algorithm development [11,16].The research has primarily relied on available secondary datasets for conducting AI and ML analyses, with very few studies situated in low-income and middle-income countries, particularly in India.
However, in addition to hospital and laboratory settings, it is essential to implement antibiotic stewardship interventions in community settings.This approach recognises and addresses the behaviours and preferences of both community members and healthcare providers.Against this backdrop, our study sought to identify predictors of OTC antibiotic use in the rural areas of Pune district, India.By employing ML methods on a primary dataset, our study contributes to the identification of these predictors of OTC antibiotic use.

Study design
For primary data collection, a cross-sectional descriptive study was conducted in 2 blocks, Junnar and Mulshi, of Pune district, Maharashtra, to understand antibiotic usage.These blocks were selected based on their proximity to urban settings, with Junnar being distant and tribal, and Mulshi being closer to Pune City and rural.

Sampling
Pune district is divided into 2 rural sub-divisions.The first, Shirur, is relatively more distant from urban Pune and includes the Junnar, Ambegaon, Khed, and Shirur blocks.The second, Maval, is more accessible and comprises the Maval and Mulshi blocks.These 2 sub-divisions, consisting of 6 blocks, served as the sampling frame for our study.From these, 2 blocks-Junnar and Mulshi -were randomly selected.Within these blocks, a total of 23 villages were chosen: 12 from Junnar and 11 from Mulshi.These villages were selected based on their higher human and livestock populations, using a proportionate sampling approach that accounted for both human and animal population sizes.

Data collection
Data collection was conducted in 2 phases within the Pune district of Maharashtra State.The first phase included key informant interviews and focus group discussions.Based on the insights gained from the first phase, 3 distinct semi-structured interview schedules were developed for the second phase.This subsequent phase involved gathering both quantitative and qualitative data through semi-structured interviews to understand the perspectives of community members, farmers, and healthcare and veterinary care practitioners on antibiotic use.

Variables and datasets
The analysis utilised quantitative data from semi-structured interviews.The outcome or dependent variable, OTC antibiotic use, was defined as a binary variable.It was coded as 0 when doctors prescribed antibiotics and household members obtained them from the pharmacy, and as 1 when individuals purchased antibiotics from the pharmacy without a doctor's advice.This latter category included instances where antibiotics were self-purchased, used from an old prescription, shared by friends, neighbours, or relatives, or suggested and purchased at the pharmacy.
The analysis included a total of 29 predictor/independent variables, which encompassed (1) socio-demographic characteristics of the households, (2) help-seeking behaviour, (3) causes, duration, dosage, and the number of household members who used antibiotics in the past year, and (4) knowledge, awareness, and perceptions about antibiotics.A detailed description of the predictor/independent variables can be found in Supplementary Material 1.
A total of 458 households participated in the survey.Following the exclusion of missing values and non-responses, 443 households remained for inclusion in the analysis.The dataset was randomly split into a training dataset (70% of cases, n = 311) and a testing dataset (30% of cases, n = 132) for the purpose of selecting predictors and developing ML models.We employed 5-fold crossvalidation on the training dataset for hyperparameter tuning to minimise prediction error.The performance of the model was assessed using the testing dataset.

Statistical analysis
All analyses were conducted in R Studio using R version 4.2.3 (R Foundation for Statistical Computing, Vienna, Austria).The exploratory data analysis utilised a complete dataset, with categorical variables described in terms of counts and percentages (%).To examine the association between categorical predictor variables and OTC antibiotic use, we applied the chi-square test of independence.We considered results statistically significant at p-value ≤ 0.05.We calculated the estimated proportions of OTC antibiotic use and their 95% confidence intervals (CIs) using the method proposed by Agresti-Coull, which was implemented with the "prevalence" package [19].In the Agresti-Coull's CI formula,

Statistical analysis
All analyses were conducted in R Studio using R version 4.2.3 (R Foundation for Statistical Computing, Vienna, Austria).The exploratory data analysis utilised a complete dataset, with categorical variables described in terms of counts and percentages (%).To examine the association between categorical predictor variables and OTC antibiotic use, we applied the chisquare test of independence.We considered results statistically significant at p-value ≤ 0.05.
We calculated the estimated proportions of OTC antibiotic use and their 95% confidence intervals (CIs) using the method proposed by Agresti-Coull, which was implemented with the "prevalence" package [19].In the Agresti-Coull's CI formula,

Selecting predictors
The predictors of OTC antibiotic use were identified by applying logistic regression, the least absolute shrinkage and selection operator (lasso), and Boruta algorithms to the training dataset using the "Caret" package.
Logistic regression employs the Akaike information criterion (AIC) for stepwise predictor selection.It eliminates predictors with a p-value greater than 0.10 and compares the AIC of the reduced model at each step to the AIC of the preceding model.The variables that remain in the logistic regression model with the lowest AIC are considered the final predictors.The lasso algorithm, also referred to as L1 penalised/regularised regression, reduces the regression coefficients of unimportant variables to zero [20].The predictors/variables with non-zero coefficients of the lasso regression model were selected as the final predictors.
The Boruta algorithm, which is based on the RF approach, generates dummy, or shadow, variables corresponding to each of the dataset's original predictor or independent variables.It then employs a random forest classifier to compare the original predictors with their shadow counterparts using the mean decrease in accuracy and calculates z-scores.An equality test is used to compare the maximum z-score of the shadow predictors against that of the original

Selecting predictors
The predictors of OTC antibiotic use were identified by applying LR, the least absolute shrinkage and selection operator (lasso), and Boruta algorithms to the training dataset using the "Caret" package.
LR employs the Akaike information criterion (AIC) for stepwise predictor selection.It eliminates predictors with a p-value greater than 0.10 and compares the AIC of the reduced model at each step to the AIC of the preceding model.The variables that remain in the LR model with the lowest AIC are considered the final predictors.The lasso algorithm, also referred to as L1 penalised/regularised regression, reduces the regression coefficients of unimportant variables to zero [20].The predictors/variables with non-zero coefficients of the lasso regression model were selected as the final predictors.
The Boruta algorithm, which is based on the RF approach, generates dummy, or shadow, variables corresponding to each of the dataset's original predictor or independent variables.It then employs a random forest classifier to compare the original predictors with their shadow counterparts using the mean decrease in accuracy and calculates z-scores.An equality test is used to compare the maximum z-score of the shadow predictors against that of the original predictors.If the z-score of an original predictor exceeds the maximum z-score of its shadow, the predictor is retained in the training dataset; otherwise, both the original and its shadow predictor are removed from the dataset.This iterative process continues until all predictors are classified as "confirmed, " "rejected, " or "tentatively important (tntv)" [21].The predictors identified by the Boruta algorithm as "confirmed important (cnf)" and "tntv" are collectively referred to as "non-rejected predictors (nonrej)." RF is an ensemble algorithm based on the "bagging" approach, which stands for "bootstrap averaging." It constructs multiple in-dependent decision tree classifiers (ntree) using a subset of randomly selected variables and two-thirds of bootstrap sample data.The algorithm then validates the predictions with the remaining one-third of the data, known as "out-of-bag" data.RF combines the predictions from all the decision trees, which are trained in parallel, and determines the final predicted class of the outcome variable by the 'majority vote' of all the predictions [22].The extreme gradient boosting tree (XGBtree) algorithm is another ensemble method that enhances prediction accuracy through gradient boosting.Unlike RF, XGBtree builds decision tree classifiers sequentially, learning from the prediction errors of the preceding tree to minimise the error in the subsequent tree.The final prediction is the sum of all individual tree predictions [23,24].Both the RF and XGBtree algorithms utilise all available variables/predictors, and variable importance (VarImp) is crucial for understanding the significance of these variables/predictors in the model.However, to effectively plan targeted program intervention strategies to reduce the OTC use of antibiotics, it is essential to identify the most important predictors.Therefore, 3 sets of predictors were employed to develop the RF and XGBtree models: (1) all 29 predictors, (2) nonrej selected using the Boruta algorithm, and (3) confirmed important predictors (cnf) also selected using Boruta [25].

Developing predictive models
Initially, all 29 variables were included in the comprehensive LR model, and the "glmStepAIC" method was employed for the stepwise selection of predictors.The model that yielded the lowest AIC was deemed the final model, and the predictors that remained were chosen as the final predictors.The hyperparameters of lasso (λ), RF (mtry and ntree), and XGBtree (nrounds, max_depth, col-sample_bytree, learning rate eta, gamma, min_child_weight, and subsample) were tuned using cross-validation.The regression coefficients of the selected variables of stepwise logistic and lasso regression, the variable importance from RF and XGBtree, and the mean variable importance with decisions about predictors from the Boruta algorithm are reported.The training dataset was used for selecting predictors, and 5-fold cross-validation was conducted to tune the hyperparameter of the models with selected predictors.
The selected predictors and the best-tuned hyperparameters were used to construct the StepLog and lasso regression models.The RF and XGBtree models were developed using 3 sets of predictors: all 29 predictors for RF and XGBtree; 9 non-rejected predictors for RF+Boruta (nonrej) and XGBtree+Boruta (nonrej); and 7 confirmed important predictors for RF+Boruta (cnf) and XGBtree+Boruta (cnf), each employing the optimally tuned hyperparameters.Model performance was assessed by calculating various metrics: the area under the curve (AUROC) with a 95% CI using the "PROC" package, log-loss, accuracy, sensitivity, specificity, F1-score, and balanced accuracy, using the "Confu-sionTableR" package, all based on the test dataset. (2)

Confusion matrix
Predicted OTC antibiotic use Actual OTC antibiotic use Total Yes (1) No ( 0

Ethics statement
The study was approved by the Institutional Ethics Committee of Savitribai Phule Pune University (Ref.No. SPPU/IEC/2020/84).

RESULTS
The socio-demographic profile, along with knowledge and practices regarding OTC antibiotic use in households, is presented in Table 1.
Of the 443 households surveyed, 217 (49.0%) were from the tribal Junnar block and 226 (51.0%) from the rural Mulshi block of Pune district, respectively.In the rural areas of Pune district, the use of OTC antibiotics was 35.9% (95% CI, 31.6 to 40.5).The use of OTC antibiotics was significantly higher for complaints related to the ear, nose, and throat (ENT) at 53.3% (95% CI, 36.1 to 69.8), eyes at 53.6% (95% CI, 42.0 to 64.9), and gastrointestinal system (GIS) at 43.7% (95% CI, 32.7 to 55.2).Additionally, in households where more than 1 person used OTC antibiotics, the usage rate was 46.1% (95% CI, 36.1 to 56.4).A significant 39.9% (95% CI, 34.0 to 46.2) of households spent less than 200 Indian rupees (Rs) on purchasing OTC antibiotics.Moreover, 62.5% (95% CI, 47.0 to 75.8) of households perceived that their health condition either did not improve or deteriorated after using antibiotics.Only 23.8% (95% CI, 15.9 to 34.0) of households were aware that not completing the prescribed antibiotic dosage could lead to a deterioration in health.
A strikingly large proportion of households, 97.5% (95% CI, 92.6 to 99.5), believed that the practice of buying antibiotics directly from the pharmacy was useful.
In the tribal block of Junnar, the use of OTC antibiotics was high, with 75.0%(95% CI, 40.1 to 93.7) for ENT complaints, 52.8% (95% CI, 37.0 to 68.0) for GIS issues, and 33.7% (95% CI, 25.3 to 43.2) for respiratory system-related complaints.In the rural block of Mulshi, OTC antibiotics were consumed by more than 1 person per household in 53.5% (95% CI, 38.9 to 67.5) of cases, for more than 10 days in 47.8% (95% CI, 36.2 to 59.5) of cases, and the use was highest at 69.4% (95% CI, 55.4 to 80.6) for eye-related complaints.In Junnar, 41.0% (95% CI, 29.5 to 53.5) of households reported that antibiotic medications were not affordable, and 35.6% (95% CI, 27.5 to 44.6) spent more than Rs 200 on purchasing these medicines.Meanwhile, in Mulshi, only one-fifth of the households reported the unaffordability of OTC antibiotics.In Junnar, 70.0% (95% CI, 47.9 to 85.7) of households perceived that their health condition was not cured or had deteriorated, 57.9% (95% CI, 36.2 to 76.9) reported problems after consuming the medications, and only 18.5% (95% CI, 7.7 to 37.2) reported that purchasing medicines directly from medicine shops or pharmacies was not useful.However, more than 95% of households in both blocks believed that antibiotics are beneficial for human health.
The regression coefficients and the importance of predictors/ features are shown in Table 2.
The perception that buying antibiotics directly from the pharmacy is useful was the most important predictor/feature across all 9 algorithms.Antibiotics used for eye-related complaints ranked as the second most significant predictor.The third most important predictor, according to regression and RF algorithms, was the greater distance of households from healthcare facilities; however, this was not supported by the Boruta algorithm.Rural blocks and membership in other social groups were deemed important by the Boruta algorithm.Additionally, the Boruta algorithm highlighted the significance of having more than 2 persons in a household consuming antibiotics, taking antibiotics for longer than 10 days, and administering more than 2 doses as important factors.Completing the prescribed antibiotic course was also considered a tntv feature by the Boruta argument.The stepwise LR (StepLog) and lasso regression algorithms identified 3 key features: assistance from government healthcare facilities, antibiotics used for respiratory complaints, and the general usefulness of antibiotics for humans as significant predictors.The Boruta algorithm distinguished 7 confirmed and 2 tntv features.The variable importance as determined by the Boruta algorithm is depicted in Figure 1.
The results from evaluating the models' prediction performance are shown in Table 3.
The final StepLog model had an AIC of 168.52 and included 14 predictors.Its log-loss was 0.378, which was higher than that of other prediction models, and it also had the lowest accuracy (0.864), specificity (0.853), F1-score (0.786), and balanced accuracy (0.872).For the lasso model, the optimally tuned 'λ' was 0.021, which utilised 9 predictors and achieved a log-loss of 0.326.This model also had the highest sensitivity (0.971) for predicting the use of OTC antibiotics.All RF models were set with ntree = 500.The mtry was 15 for the RF model with all predictors and 2 for the RF+Boruta model, which included 9 non-rejected and 7 confirmed important predictors.The best-tuned hyperparameters for all 3 XGBtree models were: nrounds at 100, max_depth at 20, eta at 0.1, gamma at 0, min_child_weight at 1, and subsample at 1.The hyperparameter colsample_bytree was set at 0.5, 0.

DISCUSSION
Our study aimed to identify predictors of OTC antibiotic use in rural communities through the application of ML methods.To the best of our knowledge, this is the first study to employ ML methods to investigate predictors of OTC antibiotic use based on a primary dataset.To minimise geographical and demographic biases, we included multiple study sites, with one located near a city and another situated farther away.
Our study findings indicate that the most significant predictor of OTC antibiotic use was the belief that it is useful to purchase antibiotics directly from pharmacies.This behaviour underscores the cultural and socio-demographic closeness of pharmacists to the rural communities they serve, in contrast to medical doctors.The results also emphasise the need for regulatory interventions to curb OTC antibiotic use, as outlined in Kerala State's AMR intervention program, Operation Amrith [10].Additionally, the use of antibiotics for eye-related and GIS complaints emerged as the second most significant predictor, likely due to the higher prevalence of these conditions.In our analysis, the XGBtree+Boruta (cnf) model with 7 predictors was identified as the most accurate in terms of prediction performance.This model outperformed   This study demonstrated the potential use of ML models for predicting OTC antibiotic use.ML models have proven to be helpful in the medical and health sciences, particularly in the areas of diagnosis and outcome prediction [26].Previous research has suggested that the application of ML models in the healthcare industry, although still in the early stages, is primarily focused on the early diagnosis of chronic diseases, predicting future disease incidence, conducting epidemiological studies, and facilitating evidence-based decision-making [26][27][28][29][30][31].There is also evidence supporting the use of AI and ML models to predict AMR among bacterial species based on whole genome sequencing [12,13,[32][33][34][35].As part of antibiotic stewardship efforts, AI and ML have been employed to guide targeted empiric antibiotic prescribing [14,16,18,36], profile and analyse drug resistance, and design targeted drug therapies [37,38] in pharmacometrics [39], and antibiotic discovery [40].Previously conducted studies in the health and medicine domains have employed several methods, including recursive decision tree-based models, XGBoost [41,42], a fuzzy logic model [43], ADABoost, RF, convolutional neural networks, SVM, LR, lasso regression, and classification and regression trees [44,45].
As this study represents one of the initial attempts of its kind, we contend that employing AI and ML models can assist in the   planning and enhancement of public health interventions in other states.This approach could mirror the successes of Operation Amrith in Kerala State [10], potentially increasing the novelty and impact of our study.Additionally, our findings highlight the imperative for more research into the patterns of OTC antibiotic usage that contribute to AMR.Such research should leverage AI and ML to inform targeted antibiotic therapies.Building on the results of our study, we advocate for further investigations that could guide the development of structured health interventions in rural Pune.There is also a pressing need for community-level health education interventions that focus on antibiotic stewardship and the broader implications of AMR.
To summarize, households that found the practice of purchasing medications directly from a pharmacy to be useful were more likely to consume antibiotics for eye-related complaints, engage in longer durations of antibiotic use, take higher doses of antibiotic medications, and have more household members using antibiotics in rural blocks and other social groups.These factors were confirmed as significant predictors of OTC antibiotic use.The XGBtree ML algorithm in conjunction with the Boruta feature selection method, which identified 7 significant predictors, emerged as the best model with the lowest prediction error.Predictions of OTC antibiotic use for individual households can be instrumental in devising intervention strategies aimed at curbing the non-prescription use of antibiotics in the rural areas of Pune district, Maharashtra.

Figure 1 .
Figure 1.Predictor selection by random forest based Boruta algorithm for predicting over-the-counter antibiotic use in Rural Pune, India.
C o l l e c t i v e .d e c i s i o n s u r g e r y _ Y e s a c c i d e n t _ Y e s f 1 2 _ N o t a w a r e d a y s _ 6 t o 1 0 d a y s f 4 _ S p o u s e s h a d o w M e a n f 1 2 _ N o t f u l l y r e c o v e r e d f 5 _ G o v t f 1 0 _ Y e s f 3 _ 2 g t 1 0 .0 0 0 r e s p i r a t o r y _ Y e s f 7 _ Y e s f 5 _ P v t e n t _ Y e s m u s c u l o s k e l e t a l _ Y e s f 8 _ 2 g t 2 0 0 f 6 _ 2 g t 5 .k m f 9 _ N o t c u r e d _ D e t e r i o r a t e d g i s _ Y e s f 1 1 _ Y e s f 1 _ B l o c k 2 f 2 _ O t h e r s d o s e _ 2 g t 2 d o s e s d a y s _ 2 g t 1 0 d a y s e y e s _ Y e s f 1 3 _ U s e f u l p e r s o n s _ 2 g t 1 p e r s o n s s h a d o w M a x f 1 4 _ U s e f u l f 1 4 _ S e l f

Table 1 .
Socio-demographic characteristics, reasons for antibiotic consumption, and knowledge and awareness about OTC antibiotic use (Continued to the next page)

Table 2 .
Predictor/feature importance by various machine learning methods for predicting OTC antibiotic use in rural Pune, India