### Data

The search sources used to obtain the data were: Medline, Econlit, Social Science Citation Index, regional Index Medicus, Eldis (for developing-country data), Commonwealth Agricultural Bureau (CAB), and the British Library for Development Studies Databases. The range of years was set at 1960 to the present. Data covering costs and charges were included.

The search terms used were: "costs and cost analysis" and hospital costs or health centre or the abbreviations HC (health centre) or PHC (primary health centre) or outpatient care. The language sources searched were English, French, Spanish and Arabic; no Arabic study was found. In addition, a number of studies were found in the grey literature, from such sources as electronic databases, government regulatory bodies, research institutions, and individual health economists known to the authors [2, 15–54]. Also included were data from a number of WHO-commissioned studies on unit costs.

A standard template was used for extracting data from all sources. Database variables include: ownership; level of facility (see Additional file 1: Annex 1 for a definition of facility types as coded in the unit cost database); number of beds; number of inpatient and outpatient specialties; cost data (cost per bed-day, outpatient visit, and admission); utilization data (bed-days, outpatient visits, admissions); types of cost included in the cost analysis (capital, drugs, ancillary, food) and whether they were based on costs or charges; capacity utilization (occupancy rate, average length of stay, bed turnover, and average number of visits per doctor per day); reference year for cost data; currency, and methods of allocation of joint costs. The database consists of unit-cost data from 49 countries for various years between 1973–2000, totalling 2173 country-years of observations. Some studies provided information on 100% of the variables described above; at the other extreme, some provided information on less than 15%. The number of observations used in this analysis was 1171 (see Additional file 1: Annex 2 for the percentage of missing data in the model variables and Additional file 1: Annex 3 for the list of countries).

Data cleaning comprised consistency checks and direct derivation of some of the missing variables, when possible, from other variables from the same observation (e.g., occupancy rate was calculated from number of beds and number of bed-days). STATA software was used for data analysis [55].

Cost data were converted to 1998 International dollars by means of GDP deflators [56] and purchasing-power-parity exchange rates used for WHO's national health accounts estimates (PPP exchange rates used in this analysis are available from the WHO-CHOICE website: http://www.who.int/evidence/cea).

### Data Imputation

Most statistical procedures rely on complete-data methods of analysis: computational programs require that all cases contain values for all variables to be analyzed. Thus, as default, most software programs exclude from the analysis observations with missing data on any of the variables (list-wise deletion). This can give rise to two problems: compromised analytical power, and estimation bias. The latter occurs, for example, if the probability that a particular value is missing is correlated with certain determinants. For example, if the complete observation sets tend to be from observations with unit costs that are systematically higher or lower than average, the conclusions for out-of sample estimation drawn from an analysis based on list-wise deletion will be biased upwards or downwards [57].

There is a growing literature on how to deal with missing data in a way that does not require incomplete observation sets to be deleted, and several software programs have been developed for this purpose. If data are not missing in a systematic way, missing data can be imputed using the observed values for complete sets of observations as covariates for prediction purposes. Multiple imputation is an effective method for general-purpose handling of missing data in multivariate analysis; it allows subsequent analysis to take account of the level of uncertainty surrounding each imputed value, as described below [58–61]. The statistical model used for multiple imputation is the joint multivariate normal distribution. One of its main advantages is that it produces reliable estimates of standard errors: single imputation methods do not allow for the additional error introduced by imputation. In addition, the introduction of random error into the imputation process makes it possible to obtain largely unbiased estimates of all parameters [58].

In this study, multiple imputation was performed with *Amelia*, a statistical software program designed specifically for multiple imputation of missing data [57, 59, 62, 63]. First, five completed-data sets are created by imputing the unobserved data five times, using five independent draws from an imputation model. The model is constructed to approximate the true distributional relationship between the unobserved data and the available information. This reduces potential bias due to systematic difference between the observed and the unobserved data. Second, five complete-data analyses are performed by treating each completed-data set as an actual complete-data set; this permits standard complete-data procedures and software to be utilized directly. Third, the results from the five complete-data analyses are combined [64] to obtain the so-called repeated-imputation inference, which takes into account the uncertainty in the imputed values.

### Model specifications

From the tradition of using cost functions to explain observed variations in unit costs, we estimate a long-run cost-function by means of Ordinary Least Squares regression analysis (OLS); the dependent variable is the natural log of cost per bed-day [2, 3, 6–8, 65]. The primary reason for using unit cost rather than total cost as the dependent variable is to avoid the higher error terms due to non-uniform variance (heteroscedasticity) in the estimated regression. This could arise if total cost were used as the dependent variable, as the error term could be correlated with hospital size [2, 3]. The reason for using cost per bed-day rather than cost per admission is that "bed-days" are better than "admissions" as a proxy for such hospital services as nursing, accommodation and other "hotel services" [3], permitting more flexibility in the use of estimated unit costs.

As the relationship between unit costs and the explanatory variables are expected to be non-linear, the Cobb-Douglas transformation was used to approximate the normal distribution of the model variables. Natural logs were used. The Cobb-Douglas functional form can be written as follows:

#### Equation 2

ln (*Y*) = δ + α_{1} ln (*X*
_{1}) + α_{2} ln (*X*
_{2})

where δ = ln (α_{0}). This function is non-linear in the variables *Y*, *X*
_{1} and *X*
_{2}, but it is linear in the parameters δ, α_{1}, α_{2}, and can be readily estimated using Ordinary Least Squares[66].

Log transformation has the added advantage that coefficients can be readily interpreted as elasticities[3, 66].

Therefore, the cost function specification of the OLS regression model may be written as:

#### Equation 3

Where *UC*
_{
i
}is the natural log (ln) of cost per bed-day in 1998 I $ in the *ith* hospital; *X*
_{1} is ln of GDP per capita in 1998 I $; *X*
_{2} is ln of occupancy rate; *X*
_{3,4} are dummy variables indicating the inclusion of drug or food costs (included = 1); *X*
_{5,6} are dummy variables for hospital levels 1–2 (the comparator is level 3 hospital); *X*
_{7,8} are dummy variables indicating facility ownership (comparator is private not-for-profit hospitals); *X*
_{9} is a dummy variable controlling for USA data (USA = 1); and *e* denotes the error term.

The choice of explanatory variables is partly related to economic theory and partly determined by the purpose of the exercise, which is to estimate unit costs for countries where the data are not available. In this case, the chosen explanatory variables must be available in the out-of-sample countries. Country-specific – or in the case of large countries such as China, province-specific – GDP per capita in international dollars (I $) is used as a proxy for level of technology [12–14]; occupancy rate as a proxy for level of capacity utilization; and hospital level as a proxy for case mix. Unit costs are expected to be correlated positively with GDP per capita and case mix and negatively with capacity utilization.

The inclusion of the seven control variables makes it possible to estimate unit cost for different purposes to suit different types of analysis – for example, cost per bed-day in a primary-level hospital, which does not provide drugs or food; or the cost in a tertiary level hospital, with drugs and food included.

The dummy for the USA was included because all data were charges rather than costs and because there were a large number of observations from that country. Dummies for countries other than the USA with a large number of observations, such as China and the United Kingdom, were also tested as was the use of dummy variables to capture whether the cost estimates included capital or ancillary costs. These variables were not included in the model which best fit the data. Utilization variables, such as number of bed-days or outpatient visits, and hospital indicators, such as average length of stay, were not included as explanatory variables because most out-of-sample countries do not have data on these variables, and prediction of unit costs would, therefore, be impossible.

### Model-fit

Regression diagnostics were used to judge the goodness-of-fit of the model. They included the tolerance test for multicollinearity, its reciprocal variance inflation factors and estimates of adjusted R square and F statistics of the regression model.

### Predicted values and uncertainty analysis

Two types of uncertainty arise from using statistical modes: estimation uncertainty arising from not knowing β and α perfectly – an unavoidable consequence of having a finite number of observations; and fundamental uncertainty represented by the stochastic component as a result of unobservable factors that may influence the dependent variable but are not included in the explanatory variables [62]. To account for both types of uncertainty, statistical simulation was used to compute the quantities of interest, namely average cost per bed-day and the uncertainty around these estimates. Statistical simulation uses the logic of survey sampling to learn about any feature of the probability distribution of the quantities of interest, such as its mean or variance [62].

It does so in two steps. First, simulated parameter values are obtained by drawing random values from the data set to obtain a new value of the parameter estimate. This is repeated 1000 times. Then the mean, standard deviation, and 95% confidence interval around the parameter estimates are computed. Second, simulated predicted values of ŷ (the quantity of interest) are calculated, as follows: (1) one value is set for each explanatory variable; (2) taking the simulated coefficients from the previous step, the systematic component (g) of the statistical model is estimated, where g= f (X,B); (3) the predicted value is simulated by taking a random draw from the systematic component of the statistical model; (4) these steps are repeated 1000 times to produce 1000 predicted values, thus approximating the entire probability distribution of ŷ. From these simulations, the mean predicted value, standard deviation, and 95% confidence interval around the predicted values are computed. In this way, this analysis accounts for both fundamental and parameter uncertainty.

The predicted log of cost per bed day, ln

, can then be calculated from:

#### Equation 4

where

and

are the estimated parameters, and X

_{
i..n
}are the independent variables. If

and

, back-transforming Equation 4 (reduced to 1 independent log-transformed variable for simplicity) gives the power function.

#### Equation 5

where

denotes a biased estimate of the mean cost per bed-day due to back-transformation. This is because one of the implicit assumptions of using log-transformed models is that the least-squares regression residuals in the transformed space are normally distributed. In this case, back-transforming to estimate unit costs gives the median and not the mean. To estimate the mean it is necessary to use a bias correction technique. The smearing method described by Duan (1983) was used to correct for the back-transformation bias [

67]. The smearing method is non-parametric, since it does not require the regression errors to have any specified distribution (e.g., normality). If the

*n* residuals in log space are denoted by

*r*
_{
i
}, and b is the base of logarithm used, the smearing correction factor,

, for the logarithmic transformation is given by:

#### Equation 6

Multiplying the right side of Equation 5 by Equation 6 almost removes the bias, so that:

#### Equation 7

The smearing correction factor (

) for our model was 1.25.