Early Cost Estimating for Road Construction Projects Using Multiple Regression Techniques

The objective of this study is to develop early cost estimating models for road construction projects using multiple regression techniques, based on 131 sets of data collected in the West Bank in Palestine. As the cost estimates are required at early stages of a project, considerations were given to the fact that the input data for the required regression model could be easily extracted from sketches or scope definition of the project. 11 regression models are developed to estimate the total cost of road construction project in US dollar; 5 of them include bid quantities as input variables and 6 include road length and road width. The coefficient of determination r2 for the developed models is ranging from 0.92 to 0.98 which indicate that the predicted values from a forecast models fit with the real-life data. The values of the mean absolute percentage error (MAPE) of the developed regression models are ranging from 13% to 31%, the results compare favorably with past research which have shown that the estimate accuracy in the early stages of a project is between ±25% and ±50%.


Introduction
The term early estimate is used to describe the process of predicting a project's cost before the design of the project is completed (Sanders et al, 1992).The technique is used to estimate one characteristic of a system, usually its cost, from other physical and/or performance characteristics of the system (Rose, 1982).This technique involves life cycle costs, a detailed data base, and the application of multivariable correlation (Black, 1984).
Early cost estimating is considered as the most significant starting process to influence the fate of a new project (Sodikov, 2005).The accuracy of cost estimation improves toward the end of the project due to detailed and precise information.The early or conceptual phase is the first phase of a project in which the need is examined, alternatives are assessed, the goals and objectives of the project are established and a sponsor is identified (Holm et al., 2005).At this stage, estimate accuracy is between ±25% and ±50% (Schexnayder et al., 2003) due to less defined project details.
Cost estimation of construction projects with high accuracy at the early phase of project development is crucial for planning and feasibility studies.Construction clients require early and accurate cost advice prior to site acquisition and commitment to build in order to enable them to take a right decision regarding the feasibility of proposed project.However, a number of difficulties arise when conducting cost estimation during the early phase.Major problems include lack of preliminary information, lack of database of works costs, lack of appropriate cost estimation methods, and the involvement of many environmental, political, social and external uncertainties.Given its significance, conventional tools such as regression analysis have been widely employed to tackle the problem.
In Palestine, the construction industry is one of the main economic driving sectors, supporting the Palestinian national economy.It contributes to 26% of the Palestinian GDP.It also plays a basic role in providing homes, public facilities and infrastructure, absorbing work forces and improving the whole Palestinian national economy.However, many construction projects report poor cost, time and quality performances.According to Mahamid et al. (2010), the cost overrun is a phenomenon in road construction projects in Palestine.Through a field study that investigated the cost diverge in road construction based on data of 169 road construction projects, they concluded that 100% of the projects suffer from cost diverge.They found that among these 169 projects, 76.33% were underestimated and 23.67 were overestimated.Therefore, it is very important issue to improve accurate models that help in reducing cost diverge in construction projects.
This study presents regression models that describe the total cost of road construction project as a function of bill of quantity (BOQ) and project size (i.e.road length and width).The estimating models were developed based on collected data for 131 awarded road construction projects in the West Bank in Palestine.As these cost estimates are required at early stages of the project, considerations were given to the fact that the input data for the required model could be easily extracted from sketches or scope definition of the project.

Objectives
The objectives of this study are:  To develop preliminary cost estimating models of road construction projects as a function of unit quantities of road construction activities  To develop early cost estimating models of road construction projects as a function of project size (i.e.road length and road width) Literature Review Ahuja et al. (1994) state that estimating is the primary function of the construction industry; the accuracy of cost estimates starting from an early phase of a project through the tender estimate can affect the success or failure of a construction project.They also state that many failures of construction projects are caused by inaccurate estimates.
A cost estimate establishes the base line of the project cost at different stages of development of the project.As Hendrickson et al. (1989) point out, a cost estimate at a given stage of project development represents a prediction provided by the cost engineer or estimator on the basis of available data.Gould (2005) defined estimate as an appraisal, an opinion, or an approximation as to the cost of a project prior to its actual construction.According to Jelen et al. (1983), estimating is the heart of the cost engineer's work and consequently it has received appropriate attention over the years.Three cost prediction models were developed by Christian and Newton (1998) in order to determine an accurate cost for road maintenance.These models were developed in the province of New Brunswick based on historical data during the period 1965-1994.Based on the models and the management review, it was concluded that maintenance funding needed to be increased by 25%.
Lowe et al. ( 2006) developed linear regression models in order to predict the construction cost of buildings, based on 286 sets of data collected in the United Kingdom.They identified 41 potential independent variables, and, through the regression process, showed five significant influencing variables such as gross internal floor area (GIFA), function, duration, mechanical installations, and piling.
Han et al ( 2008) investigated the actual budgeting process in highway construction projects, under the research collaboration of the Korean Ministry of Construction and Transportation.They then developed two-tiered cost estimation models of highway construction projects, considering the target goals for forecasting, allowable accuracy, and available information at each phase of a project budgeting and initiation.
Recent work reveals that there are still many problems in cost estimation at the conceptual stage of a project cycle.The World Bank had developed an international database for road construction cost in developing countries; the data was yielded in form of ROad Costs Knowledge System (ROCKS).It was designed to develop an international knowledge system of road work costs to obtain average and range unit costs based on historical data that could ultimately improve the reliability of new cost estimates.In that study, data from 65 developing countries were used to make comparisons between estimated costs at appraisal and actual cost at completion.Among these projects 62% were overestimated, and the rest underestimated (ROCKS, 2002).

Research Methodology
The methodology approach in this research is shown in figure 1, each step will be discussed in succeeding paragraphs.

Problem Definition
The objective of this study is to develop regression models that estimate the total cost of road construction projects as a function of bid quantities and project size.

Data Collection
The estimating technique requires an extensive historical data base.The data were collected from contracts awarded by Palestinian agencies, the clients for road construction projects, awarded in the West Bank.The data collected comprised 131 projects awarded over the years [2004][2005][2006][2007][2008].The data were tabulated to ensure that all costs were considered, none is double-counted and all are clearly defined.
All the data were deflated to year 2008.To deflate the cost of a project in a certain year to cost in year 2008, its cost is divided on the cost index value of that year.
The following guidelines were taken into consideration during the data collection: 1. Distribution among year of award A consideration is taken to have approximately equal number of awarded projects over the years 2005-2008.In 2004, only projects that awarded in the last quarter of the year were considered in the study in order to avoid the Second Intifada effects.
Table 2 shows the distribution of projects based on year of award in the collected data.

Selection of Forecasting Method
Previous studies show many methods which are used to forecast future construction costs.Some of these models describe construction cost as a function of factors believed to influence construction cost.The relationship between construction cost and these factors have been established from past records of construction cost.Typically, the models established in this manner have been used to estimate the cost of individual contracts.These models with their relational structure are the only models expected to provide reliable long-term estimates (Wilmot and Cheng, 2003).
Regression estimating models are widely used in cost estimation.They are effective due to a well-defined mathematical approach, as well as being able to explain the significance of each variable and the relationships between independent variables (Sodikov, 2005).The developed models in this study are of this type.

Proposed Model's Variables
Many qualitative and quantitative factors affect the contract price as shown in previous studies.It is striking, however, that most of past construction models used only a few of them because the lack of information available in early stages of a project and information about qualitative factors surrounding each project are difficult to obtain.
As the objective of this study is to develop estimating models that can be handled easily using calculators or simple computer programs in the early stages of project, the models were developed based on quantitative factors that have significant impact on contract price and that could be easily extracted from sketches at an early stages of a project.The following variables are used:  Total project cost as output (dependent) variable which has been represented in year 2008 US dollar; it ranges from $42000 -$2000000  Input (independent) variables are shown in Table 6.

Models Development
Once the variables to be included in the estimate equation have been identified, a series of models were developed using multiple regression analysis techniques.Regression models are intended to find the linear combination of independent variables which best correlates with dependent variables.The regression equation is expressed as follows: Where all variables are significant at the 0.05 level.The variables are described in Table 6.

Cost as a Function of Bid Quantities
Once a set of probable predictors were identified for each project, a statistical model was developed using multiple regression technique.Given the quantities per project for the predictor variables, the regression model predicted the total cost in US dollars for that project, based on six statistically significant variables as shown in  Data analyses show that the earthworks, base works, asphalt works and furniture works (i.e.all road construction activities excluding earthworks, base works and asphalt works) are the major activities in the 131 examined road construction projects.Figure 2 shows the average contribution of these activities in the total project cost.

Figure 2 Average contribution of major road construction activities in the total project cost
As the furniture works include a lot of work items, an equation that describes the total project cost as a function of quantities of major construction activities excluding furniture works is formulated; this make the model more easy to be used in the early stages where a little information are available.The coefficient of determination r2 of the developed model is 0.97.The analysis of variance test confirmed the statistical significance of the model at a significance level of 0.05 as shown in Another way to estimate the total project cost is to find the estimated cost for each construction activity and then to sum them up.The formulated model will be in the form of: Total cost ($) = Earthworks cost + Basecoarse works cost + Asphalt works cost + Furniture works cost Three individual regression models that describe the cost of earthworks, basecoarse works and asphalt works as a function of bid quantities are developed.The results are shown in Table 9.The analysis of variance test confirmed the statistical significance of the models at a significance level of 0.05.The furniture cost can be estimated using one of the following equations (   Table 10 shows that Equation 3 is the best to be used for furniture cost estimation as it has the highest r2 and lowest MAPE.The coefficient of determination r2 for model 4, using Equation 4 for furniture cost estimation (shown in Table 11), is 0.97.Then the developed model is: Model 3 is similar to model 2, but model 3 could be used to estimate cost of one individual construction activity (i.e.asphalt cost, basecoarse cost, earthworks cost, and furniture cost).
It can be seen that model 2 is better for estimates of the total project cost because it has better accuracy (MAPE) than model 3 as shown later in Table 20.
The correlation among input variables is tested; the results are shown in Table 11.The results of r2 show that there is a high correlation between pavement quantity and basecoarse quantity, a medium correlation between pavement quantity and earthworks quantity, and a low correlation between pavement quantity and other input variables.As a result of the correlation between input variables, a regression model describing the total project cost as a function of pavement quantity is developed.The regression statistics results are shown in Table 12.The model is useful in estimating project cost at early stages of the project since the information needed is only the pavement quantity, and so the estimation could be achieved within minutes.The developed model is: Where, X3 is the project's pavement quantity (m 2 ).

Cost as a Function of Project Size
A regression model that describes the total cost of a road construction project as a function of road width and road length is developed.The coefficient of determination r2 for the developed equation is 0.93.The regression statistics results for the developed model are shown in Table 13.The developed model is: It can be seen that the model has a large constant (-196877,7) and so in small projects, the estimated total cost using the model will be not realistic or it may even be negative e.g. for a road width of 5 m and road length of 370 m, the estimated cost is 0. It may give a realistic estimated value when road length is more than 1 km and road width is more than 5 m.Table 14 shows that the p-value of the road width variable is 0.8 which is higher than 0.05, meaning it is not significant to be included in the model.As a result, the model uses road length as the only input variable is formulated.The regression statistics results for the developed model are shown in Table 15.The formulated equation is: The developed model has high r2 and low p-value (less than 0.05), but for very small projects, it will give unrealistic results; e.g. at road length 220 m, the estimated cost is 0. The model may give good estimated results when the road length is higher than 600 m.Therefore, a model is developed using the road length only as independent variables with zero intercept value, the results are shown in A model using variables interaction has also been developed, the resulting equation is: The coefficient of determination r2 for the developed equation = 0.96.The regression statistics results for the developed model are shown in  Then the total project cost can be calculated by summing up the cost of construction activities (model 10).The coefficient of determination r2 for the developed model is 0.92.
Total cost ($) = Earthworks cost + Basecoarse works cost + Asphalt works cost + Furniture cost (model 10) Where; Earthworks cost = 1,56X8X9 Base works cost = 4,6 X8X9 Asphalt works cost = 10 X8X9 Furniture works cost = 4,58 X8X9 It can be seen that model 10 is similar to model 9 (both use variables interaction), but model 10 could be used to estimate the cost of one individual construction activity (i.e.asphalt cost, basecoarse cost, earthworks cost, and furniture cost).Also it has better accuracy (MAPE) than model 9 as shown in Table 20.

Testing accuracy of the developed models
The mean absolute percentage error (MAPE) is used to measure the accuracy of the developed models.The following formula is used to compute the MAPE (Lowe et al., 2006): Where, A i is the actual value F i is the forecast value n is number of fitted points Table 20 shows a summary of the developed regression models in the study, 10 models are developed to estimate the total cost of road construction project in US dollar; 4 of them include bid quantities as independent variables (models 1 through 4), while the other 6 models include road length and road width as independent variables (models 5 through 10).
It should be noticed that in the very early stages the bill of quantity (BOQ) is not available, meaning that the models using road width and length (models 5 through 10) are more easy and fit to be used.Later, when the BOQ is available, the models based on BOQ (models 1 through 4) may be used.

Model
No.
Regression models

Conclusion
This study aims at developing early cost estimating models for road construction projects using multiple regression techniques.The models were developed based on 131 set of data collected in the West Bank in Palestine.Such types of models are very useful, especially in its simplicity and ability to be handled by calculator or a simple computer program.It has a good benefit in estimating project cost at early stages of the project since the information needed could be extracted easily from sketches or scope definition of the projects.
It must be remembered that an estimated project cost is not an exact number, but it is opinion of probable cost.The accuracy and reliability of an estimate is totally dependent upon how well the project scope is defined and the time and effort expended in preparation the estimate.
In this study, 10 regression models are developed; 4 of them include bid quantities as independent variable and 6 include road length and width.The coefficients of determination r2 for the developed models range from 0.92 to 0.98.This indicates that the relationship between the independent and independent variables of the developed models is good and the predicted values from a forecast model fit with the real-life data.The values of the mean absolute percentage error (MAPE) of the developed regression models are ranging from 13.3% to 31%.The results compare favorably with past researches which have shown that the estimate accuracy in the early stages of a project is between ±25% and ±50%.The findings reveal that the models that use bid quantities as independent variables are more accurate than those use road length and road width as independent variables.But they require more information.

Figure 1
Figure 1 Research methodology

Table 1 Construction cost index value in the West Bank (PECDAR, 2009)
The construction cost index of 2008 obtained from Palestinian Economic Council for Development and Reconstruction (PECDAR) is used to deflate the data.Table1shows the index values over the years 2004-2008, the base year is 2008 (index = 1).

Table 2 Projects distribution based on year of award in the collected data 2
. Project size Road construction projects in the West Bank are classified by PECDAR based on their cost as shown in Table3.

Table 3
In this study, a consideration is taken to have approximately equal number of projects under each category.Table4shows the projects distribution based on project cost.
Classification of road construction projects in the West Bank (PECDAR, 2009) Mahamid, I (2011) 'Early cost estimating for road construction projects using multiple regression techniques', Australasian Journal of Construction Economics and Building, 11 (4) 87-101 91

Table 4 Projects distribution based on project cost in the collected data
Moreover, a consideration is taken to have approximately equal number of projects under each cost category per each year of award.Table5shows the projects distribution based on project cost per each year of award.

Table 6 Input variables description in the collected data set
*asphalt layer thickness is 5 cm, **ml: meter length

Table 7
, where the pvalues for all coefficients considered in the model are less than or equal to 0.05:

Table 7 Multiple regression results among total project cost and activities' quantities
Mahamid, I (2011) 'Early cost estimating for road construction projects using multiple regression techniques', Australasian Journal of Construction Economics and Building, 11 (4) 87-101 93

Table 8 .
The developed model is:

Table 9 Regression models among cost of each major activity in road construction and its quantity
Table 10):

Table 11 Correlation among used input variables
Mahamid, I (2011) 'Early cost estimating for road construction projects using multiple regression techniques', Australasian Journal of Construction Economics and Building, 11 (4) 87-101 95

Table 14 Multiple regression results among total project cost and road length and width when intercept = 0
Therefore, a trial was performed to develop a model with zero intercept value, the results shown in Table14.The developed model is: Mahamid, I (2011) 'Early cost estimating for road construction projects using multiple regression techniques', Australasian Journal of Construction Economics and Building, 11 (4) 87-101 96

Table 16 .
The developed model is:

Table 17 .
The results show that the analysis of variance test confirmed the statistical significance of the model at a significance level of 0.05.

Table 17 Multiple regression results among total project cost and road size
Equations that describe the cost of the major road construction activities as a function of road length and width are developed.The best fit models are achieved when variables interaction is used and intercept value = 0.The results are shown in Table18.The Table shows that r2 values are high.This indicates a good relationship between dependent and independent variables.

Table 20 Mean absolute percentage error (MAPE) and r2 of the developed regression models
It can be seen that the models that use bid quantities as independent variables are more accurate than those using road length and road width as independent variables  The table shows that for the models that use bid quantities as independent variables, when the number of work items involved in the model increase, the r 2 value increase and MAPE value decreaseThe table shows that for the models that use road length and width as independent variables, the MAPE is ranging from 18.8% to 31%.It shows that the model with highest accuracy is the model includes road length only with zero intercept value (model 8), while the model with least accuracy is the model includes road length and width (model 5)  The table shows that for the models that use BOQ as independent variables, the MAPE is ranging from 13% to 19%.It shows that the model with highest accuracy is the model includes all construction activities (model 1), while the model with least accuracy is the model includes pavement quantity only (model 5).