ato logo
Search Suggestion:

Cumpston Sarjeant review steps 1 to 6

Steps from the Review of statistical methodology used in producing small business benchmarks by David Heath.

Last updated 17 March 2025

Step 1 – Identification of industries to benchmark

Note this is from the Review of Statistical Methodology used in producing Small Business Benchmarks was undertaken by David Heath, Director of Cumpston Sarjeant Pty Ltd.

There are 233 business industry codes and associated titles that have been chosen as having higher risk relevant to the cash economy. In essence, the identification of this initial set of industries is a policy decision.

Clearly some of the 233 industries are not suitable for benchmarking due to their small population size. Others may be excluded for policy reasons including assessment of their risk rating. It is more appropriate for us to comment on the methodological approach employed, once the 102 industries are chosen by the ATO. These 102 industries are shown in Appendix B.

Step 2 – Identify the starting population of the selected business industry codes

Having identified the 102 industries that are to be benchmarked, steps 2 and 3 of the process involve the attempt to define and produce appropriate populations prior to the calculation of any ratios. We sought clarification of some aspects of the process in steps 2 and 3, as well as requesting the numbers at each stage for the 2010–11 financial year.

Initially, starting populations are derived by culling some of the available data. The starting population includes businesses that:

  • lodged their income tax return for the year to be benchmarked (as benchmarks for a particular year are derived from tax returns in that year)
  • are registered and have a current Australian business number (ABN)
  • are in one of the identified Business Industry Codes.

For the 2010-11 financial year there were 1,328,737 such entities.

The exclusions from the entire potential dataset must involve some judgment by the relevant staff of the ATO, but the exclusions at this point appear to be inherently sensible and reasonable.

The next stage of exclusions is to exclude certain businesses; those that are:

  • currently insolvent (presumably these would not form a good benchmark)
  • deceased
  • not a company, partnership, trust or individual sole trader – I understand this excludes entities such as superannuation funds
  • in not for profit, government or large market segments
  • a tax file number culled from the system (often for reasons of fraud).

Having made these exclusions the 1,328,737 entities reduce to 1,307,155, representing a reduction of 1.6%.

Again, these exclusions appear inherently sensible when considering the objective to reach a representative population prior to the calculation of any ratios.

The next set of filters (together with my comments) are to:

  • limit to business income turnover between $30,000 and $15,000,000. I note this excludes a large number, reducing the 1,307,155 businesses down to 596,925 (that is, by a further 54%)
  • exclude those with mixed business activities (the IGT report includes commentary on the issue of mixed businesses – I concur with the view that it is difficult to compare mixed businesses. Their financial behaviour will be difficult to capture in a process which has the grouping of like businesses at its heart).
  • exclude new businesses (so for the 2010–11 financial year, any businesses registered after July 2009 are excluded – presumably on the basis that set up costs involved in the establishment of a new business may distort the financial ratios. Further, newer businesses may have lower turnover in their initial stages).

These two exclusions reduce the 596,925 businesses down to 492,441 (that is, by a further 17.5%).

All the exclusions are an attempt to derive a usable population for the calculation of ratios which will later be published as benchmarks. They are inherently sensible in attempting to gain relatively homogeneous groups. Other than their qualitative objective to remove records that may not be typical of the remaining groups, there is little to say regarding the statistical basis for the exclusions.

Step 3 – Industry allocation – grouping of businesses into industry sub-groups

Having made the exclusions as described in Step 2, the next main step is to group businesses into sub-groups, being the 102 industry groups.

This is done on 2 bases (use of codes and keywords).

First, the ATO bases sub-groups on Business Industry Codes; these are a modified version of the ANZSIC codes (Australian and New Zealand Standard Industry Classification). The ANZSIC codes represent a hierarchical four digit coding. The ATO achieves further stratification of business type with the addition of a fifth digit. The IGT report provides a description of the ANZSIC and Business Industry Codes (at Appendix 4).

Beginning with the 5 digit Business Industry Codes, some industries are further divided into sub-groups. Examples are given in the Small Business Benchmarks document, including the division of Business Industry Code 32430 into 3 industry sub-groups (Carpet laying services, Tiling services and Timber Floor Sanding).

The division is on the basis that the sub-group divisions are a further attempt to arrive at similar or like groups for the purpose of calculating benchmark ratios. For the 3 sub-groups, this heterogeneity is demonstrated by the different published ratios for each sub-group. The division of the Business Industry Code is useful as it produces 3 distinct sub-groups rather than an amalgam of three separate business types which share a Business Industry Code.

Other divisions are made on the basis of ATO knowledge regarding the characteristics of different businesses. As with the culling of data, as described in step 2 above, the objective is to produce like groups. From a statistical point of view, this qualitative approach should improve the homogeneity in a quantitative sense of the groupings.

The second basis uses key-words in an attempt to improve the groupings. This is another method to classify businesses into like groups. The example given in the Small Business Benchmarks document is the classification of entities with the same ANZSIC code into sub-groupings of Restaurants and Cafes.

The identification of relevant words such as 'restaurant', 'coffee shop', or 'café' in a business description or name, allows the division of those businesses which have the same business industry code. In addition, the use of key words can also be used to classify businesses regardless of the business industry code.

For the 2010–11 benchmark year, there were 492,441 entities following the completion of the culling described in step 2. After the key-word process and the resultant assignment to benchmark industries, there were 403,908 entities remaining (that is, a further 18% reduction).

Clearly the process of the selection of appropriate key words, and the division into particular sub-groups necessarily involves judgment on the part of the relevant staff of the ATO. However, I believe this is done in a sensible manner which assists in the production of benchmark groupings that contain sufficiently homogeneous groups of entities in terms of their essential business characteristics.

I understand that in the application of the benchmarks, and the identification of businesses for potential investigation, an equivalent process is followed. This ensures the business entity does indeed belong in the relevant industry and turnover range.

Step 4 – Calculation of the ratios

Section 2.4 of the SBB document describes the calculation of the benchmark ratios. The ratios are calculated on 2 bases; first, from the Income Tax Return, and second from Activity Statements.

For the Income Tax return ratios, all ratios use turnover (revenue from goods and services excluding GST) as the denominator. The benchmark ratios are:

  • Total expenses to Turnover
  • Cost of Sales to Turnover
  • Labour to Turnover
  • Rent to Turnover
  • Motor Vehicle Expenses to Turnover.

For the Activity Statement ratios, all ratios use the denominator of Total Sales (including GST), aggregating across a complete financial year. The benchmark ratios are:

  • Non-capital purchases
  • GST-free sales.

Clearly some ratios are relevant to a lower proportion of businesses (for example, GST-free sales). In turn, as the steps below shall outline, publishing of ratios depends on their being sufficient observations for a business industry and turnover range. Accordingly, the ratios with lower numbers of observations are less likely to be published as benchmark ratios.

The SBB document provides several descriptions of logical steps taken to ensure data integrity. In order to gain a better understanding of the source of the financial ratios we requested, and received, pro-forma blank copies of the relevant Income tax returns and activity statements. This enabled us to better understand some of the adjustments and checks used by the ATO.

Step 5 – Calculate and remove the outliers

Outliers are those observations that are considered to have values that are significantly different from the majority of other observations in the dataset.

The method to remove the outliers is the Mahalonobis Distance technique. In broad terms this technique is useful for identifying distance measured from a central point in an n-dimensional space.

As outlined in step 5, above, the possible ratios calculated from both tax returns and activity statements for a particular industry are:

  • Cost of sales to turnover
  • Total expenses to turnover
  • Rent to turnover
  • Motor vehicle expenses to turnover
  • Non-capital purchases to total sales
  • GST-free sales to total sales.

In theory, each business’ ratios could be compared with the distribution of all ratios for an industry as a whole. A distance measure would consider the degree to which the combined ratios differ from those of the industry. If the distance measure is higher than some pre-determined level, that business’ ratios would all be excluded on the basis that in aggregate they are significantly different from those of the industry.

This is not the approach taken to removing outliers in the derivation of SBB. Rather than the exclusion of outliers occurring across all ratios combined for a business, each ratio is considered separately, with distance calculations being made for each individual ratio for a business.

So for a given ratio, say Total expenses to turnover, a distance is calculated for each business’ individual ratio. The ratio is excluded if the Mahalanobis distance is greater than two.

I make the following observations.

It is possible to graph each ratio, for example, the ratio of total expenses to turnover could be graphed against turnover i.e a two dimensional graph. If a line was fitted to this relationship, distance from this line would achieve a similar (though not equivalent) result. This approach would effectively remove outliers if their standard deviation from the fitted line was greater than some pre-determined value.

While the distance measures may be calculated without any judgment being involved, it is still necessary to apply some judgment in choosing the level.

Given the calculations utilise a two dimensional calculation of distance, a symmetrical distribution and a distance threshold of 2, is expected to lead to the exclusion of 15–16% of observations as outliers. The data in Appendix B, where numbers pre and post outliers are shown result in such an exclusion across all 102 industries.

As shall be seen later in step 7, further ratios are excluded from calculation when the ranges are calculated, with yet more exclusions prior to the publication of ratios. Accordingly, if this particular step was not performed a portion of the outliers would be excluded at a later step. It is important to note that the exclusion of the outliers at this step is made on the basis of the relationship between the ratio and turnover; that is an individual ratio will be excluded if it differs markedly from 'typical' ratios for that level of entity turnover. The exclusions in step 7 are made on a different basis.

While we have not received detailed data regarding the outliers excluded in this step, we have observed scatter plots of the Total Expenses to Turnover ratio in the IGT report, the SBB document, and for data we requested to assess steps 6 and 7 (below). In each case we can observe significant proportions of ratios at or close to 100%. Given the clusters near this value, it is unlikely such values would be considered outliers; rather it is expected that excluded outliers would tend to be the lower values for the Total Expenses to Turnover ratio.

It is stated in the SBB document that outlier exclusion may be 'extreme cases, mistakes or not part of the population intended to be benchmarked'. It should be remembered that outliers could also reflect entities engaged in the type of cash economy avoidance activities that the benchmarking is meant to detect. While it is appropriate to exclude these outliers from the development of published benchmarks, it would still seem correct to then assess these individual ratios against the benchmarks. For such ratios that are significantly above the published ranges, this could provide an indication of further investigation.

Not surprisingly, as shown in the graph below it is those industries with fewer business entities that tend to have relatively more outliers excluded. Those industries with fewer entities will have greater relative variability in individual ratios.

Step 6 – Assign turnover ranges to benchmark industries

The process of division of each of the 102 industries into turnover ranges involves some judgment as to the bounds of each range. Each industry has either two or three ranges which may be used for the publication of ratios.

In Step 2, above, it was noted that by definition, those businesses with turnover below $30,000 were excluded. This stage of the exclusion process eliminated a significant number of observations. However, examination of the published ranges, shows the lowest range (for 2010) as $50,000, with many industries having a low range of $65,000. So the published ranges, whether an average or a band for a particular industry and turnover range, exclude even more data than indicated in Step 2 (above).

For the 102 industries, the distribution of the bottom end of the lowest range are as follows

Start point of lowest range

Number of industries

$50,000

36

$65,000

57

$75,000

4

$100,000

4

$400,000

1

Total

102

From discussions with the ATO, I understand the decision regarding this 'floor' level of turnover for each benchmark industry is a matter of judgment, taking into consideration relevant aspects of each individual industry.

One aspect considered in the choice of the 'floor' level is the threshold for GST registration. This threshold is the point at which a business entity must register for GST, although some businesses below the threshold still register for GST. I understand that prior to 30 June 2007, the threshold for GST registration was $50,000. Since then it has been $75,000. The first published benchmarks considered both these thresholds.

In general, retail businesses are expected to have higher turnover than service based businesses where the turnover reflects a greater proportion of labour on the part of the provider. As such retail businesses will tend to have a higher starting point (say, $65,000). Other industries, by their nature typically have a higher turnover (for example, pubs), so have higher starting points

For the top of the lowest range, and in turn the higher ranges, there is some judgment involved for the relevant analysts. While primarily looking at the Total Expenses to Turnover range, analysts look at scatter plots of the ratios by Turnover. I understand they are looking for ranges whereby the data within each range shall be 'alike' and distinct from the other ranges. Apart from 'eyeballing' the scatter plots, there are other considerations that influence the choice of range, including having sufficient entities in each range to increase statistical validity. It is also desirable to have homogeneity within each range (step 8).

In looking at this step, and the following step (7), we have three broad aims:

  • gain a better understanding of the assignment of turnover ranges and its statistical validity
  • test the establishment of turnover ranges
  • gain insight into the sensitivity of the published bands of ratios to the turnover ranges.

In order to achieve these objectives, we chose six relevant industries (two smaller, two medium, and two larger), and requested some underlying data from the ATO.

The requested data was provided for the following industries:

  • Entertainment media retailing (smaller)
  • Ice cream retailing (smaller)
  • Beauty services (medium)
  • Sports, camping and fishing retailing (medium)
  • Electrical services (larger)
  • Plumbing services (larger).

The initial data provided showed Sales (Turnover) for each relevant business as well as ratios based on tax return information, being the Total Expense ratio, and the Cost of Goods Sold ratio. We were also provided with relevant data for ratios from the Activity statement.

This was the extent of the information provided for the six industries, so no data was provided that could identify an individual entity. In order to further protect the confidentiality of the data, Sales were stratified into bands of $5,000, rather than the raw dollar amount being provided. Given the stratification of the data into bands of Sales is consistent with the classification of the data into turnover ranges, the lack of precise turnover data has no effect on our analysis.

 

QC103951