2. Description of the Dataset
SNZ's proposed functions for the LBD necessitate reasonably full coverage data. In general, such data is either held on SNZ's Business Frame (BF) or derived from administrative data held by other government departments. The core administrative data on the LBD currently consists of the Longitudinal Business Frame (LBF) with goods & services tax (GST) returns, financial accounts (IR10), and company income tax returns (IR4) provided by IRD; information on employers and employees aggregated to the firm level (sourced from IRD via LEED);3 and shipment-level merchandise export and import data provided by Customs. The nature of each of these datasets is briefly discussed below.
As its name suggests, the LBF is a by-product of SNZ's sampling frame (the BF) and contains longitudinal information (eg, industry, ownership type, and sector) on a wide population of firms.4 The quality of the LBF's representation of firm characteristics, and changes in those characteristics, is a function of the maintenance processes for the BF, the ability of respondents to answer survey questions, and the quality of supplementary sources used. GST data is used to help maintain the accuracy of the BF (particularly to track the births and deaths of firms) and, consequently, a significance threshold exists at the mandatory GST filing level, below which BF coverage is limited. Large economic units are surveyed either annually or triennially to maintain the accuracy of the data held. The LBD version of the LBF holds data from April 1999 to June 2007.
GST data is collected on a monthly, bi-monthly or six-monthly basis by IRD, depending on the size of the firm filing. GST data include information on sales & purchases. SNZ manipulate this raw data to create the Business Activity Indicator (BAI) dataset (also included in the LBD). The primary manipulations applied to generate the BAI data are to temporally apportion the GST data down to a monthly frequency, apportion returns across GST group members, and apply limited imputation in cases where a single return appears to be missing. In the LBD BAI data is available from April 1992 to May 2007.
IR10 data is essentially a set of company accounts composed of a statement of financial performance and financial position. Consequently this form contains information on sales (and other income) and purchases, as well as a detailed breakdown of expenditure including depreciation, research and development, and salaries & wages. Balance sheet items include the usual suspects: fixed assets (broken down into vehicles; plant & machinery; furniture & fittings; land & buildings; and other), liabilities broken down into current & term, and shareholders funds. IR10s are available for the 1998/99 to 2005/06 financial years.
Like IR10s, IR4 returns are available on the LBD for 1998/99 to 2005/06 financial years. IR4s are declarations of taxable income for companies and, as such, include variables on overseas income, interest & dividends & income from "business or rental activities". They also contain a binary foreign-ownership indicator.5
LEED data is constructed by SNZ from IRD tax data, notably Pay-As-You-Earn (PAYE) returns for employees. To protect the confidentiality of individuals, LEED variables available in the LBD dataset have been aggregated to the firm-level (allowing the data to be accessed through the Datalab). Variables available in this manner include counts of employers (on an annual basis) and employees (on a monthly basis) with matching data on income. Summary characteristics of individuals also include gender and banded age breakdowns, tenure distributions of employees, and summary measures of the dispersion of wages within the firm. Accessions and separations are summarised at the firm level, as are counts of contractors employed (with remuneration).
Customs data is linked to the LBF initially via probabilistic matching with subsequent manual matching for any remaining unmatched large-value Customs clients (Smart & Johnstone 2007).6 The dataset contains daily shipment-level information from January 1988 through to October 2007 covering goods (defined by the 10-digit harmonised system, HS10), countries of origin and destination, values, volumes, weights, currency of trade, port of entry/exit and mode of transportation.7
In addition, a number of SNZ sample surveys have been linked to the LBD, namely:
- Annual Enterprise Survey (AES) 1997-2006;
- R&D Survey biennially 1996-2006;
- Business Practices Survey (BPS) 2001;
- Innovation Survey 2003;
- Business Finance Survey (BFS) 2004; and
- Business Operations Survey (BOS) 2005-2006.
Being sample surveys, these data are relatively sparse in the LBD. Other than AES, these datasets are not used in the current paper, and interested readers can find detailed descriptions of the survey collections on SNZ's website. AES is SNZ's primary data source for the production of National Accounts, and as such is the benchmark dataset for estimation of value-added. The survey is full coverage for large firms with a stratified sample survey for smaller firms, and has industry-specific questions in order to accurately measure aggregate gross domestic product. In this paper we use AES postal responses to assess the accuracy of our value-added measure derived from tax sources.
Lists of firms that have received assistance from government agencies, together with information on the size and nature of the assistance, have also been probabilistically matched (on contact details) to the dataset to enable evaluation of these schemes.8
Some choices have to be made about the relevant population for the statistics produced in this paper. First, we choose our unit of observation as the enterprise (referred to as the firm throughout this paper). Much research in this area uses the plant (or geographic unit in SNZ's nomenclature) as the unit of observation. However, in New Zealand data most financial variables are only observed at the firm (or tax reporting) level, not at the individual plant (the main exception being LEED salaries & wages). To avoid the issues inherent in apportioning output to firms with multiple locations, this paper focuses on firm-level performance metrics. From a conceptual perspective the span-of-control covered by a firm may be more appropriate to the types of analysis expected of the LBD. For example, business performance surveys (such as BOS) are generally targeted at the firm using the logic that firm practices are expected to be set at this level of organisation.
Second, the time frame of longitudinal analysis involving all data sources is limited by the availability of LEED data. At the time the results in this paper were prepared, full data was only available for the six years financial years from 1999/00-2004/05. An annual frequency is imposed on the data by the IR10, IR4 and working proprietor tax returns. All sub-annual data (Customs, BAI, LEED employee data) is annualised to each firm's financial year and then allocated to the "notional" 31st March year-end that has the greatest overlap with the financial year.9
Third, we have to define an in-scope firm. To simplify the discussion of data coverage and to increase the likely applicability of the performance metrics estimated, we include only "private-for-profit" firms,10 and additionally exclude households, ANZSIC Division M (Government Administration & Defence) and firms not located in New Zealand. For practical reasons, "firms" that have never reached the BF materiality threshold and, therefore, do not appear in the LBF are excluded from the analysis (as they are not currently assigned to industries). Similarly a small number of firms that are on the LBF, but have partial or no ANZSIC information, are dropped from the analysis.
Finally, we must determine criteria for whether we treat a firm as active in any particular year. SNZ's standard approach is to define populations using the dual criteria of "live" and "economically significant". The latter criteria relates to materiality, while the former assesses whether the business is in operation. Variables capturing these criteria are located on the BF (and LBF) which, in turn, makes use of IRD data to maintain the accuracy of the population characteristics. However, through the LBD we have access to a wider set of administrative data from which to assess business activity. Naturally, the use of this wider set of data increases the potential to observe active businesses. We define an "economically active" (ie, in-scope) firm as one where we observe output, purchases of inputs or factors of production, specifically: positive employee count or PAYE salaries & wages; positive BAI sales or purchases; and/or positive IR10 total income, total expenditure or total fixed assets. This sets the population much wider than a live & economically significant approach, primarily because the economically active rule does not have an explicit materiality threshold,11 and because the additional tax data suggests some firms be treated as active despite being ceased on the BF.
Table 1 sets out the size of our population in each year, together with entry and exit rates defined by a firm being active in one year, but not in the relevant adjacent year. Even in this simple breakdown, there is much dynamism present with approximately a fifth of the population of firms either entering or exiting in a given year. Put another way, there are 687,573 distinct firms within the dataset with roughly two thirds of them active in any single year. Table 2 sets out the patterns of activity present in the data. A small proportion of the observed firm turnover is due to firms that enter and exit the population on an intermittent basis, and it might be reasonable to expect that some of these transitions are spurious.12 However, 95.9% of firms experience a single continuous spell of economic activity,13 with 39.0% of firms in the dataset continuously economically active over the full period. Overall, the general picture of firm dynamics is consistent with survival analyses previously published using more "traditional" population definitions (eg, MED, SNZ various years).
Having set the population characteristics, it is necessary to discuss missing data. In this paper, we assume that missing employment (working proprietor) data implies zero employees (working proprietors) on the grounds that personal income tax non-compliance is likely to be negligible in the population of firms that meet the mandatory GST filing threshold. Similarly it is assumed that Customs data is comprehensive.14 For this exploratory analysis, we do not make any attempt to impute missing data in other datasets. Tables 3 & 4 set out coverage rates for each of our administrative datasets by firm size & industry respectively. Administrative data can be missing for a number of reasons, including:
- Filing is not mandatory. In terms of the potential for bias to be introduced into the analysis, two issues stand out from tables 3 & 4: For BAI, missing data largely arises because of GST exempt financial activities in the finance & insurance industry; and IR4s are company returns and therefore not filed by other business types, explaining very low reporting rates in some industries;15
- Filing is mandatory, but a firm is non-compliant (non-compliance with GST reporting appears very low);
- Data is filed, but has to be discarded because it is of insufficient quality for statistical purposes. In the case of IR10s, a large number of missing observations exist because a returned form only contains zeros or fails simple internal consistency checks (eg, that totals "approximately" sum correctly);16
- One data source incorrectly implies a firm is economically active, thus giving the impression that other data should be present. For example, there is undercoverage of both BAI and IR10 data for entering and exiting firms, which may be reflective of incorrectly assessing the timing of entry and exit; or
- Links between IRD & BF firm identifiers are missed, partial or incorrectly apportioned across the enterprises that the filing covers. The rate at which this occurs is assumed to be low.
Back to Top