Quantitative Data

Overview

Statistical disclosure control (SDC), also known as statistical disclosure limitation (SDL) or disclosure avoidance, is a technique used in data-driven research to ensure that no person or organisation is identifiable from the results of an analysis or survey of administrative data, or in the rerelease of microdata (information at the level of individual, such has home address).

See more

SDC typically refers to ‘output SDC’, which aims to ensure that, for example, a published table or graph does not disclose confidential information about respondents.

SDC can also describe protection methods applied in the data: for example, removing names and address, limiting extreme values, or swapping problematic observations. This is sometimes referred to as ‘input SDC’ but more commonly anonymisation, de-identification or microdata protection.

Textbooks typically cover input SDC and tabular data protection (but not other parts of output SDC). This is because these two issues are of direct interest to statistical agencies who supported the development of the field.

For analytical environments, output rules developed for statistical agencies were generally used until data managers began arguing for specific output SDC for research

Why output SDC matters

Many kinds of social, economic and health research use potentially sensitive data as a basis for their research, such as Census data or health records. This information is usually given in confidence, and in the case of administrative data, not always for the purpose of research.

See more

Researchers are not typically interested in information about one single person or business, instead looking for trends among larger groups. However, the data is linked to individual people and businesses. SDC ensures that these cannot be identified from published data, no matter how detailed.

Even if there are issues that arise, such as a researcher being able to single out an individual through their research although the data was properly anonymised, SDC will identify the disclose risk and ensure that the results of the analysis are altered to protect confidentiality.

It requires a balance between protecting confidentiality and ensuring the results of the data analysis are still useful for statistical research.

Defining rules for clearing output

Output SDC relies upon having a set of rules that can be followed by an output checker. For example, a frequency table must have a minimum number of observations, or that survival tables should be right censored for extreme values.

The value and drawbacks of rules for frequency and magnitude tables have been discussed extensively since the late 20th Century. However, with awareness of the increasing need for rules for other types of analyses, a more structured approach is needed.

Safe and unsafe statistics

Some statistical outputs, such as frequency tables, have a high level of inherent risk: differencing, low numbers, class disclosure. They therefore need to be checked before publication/release to ensure that there is no meaningful risk.

These are referred to as ‘unsafe’ statistics.

See more

However, there are some statistics, such as the coefficients in modelling, that have no meaningful risk and therefore can be released with no further checks.

These are referred to as ‘safe’ statistics.

Statbarns

The safe/unsafe model is useful but limited to two simple categories. Within those categories, guidelines for Statistical Disclosure Control (SDC) largely consist of long lists of statistics and how to handle them. In 2023, the SACRO project reviews the whole field to see whether a more useful classification scheme could be introduced. The result is the ‘statistical barn’ or statbarn.

See more

A statbarn is a classification of statistics for disclosure control purposes, where all the statistics in that class share the same characteristics as far as disclosure control is concerned.

These characteristics are as follows:

Their mathematical form is similar
They share the same risks
They share the same responses to those risks
Output checking rules are applicable to all

The statbarns defined

As of July 2025, 14 statbarns have been identified, with 12 fully described so far:

Frequencies
Statistical hypothesis tests
Coefficients of association
Position (median, IQR etc)
Extreme values (min, max)
Shape
Linear aggregates
Mode
Non-linear concentration ratios
Odds and risk ratios
Survival tables
Gini coefficients

See more

These cover almost all statistics, as well as most graph forms, where the graphs can be converted into the appropriate statbarn (for example, a pie chart is another form of frequency table).

For a summary of the characteristics of statbarns and for a searchable list of statistics see the statbarns page

The SACRO manual https://zenodo.org/records/10054629 provides detailed guidance on what to look out for, and the rules to be followed for checking.

Operating models

There are two main approaches to output SDC: principles-based and rules-based.

In principles-based systems, disclosure control attempts to uphold a specific set of fundamental principles. For example: ‘no person should be identifiable in released microdata.’

See more

Rules-based systems, in contrast, are evidenced by a specific set of rules that a person performing disclosure control follows. For example: ‘any frequency must be based on at least five observations’ after which the data are presumed to be safe to release.

In general, official statistics are rules-based whereas research environments are more likely to be principles-based.

In research environments, the choice of output checking regime can have significant operational implications.

Rules-based SDC

In Rules-based SDC, a rigid set of rules is used to determine whether the results of data analysis can be released. The rules are applied consistently, which makes it obvious which kinds of outputs are acceptable.

Rules-based systems are good for ensuring consistency across time, across data sources, and across production teams, which makes them appealing for statistical agencies.

Rules-based systems also work well for remote job services such as microdata.

Principles-based SDC

In principles-based SDC, both the researcher and output checker are trained in SDC.

They receive a set of rules, which are ‘rules-of-thumb’ instead of hard rules as seen in rules-based SDC. This means that in principle, any output may be approved or refused.

See more

The ‘rules-of-thumb’ are a starting point for the researcher. A researcher may request outputs which breach the ‘rules-of-thumb’ if they are:

Non disclosive
Important
An exceptional request

It is up to the researcher to prove that any ‘unsafe’ outputs are non-disclosive, but the checker has the final say.

Since there are no hard rules, this requires knowledge on disclosure risks and judgement from both the researcher and the checker. It requires training and an understanding of statistics and data analysis, although it has been argued that this can be used to make the process more efficient than a rules-based model.

In the UK, all major secure research environments (SREs) in social science and public health, except for Nothern Ireland, are principles-based.

Automated tools

Output checking is generally labour-intensive, as it requires analysts who can understand what they are looking at and make a judgement about whether to release an output.

There is therefore considerable interest in automated checking. A Eurostat-commissioned report explored the options for output checking, which largely came down to two options:

See more

End-of-process review (EoPR)

Training a computer to look at the output and understand what it shows. This has the advantage of requiring no additional training for the researcher. However, it can be difficult to explain to any automated system what it is looking at, therefore it can be more time consuming than checking the output manually. tauArgus and scdTable are both EoPR.

Within-process review (WPR)

The output checking tool is called at the same time the output is being generated and has access to the source data. Tehrefore there is no need to explain how the output has been created. The disadvantage of this approach is that it can slow down processing times and requires the analysis to include the necessary commands to run the output checking tool. The major advantage is that it does not need to be taught about the data.

tauArgus and sdcTable

TauArgus and sdcTable are fully automated open source EoPR tools for tabular data protection (frequency and magnitude tables). They are designed to work with multiple tables.

See more

Metadata needs to be set up describing the outputs and the control parameters. They provide the output checkers with extensive information on potential problems, including secondary disclosure across tables. They can also carry out corrective measures such as secondary suppression and controlled tabular rounding. They do not deal with non-tabular outputs.

Due to the need to rewrite the metadata for each table, these tools are poorly suited for research use. However, in official statistics, where the same tables are being repeatedly generated and where secondary differencing is considered a small problem, the investment in setting up the tools can be very cost-effective.

SACRO

SACRO (semi-autonomous checking of research outputs) is a WPR tool, originally commissioned by Eurostat in 2020 as a proof of concept to show that a general-purpose output checking tool could be developed.

See more

SACRO directly implements the statbarns model and its principles-based, hence its semi-automatic as it allows the users to request exceptions and for output checkers to override the automated recommendations.

All UK social science secure facilities, and most UK public health secure facilities are planning to adopt it.