EDASIDA seeks to transform data accessibility and security by using innovative data synthesis methods. One major challenge in data
discovery is the lack of high-quality teaching datasets, especially for data stored in virtual research environments where access
restrictions hinder their creation. This limits discoverability for new users, who must go through an approval process before accessing
the data, creating additional barriers to entry.
In response to this challenge, our project proposes an innovative solution: the production of bespoke synthetic datasets tailored
for specific teaching purposes. These datasets will be based on previously cleared and published analyses, whether from the trainer's
own work or from third-party sources. Unlike generic synthetic datasets, these bespoke datasets are designed to replicate only the
specific analyses required. Our goal is to enable teachers to generate custom synthetic teaching datasets that resemble real data,
faithfully reproduce the desired outputs, and can be created without needing access to the original data.
As a second issue data services often grapple with assessing the disclosure risk associated with their outputs, usually relying on
some variation of output disclosure control rules. Because our methodology embeds analytical output into synthetic microdata, this
raises the possibility that standard disclosure risk assessment methods for microdata can be utilised to formalise risk assessments for
statistical outputs. To investigate this potential secondary application of the new approach, we will also carry out assessment of
whether the methodology can be flipped on its head providing a method for systematically assessing the disclosure risk of standard
analytical outputs from TREs.
The project has three objectives:
- Investigate the feasibility of generating teaching datasets tailored to restricted data access scenarios
- Develop a systematic approach to assessing the disclosure risk inherent in analytical outputs derived from restricted data sources
- Assess the feasibility of producing linked synthetic data from different sources (using the same methodology)