Project Description

Welcome to the website for the EDASIDA project, funded by the UKRI Future Data Services project and running from 1st July 2024 to 30th June 2025.

This site will host information and news about the project, together with links to code and papers, etc.

EDASIDA seeks to transform data accessibility and security by using innovative data synthesis methods. One major challenge in data discovery is the lack of high-quality teaching datasets, especially for data stored in virtual research environments where access restrictions hinder their creation. This limits discoverability for new users, who must go through an approval process before accessing the data, creating additional barriers to entry.

In response to this challenge, our project proposes an innovative solution: the production of bespoke synthetic datasets tailored for specific teaching purposes. These datasets will be based on previously cleared and published analyses, whether from the trainer's own work or from third-party sources. Unlike generic synthetic datasets, these bespoke datasets are designed to replicate only the specific analyses required. Our goal is to enable teachers to generate custom synthetic teaching datasets that resemble real data, faithfully reproduce the desired outputs, and can be created without needing access to the original data.

As a second issue data services often grapple with assessing the disclosure risk associated with their outputs, usually relying on some variation of output disclosure control rules. Because our methodology embeds analytical output into synthetic microdata, this raises the possibility that standard disclosure risk assessment methods for microdata can be utilised to formalise risk assessments for statistical outputs. To investigate this potential secondary application of the new approach, we will also carry out assessment of whether the methodology can be flipped on its head providing a method for systematically assessing the disclosure risk of standard analytical outputs from TREs.

The project has three objectives:

  1. Investigate the feasibility of generating teaching datasets tailored to restricted data access scenarios
  2. Develop a systematic approach to assessing the disclosure risk inherent in analytical outputs derived from restricted data sources
  3. Assess the feasibility of producing linked synthetic data from different sources (using the same methodology)

Project Outputs

The anticipated outputs of this project encompass a comprehensive range of resources aimed at fostering knowledge dissemination and collaboration between data providers and services, and the wider research community, skill development, and long-term impact:

  • Open-Source Code: Publicly available code accompanied by comprehensive documentation and training materials, ensuring transparency and accessibility

  • Example Synthetic Datasets: Synthetic versions of various social datasets, hosted by relevant data services, offering tangible teaching resources

  • A beta level app: with the potential to allow trainers to produce their own teaching datasets

  • Academic Papers: Academic papers for publication in privacy, machine learning, and optimization communities, contributing to academic discourse and establishing the project's credibility

The Team

We are Claire Little, Mark Elliot and Richard Allmendinger from the University of Manchester

We have an advisory group consisting of Office for National Statistics, Administrative Data Research UK and the UK Data Service