What is metasyn?

Metasyn is a python package for generating synthetic data with a focus on maintaining privacy. It is aimed at owners of sensitive datasets such as public organisations, research groups, and individual researchers who want to improve the accessibility, reproducibility and reusability of their data. The goal of metasyn is to make it easy for data owners to share the structure and approximation of the content of their data with others with fewer privacy concerns.

Metasyn is auditable

Use cases

Below are a few possible use cases for metasyn:

  • A researcher who cannot share their dataset because of privacy concerns, but shares a synthetic version for reproducibility.

  • A data provider wants to show a preview of their sensitive data using synthetic versions of their datasets.

  • A developer can create synthetic data before they have real data so they can write and test data analysis scripts.

Metasyn is specifically not designed for creating highly accurate synthetic data, where all relationships between columns are reproduced.

Key features

  • Easy: Creating your first synthetic dataset shouldn’t take more than 15 minutes with our quickstart.

  • Fast: Creating synthetic data takes mere seconds for medium sized (~1000 rows) datasets.

  • Safe and Understandable Synthetic Data: As little information as possible is retained from the original dataset, and you can inspect and understand exactly which information is used to create your synthetic data.

  • Flexible: You can adjust the synthetic data columns to your liking, create your own distributions, plugins and more.

  • Data Type Support: Metasyn supports generating synthetic data for a variety of common data types including categorical, string, integer, float, date, time, and datetime.

  • Integration with Faker: Metasyn integrates with the faker package, a Python library for generating fake data such as names and emails. Allowing for more realistic synthetic data.

  • Structured String Detection: Metasyn identifies structured strings within your dataset, which can include formatted text, codes, identifiers, or any string that follows a specific pattern.

  • Handling Unique Values: Metasyn can identify and process variables with unique values or keys in the data, preserving their uniqueness in the synthetic dataset.

  • Multiple table support: Metasyn supports the synthesis of multiple tables at the same time. Primary/foreign key relations can be preserved so that the synthetic tables can be joined correctly.

How does metasyn work?

Metasyn generates synthetic data by fitting models to each column independently based on the column’s data type, then it uses those models to generate new synthetic data.

It starts with a tidy dataframe, meaning each column has the correct data type (categorical, numeric, datetime, etc.), and missing data represented as missing values. Next, metasyn models each column independently, which is known as marginal independence. For each column, metasyn considers multiple candidate distributions based on the column’s data type and optionally with additional privacy constraints.

Candidate distributions by data type

Data type

Candidate distributions

Categorical

Categorical, Constant

Continuous

Uniform, Normal, LogNormal, TruncatedNormal, Exponential, Constant

Discrete

Poisson, Uniform, Normal, TruncatedNormal, Categorical, Constant

String

Regex, Categorical, Faker, FreeText, Constant

Date/time

Uniform, Constant

After fitting all candidate distributions to the data, it selects the best-fitting one using a BIC (or pseudo-BIC when not applicable).

Once the MetaFrame is created, metasyn generates synthetic data by randomly sampling values from the fitted distribution. Finally, you can save your MetaFrame to a file (as JSON or TOML format).

For more detail on how metasyn works, see our paper.

Additionally, metasyn supports generation of synthetic data across related tables. The above process is applied to each table independently, and the relationships between tables can be preserved through defining primary and foreign key information (subset, equal, or equal ordered). After synthesizing the tables, the primary column is used to (re)generate the foreign column according to the specified relationship.

A detailed explanation of how multiple table generation works can be found here.

MetaFrames and GMF files

Metasyn Pipeline

One of the main distinguishing features of metasyn is the ability to save the model for your dataset in a standardized, human understandable and machine readable format: the Generative Metadata Format (GMF). The equivalent object in Python is the metasyn.metaframe.MetaFrame.

Why do I need the MetaFrame or GMF file?

MetaFrames are a core part of the design philosphy of metasyn: you should inspect the model before releasing any potentially private information. While metasyn will generally provide safe synthetic output, this cannot be guaranteed; whether something is considered privacy sensitive depends on the context, which metasyn does not have access to. We also generally recommend saving the GMF file for reproducibility purposes.

The GMF format can be stored as a .json file (default) or as a .toml file.

An example of a GMF file [click to expand]:
{
  "gmf_version": "1.1",
  "n_rows": 5,
  "n_columns": 5,
  "provenance": {
      "created by": {
          "name": "metasyn",
          "version": "1.1.1.dev34+g59e65a26b.d20251010"
      },
      "creation time": "2025-10-10T10:53:38.821661"
  },
  "vars": [
      {
          "name": "ID",
          "type": "discrete",
          "dtype": "Int64",
          "prop_missing": 0.0,
          "distribution": {
              "name": "core.unique_key",
              "version": "1.0",
              "class_name": "UniqueKeyDistribution",
              "unique": true,
              "parameters": {
                  "lower": 1,
                  "consecutive": true
              }
          },
          "creation_method": {
              "created_by": "metasyn",
              "unique": true
          }
      },
      {
          "name": "fruits",
          "type": "categorical",
          "dtype": "Categorical",
          "prop_missing": 0.0,
          "distribution": {
              "name": "core.multinoulli",
              "version": "1.0",
              "class_name": "MultinoulliDistribution",
              "unique": false,
              "parameters": {
                  "labels": [
                      "apple",
                      "banana"
                  ],
                  "probs": [
                      0.4,
                      0.6
                  ]
              }
          },
          "creation_method": {
              "created_by": "metasyn"
          }
      },
      {
          "name": "B",
          "type": "discrete",
          "dtype": "Int64",
          "prop_missing": 0.0,
          "distribution": {
              "name": "core.uniform",
              "version": "1.0",
              "class_name": "DiscreteUniformDistribution",
              "unique": false,
              "parameters": {
                  "lower": 1,
                  "upper": 6
              }
          },
          "creation_method": {
              "created_by": "metasyn",
              "unique": false
          }
      },
      {
          "name": "cars",
          "type": "categorical",
          "dtype": "Categorical",
          "prop_missing": 0.0,
          "distribution": {
              "name": "core.multinoulli",
              "version": "1.0",
              "class_name": "MultinoulliDistribution",
              "unique": false,
              "parameters": {
                  "labels": [
                      "audi",
                      "beetle"
                  ],
                  "probs": [
                      0.2,
                      0.8
                  ]
              }
          },
          "creation_method": {
              "created_by": "metasyn"
          }
      },
      {
          "name": "optional",
          "type": "discrete",
          "dtype": "Int64",
          "prop_missing": 0.2,
          "distribution": {
              "name": "core.uniform",
              "version": "1.0",
              "class_name": "DiscreteUniformDistribution",
              "unique": false,
              "parameters": {
                  "lower": -30,
                  "upper": 301
              }
          },
          "creation_method": {
              "created_by": "metasyn"
          }
      }
  ]
}