Improve your synthetic data

When you run metasyn on your dataframe, by default it will attempt to find the best distribution for each of your columns. This could be sub-optimal: for example, metasyn won’t know whether a column contains names of people. The column can also be too privacy-sensitive to fit with default methods.

Metasyn provides two paths to improving the quality of your synthetic data: by further specifying information directly in python, or by providing a configuration file in the TOML format. For interactive use, we foresee using python directly, and for programmatic use the configuration file is a more appropriate interface (see also our Command-Line Interface).

Python

from metasyn import MetaFrame, VarSpec
from metasyn.distribution import FakerDistribution
from metasyncontrib.disclosure import DisclosurePrivacy

specs = [
   VarSpec(name="Name", distribution=FakerDistribution(faker_type="name")),
]

mf = MetaFrame.fit_dataframe(
   df,
   privacy=DisclosurePrivacy(),
   var_specs=specs,
)

Configuration file

MetaFrame.fit_dataframe(
   df,
   config="your_config_file.toml"
)

This refers to a configuration file called your_config_file.toml:

privacy = "disclosure"

[[var]]
name = "Name"
description = "Name of the unfortunate passenger of the titanic."
distribution = {implements = "core.faker", parameters = {faker_type = "name"}}

More examples for metasyn configuration files are available on our GitHub page.

What is the TOML file format?

The TOML file format can be read with any text editor, and is human and machine-readable. You should be able to create your own TOML files from the examples below, but for more details refer to the TOML Documentation. One important thing to note is that the TOML format is case sensitive.

The remainder of this page serves as a reference for the different options to improve synthetic data quality.

General specifications

Three general options can be set: privacy, n_rows, and plugins. In our python interface, these are arguments to fit_dataframe(); in the configuration file these are mentioned at the top of the file.

Privacy: `privacy`

Using privacy plug-ins (see Extensions), metasyn can increase the level of privacy. An example is disclosure privacy, which limits the influence of various disclosive values such as outliers on the fitted distributions.

Python

from metasyncontrib.disclosure import DisclosurePrivacy
MetaFrame.fit_dataframe(
   df,
   privacy=DisclosurePrivacy(partition_size=11)
)

Configuration file

privacy = "disclosure"
parameters = {partition_size = 11}

Number of rows: `n_rows`

By default metasyn will set the number of rows to the number of rows of your dataframe. This can be disclosive or undesirable. In this case you can specify it manually:

Python

MetaFrame.fit_dataframe(
   df,
   n_rows=100
)

Configuration file

n_rows = 100

Distribution registry: `plugins`

Extra distributions and fitters can be added using plugins. By default all installed plugins will be used. For reproducibility, it is a good idea to set the plugins explicitly, so that other people using your configuration file understand which plugins were used. This can be done as follows:

Python

MetaFrame.fit_dataframe(
   df,
   plugins=["builtin", "disclosure"],
)

Configuration file

plugins = ["builtin", "disclosure"]

Improve your synthetic data

General specifications

Privacy: `privacy`

Number of rows: `n_rows`

Distribution registry: `plugins`

Column specifications

Description: `description`

Missing values: `prop_missing`

Privacy: `privacy`

Uniqueness: `unique`

Distribution: `distribution`

Improve your synthetic data

General specifications

Privacy: privacy

Number of rows: n_rows

Distribution registry: plugins

Column specifications

Description: description

Missing values: prop_missing

Privacy: privacy

Uniqueness: unique

Distribution: distribution

Privacy: `privacy`

Number of rows: `n_rows`

Distribution registry: `plugins`

Description: `description`

Missing values: `prop_missing`

Privacy: `privacy`

Uniqueness: `unique`

Distribution: `distribution`