Improve your synthetic data
When you run metasyn on your dataframe, by default it will attempt to find the best distribution for each of your columns. This could be sub-optimal: for example, metasyn won’t know whether a column contains names of people. The column can also be too privacy-sensitive to fit with default methods.
Metasyn provides two paths to improving the quality of your synthetic data: by further specifying information directly in python, or by providing a configuration file in the TOML format. For interactive use, we foresee using python directly, and for programmatic use the configuration file is a more appropriate interface (see also our Command-Line Interface).
from metasyn import MetaFrame, VarSpec
from metasyn.distribution import FakerDistribution
from metasyncontrib.disclosure import DisclosurePrivacy
specs = [
VarSpec(name="Name", distribution=FakerDistribution(faker_type="name")),
]
mf = MetaFrame.fit_dataframe(
df,
privacy=DisclosurePrivacy(),
var_specs=specs,
)
MetaFrame.fit_dataframe(
df,
config="your_config_file.toml"
)
This refers to a configuration file called your_config_file.toml:
privacy = "disclosure"
[[var]]
name = "Name"
description = "Name of the unfortunate passenger of the titanic."
distribution = {implements = "core.faker", parameters = {faker_type = "name"}}
More examples for metasyn configuration files are available on our GitHub page.
What is the TOML file format?
The TOML file format can be read with any text editor, and is human and machine-readable. You should be able to create your own TOML files from the examples below, but for more details refer to the TOML Documentation. One important thing to note is that the TOML format is case sensitive.
The remainder of this page serves as a reference for the different options to improve synthetic data quality.
General specifications
Three general options can be set: privacy, n_rows, and plugins.
In our python interface, these are arguments to fit_dataframe(); in the
configuration file these are mentioned at the top of the file.
Privacy: privacy
Using privacy plug-ins (see Extensions), metasyn can increase the level of privacy.
An example is disclosure privacy, which limits the influence of various disclosive values such as outliers on the fitted distributions.
from metasyncontrib.disclosure import DisclosurePrivacy
MetaFrame.fit_dataframe(
df,
privacy=DisclosurePrivacy(partition_size=11)
)
privacy = "disclosure"
parameters = {partition_size = 11}
Number of rows: n_rows
By default metasyn will set the number of rows to the number of rows of your dataframe. This can be disclosive or undesirable. In this case you can specify it manually:
MetaFrame.fit_dataframe(
df,
n_rows=100
)
n_rows = 100
Distribution registry: plugins
Extra distributions and fitters can be added using plugins. By default all installed plugins will be used. For reproducibility, it is a good idea to set the plugins explicitly, so that other people using your configuration file understand which plugins were used. This can be done as follows:
MetaFrame.fit_dataframe(
df,
plugins=["builtin", "disclosure"],
)
plugins = ["builtin", "disclosure"]
Column specifications
In addition to specifications that apply to all columns, you can also specify the behavior for individual columns. The most common use-case for this is to set the distribution type and/or parameters.
# we suggest using the VarSpec object like so:
from metasyn import MetaFrame, VarSpec
from metasyn.distribution import RegexDistribution
specs = [
VarSpec(
name="Cabin",
description="Cabin number of the passenger.",
distribution=RegexDistribution("[A-F][0-9]{2,3}"),
prop_missing=0.2,
),
VarSpec(
name=...,
description=...,
distribution=...,
),
...
]
MetaFrame.fit_dataframe(df, var_specs=specs)
# In this example you put the specifications in the toml file.
MetaFrame.fit_dataframe(df, config="your_config_file.toml")
[[var]]
name = "Cabin"
description = "Cabin number of the passenger."
distribution = {implements = "core.regex", parameters = {regex_data = "[A-F][0-9]{2,3}"}}
prop_missing = 0.2
[[var]]
name = "Another column name"
description = "With descriptions."
# And more specifications for that column after this.
Description: description
You can add a description about your column. This will not be used in the estimation phase of metasyn, but it will be present in the resulting GMF file so that others can more easily understand what is in the data.
specs = [ VarSpec(name="Cabin", description="Cabin number of the passenger.") ]
MetaFrame.fit_dataframe(df, var_specs=specs)
[[var]]
name = "Cabin"
description = "Cabin number of the passenger."
Missing values: prop_missing
By default metasyn will estimate the proportion of missing values from the data, but you can
overwrite this with the prop_missing parameter (between 0 and 1, inclusive):
specs = [ VarSpec(name="Cabin", prop_missing=0.2) ]
MetaFrame.fit_dataframe(df, var_specs=specs)
[[var]]
name = "Cabin"
prop_missing = 0.2
Privacy: privacy
You can override the privacy level for specific columns:
from metasyncontrib.disclosure import DisclosurePrivacy
specs = [ VarSpec(name="Cabin", privacy=DisclosurePrivacy()) ]
MetaFrame.fit_dataframe(df, var_specs=specs)
[[var]]
name = "Cabin"
privacy = "disclosure"
Uniqueness: unique
Some distributions produce only values that are unique without any repeats (see distributions starting with Unique
in Distribution list). By default, metasyn will not select any unique distributions. An exception
is the metasyn.distribution.UniqueKeyDistribution; if values in the column are sequentially
increasing. When the column represents a variable that is known to be unique (such as IDs or other key variables), this uniqueness can be enforced with:
specs = [ VarSpec(name="Cabin", unique=True) ]
MetaFrame.fit_dataframe(df, var_specs=specs)
[[var]]
name = "Cabin"
unique = true # Notice the lower case for TOML
Distribution: distribution
You can specify the distribution for a column in two different ways: either specify only the type of distribution and let metasyn find the parameters or specify both the type and parameters of the distribution.
from metasyn.distribution import RegexDistribution
cabin_dist = RegexDistribution("[A-F][0-9]{2,3}")
specs = [ VarSpec(name="Cabin", distribution=cabin_dist) ]
MetaFrame.fit_dataframe(df, var_specs=specs)
[[var]]
name = "Cabin"
distribution = {implements = "core.regex", parameters = {regex_data = "[A-F][0-9]{2,3}"}}
Ensure that the column type matches the type of the distribution, for example if the column has string values, use a distribution that supports the string type. An overview of all distributions sorted by type can be found in the API