FAQ
Here, we’ve compiled answers to commonly asked questions about metasyn and its development. If you have any other questions, need further assistance, or want to discuss something related to metasyn, don’t hesitate to contact us directly. You can find our contact details on the contact page. We’re more than happy to assist you and provide any additional information you may need.
Can I use pandas DataFrames with metasyn?
Yes, you can use pandas DataFrames with metasyn. However, please note that metasyn internally utilizes polars for consistent typing and handling of non-existing data. Although supplying pandas DataFrames is supported, there may be conversion issues, particularly in certain edge cases. To ensure optimal functionality, we recommend creating polars DataFrames. The synthetic datasets generated by metasyn are always in the form of polars DataFrames. If needed, you can easily convert them back to pandas DataFrames using the df_pandas = df_polars.to_pandas() method.
What is a MetaFrame?
A MetaFrame is a fitted model that describes the aggregate structure and characteristics of a dataset. It functions like (statistical) metadata for the dataset, providing information about the dataset without revealing the actual data itself. When metasyn is fed a dataset (as DataFrame), it generates this MetaFrame to capture certain key aspects of the data.
Key elements encapsulated in a MetaFrame include variable names, their data types, the proportion of missing values, and the parameters of the distributions that these variables follow in the dataset. This information is sufficient to understand the overall structure and attributes of the data, without divulging the exact data points.
When a MetaFrame is created from an input dataset, it can be saved for auditing or manual editing.
In the metasyn workflow, once you have a MetaFrame, metasyn can generate synthetic data that aligns with the MetaFrame. This synthetic data shares the structural and distributional characteristics (as defined in the MetaFrame) with the original data but does not contain any actual data points from the original dataset, thus preserving privacy.
The process of generating synthetic data solely from the MetaFrame ensures that this synthetic data is separate and independent from the original sensitive source data, thereby reducing privacy concerns while sharing or distributing this synthetic data.
I encountered the warning: “Metasyn detected that variable {x} is potentially unique.” What should I do?
This warning occurs when metasyn detects a column, that seems to have unique values in the real dataset but isn’t specified to be unique in the fitting of the MetaFrame. To address this, you can use the spec parameter to create a specification dictionary and indicate that the column should have unique values. Here’s an example of how to do this (in this example PassengerId is the column with unique values):
from metasyn import VarSpec
# Create a specification dictionary, and specify the column as unique:
var_specs = [
VarSpec("PassengerId", unique=True)
]
# Call the fit_dataframe() function, passing in the `var_spec` dictionary as the `spec` argument
mf = MetaFrame.fit_dataframe(df, var_specs=var_specs)
More information on how to use the optional parameters in the metasyn.MetaFrame.fit_dataframe() function can be found in Improve your synthetic data.
You can also set the uniqueness of a variable in the configuration file
I found a bug/issue, where can I report it?
If you encounter any bugs or have identified an issue with metasyn, we encourage you to report it on our GitHub issue tracker. This allows us to track and address the problem efficiently. Alternatively, you can find out how to contact us through the details provided in our contact page.
I would like to contribute to the project, how do I get started?
That’s fantastic! We welcome and appreciate contributions from the community. To get started with contributing to metasyn, please refer to our detailed guide in the Developer guide section. It contains all the information you need to start contributing.
Why did you change the name from MetaSynth to metasyn?
The project was originally named MetaSynth. However, as we progressed, we discovered that there was already an existing and established audio synthesis software under the same name. To avoid potential confusion between these two unrelated projects, we have decided to change the name of our project to metasyn. This new name still reflects the package’s core goal, of metadata-driven data synthesis. We have also changed the styling of the name to be all lowercase to align with how the package is used in code (e.g. `import metasyn`).
It is important to note that despite the name change, metasyn as a project and the package’s functionality remain the same.
What is the classification of metasyn’s synthetic datasets?
Metasyn’s synthetically generated datasets are classified as Synthetically-Augmented Plausible datasets, as categorized by the Office for National Statistics (ONS).
ONS criteria for a Synthetically-augmented plausible dataset:
Preserve the format and record-level plausibility as detailed previously and replicate marginal (univariate) distributions where possible.
Constructed based on the real dataset, values are generated based on observed distributions (with added fuzziness and smoothing) but no attempt made to preserve relationships.
Missing value codes and their frequency is to be preserved.
Disclosure control evaluation is necessary case by case, special care to be taken with names and so on.
To be used for extended code testing, minimal analytical value, non-negligible disclosure risk.
Can I make the generation of synthetic data reproducible?
To some extent, the answer is yes. You can set the seed for the generation of synthetic data as follows:
mf.synthesize(10, seed=1234)
metasyn synthesize gmf_file.json --preview --seed 1234
This should give the same results when you run it multiple times on your machine. However, we cannot guarantee reproducibility across different versions of Python, Numpy, Faker. Different CPU architectures will also most likely produce different results.