Metasyn Documentation
Welcome to the metasyn documentation.
Metasyn
is a Python package for generating synthetic tabular data with a focus on privacy. It is designed for owners of sensitive datasets who want to share approximations of their data so that others can perform exploratory analysis and testing without disclosing real values.
Metasyn
has three main functionalities:
Estimation:
Metasyn
can create a MetaFrame, from a dataset. A MetaFrame is metadata describing a table, augmented with statistical information on the columns. It captures individual distributions and features and enables the generation of synthetic data based on it.Generation:
Metasyn
can generate synthetic data based on a MetaFrame. The synthetic data produced solely depends on the MetaFrame, thereby maintaining a critical separation between the original sensitive data and the generated synthetic data.Serialization:
Metasyn
can export a MetaFrame into an easy-to-read Generative Metadata Format (GMF) file. This allows users to audit, understand, and modify their data generation model. These GMF files can also be imported back into Metasyn to generate synthetic data.
Researchers and data owners can use metasyn
to generate and share synthetic versions of their sensitive datasets, mitigating privacy concerns. Additionally, metasyn
facilitates transparency and reproducibility by allowing the underlying MetaFrames to be exported and shared. Other researchers can use these to regenerate consistent synthetic datasets, validating published work without requiring sensitive data.
Key Features
MetaFrame Generation:
Metasyn
allows the creation of a MetaFrame from a dataset provided as a Polars or Pandas DataFrame. MetaFrames include key characteristics such as variable names, data types, percentage of missing values, and distribution parameters.Exporting MetaFrames:
Metasyn
can export and import MetaFrames to GMF files. These are JSON files that follow the easy-to-read and understand Generative Metadata Format (GMF).Synthetic Data Generation:
Metasyn
allows for the generation of a Polars DataFrame with synthetic data that resembles the original data.Distribution Fitting:
Metasyn
allows for manual and automatic distribution fitting.Data Type Support:
Metasyn
supports generating synthetic data for a variety of common data types includingcategorical
,string
,integer
,float
,date
,time
, anddatetime
.Integration with Faker:
Metasyn
integrates with the faker package, a Python library for generating fake data such as names and emails. Allowing for more realistic synthetic data.Structured String Detection:
Metasyn
identifies structured strings within your dataset, which can include formatted text, codes, identifiers, or any string that follows a specific pattern.Handling Unique Values:
Metasyn
can identify and process variables with unique values or keys in the data, preserving their uniqueness in the synthetic dataset.
Want to know more?
For more information on metasyn
and its features, check out the About section.
Documentation Outline
This documentation is designed to help you easily navigate and find the information you need. It is organized into the following four sections:
How does metasyn work?
The How does metasyn work? section provides an overview of metasyn’s purpose and functionality.
User Guide
The User Guide contains detailed, step-by-step instructions, as well as best practices for using the package. If you’re new to metasyn, we recommend you start here!
API Reference
The API Reference is a technical reference for metasyn. Here, each function, class, and module is outlined in detail, giving you a comprehensive understanding of how the package works and how to use its various functionalities. If, for example, you’d like to discover which parameters can be used for which function, you can find that here.
Developer Guide
The Developer Guide provides resources for those interested in contributing to metasyn’s development. This section includes guidance on how to build upon the existing metasyn
codebase, add new modules, functions, or even integrate other packages.
FAQ
The FAQ contains commonly asked questions and answers about metasyn.
About
The About Section provides contact information, and licensing details.