Metasyn_Logo


GitHub Repository Button GitHub Issue Tracker Button


Metasyn Documentation

Welcome to the metasyn documentation.

Metasyn is a Python package for generating synthetic tabular data with a focus on privacy. It is designed for owners of sensitive datasets who want to share approximations of their data so that others can perform exploratory analysis and testing without disclosing real values.

Metasyn has three main functionalities:

``Metasyn`` Pipeline
  1. Estimation: Metasyn can create a MetaFrame, from a dataset. A MetaFrame is metadata describing a table, augmented with statistical information on the columns. It captures individual distributions and features and enables the generation of synthetic data based on it.

  2. Generation: Metasyn can generate synthetic data based on a MetaFrame. The synthetic data produced solely depends on the MetaFrame, thereby maintaining a critical separation between the original sensitive data and the generated synthetic data.

  3. Serialization: Metasyn can export a MetaFrame into an easy-to-read Generative Metadata Format (GMF) file. This allows users to audit, understand, and modify their data generation model. These GMF files can also be imported back into Metasyn to generate synthetic data.

Researchers and data owners can use metasyn to generate and share synthetic versions of their sensitive datasets, mitigating privacy concerns. Additionally, metasyn facilitates transparency and reproducibility by allowing the underlying MetaFrames to be exported and shared. Other researchers can use these to regenerate consistent synthetic datasets, validating published work without requiring sensitive data.

Key Features

  • MetaFrame Generation: Metasyn allows the creation of a MetaFrame from a dataset provided as a Polars or Pandas DataFrame. MetaFrames include key characteristics such as variable names, data types, percentage of missing values, and distribution parameters.

  • Exporting MetaFrames: Metasyn can export and import MetaFrames to GMF files. These are JSON files that follow the easy-to-read and understand Generative Metadata Format (GMF).

  • Synthetic Data Generation: Metasyn allows for the generation of a Polars DataFrame with synthetic data that resembles the original data.

  • Distribution Fitting: Metasyn allows for manual and automatic distribution fitting.

  • Data Type Support: Metasyn supports generating synthetic data for a variety of common data types including categorical, string, integer, float, date, time, and datetime.

  • Integration with Faker: Metasyn integrates with the faker package, a Python library for generating fake data such as names and emails. Allowing for more realistic synthetic data.

  • Structured String Detection: Metasyn identifies structured strings within your dataset, which can include formatted text, codes, identifiers, or any string that follows a specific pattern.

  • Handling Unique Values: Metasyn can identify and process variables with unique values or keys in the data, preserving their uniqueness in the synthetic dataset.

Want to know more?

For more information on metasyn and its features, check out the About section.

Documentation Outline

This documentation is designed to help you easily navigate and find the information you need. It is organized into the following four sections:

How does metasyn work?

The How does metasyn work? section provides an overview of metasyn’s purpose and functionality.

User Guide

The User Guide contains detailed, step-by-step instructions, as well as best practices for using the package. If you’re new to metasyn, we recommend you start here!

API Reference

The API Reference is a technical reference for metasyn. Here, each function, class, and module is outlined in detail, giving you a comprehensive understanding of how the package works and how to use its various functionalities. If, for example, you’d like to discover which parameters can be used for which function, you can find that here.

Developer Guide

The Developer Guide provides resources for those interested in contributing to metasyn’s development. This section includes guidance on how to build upon the existing metasyn codebase, add new modules, functions, or even integrate other packages.

FAQ

The FAQ contains commonly asked questions and answers about metasyn.

About

The About Section provides contact information, and licensing details.

Indices and tables