MetaFrame

class metasyn.metaframe.MetaFrame(meta_vars, n_rows=None, file_format=None, name='single_table')

Bases: object

Container for statistical metadata describing a dataset.

This class is used to fit a MetaFrame to a Polars DataFrame, serialize and save the MetaFrame to a file, read a MetaFrame from a file, and create a synthetic Polars DataFrame.

A MetaFrame represents a metadata frame, which is a structure that holds statistical metadata about a dataset. The data contained in a MetaFrame follows the Generative Metadata Format (GMF). The metadata is contained in a collection of MetaVar objects, with each MetaVar representing a column (variable).

A MetaFrame can easily be created using the fit_dataframe method, which takes a Polars DataFrame and fits a MetaFrame to it.

Parameters:
  • meta_vars (List[MetaVar]) – List of variables representing columns in a DataFrame.

  • n_rows (Optional[int]) – Number of rows in the original DataFrame.

  • privacy_package – Package that supplies the distributions.

  • file_format (Union[None, BaseFileInterface, dict[str, Any]])

  • name (str)

property n_columns: int

Number of columns of the original dataframe.

Type:

int

classmethod fit_dataframe(df, var_specs=None, plugins=None, privacy=None, n_rows=None, progress_bar=True, config=None, file_format=None, name='single_table')

Create a metasyn object from a polars (or pandas) dataframe.

The Polars dataframe should be formatted already with the correct datatypes, such as pl.Categorical (or the pandas equivalent).

Parameters:
  • df (Optional[DataFrame]) – Polars dataframe with the correct column dtypes.

  • var_specs (Optional[list[VarSpec]]) – Specifications for each column/variable. These specifications are supplied as a list of VarSpec instances (one for each column). Alternatively, the specifications can be entered as a path to a .toml file. For more information on this approach, see the MetaConfig class or the examples in the documentation. By default var_specs is None, which will use the default settings for each column.

  • plugins (Optional[list[str]]) – Plugins to use when fitting distributions to variables. Can be a list of strings or unspecified. This will overwrite the defaults if they were specified in the varspecs (but not the specifications per column).

  • privacy (Union[BasePrivacy, dict, None]) – Privacy level to use by default. This will overwrite the defaults if they were specified in the varspecs (but not the specifications per column).

  • n_rows (Optional[int]) – Number of rows registered in the MetaFrame. If left at None, it will use the number of rows in the input dataframe.

  • progress_bar (bool) – Whether to display a progress bar.

  • config (Union[Path, str, MetaConfig, None]) – A path or MetaConfig object that contains information about the variable specifications , defaults, etc. Variable specs in the config parameter will be overwritten by the var_specs parameter.

  • file_format (dict[str, Any] | BaseFileInterface | None)

  • name (str)

Returns:

Initialized metasyn metaframe.

Return type:

MetaFrame

classmethod from_config(meta_config)

Create a MetaFrame using a configuration, but without a DataFrame.

Parameters:

meta_config (MetaConfig) – Configuration to be used for creating the new MetaFrame.

Return type:

MetaFrame

Returns:

A created MetaFrame.

to_dict()

Create dictionary with the properties for recreation.

Return type:

Dict[str, Any]

classmethod from_dict(gmf_dict, table_name=None)
Parameters:
  • gmf_dict (dict)

  • table_name (str | None)

property file_format: dict[str, Any] | None
property descriptions: dict[str, str]

Return the descriptions of the columns.

save(fp, validate=True)

Serialize and save the MetaFrame to a JSON or TOML file, following the GMF format.

Optionally, validate the saved JSON file against the JSON schema(s) included in the package. A TOML cannot be validated against a schema currently.

Parameters:
  • fp (Union[Path, str, None]) – File to write the metaframe to.

  • validate (bool) – Validate the JSON file with a schema. If the file is a TOML file, then this will be ignored.

Return type:

None

classmethod load(fp, validate=True, table_name=None)

Read a MetaFrame from a JSON or TOML GMF file.

Optionally, validate the saved JSON file against the JSON schema(s) included in the package. A TOML cannot be validated against a schema currently.

Parameters:
  • fp (Union[Path, str]) – Path to read the data from.

  • validate (bool) – Validate the JSON file with a schema. If the file is a TOML file, then this will be ignored.

  • table_name (str | None)

Returns:

A restored MetaFrame from the file.

Return type:

MetaFrame

save_json(fp, validate=True)

Serialize and save the MetaFrame to a JSON file, following the GMF format.

Optionally, validate the saved JSON file against the JSON schema(s) included in the package.

Parameters:
  • fp (Union[Path, str, None]) – File to write the metaframe to.

  • validate (bool) – Validate the JSON file with a schema.

Return type:

None

classmethod load_json(fp, validate=True, table_name=None)

Read a MetaFrame from a JSON file.

Parameters:
  • fp (Union[Path, str, dict]) – Path to read the data from.

  • validate (bool) – Validate the JSON file with a schema.

  • table_name (str | None)

Returns:

A restored MetaFrame from the file.

Return type:

MetaFrame

to_json(fp, validate=True)

Export, deprecated method, use Metaframe.save_json instead.

Return type:

None

Parameters:
  • fp (Path | str)

  • validate (bool)

export(fp, validate=True)

Export, deprecated method, use Metaframe.save instead.

Return type:

None

Parameters:
  • fp (Path | str)

  • validate (bool)

classmethod from_json(fp, validate=True)

Import, deprecated method, use Metaframe.load_json instead.

Return type:

MetaFrame

Parameters:
  • fp (Path | str)

  • validate (bool)

save_toml(fp, validate=True)
classmethod load_toml(fp, validate=True, table_name=None)
Return type:

MetaFrame

Parameters:
  • fp (Path | str)

  • validate (bool)

  • table_name (str | None)

synthesize(n=None, seed=None, progress_bar=True)

Create a synthetic Polars dataframe.

Parameters:
  • n (Optional[int]) – Number of rows to generate, if None, use number of rows in original dataframe.

  • seed (Optional[int]) – Seed value for the internal random number generator. Set this to ensure reproducibility.

  • progress_bar (bool) – Whether to display a progress bar.

Returns:

Dataframe with the synthetic data.

Return type:

polars.DataFrame

write_synthetic(file_name=None, n=None, seed=None, file_format=None, overwrite=False)

Write a synthetic dataset to a file.

To write a synthetic dataset, by default it will try to create a file that has the same format as the original one. For example, if the separator of the CSV file was a comma, then it will write the synthetic data with the same separator. If the file format is not available (GMF files with older versions of metasyn or custom file interfaces), then you will have to supply your own file interface.

Parameters:
  • file_name (Union[None, Path, str]) – The filename to write the synthetic data to, by default None in which case the same filename will be used as for the original filename if available.

  • n (Optional[int]) – Number of rows to be written for the new synthetic file, by default None in which case the number of rows of the original dataset will be used.

  • seed (Optional[int]) – Set the seed for creating the synthetic dataset, by default None

  • file_format (Union[None, dict, BaseFileInterface]) – File format that determines how the file will be written. This is a dictionary that can be created by a file interface with the metasyn.fileinterface.BaseFileInterface.to_dict() method. Example file interface classes are metasyn.fileinterface.CsvFileReader and class:metasyn.fileinterface.SavFileReader. By default the file_format is None, in which case the file interface from the GMF file will be used, otherwise an error will be thrown.

  • overwrite (bool)

Raises:

ValueError: – If the file format is None, and the MetaFrame object itself does not have a file format either.