MetaFrame
- class metasyn.metaframe.MetaFrame(meta_vars, n_rows=None, file_format=None, name='single_table')
Bases:
objectContainer for statistical metadata describing a dataset.
This class is used to fit a MetaFrame to a Polars DataFrame, serialize and save the MetaFrame to a file, read a MetaFrame from a file, and create a synthetic Polars DataFrame.
A MetaFrame represents a metadata frame, which is a structure that holds statistical metadata about a dataset. The data contained in a MetaFrame follows the Generative Metadata Format (GMF). The metadata is contained in a collection of MetaVar objects, with each MetaVar representing a column (variable).
A MetaFrame can easily be created using the
fit_dataframemethod, which takes a Polars DataFrame and fits a MetaFrame to it.- Parameters:
meta_vars (
List[MetaVar]) – List of variables representing columns in a DataFrame.n_rows (
Optional[int]) – Number of rows in the original DataFrame.privacy_package – Package that supplies the distributions.
file_format (Union[None, BaseFileInterface, dict[str, Any]])
name (str)
- property n_columns: int
Number of columns of the original dataframe.
- Type:
int
- classmethod fit_dataframe(df, var_specs=None, plugins=None, privacy=None, n_rows=None, progress_bar=True, config=None, file_format=None, name='single_table')
Create a metasyn object from a polars (or pandas) dataframe.
The Polars dataframe should be formatted already with the correct datatypes, such as pl.Categorical (or the pandas equivalent).
- Parameters:
df (
Optional[DataFrame]) – Polars dataframe with the correct column dtypes.var_specs (
Optional[list[VarSpec]]) – Specifications for each column/variable. These specifications are supplied as a list of VarSpec instances (one for each column). Alternatively, the specifications can be entered as a path to a .toml file. For more information on this approach, see the MetaConfig class or the examples in the documentation. By default var_specs is None, which will use the default settings for each column.plugins (
Optional[list[str]]) – Plugins to use when fitting distributions to variables. Can be a list of strings or unspecified. This will overwrite the defaults if they were specified in the varspecs (but not the specifications per column).privacy (
Union[BasePrivacy,dict,None]) – Privacy level to use by default. This will overwrite the defaults if they were specified in the varspecs (but not the specifications per column).n_rows (
Optional[int]) – Number of rows registered in the MetaFrame. If left at None, it will use the number of rows in the input dataframe.progress_bar (
bool) – Whether to display a progress bar.config (
Union[Path,str,MetaConfig,None]) – A path or MetaConfig object that contains information about the variable specifications , defaults, etc. Variable specs in the config parameter will be overwritten by the var_specs parameter.file_format (dict[str, Any] | BaseFileInterface | None)
name (str)
- Returns:
Initialized metasyn metaframe.
- Return type:
- classmethod from_config(meta_config)
Create a MetaFrame using a configuration, but without a DataFrame.
- Parameters:
meta_config (
MetaConfig) – Configuration to be used for creating the new MetaFrame.- Return type:
- Returns:
A created MetaFrame.
- to_dict()
Create dictionary with the properties for recreation.
- Return type:
Dict[str,Any]
- classmethod from_dict(gmf_dict, table_name=None)
- Parameters:
gmf_dict (dict)
table_name (str | None)
- property file_format: dict[str, Any] | None
- property descriptions: dict[str, str]
Return the descriptions of the columns.
- save(fp, validate=True)
Serialize and save the MetaFrame to a JSON or TOML file, following the GMF format.
Optionally, validate the saved JSON file against the JSON schema(s) included in the package. A TOML cannot be validated against a schema currently.
- Parameters:
fp (
Union[Path,str,None]) – File to write the metaframe to.validate (
bool) – Validate the JSON file with a schema. If the file is a TOML file, then this will be ignored.
- Return type:
None
- classmethod load(fp, validate=True, table_name=None)
Read a MetaFrame from a JSON or TOML GMF file.
Optionally, validate the saved JSON file against the JSON schema(s) included in the package. A TOML cannot be validated against a schema currently.
- Parameters:
fp (
Union[Path,str]) – Path to read the data from.validate (
bool) – Validate the JSON file with a schema. If the file is a TOML file, then this will be ignored.table_name (str | None)
- Returns:
A restored MetaFrame from the file.
- Return type:
- save_json(fp, validate=True)
Serialize and save the MetaFrame to a JSON file, following the GMF format.
Optionally, validate the saved JSON file against the JSON schema(s) included in the package.
- Parameters:
fp (
Union[Path,str,None]) – File to write the metaframe to.validate (
bool) – Validate the JSON file with a schema.
- Return type:
None
- classmethod load_json(fp, validate=True, table_name=None)
Read a MetaFrame from a JSON file.
- Parameters:
fp (
Union[Path,str,dict]) – Path to read the data from.validate (
bool) – Validate the JSON file with a schema.table_name (str | None)
- Returns:
A restored MetaFrame from the file.
- Return type:
- to_json(fp, validate=True)
Export, deprecated method, use Metaframe.save_json instead.
- Return type:
None- Parameters:
fp (Path | str)
validate (bool)
- export(fp, validate=True)
Export, deprecated method, use Metaframe.save instead.
- Return type:
None- Parameters:
fp (Path | str)
validate (bool)
- classmethod from_json(fp, validate=True)
Import, deprecated method, use Metaframe.load_json instead.
- Return type:
- Parameters:
fp (Path | str)
validate (bool)
- save_toml(fp, validate=True)
- classmethod load_toml(fp, validate=True, table_name=None)
- Return type:
- Parameters:
fp (Path | str)
validate (bool)
table_name (str | None)
- synthesize(n=None, seed=None, progress_bar=True)
Create a synthetic Polars dataframe.
- Parameters:
n (
Optional[int]) – Number of rows to generate, if None, use number of rows in original dataframe.seed (
Optional[int]) – Seed value for the internal random number generator. Set this to ensure reproducibility.progress_bar (
bool) – Whether to display a progress bar.
- Returns:
Dataframe with the synthetic data.
- Return type:
polars.DataFrame
- write_synthetic(file_name=None, n=None, seed=None, file_format=None, overwrite=False)
Write a synthetic dataset to a file.
To write a synthetic dataset, by default it will try to create a file that has the same format as the original one. For example, if the separator of the CSV file was a comma, then it will write the synthetic data with the same separator. If the file format is not available (GMF files with older versions of metasyn or custom file interfaces), then you will have to supply your own file interface.
- Parameters:
file_name (
Union[None,Path,str]) – The filename to write the synthetic data to, by default None in which case the same filename will be used as for the original filename if available.n (
Optional[int]) – Number of rows to be written for the new synthetic file, by default None in which case the number of rows of the original dataset will be used.seed (
Optional[int]) – Set the seed for creating the synthetic dataset, by default Nonefile_format (
Union[None,dict,BaseFileInterface]) – File format that determines how the file will be written. This is a dictionary that can be created by a file interface with themetasyn.fileinterface.BaseFileInterface.to_dict()method. Example file interface classes aremetasyn.fileinterface.CsvFileReaderand class:metasyn.fileinterface.SavFileReader. By default the file_format is None, in which case the file interface from the GMF file will be used, otherwise an error will be thrown.overwrite (bool)
- Raises:
ValueError: – If the file format is None, and the MetaFrame object itself does not have a file format either.