metasyn.varspec

Module for distribution and variable specifications.

Classes

DistributionSpec([name, unique, parameters, ...])

Specification that determines which distribution is selected.

VarDefaults([data_free, prop_missing, ...])

Dataclass for variable defaults.

VarSpec(name[, distribution, unique, ...])

Data class for storing the specifications for variables.

class metasyn.varspec.DistributionSpec(name=None, unique=None, parameters=None, fit_kwargs=<factory>, version=None, distribution=None)

Bases: object

Specification that determines which distribution is selected.

It has the following attributes: - name: Which distribution is chosen. - unique: Whether the distribution should be unique. - parameters: The parameters of the distribution as defined by name. - fit_kwargs: Fitting keyword arguments to be used while fitting the distribution. - version: Version of the distribution to fit.

Parameters:
  • name (str | None)

  • unique (bool | None)

  • parameters (dict | None)

  • fit_kwargs (dict)

  • version (str | None)

  • distribution (BaseDistribution | None)

classmethod parse(dist_spec, unique=None)

Create a DistributionSpec instance from a variety of inputs.

Parameters:
  • dist_spec (Union[dict, type[BaseDistribution], BaseDistribution, DistributionSpec, str, None]) – Specification for the distribution in several types.

  • unique (Optional[bool]) – Whether the distribution is unique. This is only taken into account if dist_spec is None or a string.

Return type:

DistributionSpec

Returns:

A instantiated version of the dist_spec that has the DistributionSpec type.

Raises:

TypeError – If the input has the wrong type and cannot be parsed.

property fully_specified: bool

Indicate whether the distribution is suitable for datafree creation.

Returns:

  • A flag that indicates whether a distribution can be generated from the values

  • that are specified (not None).

get_creation_method(fitter)

Create a dictionary on how the distribution was created.

Parameters:
  • privacy – Privacy object with which the dictionary is being created.

  • fitter (BaseFitter | None)

Return type:

dict

Returns:

Dictionary containing all the non-default settings for the creation method.

class metasyn.varspec.VarDefaults(data_free=False, prop_missing=None, distribution=None, privacy=None)

Bases: object

Dataclass for variable defaults.

Parameters:
  • data_free (bool) – Whether the variable is completely synthetic or is based on real data.

  • prop_missing (Optional[float]) – Proportion of missing values.

  • distribution (Optional[dict]) – Dictionary containing default distributions for each variable type.

  • privacy (Optional[BasePrivacy]) – Privacy to be used by default for estimating distributions.

class metasyn.varspec.VarSpec(name, distribution=None, unique=None, privacy=None, prop_missing=None, description=None, data_free=None, var_type=None)

Bases: object

Data class for storing the specifications for variables.

Parameters:
  • name (str) – Name of the variable/column.

  • distribution (Union[dict, type[BaseDistribution], BaseDistribution, DistributionSpec, str, None]) –

    Distribution to use for fitting/finding the distribution. Leave at None to allow metasyn to find the most suitable distribution automatically.

    >>> # Use normal distribution
    >>> distribution="normal"
    >>> # Use normal distribution with mean 0, standard deviation 1
    >>> distribution=NormalDistribution(0, 1)
    

  • unique – To set a column to be unique/key. This is only available for the integer and string datatypes. Setting a variable to unique ensures that the synthetic values generated for this variable are unique. This is useful for ID or primary key variables, for example. The parameter… is ignored when the distribution is set manually. For example: {“unique”: True}, which sets the variable to be unique or {“unique”: False} which forces the variable to be not unique. If the uniqueness is not specified, it is assumed to be not unique, but gives a warning if metasyn thinks it should be.

  • privacy (Optional[BasePrivacy]) – Set the privacy level for a variable, e.g.: DifferentialPrivacy(epsilon=10).

  • prop_missing (Optional[float]) – Proportion of missing values for a variable.

  • description (Optional[str]) – Set the description of a variable.

  • data_free (Optional[bool]) – Whether this variable/column is to be generated from scratch or from an existing column in the dataframe.

  • var_type (Optional[str]) – Manually set the variable type of the columns (used mainly for data_free columns).

classmethod from_dict(var_dict)

Create a variable specification from a dictionary.

Parameters:

var_dict (dict) – Dictionary to parse the specification from.

Return type:

VarSpec

Returns:

A VarSpec instance.