Distributions
This page is intended to provide an overview of how distributions are implemented and organized in metasyn. It should help you understand how to create new distributions, or modify existing ones.
For a detailed overview of classes, methods and attributes mentioned on this page, refer to the API reference. Clicking on object names will automatically take you to their API reference page.
Distribution subpackage
Classes used to represent and handle distributions are located in the distribution subpackage. This subpackage contains modules used to represent different types of distributions, as well as a base module that contains the base classes for all distributions.
Base submodule
The base module contains the BaseDistribution class, which is the base class for all distributions. It also contains the ScipyDistribution class, which is a specialized base class for distributions that are built on top of SciPy’s statistical distributions.
Additionally it contains the UniqueDistributionMixin class, which is a mixin class that can be used to make a distribution unique (i.e., one that does not contain duplicate values).
Finally it contains the metadist() decorator, which is used to set the attributes of a distribution.
BaseDistribution class
This is the base class providing the basic structure for all distributions. It is not intended to be used directly, but rather to be derived from when implementing a new distribution.
BaseDistribution has the following attributes:
Attribute |
Description |
Example(s) |
|---|---|---|
implements |
A unique string identifier for the distribution type it implements |
|
var_type |
The type of variable associated with the distribution |
|
provenance |
Information about the source (core, plug-in, etc.) of the distribution. |
|
privacy |
The privacy class or implementation associated with the distribution. |
|
unique |
A boolean indicating whether the values in the distribution are unique. |
|
version |
The version of the distribution. |
|
Though they can be set manually, the intended way of setting these attributes is through the use of the metadist() decorator (which is covered further below).
Note on ‘implements’ attributes
The naming convention for the implements attribute is:
<prefix>.<distribution_name>
Distributions that are part of the core metasyn distribution registry list should use core as the prefix, e.g. core.multinoulli.
BaseDistribution class has a series of abstract methods that must be implemeted by derived classes, these are:
_fit()to contain the fitting logic for the distribution. It does not need to handle N/A values.draw()to draw a new value from the distribution._param_dict()to return a dictionary of the distribution’s parameters._param_schema()to return a schema for the distribution’s parameters.default_distribution()to return a distribution with default parameters.
If the distribution has subsequently draws that are not independent, it is recommended to implement draw_reset(). As the name suggests, this method is intended to reset the distribution’s drawing mechanism.
It is recommended to also implement information_criterion(). This is a class method used to determine which distribution gets selected during the fitting process for a series of values. The distribution with the lowest information criterion of the correct variable type will be selected. For discrete and continuous distributions it is currently implemented as BIC.
Another optional method to implement is the draw_list(). Normally, metasyn will draw values one at a time when synthesizing. Sometimes this is slow and it is faster to draw multiple values at once. In this case you can implement the draw_list() method.
There are more methods, but this is a good starting point when implementing a new distribution.
For an overview of the rest of the methods and implementation details, refer to the BaseDistribution class.
Metadist decorator method
When implementing a new distribution (that inherits from BaseDistribution), the metadist() decorator is intended to be used to set its attributes.
To use the decorator, annotate a distribution class with @metadist, passing in the attributes of the target distribution as parameters.
For example, the following distributions use the decorator as follows:
@metadist(implements="core.multinoulli", var_type=["categorical", "discrete", "string"])
class MultinoulliDistribution(BaseDistribution):
@metadist(implements="core.regex", var_type="string", unique=True)
class UniqueRegexDistribution(UniqueDistributionMixin, RegexDistribution):
@metadist(implements="core.uniform", var_type="date")
class UniformDateDistribution(BaseUniformDistribution):
The metadist() decorator, which is a part of the metasyn.distribution.base submodule, is directly accessible when importing the main metasyn package, as it’s explicitly and relatively imported upon importing the main metasyn package.
ScipyDistribution class
The ScipyDistribution is a specialized base class for distributions that are based on
SciPy statistical distributions.
All the current discrete and continuous distributions are derived from this class.
UniqueDistributionMixin class
The UniqueDistributionMixin is a mixin class that can be combined with other distribution classes to create distributions that generate unique values.
For example, the unique variants of the RegexDistribution and the UniqueFakerDistribution are implemented using this mixin as follows:
@metadist(implements="core.regex", var_type="string", unique=True)
class UniqueRegexDistribution(UniqueDistributionMixin, RegexDistribution):
@metadist(implements="core.faker", var_type="string")
class UniqueFakerDistribution(UniqueDistributionMixin, FakerDistribution):
Other modules
The rest of the modules in the distribution subpackage contain the classes used to represent different types of distributions. A comprehensive overview of these modules, along with the distributions they implement, can be found on the API reference’s Distribution list page.
Creating a new distribution
The first step to creating a new distribution is to inherit from a distribution class. This can be a base class (e.g. BaseDistribution, ScipyDistribution), or an existing distribution.
The next step is to set the attributes of the distribution using the metadist() decorator. Refer to BaseDistribution for an overview of these attributes.
Important
In is posible to have different variations of the same distribution, for various data types. As is the case with the core.uniform distributions in metasyn.
Then, implement the required methods (_fit(), draw(), _param_dict(), _param_schema(), default_distribution() and __init__), as well as any other applicable methods.
Finally the distribution has to be added to a distribution registry, so that it can be used by metasyn for fitting.
For example, let’s say we want to create a new distribution for unique continuous variables, to be a part of the core metasyn distribution registry. We could implement the distribution as follows:
@metadist(implements="core.new_distribution", var_type="continuous", unique=True, version="1.0")
class NewDistribution(UniqueDistributionMixin, BaseDistribution):
"""New custom distribution."""
def __init__(self, lower=0, upper=1):
self.lower = lower
self.upper = upper
@classmethod
def default_distribution(cls, var_type: Optional[str] = None):
return cls(0, 1) # default distribution with lower=0 and upper=1
@classmethod
def _param_schema(cls):
return {
"lower": {"type": "number"},
"upper": {"type": "number"},
}
@classmethod
def _fit(cls, values):
lower = min(values)
upper = max(values)
return cls(lower, upper)
def draw(self):
return random.uniform(self.lower, self.upper)
def _param_dict(self):
return {"lower": self.lower, "upper": self.upper}
And then add it to the builtin_fitters list in the __init__ module.
Note that this is a bare-bones example and that the implementation of the distribution will vary depending on the type of distribution being implemented.
Creating a distribution plug-in
In case you want to create a new distribution as part of an add-on, as opposed to it being implemented in the core package, you can easily do so by following the available distribution plug-in template.
More information on creating plug-ins can be found in the Creating plug-ins section of the documentation.