metasyn.distribution.regex

Regex distributions and fitters.

Classes

`RegexDistribution`(regex_data)	Structured string distribution using regex.
`RegexFitter`(privacy)	Fitter for regex distribution.
`UniqueRegexDistribution`(regex_data)	Unique variant of the regex distribution.
`UniqueRegexFitter`(privacy)	Fitter for unique regex distribution.

class metasyn.distribution.regex.RegexDistribution(regex_data)

Bases: BaseDistribution

Structured string distribution using regex.

Main implementation details in the regexmodel package: https://github.com/sodascience/regexmodel

This distribution tries to create a regex that fits the variable. This regex also contains statistical information about the probabilities if the regex has multiple options (e.g. a|b). The regex is only a subset what is provided by the python re package. What is currently implemented:

Parentheses with multiple options and no modifiers, e.g. ([a]|[b]|[c])
Square brackets without negation, e.g. [abc]
Ranges [A-Z], [a-z], [0-9], but not subranges (e.g. [0-3])
Repetition quantifiers (curly brackets) with minimum and maximum [A-Z]{3,6}, but not [A-Z]{6}.

When fitting the RegexDistribution using the fit method, pay attention to the count_thres and method parameters. By default these will be dynamic and take reasonable values for the input, but in some cases it can be important to set them manually. The count_thres parameter sets the minimum number of times a regex element needs to be used. So, if count_thres=2, and there is only one value starting with “a”, then the regex will never start with “a”. In effect, a higher value will provide more privacy, less utility and a faster fit. The other parameter “method” has a small effect on the accuracy of the regex, and a larger effect on the worst case time consumption for fitting. Set to “accurate” for the best result, and “fast” for the fastest result.

Examples that this distribution should work reasonably for are: email, ID’s, telephone numbers, ip addresses, etc.

Parameters:: regex_data (Union[str, dict, RegexModel]) – Valid inputs for the regex model are: - str: String with a regex (that falls within the specifications, see above). - dict: Serialized version of the regex model, as it is coming from a JSON file. - RegexModel: Initialized regex model.

Examples

>>> RegexDistribution(r"AB[0-9]{4}").draw()
"AB8123"
>>> RegexDistribution(r"(a|b|c)10)").draw()
"b10"

name: core.regex

unique: False

version: 2.0

var_type: string

draw(): Draw a random element from the fitted distribution.

classmethod default_distribution(var_type=None)

Get a distribution with default parameters.

Return type:: BaseDistribution

information_criterion(values)

Get the BIC value for a particular set of values.

Parameters:: values (array_like) – Values to determine the BIC value of.

name: str = 'core.regex': The identifier for the implemented distribution

var_type: Union[str, Sequence[str]] = 'string': The variable type of the distribution

version: str = '2.0': Version of the implemented distribution

class metasyn.distribution.regex.RegexFitter(privacy)

Bases: BaseFitter

Fitter for regex distribution.

Parameters:: privacy (BasePrivacy)

dist_class: <class ‘metasyn.distribution.regex.RegexDistribution’>

version: 2.0

var_type: string

privacy: none

distribution: alias of RegexDistribution

class metasyn.distribution.regex.UniqueRegexDistribution(regex_data)

Bases: UniqueDistributionMixin, RegexDistribution

Unique variant of the regex distribution.

See RegexDistribution for examples and explanation on this distribution.

name: core.regex

unique: True

version: 2.0

var_type: string

name: str = 'core.regex': The identifier for the implemented distribution

unique: bool = True: Whether the distribution creates only unique values

var_type: Union[str, Sequence[str]] = 'string': The variable type of the distribution

class metasyn.distribution.regex.UniqueRegexFitter(privacy)

Bases: RegexFitter

Fitter for unique regex distribution.

Parameters:: privacy (BasePrivacy)

dist_class: <class ‘metasyn.distribution.regex.UniqueRegexDistribution’>

version: 2.0

var_type: string

privacy: none

distribution: alias of UniqueRegexDistribution