metasyn.distribution.regex
Regex distributions and fitters.
Classes
|
Structured string distribution using regex. |
|
Fitter for regex distribution. |
|
Unique variant of the regex distribution. |
|
Fitter for unique regex distribution. |
- class metasyn.distribution.regex.RegexDistribution(regex_data)
Bases:
BaseDistributionStructured string distribution using regex.
Main implementation details in the regexmodel package: https://github.com/sodascience/regexmodel
This distribution tries to create a regex that fits the variable. This regex also contains statistical information about the probabilities if the regex has multiple options (e.g. a|b). The regex is only a subset what is provided by the python re package. What is currently implemented:
Parentheses with multiple options and no modifiers, e.g. ([a]|[b]|[c])
Square brackets without negation, e.g. [abc]
Ranges [A-Z], [a-z], [0-9], but not subranges (e.g. [0-3])
Repetition quantifiers (curly brackets) with minimum and maximum [A-Z]{3,6}, but not [A-Z]{6}.
When fitting the RegexDistribution using the fit method, pay attention to the count_thres and method parameters. By default these will be dynamic and take reasonable values for the input, but in some cases it can be important to set them manually. The count_thres parameter sets the minimum number of times a regex element needs to be used. So, if count_thres=2, and there is only one value starting with “a”, then the regex will never start with “a”. In effect, a higher value will provide more privacy, less utility and a faster fit. The other parameter “method” has a small effect on the accuracy of the regex, and a larger effect on the worst case time consumption for fitting. Set to “accurate” for the best result, and “fast” for the fastest result.
Examples that this distribution should work reasonably for are: email, ID’s, telephone numbers, ip addresses, etc.
- Parameters:
regex_data (
Union[str,dict,RegexModel]) – Valid inputs for the regex model are: - str: String with a regex (that falls within the specifications, see above). - dict: Serialized version of the regex model, as it is coming from a JSON file. - RegexModel: Initialized regex model.
Examples
>>> RegexDistribution(r"AB[0-9]{4}").draw() "AB8123" >>> RegexDistribution(r"(a|b|c)10)").draw() "b10"
- name
core.regex
- unique
False
- version
2.0
- var_type
string
- draw()
Draw a random element from the fitted distribution.
- classmethod default_distribution(var_type=None)
Get a distribution with default parameters.
- Return type:
BaseDistribution
- information_criterion(values)
Get the BIC value for a particular set of values.
- Parameters:
values (array_like) – Values to determine the BIC value of.
- name: str = 'core.regex'
The identifier for the implemented distribution
- var_type: Union[str, Sequence[str]] = 'string'
The variable type of the distribution
- version: str = '2.0'
Version of the implemented distribution
- class metasyn.distribution.regex.RegexFitter(privacy)
Bases:
BaseFitterFitter for regex distribution.
- Parameters:
privacy (BasePrivacy)
- dist_class
<class ‘metasyn.distribution.regex.RegexDistribution’>
- version
2.0
- var_type
string
- privacy
none
- distribution
alias of
RegexDistribution
- class metasyn.distribution.regex.UniqueRegexDistribution(regex_data)
Bases:
UniqueDistributionMixin,RegexDistributionUnique variant of the regex distribution.
See
RegexDistributionfor examples and explanation on this distribution.- name
core.regex
- unique
True
- version
2.0
- var_type
string
- name: str = 'core.regex'
The identifier for the implemented distribution
- unique: bool = True
Whether the distribution creates only unique values
- var_type: Union[str, Sequence[str]] = 'string'
The variable type of the distribution
- class metasyn.distribution.regex.UniqueRegexFitter(privacy)
Bases:
RegexFitterFitter for unique regex distribution.
- Parameters:
privacy (BasePrivacy)
- dist_class
<class ‘metasyn.distribution.regex.UniqueRegexDistribution’>
- version
2.0
- var_type
string
- privacy
none
- distribution
alias of
UniqueRegexDistribution