metasyn.distribution.regex.RegexDistribution
- class metasyn.distribution.regex.RegexDistribution(regex_data)
Structured string distribution using regex.
Main implementation details in the regexmodel package: https://github.com/sodascience/regexmodel
This distribution tries to create a regex that fits the variable. This regex also contains statistical information about the probabilities if the regex has multiple options (e.g. a|b). The regex is only a subset what is provided by the python re package. What is currently implemented:
Parentheses with multiple options and no modifiers, e.g. ([a]|[b]|[c])
Square brackets without negation, e.g. [abc]
Ranges [A-Z], [a-z], [0-9], but not subranges (e.g. [0-3])
Repetition quantifiers (curly brackets) with minimum and maximum [A-Z]{3,6}, but not [A-Z]{6}.
When fitting the RegexDistribution using the fit method, pay attention to the count_thres and method parameters. By default these will be dynamic and take reasonable values for the input, but in some cases it can be important to set them manually. The count_thres parameter sets the minimum number of times a regex element needs to be used. So, if count_thres=2, and there is only one value starting with “a”, then the regex will never start with “a”. In effect, a higher value will provide more privacy, less utility and a faster fit. The other parameter “method” has a small effect on the accuracy of the regex, and a larger effect on the worst case time consumption for fitting. Set to “accurate” for the best result, and “fast” for the fastest result.
Examples that this distribution should work reasonably for are: email, ID’s, telephone numbers, ip addresses, etc.
- Parameters:
regex_data (
Union[str,dict,RegexModel]) – Valid inputs for the regex model are: - str: String with a regex (that falls within the specifications, see above). - dict: Serialized version of the regex model, as it is coming from a JSON file. - RegexModel: Initialized regex model.
Examples
>>> RegexDistribution(r"AB[0-9]{4}").draw() "AB8123" >>> RegexDistribution(r"(a|b|c)10)").draw() "b10"
- name
core.regex
- unique
False
- version
2.0
- var_type
string
- __init__(regex_data)
- Parameters:
regex_data (str | dict | RegexModel)
Methods
__init__(regex_data)default_distribution([var_type])Get a distribution with default parameters.
draw()Draw a random element from the fitted distribution.
draw_list(n)Draw a list of values from the distribution.
draw_reset()Reset the drawing of elements to start again.
from_dict(dist_dict)Create a distribution from a dictionary.
information_criterion(values)Get the BIC value for a particular set of values.
matches_name(name)Check whether the name matches the distribution.
provides_var_type(var_type)schema()Create sub-schema to validate GMF file.
to_dict()Convert the distribution to a dictionary.
Attributes
The identifier for the implemented distribution
Whether the distribution creates only unique values
The variable type of the distribution
Version of the implemented distribution