Multiframe
Multi dataframe functionality for metasyn.
- class metasyn.multiframe.RelationType(*values)
Bases:
EnumEnumeration for the different relation types between columns.
There are multiple types of relations that have different associated symbols: Subset (
SUBSET OF), Equal (EQUALS), EqualOrdered (EQUAL ORDERED) and Infer (INFER FROM). Subset means that the foreign column contains values from the primary column, but not all values from the primary column need to be present in the foreign column. Equal means that all values in the primary column are present in the foreign column exactly once, but not necessarily in the same order. EqualOrdered is the same as Equal except that they are also present in the same order. Infer means that it is unknown which of the different relation types is the correct one and that this is still to be inferred.- Subset = 'SUBSET OF'
- Equal = 'EQUALS'
- EqualOrdered = 'EQUAL ORDERED'
- Infer = 'INFER FROM'
- classmethod parse(symbol)
- Return type:
- Parameters:
symbol (str)
- class metasyn.multiframe.ColumnRelation(foreign_table, foreign_key, primary_table, primary_key, relation_type=RelationType.Infer)
Bases:
objectSpecification of how two columns relate to each other for multiframe inference.
The easiest way to specify the relation between two columns is use the
ColumnRelation.parse()method.- Parameters:
foreign_table (str)
foreign_key (str)
primary_table (str)
primary_key (str)
relation_type (RelationType)
-
foreign_table:
str
-
foreign_key:
str
-
primary_table:
str
-
primary_key:
str
-
relation_type:
RelationType= 'INFER FROM'
- classmethod parse(relation_str)
Parse a string to convert it into a column relation.
- Parameters:
relation_str (
str) – String of the form primary_table[primary_column] {relation type} foreign_table[foreign_column]. SeeRelationTypefor the different relations types. Note that the tables and columns can have spaces.- Raises:
ValueError: – If the relation string cannot be parsed.
- Return type:
- Returns:
An initialized column relation.
- to_dict()
Convert the column relation to a dictionary.
Used mainly for serialization to json.
- Returns:
Dictionary containing the required information of the column relation.
- classmethod from_dict(col_dict)
Create ColumnRelation from a serialized dictionary.
Mainly used for deserializing from json files.
- Parameters:
col_dict (
dict[str,Any]) – Dictionary containing the specifications of a column relation.- Return type:
- Returns:
A newly initialized column relation.
- class metasyn.multiframe.MultiFrame(metaframes, relations, dataframes=None)
Bases:
objectGeneration of multiple synthetic data frames.
This class implements the generation of multiple synthetic data frames with relations between columns.
Initialize the MultiFrame object.
- Parameters:
metaframes (
dict) – A dictionary containing metaframes to make a multi metaframe from. The keys are used to identify the tables, but can be freely chosen as strings. You can choose for example the keys to be the names of the tables or the files in which they are stored.relations (
list[ColumnRelation]) – A list of relations between columns, seeColumnRelations.dataframes (
Optional[dict[str,DataFrame]]) – Dataframes from which the metaframes were generated. By default None, in which case relations cannot be inferred from the data.
- synthesize(n=None)
Synthesize multiple tables.
- Parameters:
n (
Optional[dict]) – Number of rows to synthesize. The number of rows for each table is individually set using a dictionary, so for example for table ‘x’ with 10 rows, don = {'x': 10}.- Return type:
dict[str,DataFrame]- Returns:
A dictionary with the synthesized dataframes.
- Raises:
ValueError – When the combination of data frames do not have the right number of rows. For example when one relation has the equal relation type, columns in both tables should have the same number of rows.
ValueError – When one of the relations has a relation type that is unknown or RelationType.Infer.
- save_json(fp=None, validate=True)
Save the MultiFrame object to a file.
- Parameters:
fp (
Union[Path,str,None]) – File to save the metadata to. If left at None, it will print it instead.validate (bool)
- classmethod load_json(fp, validate=True)
Create a MultiFrame from a file with metadata.
- Parameters:
fp (
Union[Path,str]) – File that contains the data to create the MultiFrame.validate (bool)
- Return type:
- Returns:
An initialized MultiFrame.
- save(fp)
Save the MultiFrame to a file.
- Parameters:
fp (
Union[Path,str,None]) – File to save to.
- classmethod load(fp)
Load a MultiFrame from a GMF file.
- Parameters:
fp (
Union[Path,str]) – GMF file to read.- Return type:
- Returns:
A multiframe read from the GMF file.
- classmethod fit_dataframes(dataframes, relations, extra_kwargs=None, **global_kwargs)
Fit multiple dataframes to create a MultiFrame.
- Parameters:
dataframes (
dict[str,DataFrame]) – Dictionary of dataframes that contain the tables to be fitted. The keys in the dictionary are used for defining the relations between columns in different tables.relations (
list[ColumnRelation]) – Relations between different columns, where primary/foreign key relationships are defined.extra_kwargs (
Optional[dict[str,dict]]) – Extra keyword arguments to be supplied for fitting each of the individual dataframes. If supplied, this should be a dictionary of dictionaries, where the first dictionary has keys that correspond to the keys of the dataframes.global_kwargs – Extra keyword arguments applied to all dataframes equally. This gets overridden by the the extra_kwargs keyword argument if supplied for individual dataframes.
- Return type:
- Returns:
A fitted multiframe object, containing the metadata for all tables and their relationships.