Multiple table generation
From version 2.1 onwards, metasyn implements primary key/foreign key relations. You might have multiple tables where one column in one table references another column in another table. It might be important for the utility of the synthetic data that this relationship is also present in the synthetic data.
Consider for example the very simple set of two tables with passengers and their medical data.
PassengerId |
Name |
Sex |
Age |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Birthday |
Board time |
Married since |
all_NA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 |
“Braund, Mr. Owen Harris” |
“male” |
22 |
0 |
“A/5 21171” |
7.25 |
null |
“S” |
1937-10-28 |
15:53:04 |
2022-08-05 04:43:34 |
null |
2 |
“Cumings, Mrs. John Bradley (Florence Briggs Thayer)” |
“female” |
38 |
0 |
“PC 17599” |
71.2833 |
“C85” |
“C” |
null |
12:26:00 |
2022-08-07 01:56:33 |
null |
3 |
“Heikkinen, Miss. Laina” |
“female” |
26 |
0 |
“STON/O2. 3101282” |
7.925 |
null |
“S” |
1931-09-24 |
16:08:25 |
2022-08-04 20:27:37 |
null |
4 |
“Futrelle, Mrs. Jacques Heath (Lily May Peel)” |
“female” |
35 |
0 |
“113803” |
53.1 |
“C123” |
“S” |
1936-11-30 |
null |
2022-08-07 07:05:55 |
null |
5 |
“Allen, Mr. William Henry” |
“male” |
35 |
0 |
“373450” |
8.05 |
null |
“S” |
1918-11-07 |
10:59:08 |
2022-08-02 15:13:34 |
null |
PassengerId |
Medical condition |
|---|---|
3 |
“Healthy” |
1 |
“Fever” |
4 |
“Unknown” |
In our analysis we might want to combine the two tables; for example we might want to analyze relate the medical condition
and the age of the passenger. For this you will need to join the tables. This might pose a problem if you generate
the synthetic tables independently of each other. The synthetic medical_data.csv version might generate PassengerId
values that do not occur in the passengers.csv table, while their original versions do.
Metasyn provides the metasyn.multiframe.Multiframe class to synthesize multiple tables at once and define
relations between columns across tables.
Column relations
Metasyn implements a few different kinds of relations between columns: subset (SUBSET OF), equal (EQUALS) and
equal_ordered (EQUAL ORDERED). There is one extra relation infer (INFER FROM), which signals to metasyn to attempts to
infer the relation automatically. There are two ways to define a relation between two columns: one using a string, the other
using the metasyn.multiframe.ColumnRelation class:
from metasyn.multiframe import ColumnRelation
relation_str = "medical_data.csv[PassengerId] SUBSET OF passengers.csv[PassengerId]"
relation = ColumnRelation.parse(relation_str)
from metasyn.multiframe import ColumnRelation, RelationType
relation = ColumnRelation(primary_table="passengers.csv", primary_key="PassengerId",
foreign_table="medical_data.csv", foreign_key="PassengerId",
relation_type=RelationType.Subset)
Multiframe
The metasyn.multiframe.MultiFrame class is the equivalent of the metasyn.metaframe.MetaFrame for
multiple tables. The class can be created directly using initialized metaframes or you can use the metasyn.multiframe.MultiFrame.fit_dataframes() method.
from metasyn.multiframe import MultiFrame
dfs = {"a": pl.read_csv(...), "b": pl.read_csv(...)}
relations = ["b[passengerId] SUBSET OF a[ID]", "a[userId] <= b[userId]"]
multi_frame = MultiFrame.fit_dataframes(dfs, relations=relations, extra_kwargs={"a": {...}, "b": {...})
from metasyn.multiframe import MultiFrame
dfs = {"a": pl.read_csv(...), "b": pl.read_csv(...)}
mfs = {"a": MetaFrame.fit_dataframe(dfs["a"], ...), "b": MetaFrame.fit_dataframe(dfs["b"], ...)}
relations = ["b[passengerId] SUBSET OF a[ID]", "b[userId] EQUALS a[userId]"]
multi_frame = MultiFrame(mfs, relations=relations, dataframes=dfs)
Synthesizing multiple tables
Similar to the metasyn.metaframe.MetaFrame class, the MultiFrame class has a MultiFrame.synthesize() method to generate
synthetic dataframes. Tables can have a different number of rows and can be set during generation of the synthetic dataset.
multiframe.synthesize(n={"a": 100, "b": 200})
Note
In contrast to single table synthesis, you might not be able to independently set the number of rows for each table. For example, if one table has an “equal” relationship with another table, the two tables should have the same size.