Quickstart
In this concise demonstration, you’ll learn the basic functionality of metasyn by generating synthetic data from a classic dataset about the titanic disaster.
Note
A more elaborate version of this page is also available as an interactive tutorial available on the Interactive tutorials page.
Importing libraries
The first step is to import the required Python libraries. For this example, we will use the Polars data frame library and metasyn.
import polars as pl # NB: pandas is also supported
import metasyn as ms
Loading the dataset
Next, we assume we have the Titanic dataset stored as a CSV file. To get the path to the demo file, we can use the built-in metasyn.demo_file() function.
csv_path = ms.demo_file("titanic")
When loading in the DataFrame, it is important to use the correct data types. For example, we need to tell Polars which columns are categorical, as it cannot automatically infer this. This is important, as metasyn can later read this information and use it to generate categorical values where necessary. For this, we create a dictionary with the column names that we want to specify as categorical. For this example, we specify which columns are dates / times as well.
data_types = {
"Sex": pl.Categorical,
"Embarked": pl.Categorical,
"Birthday": pl.Date,
"Board time": pl.Time,
"Married since": pl.DateTime,
}
To finish loading the dataset, we can either use the polars.read_csv() function or the metasyn
metasyn.read_csv() function. The advantage of the latter is that with it metasyn will remember the
specifics of the file format of your file (for example if your separator between columns was a , or ;)
and when writing the synthetic data file, this can be reused.
Note
Metasyn also supports reading an writing sav files (metasyn.read_sav()), excel files (metasyn.read_excel()), tsv files (metasyn.read_tsv()), .dta files (metasyn.read_dta()), and more.
For reading a CSV file, you should give the data_types dictionary as an argument.
# for more info, see pl.read_csv()
df, file_format = ms.read_csv(csv_path, schema_overrides=data_types)
This converts the CSV file into a DataFrame named df. Additionally, we have a file_format object
that remembers the name of your file, that it was a CSV file and more.
The dataset should now be loaded into the DataFrame, and we can verify this by inspecting the first 5 rows of the DataFrame using the df.head(5). This will output the following table:
PassengerId |
Name |
Sex |
Age |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Birthday |
Board time |
Married since |
all_NA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 |
“Braund, Mr. Owen Harris” |
“male” |
22 |
0 |
“A/5 21171” |
7.25 |
null |
“S” |
1937-10-28 |
15:53:04 |
2022-08-05 04:43:34 |
null |
2 |
“Cumings, Mrs. John Bradley (Florence Briggs Thayer)” |
“female” |
38 |
0 |
“PC 17599” |
71.2833 |
“C85” |
“C” |
null |
12:26:00 |
2022-08-07 01:56:33 |
null |
3 |
“Heikkinen, Miss. Laina” |
“female” |
26 |
0 |
“STON/O2. 3101282” |
7.925 |
null |
“S” |
1931-09-24 |
16:08:25 |
2022-08-04 20:27:37 |
null |
4 |
“Futrelle, Mrs. Jacques Heath (Lily May Peel)” |
“female” |
35 |
0 |
“113803” |
53.1 |
“C123” |
“S” |
1936-11-30 |
null |
2022-08-07 07:05:55 |
null |
5 |
“Allen, Mr. William Henry” |
“male” |
35 |
0 |
“373450” |
8.05 |
null |
“S” |
1918-11-07 |
10:59:08 |
2022-08-02 15:13:34 |
null |
Generating the MetaFrame
With the DataFrame loaded, you can now generate a MetaFrame.
mf = ms.MetaFrame.fit_dataframe(df, file_format=file_format)
This creates a MetaFrame named mf. Note that you don’t have to supply the file_format argument, but adding it will enable metasyn to write the synthetic data in exactly the same format as the real data.
We can inspect the MetaFrame by printing it (print(mf)). This will produce the following output:
# Rows: 891
# Columns: 13
Column 1: "PassengerId"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.0000
- Distribution:
- Type: core.uniform
- Provenance: builtin
- Parameters:
- lower: 1
- upper: 892
Column 2: "Name"
# ...
Saving and loading the MetaFrame
The MetaFrame can be saved to a GMF file for future use, to do so we use the save() method on the MetaFrame (which in our case is named mf), and pass in the desired filepath as a parameter. The following code saves the MetaFrame to a JSON file named “saved_metaframe.json”:
mf.save("saved_metaframe.json")
Inversely, we can load a MetaFrame from a JSON file using the load() method, passing in the filepath as a parameter. To load our previously saved MetaFrame, we use the following code:
mf = ms.MetaFrame.load("saved_metaframe.json")
Synthesizing the data
With the MetaFrame loaded, you can synthesize new data. To do so, we simply call the synthesize() method on the MetaFrame, and pass in the number of rows you would like to generate as a parameter. For example, to generate five rows of synthetic data we can use:
synthetic_data = mf.synthesize(5)
We can inspect our synthesized data by printing it (print(synthetic_data)). This will output a table similar to the following, but with different values as it is randomly generated:
PassengerId |
Name |
Sex |
Age |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Birthday |
Board time |
Married since |
all_NA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
19 |
“Certain. Nearly.” |
“male” |
30 |
0 |
“11941” |
2.025903 |
null |
“S” |
1921-10-19 |
14:06:13 |
2022-08-03 15:51:11 |
null |
795 |
“Between. Nature.” |
“male” |
43 |
0 |
“2067” |
16.766045 |
null |
“S” |
1936-04-09 |
12:26:26 |
2022-07-27 15:15:46 |
null |
257 |
“Country. View. Evidence.” |
“male” |
44 |
0 |
“451553” |
3.687185 |
null |
“S” |
1930-10-18 |
11:58:39 |
null |
null |
575 |
“Scene. Reason. Low. Recent.” |
“male” |
34 |
1 |
“8659” |
25.834306 |
null |
“S” |
1914-06-14 |
15:43:05 |
2022-08-08 05:50:39 |
null |
495 |
“Morning. Nice. Large. Challenge.” |
“male” |
8 |
0 |
“9582” |
9.150979 |
“G01” |
“S” |
1914-06-23 |
12:16:21 |
2022-07-19 09:34:07 |
null |
Of course, it’s easy to see some flaws with the generated dataset, such as the names not making a lot of sense. The page on Improve your synthetic data shows how to improve the quality of the synthesized data, such as for example generating fake names using Faker.
Writing synthetic data
If you used metasyn.read_csv() to read in your file and added the file format to the metaframe, you can also write the synthetic data directly to a file:
mf.write_synthetic("some_file_name.csv")
Conclusion
Congratulations! You’ve successfully generated synthetic data using metasyn. The synthesized data is returned as a DataFrame, so you can inspect and manipulate it as you would with any DataFrame.