Quickstart

In this concise demonstration, you’ll learn the basic functionality of metasyn by generating synthetic data from a classic dataset about the titanic disaster.

Note

A more elaborate version of this page is also available as an interactive tutorial available on the Interactive tutorials page.

Importing libraries

The first step is to import the required Python libraries. For this example, we will use the Polars data frame library and metasyn.

import polars as pl # NB: pandas is also supported
import metasyn as ms

Loading the dataset

Next, we assume we have the Titanic dataset stored as a CSV file. To get the path to the demo file, we can use the built-in metasyn.demo_file() function.

csv_path = ms.demo_file("titanic")

When loading in the DataFrame, it is important to use the correct data types. For example, we need to tell Polars which columns are categorical, as it cannot automatically infer this. This is important, as metasyn can later read this information and use it to generate categorical values where necessary. For this, we create a dictionary with the column names that we want to specify as categorical. For this example, we specify which columns are dates / times as well.

data_types = {
   "Sex": pl.Categorical,
   "Embarked": pl.Categorical,
   "Birthday": pl.Date,
   "Board time": pl.Time,
   "Married since": pl.DateTime,
}

To finish loading the dataset, we can either use the polars.read_csv() function or the metasyn metasyn.read_csv() function. The advantage of the latter is that with it metasyn will remember the specifics of the file format of your file (for example if your separator between columns was a , or ;) and when writing the synthetic data file, this can be reused.

Note

Metasyn also supports reading an writing sav files (metasyn.read_sav()), excel files (metasyn.read_excel()), tsv files (metasyn.read_tsv()), .dta files (metasyn.read_dta()), and more.

For reading a CSV file, you should give the data_types dictionary as an argument.

# for more info, see pl.read_csv()
df, file_format = ms.read_csv(csv_path, schema_overrides=data_types)

This converts the CSV file into a DataFrame named df. Additionally, we have a file_format object that remembers the name of your file, that it was a CSV file and more.

The dataset should now be loaded into the DataFrame, and we can verify this by inspecting the first 5 rows of the DataFrame using the df.head(5). This will output the following table:

PassengerId	Name	Sex	Age	Ticket	Fare	Cabin	Embarked	Birthday	Board time	Married since	all_NA
1	“Braund, Mr. Owen Harris”	“male”	22	“A/5 21171”	7.25	null	“S”	1937-10-28	15:53:04	2022-08-05 04:43:34	null
2	“Cumings, Mrs. John Bradley (Florence Briggs Thayer)”	“female”	38	“PC 17599”	71.2833	“C85”	“C”	null	12:26:00	2022-08-07 01:56:33	null
3	“Heikkinen, Miss. Laina”	“female”	26	“STON/O2. 3101282”	7.925	null	“S”	1931-09-24	16:08:25	2022-08-04 20:27:37	null
4	“Futrelle, Mrs. Jacques Heath (Lily May Peel)”	“female”	35	“113803”	53.1	“C123”	“S”	1936-11-30	null	2022-08-07 07:05:55	null
5	“Allen, Mr. William Henry”	“male”	35	“373450”	8.05	null	“S”	1918-11-07	10:59:08	2022-08-02 15:13:34	null

Generating the MetaFrame

With the DataFrame loaded, you can now generate a MetaFrame.

mf = ms.MetaFrame.fit_dataframe(df, file_format=file_format)

This creates a MetaFrame named mf. Note that you don’t have to supply the file_format argument, but adding it will enable metasyn to write the synthetic data in exactly the same format as the real data.

We can inspect the MetaFrame by printing it (print(mf)). This will produce the following output:

# Rows: 891
# Columns: 13

Column 1: "PassengerId"
- Variable Type: discrete
- Data Type: Int64
- Proportion of Missing Values: 0.0000
- Distribution:
   - Type: core.uniform
   - Provenance: builtin
   - Parameters:
      - lower: 1
      - upper: 892

Column 2: "Name"
# ...

Saving and loading the MetaFrame

The MetaFrame can be saved to a GMF file for future use, to do so we use the save() method on the MetaFrame (which in our case is named mf), and pass in the desired filepath as a parameter. The following code saves the MetaFrame to a JSON file named “saved_metaframe.json”:

mf.save("saved_metaframe.json")

Inversely, we can load a MetaFrame from a JSON file using the load() method, passing in the filepath as a parameter. To load our previously saved MetaFrame, we use the following code:

mf = ms.MetaFrame.load("saved_metaframe.json")

Synthesizing the data

With the MetaFrame loaded, you can synthesize new data. To do so, we simply call the synthesize() method on the MetaFrame, and pass in the number of rows you would like to generate as a parameter. For example, to generate five rows of synthetic data we can use:

synthetic_data = mf.synthesize(5)

We can inspect our synthesized data by printing it (print(synthetic_data)). This will output a table similar to the following, but with different values as it is randomly generated:

PassengerId	Name	Sex	Age	Parch	Ticket	Fare	Cabin	Embarked	Birthday	Board time	Married since	all_NA
19	“Certain. Nearly.”	“male”	30	0	“11941”	2.025903	null	“S”	1921-10-19	14:06:13	2022-08-03 15:51:11	null
795	“Between. Nature.”	“male”	43	0	“2067”	16.766045	null	“S”	1936-04-09	12:26:26	2022-07-27 15:15:46	null
257	“Country. View. Evidence.”	“male”	44	0	“451553”	3.687185	null	“S”	1930-10-18	11:58:39	null	null
575	“Scene. Reason. Low. Recent.”	“male”	34	1	“8659”	25.834306	null	“S”	1914-06-14	15:43:05	2022-08-08 05:50:39	null
495	“Morning. Nice. Large. Challenge.”	“male”	8	0	“9582”	9.150979	“G01”	“S”	1914-06-23	12:16:21	2022-07-19 09:34:07	null

Of course, it’s easy to see some flaws with the generated dataset, such as the names not making a lot of sense. The page on Improve your synthetic data shows how to improve the quality of the synthesized data, such as for example generating fake names using Faker.

Writing synthetic data

If you used metasyn.read_csv() to read in your file and added the file format to the metaframe, you can also write the synthetic data directly to a file:

mf.write_synthetic("some_file_name.csv")

Conclusion

Congratulations! You’ve successfully generated synthetic data using metasyn. The synthesized data is returned as a DataFrame, so you can inspect and manipulate it as you would with any DataFrame.