metasyn.file

File interfaces to read dataset and write synthetic datasets.

Functions

file_interface_from_dict(file_format_dict)

Create a file interface from a dictionary.

fileinterface(*args)

Register a dataset so that it can be found by name.

get_file_interface_class(fp)

Get the file interface class from a filename.

read_csv(fp[, separator, eol_char, ...])

Create the file interface from a file.

read_dta(fp[, max_rows, chunk_size])

Read a .dta stata file into metadata and a DataFrame.

read_excel(fp)

Read an excel file and create a file interface from that.

read_file(fp[, name, arguments])

Attempt to create file interface from a dataset.

read_sav(fp[, max_rows, chunk_size])

Create the file interface from a .sav or .zsav file.

read_tsv(*args, **kwargs)

Alias for read_csv().

write_csv(df, fp[, file_format, overwrite])

Write to a CSV file with the same file format.

write_dta(df, fp[, file_format, overwrite])

Write to a DTA file with the same file format.

write_excel(df, fp[, file_format, overwrite])

Write to a Excel file with the same file format.

write_sav(df, fp[, file_format, overwrite])

Write to a SAV file with the same file format.

write_tsv(*args, **kwargs)

Alias for write_csv().

metasyn.file.file_interface_from_dict(file_format_dict)

Create a file interface from a dictionary.

Parameters:

file_format_dict (dict) – Dictionary containing information to create the file interface.

Return type:

BaseFileInterface

metasyn.file.fileinterface(*args)

Register a dataset so that it can be found by name.

metasyn.file.get_file_interface_class(fp)

Get the file interface class from a filename.

Return type:

Type[BaseFileInterface]

Parameters:

fp (Path | str)

metasyn.file.read_csv(fp, separator=None, eol_char='\\n', quote_char='"', null_values=None, **kwargs)

Create the file interface from a file.

This function is a wrapper around polars.read_csv <https://docs.pola.rs/api/python/dev/reference/api/polars.read_csv.html> with different defaults for some of the keywords, but all keywords should be passed through.

Parameters:
  • fp (Union[Path, str]) – File to be read.

  • separator (Optional[str]) – Separator for the csv file, by default None in which case the separator will be a “,” for .csv files and a “\t” for .tsv files.

  • eol_char (str) – End of line character, by default “\n”

  • quote_char (str) – Quotation character, by default ‘”’

  • null_values (Union[str, list[str], None]) – Values that will be replaced by nulls, by default None in which case the defaults of polars will be used [“”, “na”, “NA”, “N/A”, “Na”].

  • kwargs – Extra keyword arguments to be passed through to polars.

Return type:

tuple[DataFrame, CsvFileInterface]

Returns:

  • df – Data frame read from the files.

  • cls – CsvFileInterface instance containing information on how to write CSV files.

metasyn.file.read_dta(fp, max_rows=None, chunk_size=None)

Read a .dta stata file into metadata and a DataFrame.

Parameters:
  • fp (Union[Path, str]) – File to be read with .dta extension.

  • max_rows (Optional[int]) – Maximum number of rows to read in.

  • chunk_size (Optional[int]) – Perform row sampling with contiguous rows. Should be used in combination with the max_rows parameter, otherwise it is ignored.

Returns:

Polars dataframe with the converted columns.

Return type:

df

metasyn.file.read_excel(fp)

Read an excel file and create a file interface from that.

Parameters:

fp (Union[Path, str]) – Excel file to read.

Return type:

tuple[DataFrame, ExcelFileInterface]

Returns:

  • df – Polars dataframe representing the excel dataset.

  • file_interface – An instance of the ExcelFileInterface used for writing excel files.

metasyn.file.read_file(fp, name=None, arguments=None)

Attempt to create file interface from a dataset.

Default options will be used to read in the file.

Parameters:
  • fp (Union[Path, str]) – Filename of the dataset to be read.

  • name (str | None)

  • arguments (dict | None)

Return type:

tuple[DataFrame, BaseFileInterface]

Returns:

  • df – Dataframe with the dataset.

  • file_interface – The file interface that has created the dataframe.

Raises:

ValueError – When the extension is unknown.

metasyn.file.read_sav(fp, max_rows=None, chunk_size=None)

Create the file interface from a .sav or .zsav file.

Parameters:
  • fp (Union[Path, str]) – File to read the dataframe and metadata from.

  • max_rows (Optional[int]) – Maximum number of rows to read in.

  • chunk_size (Optional[int]) – Perform row sampling with contiguous rows. Should be used in combination with the max_rows parameter, otherwise it is ignored.

Returns:

Polars dataframe with the converted columns.

Return type:

df

metasyn.file.read_tsv(*args, **kwargs)

Alias for read_csv().

Return type:

tuple[DataFrame, CsvFileInterface]

metasyn.file.write_csv(df, fp, file_format=None, overwrite=False)

Write to a CSV file with the same file format.

Parameters:
  • df (DataFrame) – DataFrame to write to a file.

  • fp (Union[Path, str]) – File to write to.

  • file_format (Union[dict, BaseFileInterface, None]) – File format to use for writing the file, by default None meaning to use the defaults.

  • overwrite (bool) – Whether to overwrite the file if it exists, by default False

metasyn.file.write_dta(df, fp, file_format=None, overwrite=False)

Write to a DTA file with the same file format.

Parameters:
  • df (DataFrame) – DataFrame to write to a file.

  • fp (Union[Path, str]) – File to write to.

  • file_format (Union[dict, BaseFileInterface, None]) – File format to use for writing the file, by default None meaning to use the defaults.

  • overwrite (bool) – Whether to overwrite the file if it exists, by default False

metasyn.file.write_excel(df, fp, file_format=None, overwrite=False)

Write to a Excel file with the same file format.

Parameters:
  • df (DataFrame) – DataFrame to write to a file.

  • fp (Union[Path, str]) – File to write to.

  • file_format (Union[dict, BaseFileInterface, None]) – File format to use for writing the file, by default None meaning to use the defaults.

  • overwrite (bool) – Whether to overwrite the file if it exists, by default False

metasyn.file.write_sav(df, fp, file_format=None, overwrite=False)

Write to a SAV file with the same file format.

Parameters:
  • df (DataFrame) – DataFrame to write to a file.

  • fp (Union[Path, str]) – File to write to.

  • file_format (Union[dict, BaseFileInterface, None]) – File format to use for writing the file, by default None meaning to use the defaults.

  • overwrite (bool) – Whether to overwrite the file if it exists, by default False

metasyn.file.write_tsv(*args, **kwargs)

Alias for write_csv().

Classes

BaseFileInterface(metadata, file_name)

Abstract file interface class to derive specific implementations from.

CsvFileInterface(metadata, file_name)

File interface to read and write CSV files.

ExcelFileInterface(metadata, file_name)

File interface/writer for Microsoft Excel files.

ReadStatInterface(metadata, file_name)

Abstract class to make it easier to create pyreadstat file interfaces.

SavFileInterface(metadata, file_name)

File interface for .sav and .zsav files.

StataFileInterface(metadata, file_name)

File interface for .dta files.

class metasyn.file.BaseFileInterface(metadata, file_name)

Bases: ABC

Abstract file interface class to derive specific implementations from.

The file interface facilitates the reading and writing of dataset files. In particular they can ensure that the output data are exactly the same as the input data.

The implementation class should have at least two class attributes: a name for the implementation and extensions, which is a list of extensions to be associated with the implementation. For example [".csv", ".tsv"].

Initialize the file interface with metadata and the original file name.

Parameters:
  • metadata (dict[str, Any]) – A dictionary containing all the information such as metadata and file format directives. The structure of the metadata is determined by the implementation of the BaseFileInterface and can be empty.

  • file_name (str) – file name of the original dataset.

to_dict()

Convert the class instance to a dictionary.

Return type:

dict[str, Any]

Returns:

A dictionary containing all information to reconstruct the file interface.

write_file(df, fp=None, overwrite=False)

Write the synthetic dataframe to a file.

Parameters:
  • df (DataFrame) – Dataframe to be written to a file.

  • fp (Union[None, Path, str]) – File to write the dataframe to, by default None in which case the file will be the same as the original filename in the current working directory.

  • overwrite (bool) – Allow overwriting of the file if it already exists, by default False.

Raises:

FileExistsError – If the file already exists and the overwrite argument is False.

check_filename(fp=None, overwrite=False)

Check whether the filename can be written to.

Parameters:
  • fp (Union[None, Path, str]) – File check the filename for, by default None

  • overwrite (bool) – Whether overwriting is allowed, by default False

Return type:

Union[Path, str]

Returns:

filename which could be either the same or different from fp.

Raises:
  • FileExistsError – If the file already exists and overwrite=False

  • FileNotFoundError: – If the parent directory of fp does not exist.

abstractmethod classmethod default_interface(fp)

Create a defeault interface with the most likely settings for writing.

Parameters:

fp (Union[Path, str]) – File for writing to by default.

Returns:

An instantiated file interface with default settings.

abstractmethod classmethod read_file(fp)

Create a file interface from a path.

Parameters:

fp (Union[Path, str]) – Path to read the dataset from.build

Returns:

An initialized file interface.

class metasyn.file.CsvFileInterface(metadata, file_name)

Bases: BaseFileInterface

File interface to read and write CSV files.

Initialize the file interface with metadata and the original file name.

Parameters:
  • metadata (dict[str, Any]) – A dictionary containing all the information such as metadata and file format directives. The structure of the metadata is determined by the implementation of the BaseFileInterface and can be empty.

  • file_name (str) – file name of the original dataset.

read_dataset(fp, **kwargs)

Read CSV file.

Parameters:
  • fp (Union[Path, str]) – File to be read with the file interface.

  • kwargs – Extra keyword arguments to be passed to polars.

classmethod read_file(fp, separator=None, eol_char='\\n', quote_char='"', null_values=None, encoding='utf-8', **kwargs)

Read a csv file.

See read_csv() for more detail.

Parameters:
  • fp (Union[Path, str]) – File to be read.

  • separator (Optional[str]) – Separator, by default None

  • eol_char (str) – End of line character, by default “\n”

  • quote_char (str) – Quotation character, by default ‘”’

  • null_values (Union[None, str, list[str]]) – Null values, by default None

Returns:

  • df – Polars dataframe for the file.

  • file_interface – File interface that read the dataset.

classmethod default_interface(fp)

Create a defeault interface with the most likely settings for writing.

Parameters:

fp (Union[Path, str]) – File for writing to by default.

Returns:

An instantiated file interface with default settings.

class metasyn.file.ExcelFileInterface(metadata, file_name)

Bases: BaseFileInterface

File interface/writer for Microsoft Excel files.

Initialize the file interface with metadata and the original file name.

Parameters:
  • metadata (dict[str, Any]) – A dictionary containing all the information such as metadata and file format directives. The structure of the metadata is determined by the implementation of the BaseFileInterface and can be empty.

  • file_name (str) – file name of the original dataset.

classmethod read_file(fp, sheet_name=None)

Create a file interface from a path.

Parameters:
  • fp (Union[Path, str]) – Path to read the dataset from.build

  • sheet_name (str | None)

Returns:

An initialized file interface.

classmethod default_interface(fp)

Create a defeault interface with the most likely settings for writing.

Parameters:

fp (Union[Path, str]) – File for writing to by default.

Returns:

An instantiated file interface with default settings.

class metasyn.file.ReadStatInterface(metadata, file_name)

Bases: BaseFileInterface, ABC

Abstract class to make it easier to create pyreadstat file interfaces.

Initialize the file interface with metadata and the original file name.

Parameters:
  • metadata (dict[str, Any]) – A dictionary containing all the information such as metadata and file format directives. The structure of the metadata is determined by the implementation of the BaseFileInterface and can be empty.

  • file_name (str) – file name of the original dataset.

read_dataset(fp)

Read the dataset without the metadata.

Parameters:

fp (Path | str)

classmethod read_file(fp, **kwargs)

Create the file interface from a .sav or .zsav file.

Parameters:

fp (Union[Path, str]) – File to read the dataframe and metadata from.

Returns:

  • df – Polars dataframe with the converted columns.

  • file_interface – An instance of the SavFileInterface with the appropriate metadata.

classmethod default_interface(fp)

Create a defeault interface with the most likely settings for writing.

Parameters:

fp (Union[str, Path]) – File for writing to by default.

Returns:

An instantiated file interface with default settings.

class metasyn.file.SavFileInterface(metadata, file_name)

Bases: ReadStatInterface

File interface for .sav and .zsav files.

Also stores the descriptions of the columns and makes sure that F.0 columns are converted to integers.

Initialize the file interface with metadata and the original file name.

Parameters:
  • metadata (dict[str, Any]) – A dictionary containing all the information such as metadata and file format directives. The structure of the metadata is determined by the implementation of the BaseFileInterface and can be empty.

  • file_name (str) – file name of the original dataset.

classmethod default_interface(fp)

Create a defeault interface with the most likely settings for writing.

Parameters:

fp (Union[str, Path]) – File for writing to by default.

Returns:

An instantiated file interface with default settings.

class metasyn.file.StataFileInterface(metadata, file_name)

Bases: ReadStatInterface

File interface for .dta files.

Initialize the file interface with metadata and the original file name.

Parameters:
  • metadata (dict[str, Any]) – A dictionary containing all the information such as metadata and file format directives. The structure of the metadata is determined by the implementation of the BaseFileInterface and can be empty.

  • file_name (str) – file name of the original dataset.