Overview
The aindo.anonymize
library supports pandas.DataFrame
and pandas.Series
data structures
for both inputs and outputs, depending on the method used (see the anonymize
and
anonymize_column
methods).
This means that the data must be loaded as pandas.DataFrame
or pandas.Series
before being processed.
The anonymization techniques will then return the anonymized data in the same format — either pandas.DataFrame
or pandas.Series
— based on the method used.
It currently implements the following anonymization techniques:
-
Data nulling: Replaces the original data with
None
or a custom constant value. -
Character masking: Replaces some or all characters with a constant symbol.
-
Mocking: Generates realistic mock data for various fields, such as names, emails, and more.
-
KeyHashing: Data values are hashed using a cryptographic key and then encoded using Base64.
-
Swapping: Rearranges data by swapping values.
-
Binning: Groups numerical values into discrete bins and replaces individual values with their corresponding bin ranges.
-
Top/Bottom coding: Replaces values above or below certain thresholds with a capped value.
-
Perturbation: Slightly modifies the values according to the specified perturbation intensity and replacement strategy.
The library provides two usage approaches, catering to different use cases and development preferences:
-
Explicit approach: this method involves directly instantiating anonymization techniques using the library's Python classes (e.g.,
Binning
for data binning).
It is particularly effective for scenarios where: (1) a limited number of techniques need to be applied, (2) the anonymization methods are predefined and can be hardcoded, and (3) development environment features, such as code completion and type checking, are leveraged to enhance productivity and maintainability.
See a simple example here. -
Declarative approach: this approach allows users to define multiple anonymization techniques within a single configuration, which is then executed sequentially on the input data. The configuration can be instantiated from a Python dictionary using the
Config.from_dict
method and can be read from files in YAML, TOML or JSON formats.
This method is particularly advantageous when the anonymization workflow is dynamic and requires flexibility.
See a simple example here.