Overview

The aindo.anonymize library supports pandas.DataFrame and pandas.Series data structures for both inputs and outputs, depending on the method used (see the anonymize and anonymize_column methods). This means that the data must be loaded as pandas.DataFrame or pandas.Series before being processed. The anonymization techniques will then return the anonymized data in the same format — either pandas.DataFrame or pandas.Series — based on the method used.

It currently implements the following anonymization techniques:

  • Data nulling: Replaces the original data with None or a custom constant value.

  • Character masking: Replaces some or all characters with a constant symbol.

  • Mocking: Generates realistic mock data for various fields, such as names, emails, and more.

  • KeyHashing: Data values are hashed using a cryptographic key and then encoded using Base64.

  • Swapping: Rearranges data by swapping values.

  • Binning: Groups numerical values into discrete bins and replaces individual values with their corresponding bin ranges.

  • Top/Bottom coding: Replaces values above or below certain thresholds with a capped value.

  • Perturbation: Slightly modifies the values according to the specified perturbation intensity and replacement strategy.

The library provides two usage approaches, catering to different use cases and development preferences:

  • Explicit approach: this method involves directly instantiating anonymization techniques using the library's Python classes (e.g., Binning for data binning).
    It is particularly effective for scenarios where: (1) a limited number of techniques need to be applied, (2) the anonymization methods are predefined and can be hardcoded, and (3) development environment features, such as code completion and type checking, are leveraged to enhance productivity and maintainability.
    See a simple example here.

  • Declarative approach: this approach allows users to define multiple anonymization techniques within a single configuration, which is then executed sequentially on the input data. The configuration can be instantiated from a Python dictionary using the Config.from_dict method and can be read from files in YAML, TOML or JSON formats.
    This method is particularly advantageous when the anonymization workflow is dynamic and requires flexibility.
    See a simple example here.