Quick start

In the following examples, we will demonstrate how to anonymize specific columns of the UCI Adult dataset using the Explicit and the Declarative approaches.

The first step is to load the CSV file into a pandas.DataFrame:

Load UCI Adult dataset

import pandas as pd

dtypes = {
    "age": "int",
    "workclass": "category",
    "fnlwgt": "int",
    "education": "category",
    "education-num": "category",
    "marital-status": "category",
    "occupation": "category",
    "relationship": "category",
    "race": "category",
    "sex": "category",
    "capital-gain": "int",
    "capital-loss": "int",
    "hours-per-week": "int",
    "native-country": "category",
    "y": "category",
}
df = pd.read_csv("adult.data", names=list(dtypes.keys()), dtype=dtypes)

Explicit approach

In the explicit the user directly instantiates techniques and apply them to specific columns.

from aindo.anonymize.techniques import Binning, TopBottomCodingCategorical # (1)!

# Replace education categories representing less or equal than 1%
# of values with "OTHER"
anonymizer = TopBottomCodingCategorical(q=0.01, other_label="OTHER") # (2)!
print(anonymizer.anonymize_column(df.education)) # (3)!

# Group age values into discrete bins and replace each age
# with its corresponding bin range
anonymizer = Binning(bins=[17, 20, 30, 50, 70, 90])
print(anonymizer.anonymize_column(df.age))

All anonymization techniques are imported from aindo.anonymize.techniques.
Create an instance of the Top-Bottom Coding technique for categorical data.
anonymizer.anonymize_column() applies the anonymization technique to a single column (provided as a Pandas Series) and always returns a copy.

For a complete list of available techniques and their parameters, please refer to API reference - Techniques.

Declarative approach

In the declarative approach a yaml configuration file needs to be created first, listing all the operations that need to be performed in the desired order, defining a pipeline of operations.

config.yml

# aindo-anonymize table anonymization configuration
steps:
  - method:
      type: top_bottom_coding_categorical # specifies the technique to be configured
      q: 0.01 # specific parameter for the Top-Bottom Coding technique
    columns: [education] # list of column names to which the technique will be applied
  - method:
      type: binning
      bins: [17, 20, 30, 50, 70, 90]
    columns: [age]

Then, the defined configuration file can be loaded as pipeline and the pipeline executed.

import yaml
from aindo.anonymize import AnonymizationPipeline, Config

config = Config.from_dict(yaml.safe_load("config.yml"))
pipeline = AnonymizationPipeline(config=config)
print(pipeline.run(df))

Full documentation available at API reference - Pipeline.