Skip to content

Techniques

aindo-anonymize techniques are classes that derive from BaseTechnique and define specific parameters and logic for anonymization.

BaseTechnique

Bases: ABC

Abstract base class for all anonymization techniques.

anonymize abstractmethod

anonymize(dataframe: DataFrame) -> DataFrame

Applies the anonymization technique to the given data.

Parameters:

Name Type Description Default
dataframe DataFrame

The input data to be anonymized.

required

Returns:

Type Description
DataFrame

The anonymized version of the input data.

Single-column techniques are anonymization methods designed to operate on individual data columns. These techniques are implemented as classes that derive from BaseSingleColumnTechnique.

BaseSingleColumnTechnique

Bases: BaseTechnique, ABC

Abstract base class for anonymization techniques applied to a single column.

Subclasses should implement the anonymize_column method, which defines the logic for anonymizing a single column.

anonymize_column

anonymize_column(col: Series) -> Series

Applies the anonymization technique to a single column.

Parameters:

Name Type Description Default
col Series

The input data to be anonymized.

required

Returns:

Type Description
Series

The anonymized version of the input data.

anonymize

anonymize(dataframe: DataFrame) -> DataFrame

Applies the anonymization technique to a single-column dataframe.

This is analogous to calling anonymize_column() on a single Pandas Series. It is a convenience method shared across all types of anonymizers.

Parameters:

Name Type Description Default
dataframe DataFrame

The input data. Must have exactly one column.

required

Returns:

Type Description
DataFrame

The anonymized version of the input data.

Techniques

DataNulling

DataNulling(constant_value: Any = None)

Bases: BaseSingleColumnTechnique

Implements data nulling.

Data nulling replaces the original data with a None value (or a custom constant value).

Attributes:

Name Type Description
constant_value Any

The value that will replace the original data. Default to None.

CharacterMasking

CharacterMasking(
    mask_length: int = 1,
    symbol: AnyStr = "*",
    starting_direction: StartingDirection = "left",
)

Bases: BaseSingleColumnTechnique, Generic[AnyStr]

Implements character masking.

Character masking involves replacing, usually partially, the characters of a data value with a constant symbol. Full masking is achieved by setting mask_length=-1.

Attributes:

Name Type Description
starting_direction StartingDirection

The direction in which masking starts. Default is "left".

mask_length int

The number of characters to mask. Set to -1 to mask the entire value. Defaults to 1.

symbol AnyStr

The symbol used for masking. Defaults to "*".

Mocking

Mocking(
    data_generator: MockingGeneratorMethods,
    seed: SeedType = None,
    faker_kwargs: dict[str, Any] | None = None,
    faker_generator_kwargs: dict[str, Any] | None = None,
)

Bases: BaseSingleColumnTechnique

Implements mocking.

Mocking generates realistic mock data for various fields such as names, addresses, emails, and more. It leverages the faker library to produce customizable, locale-aware fake data.

Attributes:

Name Type Description
data_generator MockingGeneratorMethods

Faker's generator method ("fake") used to generate data (e.g., name, email).

seed SeedType

A seed to initialize numpy Generator.

Parameters:

Name Type Description Default
data_generator MockingGeneratorMethods

Faker's generator method ("fake") used to generate data (e.g., name, email).

required
faker_kwargs dict[str, Any] | None

Additional arguments passed to the main Faker object (proxy class).

None
faker_generator_kwargs dict[str, Any] | None

Additional arguments passed to the Faker's generator method.

None

KeyHashing

KeyHashing(
    key: str,
    salt: str | None = None,
    hash_name: str = "sha256",
)

Bases: BaseSingleColumnTechnique

Implements key-based hashing.

Data values are hashed using HMAC with a cryptographic key and the chosen hashing algorithm (defaults to SHA-256). The resulting hash is then encoded using Base64. The de-identified values have always a uniform length.

Attributes:

Name Type Description
key str

The cryptographic key used for hashing.

salt str | None

An optional salt that can be added to the value before hashing. Defaults to None.

hash_name str

The hashing algorithm to use, compatible with hashlib.new(). Defaults to "sha256".

Swapping

Swapping(alpha: float, **kwargs: SeedT)

Bases: BaseSingleColumnTechnique, Seeder, AlphaProbability

Implements swapping.

Swapping rearranges data by shuffling values, ensuring that individual values remain present but are generally not in their original position. The process is controlled by the alpha parameter, representing the probability of a row being swapped with another.

Attributes:

Name Type Description
alpha float

The perturbation intensity, a value in the range [0, 1].

Binning

Binning(bins: int | Sequence[int] | Sequence[float])

Bases: BaseSingleColumnTechnique

Implements binning for numerical columns.

Binning works by grouping numerical values into discrete bins, allowing for data generalization by replacing individual values with their corresponding bin ranges.

Attributes:

Name Type Description
bins int | Sequence[int] | Sequence[float]

The bin edges or number of bins to use.

Examples:

An integer bins will form equal-width bins.

>>> ages = pd.Series([10, 15, 13, 12, 23, 25, 28, 59, 60])
>>> Binning(bins=3).anonymize_column(ages)
[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], ...
Categories (3, interval[float64, right]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]

A list of ordered bin edges will assign an interval for each variable.

>>> ages = pd.Series([10, 15, 13, 12, 23, 25, 28, 59, 60])
>>> Binning(bins=[0, 18, 35, 70]).anonymize_column(ages)
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], ...
Categories (3, interval[int64, right]): [(0, 18] < (18, 35] < (35, 70]]

PerturbationNumerical

PerturbationNumerical(
    alpha: float,
    sampling_mode: SamplingMode = "uniform",
    perturbation_range: tuple[NumericsT, NumericsT]
    | None = None,
    **kwargs: SeedT,
)

Bases: BasePerturbation, Generic[NumericsT]

Implements perturbation for numerical columns.

Perturbation consists of modify each value based on the specified perturbation intensity (alpha) and replacement strategy. It supports two modes of replacement: uniform sampling and distribution-preserving sampling.

Attributes:

Name Type Description
alpha float

The perturbation intensity, a value in the range [0, 1]. - alpha=0: No perturbation; values remain unchanged. - alpha=1: Maximum perturbation; values are fully replaced according to the specified sampling mode.

sampling_mode SamplingMode

The strategy used to sample replacement values: - uniform: Values are perturbed with random values uniformly sampled from the range [min, max]. - weighted: Values are perturbed in a way to keep the original distribution.

perturbation_range tuple[NumericsT, NumericsT] | None

A tuple[min, max] within which random values are sampled. If not set, the range is automatically computed as the minimum and maximum of the input data.

PerturbationCategorical

PerturbationCategorical(
    alpha: float,
    sampling_mode: SamplingMode = "uniform",
    frequencies: dict[str, float] | None = None,
    **kwargs: SeedT,
)

Bases: BasePerturbation

Implements perturbation for categorical columns.

Perturbation consists of replacing values with randomized alternatives based on the specified sampling mode and perturbation intensity (alpha). It supports two modes of replacement: uniform sampling and distribution-preserving sampling.

Attributes:

Name Type Description
alpha float

The perturbation intensity, a value in the range [0, 1]. - alpha=0: No perturbation; values remain unchanged. - alpha=1: Maximum perturbation; values are fully replaced according to the specified sampling mode.

sampling_mode SamplingMode

The strategy used to sample replacement values: - uniform: Replaces values with others chosen uniformly at random. - weighted: Replaces values based on their original distribution.

frequencies dict[str, float] | None

Optional mapping of unique values to their relative frequencies, used for weighted sampling mode. Automatically computed if not provided.

TopBottomCodingNumerical

TopBottomCodingNumerical(
    q: float | None = None,
    lower_value: float | None = None,
    upper_value: float | None = None,
)

Bases: BaseSingleColumnTechnique

Implements top/bottom coding for numerical columns.

This technique caps values above the (1 - q/2) quantile (top coding) and raises values below the (q/2) quantile (bottom coding). The threshold parameter q specifies the total proportion of extreme values to code (e.g., q=0.1 applies top/bottom coding to 5% each).

Either the threshold q or both quantile values (lower_value and upper_value) must be provided, but not both. If lower_value and upper_value are used, they must be specified together.

Attributes:

Name Type Description
q float | None

Proportion controlling the extent of top/bottom coding, between 0 and 1.

lower_value float | None

Input data quantile value at q/2.

upper_value float | None

Input data quantile value at (1- q/2).

TopBottomCodingCategorical

TopBottomCodingCategorical(
    q: float | None = None,
    other_label: Any = "OTHER",
    rare_categories: list[Any] | None = None,
)

Bases: BaseSingleColumnTechnique

Implements top/bottom coding for categorical columns.

Categories representing less or equal than q of the total data are replaced with the other_label (e.g.: q=0.01 represents the 1%).

Attributes:

Name Type Description
q float | None

A proportion controlling the extent of top/bottom coding, between 0 and 1.

other_label Any

The new category to replace rare categories with. Default is "OTHER".

rare_categories list[Any] | None

A list of rare categories to be replaced. This can be used instead of the q parameter to explicitly specify which categories should be replaced with other_label.

Types

StartingDirection module-attribute

StartingDirection = Literal['left', 'right']

SeedT module-attribute

SeedT = int | Generator | None

SamplingMode module-attribute

SamplingMode = Literal['uniform', 'weighted']

NumericsT module-attribute

NumericsT = TypeVar('NumericsT', int, float)

MockingGeneratorMethods module-attribute

MockingGeneratorMethods = Literal[
    "aba",
    "address",
    "administrative_unit",
    "am_pm",
    "android_platform_token",
    "ascii_company_email",
    "ascii_email",
    "ascii_free_email",
    "ascii_safe_email",
    "bank_country",
    "bban",
    "binary",
    "boolean",
    "bothify",
    "bs",
    "building_number",
    "catch_phrase",
    "century",
    "chrome",
    "city",
    "city_prefix",
    "city_suffix",
    "color",
    "color_name",
    "company",
    "company_email",
    "company_suffix",
    "coordinate",
    "country",
    "country_calling_code",
    "country_code",
    "credit_card_expire",
    "credit_card_full",
    "credit_card_number",
    "credit_card_provider",
    "credit_card_security_code",
    "cryptocurrency",
    "cryptocurrency_code",
    "cryptocurrency_name",
    "csv",
    "currency",
    "currency_code",
    "currency_name",
    "currency_symbol",
    "current_country",
    "current_country_code",
    "date",
    "date_between",
    "date_between_dates",
    "date_object",
    "date_of_birth",
    "date_this_century",
    "date_this_decade",
    "date_this_month",
    "date_this_year",
    "date_time",
    "date_time_ad",
    "date_time_between",
    "date_time_between_dates",
    "date_time_this_century",
    "date_time_this_decade",
    "date_time_this_month",
    "date_time_this_year",
    "day_of_month",
    "day_of_week",
    "dga",
    "domain_name",
    "domain_word",
    "dsv",
    "ean",
    "ean13",
    "ean8",
    "ein",
    "email",
    "emoji",
    "enum",
    "file_extension",
    "file_name",
    "file_path",
    "firefox",
    "first_name",
    "first_name_female",
    "first_name_male",
    "first_name_nonbinary",
    "fixed_width",
    "free_email",
    "free_email_domain",
    "future_date",
    "future_datetime",
    "hex_color",
    "hexify",
    "hostname",
    "http_method",
    "iana_id",
    "iban",
    "image",
    "image_url",
    "internet_explorer",
    "invalid_ssn",
    "ios_platform_token",
    "ipv4",
    "ipv4_network_class",
    "ipv4_private",
    "ipv4_public",
    "ipv6",
    "isbn10",
    "isbn13",
    "iso8601",
    "itin",
    "job",
    "json",
    "json_bytes",
    "language_code",
    "language_name",
    "last_name",
    "last_name_female",
    "last_name_male",
    "last_name_nonbinary",
    "latitude",
    "latlng",
    "lexify",
    "license_plate",
    "linux_platform_token",
    "linux_processor",
    "local_latlng",
    "locale",
    "localized_ean",
    "localized_ean13",
    "localized_ean8",
    "location_on_land",
    "longitude",
    "mac_address",
    "mac_platform_token",
    "mac_processor",
    "md5",
    "military_apo",
    "military_dpo",
    "military_ship",
    "military_state",
    "mime_type",
    "month",
    "month_name",
    "msisdn",
    "name",
    "name_female",
    "name_male",
    "name_nonbinary",
    "nic_handle",
    "nic_handles",
    "null_boolean",
    "numerify",
    "opera",
    "paragraph",
    "paragraphs",
    "password",
    "past_date",
    "past_datetime",
    "phone_number",
    "port_number",
    "postalcode",
    "postalcode_in_state",
    "postalcode_plus4",
    "postcode",
    "postcode_in_state",
    "prefix",
    "prefix_female",
    "prefix_male",
    "prefix_nonbinary",
    "pricetag",
    "profile",
    "providers",
    "psv",
    "pybool",
    "pydecimal",
    "pydict",
    "pyfloat",
    "pyint",
    "pyiterable",
    "pylist",
    "pyset",
    "pystr",
    "pystr_format",
    "pystruct",
    "pytimezone",
    "pytuple",
    "random_choices",
    "random_digit",
    "random_digit_not_null",
    "random_digit_not_null_or_empty",
    "random_digit_or_empty",
    "random_element",
    "random_elements",
    "random_int",
    "random_letter",
    "random_letters",
    "random_lowercase_letter",
    "random_number",
    "random_sample",
    "random_uppercase_letter",
    "randomize_nb_elements",
    "rgb_color",
    "rgb_css_color",
    "ripe_id",
    "safari",
    "safe_color_name",
    "safe_domain_name",
    "safe_email",
    "safe_hex_color",
    "secondary_address",
    "sentence",
    "sentences",
    "sha1",
    "sha256",
    "simple_profile",
    "slug",
    "ssn",
    "state",
    "state_abbr",
    "street_address",
    "street_name",
    "street_suffix",
    "suffix",
    "suffix_female",
    "suffix_male",
    "suffix_nonbinary",
    "swift",
    "swift11",
    "swift8",
    "tar",
    "text",
    "texts",
    "time",
    "time_delta",
    "time_object",
    "time_series",
    "timezone",
    "tld",
    "tsv",
    "unix_device",
    "unix_partition",
    "unix_time",
    "upc_a",
    "upc_e",
    "uri",
    "uri_extension",
    "uri_page",
    "uri_path",
    "url",
    "user_agent",
    "user_name",
    "uuid4",
    "windows_platform_token",
    "word",
    "words",
    "year",
    "zip",
    "zipcode",
    "zipcode_in_state",
    "zipcode_plus4",
]