Techniques

aindo-anonymize techniques are classes that derive from BaseTechnique and define specific parameters and logic for anonymization.

BaseTechnique

Bases: ABC

Abstract base class for all anonymization techniques.

anonymize `abstractmethod`

anonymize(dataframe: DataFrame) -> DataFrame

Applies the anonymization technique to the given data.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The input data to be anonymized.	required

Returns:

Type	Description
`DataFrame`	The anonymized version of the input data.

Single-column techniques are anonymization methods designed to operate on individual data columns. These techniques are implemented as classes that derive from BaseSingleColumnTechnique.

BaseSingleColumnTechnique

Bases: BaseTechnique, ABC

Abstract base class for anonymization techniques applied to a single column.

Subclasses should implement the anonymize_column method, which defines the logic for anonymizing a single column.

anonymize_column

anonymize_column(col: Series) -> Series

Applies the anonymization technique to a single column.

Parameters:

Name	Type	Description	Default
`col`	`Series`	The input data to be anonymized.	required

Returns:

Type	Description
`Series`	The anonymized version of the input data.

anonymize

anonymize(dataframe: DataFrame) -> DataFrame

Applies the anonymization technique to a single-column dataframe.

This is analogous to calling anonymize_column() on a single Pandas Series. It is a convenience method shared across all types of anonymizers.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The input data. Must have exactly one column.	required

Returns:

Type	Description
`DataFrame`	The anonymized version of the input data.

Techniques

Identity

Identity()

Bases: BaseTechnique

Identity technique.

Leaves the original data untouched. This special technique is particularly useful in a declarative approach (see documentation).

anonymize

anonymize(dataframe: DataFrame) -> DataFrame

Anonymize the input data using the identity technique.

Parameters:

Name	Type	Description	Default
`dataframe`	`DataFrame`	The input data to be anonymized.	required

Returns:

Type	Description
`DataFrame`	The anonymized version of the input data.

DataNulling

DataNulling(constant_value: Any = None)

Bases: BaseSingleColumnTechnique

Implements data nulling.

Data nulling replaces the original data with a constant value. Missing values (np.NaN, None, pd.NA, pd.NaT) are also replaced.

Attributes:

Name	Type	Description
`constant_value`	`Any`	The value that will replace the original data. Default to None.

CharacterMasking

CharacterMasking(
    mask_length: int = 1,
    symbol: AnyStr = "*",
    starting_direction: StartingDirection = "left",
)

Bases: BaseSingleColumnTechnique, Generic[AnyStr]

Implements character masking.

Character masking involves replacing, usually partially, the characters of a data value with a constant symbol. Full masking is achieved by setting mask_length=-1.

Attributes:

Name	Type	Description
`starting_direction`	`StartingDirection`	The direction in which masking starts. Default is "left".
`mask_length`	`int`	The number of characters to mask. Set to -1 to mask the entire value. Defaults to 1.
`symbol`	`AnyStr`	The symbol used for masking. Defaults to "*".

Mocking

Mocking(
    data_generator: MockingGeneratorMethods,
    seed: SeedType = None,
    faker_kwargs: dict[str, Any] | None = None,
    faker_generator_kwargs: dict[str, Any] | None = None,
)

Bases: BaseSingleColumnTechnique

Implements mocking.

Mocking generates realistic mock data for various fields such as names, addresses, emails, and more. It leverages the faker library to produce customizable, locale-aware fake data.

Missing values (np.NaN, None, pd.NA, pd.NaT) are replaced.

Attributes:

Name	Type	Description
`data_generator`	`MockingGeneratorMethods`	Faker's generator method ("fake") used to generate data (e.g., name, email).
`seed`	`SeedType`	A seed to initialize numpy `Generator`.

Initializes the Mocking technique.

Parameters:

Name	Type	Description	Default
`data_generator`	`MockingGeneratorMethods`	Faker's generator method ("fake") used to generate data (e.g., name, email).	required
`seed`	`SeedType`	A seed to initialize numpy `Generator`.	`None`
`faker_kwargs`	`dict[str, Any] \| None`	Additional arguments passed to the main Faker object (proxy class).	`None`
`faker_generator_kwargs`	`dict[str, Any] \| None`	Additional arguments passed to the Faker's generator method.	`None`

KeyHashing

KeyHashing(
    key: str,
    salt: str | None = None,
    hash_name: str = "sha256",
)

Bases: BaseSingleColumnTechnique

Implements key-based hashing.

Data values are hashed using HMAC with a cryptographic key and the chosen hashing algorithm (defaults to SHA-256). The resulting hash is then encoded using Base64. The de-identified values have always a uniform length.

Attributes:

Name	Type	Description
`key`	`str`	The cryptographic key used for hashing.
`salt`	`str \| None`	An optional salt that can be added to the value before hashing. Defaults to None.
`hash_name`	`str`	The hashing algorithm to use, compatible with `hashlib.new()`. Defaults to "sha256".

generate_salt `classmethod`

generate_salt() -> str

Generates a random salt.

Returns:

Name	Type	Description
`str`	`str`	A random salt.

Swapping

Swapping(alpha: float, **kwargs: SeedT)

Bases: BaseSingleColumnTechnique, Seeder, AlphaProbability

Implements swapping.

Swapping rearranges data by shuffling values, ensuring that individual values remain present but are generally not in their original position. The process is controlled by the alpha parameter, representing the probability of a row being swapped with another.

Attributes:

Name	Type	Description
`alpha`	`float`	The perturbation intensity, a value in the range [0, 1].

Binning

Binning(bins: int | Sequence[int] | Sequence[float])

Bases: BaseSingleColumnTechnique

Implements binning for numerical columns.

Binning works by grouping numerical values into discrete bins, allowing for data generalization by replacing individual values with their corresponding bin ranges.

Attributes:

Name	Type	Description
`bins`	`int \| Sequence[int] \| Sequence[float]`	The bin edges or number of bins to use.

Examples:

An integer bins will form equal-width bins.

>>> ages = pd.Series([10, 15, 13, 12, 23, 25, 28, 59, 60])
>>> Binning(bins=3).anonymize_column(ages)
[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], ...
Categories (3, interval[float64, right]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]

A list of ordered bin edges will assign an interval for each variable.

>>> ages = pd.Series([10, 15, 13, 12, 23, 25, 28, 59, 60])
>>> Binning(bins=[0, 18, 35, 70]).anonymize_column(ages)
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], ...
Categories (3, interval[int64, right]): [(0, 18] < (18, 35] < (35, 70]]

PerturbationNumerical

PerturbationNumerical(
    alpha: float,
    sampling_mode: SamplingMode = "uniform",
    perturbation_range: tuple[NumericsT, NumericsT]
    | None = None,
    **kwargs: SeedT,
)

Bases: BasePerturbation, Generic[NumericsT]

Implements perturbation for numerical columns.

Perturbation consists of modify each value based on the specified perturbation intensity (alpha) and replacement strategy. It supports two modes of replacement: uniform sampling and distribution-preserving sampling.

Attributes:

Name	Type	Description
`alpha`	`float`	The perturbation intensity, a value in the range [0, 1]. - `alpha=0`: No perturbation; values remain unchanged. - `alpha=1`: Maximum perturbation; values are fully replaced according to the specified sampling mode.
`sampling_mode`	`SamplingMode`	The strategy used to sample replacement values: - `uniform`: Values are perturbed with random values uniformly sampled from the range [min, max]. - `weighted`: Values are perturbed in a way to keep the original distribution.
`perturbation_range`	`tuple[NumericsT, NumericsT] \| None`	A tuple[min, max] within which random values are sampled. If not set, the range is automatically computed as the minimum and maximum of the input data.

PerturbationCategorical

PerturbationCategorical(
    alpha: float,
    sampling_mode: SamplingMode = "uniform",
    frequencies: dict[str, float] | None = None,
    **kwargs: SeedT,
)

Bases: BasePerturbation

Implements perturbation for categorical columns.

Perturbation consists of replacing values with randomized alternatives based on the specified sampling mode and perturbation intensity (alpha). It supports two modes of replacement: uniform sampling and distribution-preserving sampling.

Attributes:

Name	Type	Description
`alpha`	`float`	The perturbation intensity, a value in the range [0, 1]. - `alpha=0`: No perturbation; values remain unchanged. - `alpha=1`: Maximum perturbation; values are fully replaced according to the specified sampling mode.
`sampling_mode`	`SamplingMode`	The strategy used to sample replacement values: - `uniform`: Replaces values with others chosen uniformly at random. - `weighted`: Replaces values based on their original distribution.
`frequencies`	`dict[str, float] \| None`	Optional mapping of unique values to their relative frequencies, used for weighted sampling mode. Automatically computed if not provided.

TopBottomCodingNumerical

TopBottomCodingNumerical(
    q: float | None = None,
    lower_value: float | None = None,
    upper_value: float | None = None,
)

Bases: BaseSingleColumnTechnique

Implements top/bottom coding for numerical columns.

This technique caps values above the (1 - q/2) quantile (top coding) and raises values below the (q/2) quantile (bottom coding). The threshold parameter q specifies the total proportion of extreme values to code (e.g., q=0.1 applies top/bottom coding to 5% each).

Either the threshold q or both quantile values (lower_value and upper_value) must be provided, but not both. If lower_value and upper_value are used, they must be specified together.

Attributes:

Name	Type	Description
`q`	`float \| None`	Proportion controlling the extent of top/bottom coding, between 0 and 1.
`lower_value`	`float \| None`	Input data quantile value at q/2.
`upper_value`	`float \| None`	Input data quantile value at (1- q/2).

TopBottomCodingCategorical

TopBottomCodingCategorical(
    q: float | None = None,
    other_label: Any = "OTHER",
    rare_categories: list[Any] | None = None,
)

Bases: BaseSingleColumnTechnique

Implements top/bottom coding for categorical columns.

Categories representing less or equal than q of the total data are replaced with the other_label (e.g.: q=0.01 represents the 1%).

Attributes:

Name	Type	Description
`q`	`float \| None`	A proportion controlling the extent of top/bottom coding, between 0 and 1.
`other_label`	`Any`	The new category to replace rare categories with. Default is "OTHER".
`rare_categories`	`list[Any] \| None`	A list of rare categories to be replaced. This can be used instead of the `q` parameter to explicitly specify which categories should be replaced with `other_label`.

Types

StartingDirection `module-attribute`

StartingDirection = Literal['left', 'right']

SeedT `module-attribute`

SeedT = int | Generator | None

SamplingMode `module-attribute`

SamplingMode = Literal['uniform', 'weighted']

NumericsT `module-attribute`

NumericsT = TypeVar('NumericsT', int, float)

MockingGeneratorMethods `module-attribute`

MockingGeneratorMethods = Literal[
    "aba",
    "address",
    "administrative_unit",
    "am_pm",
    "android_platform_token",
    "ascii_company_email",
    "ascii_email",
    "ascii_free_email",
    "ascii_safe_email",
    "bank_country",
    "bban",
    "binary",
    "boolean",
    "bothify",
    "bs",
    "building_number",
    "catch_phrase",
    "century",
    "chrome",
    "city",
    "city_prefix",
    "city_suffix",
    "color",
    "color_name",
    "company",
    "company_email",
    "company_suffix",
    "coordinate",
    "country",
    "country_calling_code",
    "country_code",
    "credit_card_expire",
    "credit_card_full",
    "credit_card_number",
    "credit_card_provider",
    "credit_card_security_code",
    "cryptocurrency",
    "cryptocurrency_code",
    "cryptocurrency_name",
    "csv",
    "currency",
    "currency_code",
    "currency_name",
    "currency_symbol",
    "current_country",
    "current_country_code",
    "date",
    "date_between",
    "date_between_dates",
    "date_object",
    "date_of_birth",
    "date_this_century",
    "date_this_decade",
    "date_this_month",
    "date_this_year",
    "date_time",
    "date_time_ad",
    "date_time_between",
    "date_time_between_dates",
    "date_time_this_century",
    "date_time_this_decade",
    "date_time_this_month",
    "date_time_this_year",
    "day_of_month",
    "day_of_week",
    "dga",
    "domain_name",
    "domain_word",
    "dsv",
    "ean",
    "ean13",
    "ean8",
    "ein",
    "email",
    "emoji",
    "enum",
    "file_extension",
    "file_name",
    "file_path",
    "firefox",
    "first_name",
    "first_name_female",
    "first_name_male",
    "first_name_nonbinary",
    "fixed_width",
    "free_email",
    "free_email_domain",
    "future_date",
    "future_datetime",
    "hex_color",
    "hexify",
    "hostname",
    "http_method",
    "iana_id",
    "iban",
    "image",
    "image_url",
    "internet_explorer",
    "invalid_ssn",
    "ios_platform_token",
    "ipv4",
    "ipv4_network_class",
    "ipv4_private",
    "ipv4_public",
    "ipv6",
    "isbn10",
    "isbn13",
    "iso8601",
    "itin",
    "job",
    "json",
    "json_bytes",
    "language_code",
    "language_name",
    "last_name",
    "last_name_female",
    "last_name_male",
    "last_name_nonbinary",
    "latitude",
    "latlng",
    "lexify",
    "license_plate",
    "linux_platform_token",
    "linux_processor",
    "local_latlng",
    "locale",
    "localized_ean",
    "localized_ean13",
    "localized_ean8",
    "location_on_land",
    "longitude",
    "mac_address",
    "mac_platform_token",
    "mac_processor",
    "md5",
    "military_apo",
    "military_dpo",
    "military_ship",
    "military_state",
    "mime_type",
    "month",
    "month_name",
    "msisdn",
    "name",
    "name_female",
    "name_male",
    "name_nonbinary",
    "nic_handle",
    "nic_handles",
    "null_boolean",
    "numerify",
    "opera",
    "paragraph",
    "paragraphs",
    "password",
    "past_date",
    "past_datetime",
    "phone_number",
    "port_number",
    "postalcode",
    "postalcode_in_state",
    "postalcode_plus4",
    "postcode",
    "postcode_in_state",
    "prefix",
    "prefix_female",
    "prefix_male",
    "prefix_nonbinary",
    "pricetag",
    "profile",
    "providers",
    "psv",
    "pybool",
    "pydecimal",
    "pydict",
    "pyfloat",
    "pyint",
    "pyiterable",
    "pylist",
    "pyset",
    "pystr",
    "pystr_format",
    "pystruct",
    "pytimezone",
    "pytuple",
    "random_choices",
    "random_digit",
    "random_digit_not_null",
    "random_digit_not_null_or_empty",
    "random_digit_or_empty",
    "random_element",
    "random_elements",
    "random_int",
    "random_letter",
    "random_letters",
    "random_lowercase_letter",
    "random_number",
    "random_sample",
    "random_uppercase_letter",
    "randomize_nb_elements",
    "rgb_color",
    "rgb_css_color",
    "ripe_id",
    "safari",
    "safe_color_name",
    "safe_domain_name",
    "safe_email",
    "safe_hex_color",
    "secondary_address",
    "sentence",
    "sentences",
    "sha1",
    "sha256",
    "simple_profile",
    "slug",
    "ssn",
    "state",
    "state_abbr",
    "street_address",
    "street_name",
    "street_suffix",
    "suffix",
    "suffix_female",
    "suffix_male",
    "suffix_nonbinary",
    "swift",
    "swift11",
    "swift8",
    "tar",
    "text",
    "texts",
    "time",
    "time_delta",
    "time_object",
    "time_series",
    "timezone",
    "tld",
    "tsv",
    "unix_device",
    "unix_partition",
    "unix_time",
    "upc_a",
    "upc_e",
    "uri",
    "uri_extension",
    "uri_page",
    "uri_path",
    "url",
    "user_agent",
    "user_name",
    "uuid4",
    "windows_platform_token",
    "word",
    "words",
    "year",
    "zip",
    "zipcode",
    "zipcode_in_state",
    "zipcode_plus4",
]

List all available generator methods from Faker (fake).

Techniques

BaseTechnique

anonymize abstractmethod

BaseSingleColumnTechnique

anonymize_column

anonymize

Techniques

Identity

anonymize

DataNulling

CharacterMasking

Mocking

KeyHashing

generate_salt classmethod

Swapping

Binning

PerturbationNumerical

PerturbationCategorical

TopBottomCodingNumerical

TopBottomCodingCategorical

Types

StartingDirection module-attribute

SeedT module-attribute

SamplingMode module-attribute

NumericsT module-attribute

MockingGeneratorMethods module-attribute

anonymize `abstractmethod`

generate_salt `classmethod`

StartingDirection `module-attribute`

SeedT `module-attribute`

SamplingMode `module-attribute`

NumericsT `module-attribute`

MockingGeneratorMethods `module-attribute`