Techniques
aindo-anonymize
techniques are classes that derive from
BaseTechnique
and define specific parameters and logic for anonymization.
BaseTechnique
Single-column techniques are anonymization methods designed to operate on individual data columns.
These techniques are implemented as classes that derive from
BaseSingleColumnTechnique
.
BaseSingleColumnTechnique
Bases: BaseTechnique
, ABC
Abstract base class for anonymization techniques applied to a single column.
Subclasses should implement the anonymize_column
method,
which defines the logic for anonymizing a single column.
anonymize_column
anonymize
Applies the anonymization technique to a single-column dataframe.
This is analogous to calling anonymize_column()
on a single Pandas Series.
It is a convenience method shared across all types of anonymizers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataframe
|
DataFrame
|
The input data. Must have exactly one column. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The anonymized version of the input data. |
Techniques
DataNulling
DataNulling(constant_value: Any = None)
Bases: BaseSingleColumnTechnique
Implements data nulling.
Data nulling replaces the original data with a None
value
(or a custom constant value).
Attributes:
Name | Type | Description |
---|---|---|
constant_value |
Any
|
The value that will replace the original data. Default to None. |
CharacterMasking
CharacterMasking(
mask_length: int = 1,
symbol: AnyStr = "*",
starting_direction: StartingDirection = "left",
)
Bases: BaseSingleColumnTechnique
, Generic[AnyStr]
Implements character masking.
Character masking involves replacing, usually partially, the characters of a data value
with a constant symbol.
Full masking is achieved by setting mask_length=-1
.
Attributes:
Name | Type | Description |
---|---|---|
starting_direction |
StartingDirection
|
The direction in which masking starts. Default is "left". |
mask_length |
int
|
The number of characters to mask. Set to -1 to mask the entire value. Defaults to 1. |
symbol |
AnyStr
|
The symbol used for masking. Defaults to "*". |
Mocking
Mocking(
data_generator: MockingGeneratorMethods,
seed: SeedType = None,
faker_kwargs: dict[str, Any] | None = None,
faker_generator_kwargs: dict[str, Any] | None = None,
)
Bases: BaseSingleColumnTechnique
Implements mocking.
Mocking generates realistic mock data for various fields such as names, addresses, emails, and more.
It leverages the faker
library to produce customizable, locale-aware fake data.
Attributes:
Name | Type | Description |
---|---|---|
data_generator |
MockingGeneratorMethods
|
Faker's generator method ("fake") used to generate data (e.g., name, email). |
seed |
SeedType
|
A seed to initialize numpy |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_generator
|
MockingGeneratorMethods
|
Faker's generator method ("fake") used to generate data (e.g., name, email). |
required |
faker_kwargs
|
dict[str, Any] | None
|
Additional arguments passed to the main Faker object (proxy class). |
None
|
faker_generator_kwargs
|
dict[str, Any] | None
|
Additional arguments passed to the Faker's generator method. |
None
|
KeyHashing
Bases: BaseSingleColumnTechnique
Implements key-based hashing.
Data values are hashed using HMAC with a cryptographic key and the chosen hashing algorithm (defaults to SHA-256). The resulting hash is then encoded using Base64. The de-identified values have always a uniform length.
Attributes:
Name | Type | Description |
---|---|---|
key |
str
|
The cryptographic key used for hashing. |
salt |
str | None
|
An optional salt that can be added to the value before hashing. Defaults to None. |
hash_name |
str
|
The hashing algorithm to use, compatible with |
Swapping
Bases: BaseSingleColumnTechnique
, Seeder
, AlphaProbability
Implements swapping.
Swapping rearranges data by shuffling values,
ensuring that individual values remain present but are generally not in their original position.
The process is controlled by the alpha
parameter,
representing the probability of a row being swapped with another.
Attributes:
Name | Type | Description |
---|---|---|
alpha |
float
|
The perturbation intensity, a value in the range [0, 1]. |
Binning
Bases: BaseSingleColumnTechnique
Implements binning for numerical columns.
Binning works by grouping numerical values into discrete bins, allowing for data generalization by replacing individual values with their corresponding bin ranges.
Attributes:
Name | Type | Description |
---|---|---|
bins |
int | Sequence[int] | Sequence[float]
|
The bin edges or number of bins to use. |
Examples:
An integer bins will form equal-width bins.
>>> ages = pd.Series([10, 15, 13, 12, 23, 25, 28, 59, 60])
>>> Binning(bins=3).anonymize_column(ages)
[(9.95, 26.667], (9.95, 26.667], (9.95, 26.667], ...
Categories (3, interval[float64, right]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.0]]
A list of ordered bin edges will assign an interval for each variable.
>>> ages = pd.Series([10, 15, 13, 12, 23, 25, 28, 59, 60])
>>> Binning(bins=[0, 18, 35, 70]).anonymize_column(ages)
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], ...
Categories (3, interval[int64, right]): [(0, 18] < (18, 35] < (35, 70]]
PerturbationNumerical
PerturbationNumerical(
alpha: float,
sampling_mode: SamplingMode = "uniform",
perturbation_range: tuple[NumericsT, NumericsT]
| None = None,
**kwargs: SeedT,
)
Bases: BasePerturbation
, Generic[NumericsT]
Implements perturbation for numerical columns.
Perturbation consists of modify each value based on the specified perturbation intensity (alpha) and replacement strategy. It supports two modes of replacement: uniform sampling and distribution-preserving sampling.
Attributes:
Name | Type | Description |
---|---|---|
alpha |
float
|
The perturbation intensity, a value in the range [0, 1].
- |
sampling_mode |
SamplingMode
|
The strategy used to sample replacement values:
- |
perturbation_range |
tuple[NumericsT, NumericsT] | None
|
A tuple[min, max] within which random values are sampled. If not set, the range is automatically computed as the minimum and maximum of the input data. |
PerturbationCategorical
PerturbationCategorical(
alpha: float,
sampling_mode: SamplingMode = "uniform",
frequencies: dict[str, float] | None = None,
**kwargs: SeedT,
)
Bases: BasePerturbation
Implements perturbation for categorical columns.
Perturbation consists of replacing values with randomized alternatives based on the specified sampling mode and perturbation intensity (alpha). It supports two modes of replacement: uniform sampling and distribution-preserving sampling.
Attributes:
Name | Type | Description |
---|---|---|
alpha |
float
|
The perturbation intensity, a value in the range [0, 1].
- |
sampling_mode |
SamplingMode
|
The strategy used to sample replacement values:
- |
frequencies |
dict[str, float] | None
|
Optional mapping of unique values to their relative frequencies, used for weighted sampling mode. Automatically computed if not provided. |
TopBottomCodingNumerical
TopBottomCodingNumerical(
q: float | None = None,
lower_value: float | None = None,
upper_value: float | None = None,
)
Bases: BaseSingleColumnTechnique
Implements top/bottom coding for numerical columns.
This technique caps values above the (1 - q/2)
quantile (top coding) and raises
values below the (q/2)
quantile (bottom coding). The threshold parameter q
specifies the
total proportion of extreme values to code (e.g., q=0.1
applies top/bottom coding to 5% each).
Either the threshold q
or both quantile values (lower_value
and upper_value
) must be provided,
but not both. If lower_value
and upper_value
are used, they must be specified together.
Attributes:
Name | Type | Description |
---|---|---|
q |
float | None
|
Proportion controlling the extent of top/bottom coding, between 0 and 1. |
lower_value |
float | None
|
Input data quantile value at q/2. |
upper_value |
float | None
|
Input data quantile value at (1- q/2). |
TopBottomCodingCategorical
TopBottomCodingCategorical(
q: float | None = None,
other_label: Any = "OTHER",
rare_categories: list[Any] | None = None,
)
Bases: BaseSingleColumnTechnique
Implements top/bottom coding for categorical columns.
Categories representing less or equal than q
of the total data are replaced with the other_label
(e.g.: q=0.01 represents the 1%).
Attributes:
Name | Type | Description |
---|---|---|
q |
float | None
|
A proportion controlling the extent of top/bottom coding, between 0 and 1. |
other_label |
Any
|
The new category to replace rare categories with. Default is "OTHER". |
rare_categories |
list[Any] | None
|
A list of rare categories to be replaced.
This can be used instead of the |
Types
MockingGeneratorMethods
module-attribute
MockingGeneratorMethods = Literal[
"aba",
"address",
"administrative_unit",
"am_pm",
"android_platform_token",
"ascii_company_email",
"ascii_email",
"ascii_free_email",
"ascii_safe_email",
"bank_country",
"bban",
"binary",
"boolean",
"bothify",
"bs",
"building_number",
"catch_phrase",
"century",
"chrome",
"city",
"city_prefix",
"city_suffix",
"color",
"color_name",
"company",
"company_email",
"company_suffix",
"coordinate",
"country",
"country_calling_code",
"country_code",
"credit_card_expire",
"credit_card_full",
"credit_card_number",
"credit_card_provider",
"credit_card_security_code",
"cryptocurrency",
"cryptocurrency_code",
"cryptocurrency_name",
"csv",
"currency",
"currency_code",
"currency_name",
"currency_symbol",
"current_country",
"current_country_code",
"date",
"date_between",
"date_between_dates",
"date_object",
"date_of_birth",
"date_this_century",
"date_this_decade",
"date_this_month",
"date_this_year",
"date_time",
"date_time_ad",
"date_time_between",
"date_time_between_dates",
"date_time_this_century",
"date_time_this_decade",
"date_time_this_month",
"date_time_this_year",
"day_of_month",
"day_of_week",
"dga",
"domain_name",
"domain_word",
"dsv",
"ean",
"ean13",
"ean8",
"ein",
"email",
"emoji",
"enum",
"file_extension",
"file_name",
"file_path",
"firefox",
"first_name",
"first_name_female",
"first_name_male",
"first_name_nonbinary",
"fixed_width",
"free_email",
"free_email_domain",
"future_date",
"future_datetime",
"hex_color",
"hexify",
"hostname",
"http_method",
"iana_id",
"iban",
"image",
"image_url",
"internet_explorer",
"invalid_ssn",
"ios_platform_token",
"ipv4",
"ipv4_network_class",
"ipv4_private",
"ipv4_public",
"ipv6",
"isbn10",
"isbn13",
"iso8601",
"itin",
"job",
"json",
"json_bytes",
"language_code",
"language_name",
"last_name",
"last_name_female",
"last_name_male",
"last_name_nonbinary",
"latitude",
"latlng",
"lexify",
"license_plate",
"linux_platform_token",
"linux_processor",
"local_latlng",
"locale",
"localized_ean",
"localized_ean13",
"localized_ean8",
"location_on_land",
"longitude",
"mac_address",
"mac_platform_token",
"mac_processor",
"md5",
"military_apo",
"military_dpo",
"military_ship",
"military_state",
"mime_type",
"month",
"month_name",
"msisdn",
"name",
"name_female",
"name_male",
"name_nonbinary",
"nic_handle",
"nic_handles",
"null_boolean",
"numerify",
"opera",
"paragraph",
"paragraphs",
"password",
"past_date",
"past_datetime",
"phone_number",
"port_number",
"postalcode",
"postalcode_in_state",
"postalcode_plus4",
"postcode",
"postcode_in_state",
"prefix",
"prefix_female",
"prefix_male",
"prefix_nonbinary",
"pricetag",
"profile",
"providers",
"psv",
"pybool",
"pydecimal",
"pydict",
"pyfloat",
"pyint",
"pyiterable",
"pylist",
"pyset",
"pystr",
"pystr_format",
"pystruct",
"pytimezone",
"pytuple",
"random_choices",
"random_digit",
"random_digit_not_null",
"random_digit_not_null_or_empty",
"random_digit_or_empty",
"random_element",
"random_elements",
"random_int",
"random_letter",
"random_letters",
"random_lowercase_letter",
"random_number",
"random_sample",
"random_uppercase_letter",
"randomize_nb_elements",
"rgb_color",
"rgb_css_color",
"ripe_id",
"safari",
"safe_color_name",
"safe_domain_name",
"safe_email",
"safe_hex_color",
"secondary_address",
"sentence",
"sentences",
"sha1",
"sha256",
"simple_profile",
"slug",
"ssn",
"state",
"state_abbr",
"street_address",
"street_name",
"street_suffix",
"suffix",
"suffix_female",
"suffix_male",
"suffix_nonbinary",
"swift",
"swift11",
"swift8",
"tar",
"text",
"texts",
"time",
"time_delta",
"time_object",
"time_series",
"timezone",
"tld",
"tsv",
"unix_device",
"unix_partition",
"unix_time",
"upc_a",
"upc_e",
"uri",
"uri_extension",
"uri_page",
"uri_path",
"url",
"user_agent",
"user_name",
"uuid4",
"windows_platform_token",
"word",
"words",
"year",
"zip",
"zipcode",
"zipcode_in_state",
"zipcode_plus4",
]