This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Anonymization

Protect sensitive data by anonymizing it while maintaining its utility for analysis and development.

1: Anonymization Architecture
2: Prerequisites for Anonymization
3: Setting up Anonymization
4: Running the Anonymization samples
5: Using the Anonymization APIs
6: Uninstalling Anonymization

Anonymization is a powerful feature that helps organizations protect sensitive data by anonymizing it while maintaining its utility for analysis and development. By leveraging AI, Anonymization enables organizations to transform sensitive data into anonymized data that preserves its analytical value while ensuring privacy and compliance.

1 - Anonymization Architecture

Architecture of the Anonymization feature.

Protegrity Anonymization allows processing of the datasets through generalization, to ensure the risk of re-identification is within tolerable thresholds. The anonymization process will have an impact on data utility, but Protegrity Anonymization optimizes this fundamental privacy-utility trade-off to ensure maximum data quality within the privacy goals.

Protegrity Anonymization leverages Kubernetes for data anonymization at scale and it provides instructions and support for deployment and usage on AWS EKS and Microsoft Azure AKS.

An overview of the communication is shown in the following figure.

Anonymization Components

Architecture

Protegrity Anonymization uses several pods on Kubernetes. The Protegrity Anonymization Web Server processes requests and stores the data securely in an internal Database Server. The Protegrity Anonymization request is received by the Nginx-Ingress component. Ingress forwards the request to the Anon-App. The Anon-App processes the request and submits the tasks to the cluster. The scheduler schedules tasks on the workers. The Anon-app stores the metadata about the job in the Anon-DB container. Next, the workers read, write, and process the data that is stored in the Anon-Storage, the request stream, or the Cloud storage. The Anon-Storage uses S3 bucket for storing data. The communication between the scheduler and the workers is handled by the scheduler. The workers run on random ports.

The user accesses Protegrity Anonymization using HTTPS over port 443. The user requests are directed to an Ingress Controller, and the controller in turn communicates with the required pods using the following ports:

8090: Ingress controller and the Protegrity Anonymization API Web Service
8786: Ingress controller
8100: Ingress controller and S3 bucket

Protegrity Anonymization leverages Kubernetes for data anonymization at scale and it provides instructions and support for deployment and usage on AWS EKS and Microsoft Azure AKS.

Components

Protegrity Anonymization is composed of the following main components:

Protegrity Anonymization REST Server: This core component exposes a REST interface through which clients can interact with the Protegrity Anonymization service. It uses an in-memory task queue and stores anonymized datasets and respective metadata on persistent storage. Protegrity Anonymization tasks are submitted to a queue and are handled in first-in first out fashion.

Note: Only one anonymization task is executed at a time in Protegrity Anonymization.

REST Client: The client connects to the Protegrity Anonymization REST Server using an API tool, such as Postman, to create, send, and receive the Protegrity Anonymization request. It also provides a Swagger interface detailing the APIs available. The Swagger interface can also be used as a REST client for raising API requests.
Python SDK: It is the Python programmatic interface used to communicate with the REST server.
Anon-Storage*: It is used to read data from and write data to the storage. It uses the S3 bucket framework to perform file operations.
Anon-DB: It is a PostgreSQL database that is used to store metadata related to Protegrity Anonymization jobs.

2 - Prerequisites for Anonymization

Prerequisites for the Anonymization feature.

Ensure that the following prerequisites are met before running these examples for Anonymization:

Docker CLI, Docker Compose, and Python are installed. For more information, refer to AI Developer Edition, Pre-requisites Guide.
For shell samples: Bash version greater than or equal to 5.1.8 and curl version greater than or equal to 7.76.1.
For notebook samples: JupyterLab version greater than or equal to 4.5.6.

3 - Setting up Anonymization

Installation instructions for the Anonymization feature.

Use the containers to set up the Anonymization feature required for identifying sensitive data.

Open a command prompt.
Navigate to the cloned repository location for protegrity-ai-developer-edition.
Run the following command to download and start the containers. The dependent containers are large in size. Based on the network connection, the containers might take time to download and deploy.
```
cd anonymization
docker compose up -d
```
Based on your configuration use the docker-compose up -d command.
Note: By default images are obtained from ghcr.io. To obtain images from public.ecr.aws, navigate to the anonymization directory and copy the .env.example file to .env. Open the .env file and uncomment the REGISTRY=public.ecr.aws/protegrity-ai-developer-edition line in the file. Save the file and run the docker compose up -d command to download and start the containers.
Verify that the containers started successfully.
```
docker compose logs
```
Set up the Jupyter notebook for working with the notebooks provided from the cloned repository location for protegrity-ai-developer-edition.
```
pip install -r shared/requirements.txt
```

Install the Anonymization SDK package.

pip install protegrity-anonymization-sdk

4 - Running the Anonymization samples

Instructions for running the Anonymization samples.

The example scripts under the anonymization/ folder demonstrate the usage of Anonymization APIs. For more information about the Anonymization APIs, refer to the section Anonymization APIs.

Note: A dedicated anonymization/docker-compose.yml is provided to start the Anonymization services.

Open a command prompt.
Navigate to the directory where AI Developer Edition is cloned.
Run the following command to start Jupyter Lab.
```
jupyter lab
```
Copy the URL displayed and navigate to the site from a web browser. Ensure that localhost is replaced with the IP address of the system where the AI Developer Edition is set up.
In the left pane of the Jupyter Lab, navigate to anonymization/samples/python/sample-app-anonymization.
Open the anonymization.ipynb file.
Click the Play icon and follow the prompts in the Jupyter Lab.

5 - Using the Anonymization APIs

Listing the APIs for Anonymization.

client

Anonymization SDK Client.

Provides synchronous (AnonymizationClient) and asynchronous (AsyncAnonymizationClient) Python clients for the Anonymization anonymization API.

Public models, enums, and exceptions are re-exported here for backward compatibility so that from anonymization_sdk.client import X continues to work.

AnonymizationClient

class AnonymizationClient()

Synchronous client for the Anonymization anonymization API.

Arguments:

base_url - Base URL of the Anonymization API (default: http://localhost:8000)
timeout - Request timeout in seconds (default: 30)
headers - Additional headers to include in requests

init

def __init__(base_url: str = DEFAULT_BASE_URL,
             timeout: float = DEFAULT_TIMEOUT,
             headers: dict[str, str] | None = None,
             mlops_config: dict[str, Any] | None = None)

Initialize the Anonymization client.

Arguments:

base_url - Base URL of the Anonymization API
timeout - Request timeout in seconds
headers - Additional HTTP headers to include in requests
mlops_config - Default MLOps tracking configuration applied to every anonymize, auto_anonymize, apply_anon, and calculate_risk call. Can be overridden per-call by passing mlops_config explicitly.

close

def close() -> None

Close the HTTP client.

is_healthy

def is_healthy() -> bool

Check if the API is healthy and responding.

Returns:

True if the API is reachable and healthy, False otherwise.

get_health

def get_health() -> dict[str, Any]

Get detailed health information from the API.

Returns:

Dictionary with health status, version, and component states.

Raises:

APIError - If the API returns an error status.

detect_qi

def detect_qi(data: DataInputType,
              *,
              mode: DetectionMode | str = DetectionMode.AUTO,
              sampling_method: SamplingMethod | str = SamplingMethod.FAST,
              cumulative_importance_threshold: float = 0.8,
              max_quasi_identifiers: int = 10,
              uniqueness_threshold: float = 0.95,
              known_identifiers: list[str] | None = None,
              known_sensitive: list[str] | None = None,
              ignore_columns: list[str] | None = None) -> DetectionResult

Detect quasi-identifiers in a dataset.

Arguments:

data - Inline records (List[Dict]), local file path / file:// URI, or cloud URI (s3://, gs://, azure://, etc.). Local paths are read and encoded automatically.
mode - Detection algorithm (“auto”, “ml”, “heuristic”).
sampling_method - Sampling strategy (“fast”, “full”, “adaptive”).
cumulative_importance_threshold - Stop adding QIs at this cumulative importance threshold (0.0–1.0, default 0.8).
max_quasi_identifiers - Maximum QIs to return (default 10).
uniqueness_threshold - Columns above this uniqueness ratio are flagged as direct identifiers (0.0–1.0, default 0.95).
known_identifiers - Columns you know are direct identifiers.
known_sensitive - Columns you know are sensitive.
ignore_columns - Columns to skip during detection.

Returns:

DetectionResult with quasi_identifiers, direct_identifiers, sensitive_attributes, attributes, and optional model_metrics.

Raises:

APIError - If the API returns an error.
ValidationError - If the request is invalid.

generate_config

def generate_config(data: DataInputType,
                    *,
                    privacy_model: PrivacyModel
                    | str = PrivacyModel.K_ANONYMITY,
                    k: int = 5,
                    l: int | None = None,
                    t: float | None = None,
                    mode: DetectionMode | str = DetectionMode.AUTO,
                    **kwargs) -> AutoConfigResult

Generate anonymization configuration automatically.

Arguments:

data - Inline records (List[Dict]), local file path, or cloud URI.
privacy_model - Privacy model (“k-anonymity”, “l-diversity”, “t-closeness”).
k - K value (default 5).
l - L value for l-diversity.
t - T threshold for t-closeness.
mode - Detection algorithm (“auto”, “ml”, “heuristic”).
**kwargs - max_suppression, diversity_type, distance_metric, sampling_method.

Returns:

AutoConfigResult with detection results and a ready-to-use anonymize_request configuration dict.

calculate_risk

def calculate_risk(data: DataInputType,
                   quasi_identifiers: list[str] | None = None,
                   *,
                   risk_threshold: float = 0.2,
                   suppress_value: str = "*",
                   include_prosecutor: bool = True,
                   include_journalist: bool = True,
                   include_marketer: bool = True,
                   mlops_config: dict[str, Any] | None = None) -> RiskResult

Calculate re-identification risk metrics.

Arguments:

data - Inline records (List[Dict]), local file path, or cloud URI.
quasi_identifiers - QI column names to consider for risk.
risk_threshold - Records above this threshold are “at risk” (default 0.2).
suppress_value - Value marking suppressed records (default “*”).
include_prosecutor - Calculate prosecutor risk (default True).
include_journalist - Calculate journalist risk (default True).
include_marketer - Calculate marketer risk (default True).
mlops_config - MLOps config override.

Returns:

RiskResult with prosecutor, journalist, marketer risk models and k_anonymity, highest_risk_level, equivalence class statistics.

anonymize

def anonymize(data: DataInputType,
              *,
              privacy_model: PrivacyModel | str = PrivacyModel.K_ANONYMITY,
              k: int = 5,
              l: int | None = None,
              t: float | None = None,
              attributes: list[dict[str, Any]] | None = None,
              max_suppression: float = 0.0,
              output_uri: str | None = None,
              output_format: str = "csv",
              mlops_config: dict[str, Any] | None = None,
              **kwargs) -> AnonymizeResult

Anonymize data synchronously using the specified privacy model.

Arguments:

data - Inline records (List[Dict]), local file path / file:// URI, or cloud URI (s3://, gs://, azure://, etc.). Local paths are read and encoded automatically.
privacy_model - Privacy model (“k-anonymity”, “l-diversity”, “t-closeness”).
k - K value for k-anonymity (default 5).
l - L value for l-diversity.
t - T threshold for t-closeness (0.0–1.0).
attributes - Attribute configurations - list of dicts with name, type (“quasi_identifier”, “sensitive”, “identifier”, “insensitive”), and optional hierarchy.
max_suppression - Maximum fraction of records to suppress (0.0–1.0).
output_uri - Cloud URI to write results to instead of returning inline (e.g. "s3://bucket/output.csv"). When set, result_path is populated in the response instead of data.
output_format - Format for cloud output (“csv”, “parquet”, “json”).
mlops_config - MLOps tracking configuration.
**kwargs - diversity_type, distance_metric, use_lattice_search, etc.

Returns:

AnonymizeResult with data (inline), or result_path (cloud output), row_count, suppressed_count, and metrics.

submit_job

def submit_job(data: DataInputType,
               *,
               privacy_model: PrivacyModel | str = PrivacyModel.K_ANONYMITY,
               k: int = 5,
               l: int | None = None,
               t: float | None = None,
               attributes: list[dict[str, Any]] | None = None,
               max_suppression: float = 0.0,
               **kwargs) -> JobResponse

Submit an anonymization job for asynchronous processing.

Arguments:

data - Inline records (List[Dict]), local file path, or cloud URI.
privacy_model - Privacy model (“k-anonymity”, “l-diversity”, “t-closeness”).
k - K value for k-anonymity (default 5).
l - L value for l-diversity.
t - T threshold for t-closeness.
attributes - Attribute configurations.
max_suppression - Maximum suppression rate (0.0–1.0).
**kwargs - Additional parameters (diversity_type, distance_metric).

Returns:

JobResponse with job_id, status, message, and created_at timestamp.

get_job_status

def get_job_status(job_id: str) -> JobStatusResponse

Get the status of an anonymization job.

Poll this method to track progress of jobs submitted via submit_job(). The response includes progress percentage, status, timestamps, and any error messages if the job failed.

Arguments:

job_id - Unique job identifier returned by submit_job()

Returns:

JobStatusResponse with:

job_id: Job identifier
status: Current status (pending, running, completed, failed, cancelled)
progress: Progress percentage (0-100)
message: Status message
created_at: Job creation timestamp
updated_at: Last update timestamp
completed_at: Completion timestamp (if completed)
result_path: Path to result file (if completed)
error: Error message (if failed)

Raises:

APIError - If job not found or API call fails

cancel_job

def cancel_job(job_id: str) -> None

Cancel a pending or running anonymization job.

Cancels a job that was submitted via submit_job(). Only jobs with status PENDING or RUNNING can be cancelled. Completed, failed, or already cancelled jobs cannot be cancelled.

Arguments:

job_id - Unique job identifier returned by submit_job()

Raises:

APIError - If job not found or cannot be cancelled

apply_anon

def apply_anon(job_id: str,
               data: DataInputType,
               *,
               mlops_config: dict[str, Any] | None = None) -> "ApplyResult"

Apply a saved anonymization solution to new data.

Re-uses the generalization levels computed during a prior anonymize() call identified by job_id. The lattice is not recomputed.

Arguments:

job_id - Solution identifier returned in AnonymizeResult.job_id.
data - Inline records (List[Dict]), local file path, or cloud URI.
mlops_config - Optional per-request MLOps tracking configuration.

Returns:

ApplyResult with anonymized data, row/suppressed counts, source_job_id, and privacy_model.

list_models

def list_models(*,
                model_type: str | None = None,
                all_metrics: bool = False) -> dict[str, Any]

List tracked anonymization models in Production.

Arguments:

model_type - Optional filter by privacy model type (e.g. “k-anonymity”).
all_metrics - If True, return all metrics instead of only the promotion metric.

Returns:

Raw response dict with ‘models’ list and ‘count’.

list_jobs

def list_jobs(*,
              status: JobStatus | str | None = None,
              limit: int = 100,
              offset: int = 0) -> "JobListResult"

List / browse all jobs with optional status filter and pagination.

Returns newest jobs first.

Arguments:

status - Optional filter (e.g. JobStatus.COMPLETED or “failed”)
limit - Page size (1-1000, default 100)
offset - Page offset (default 0)

Returns:

JobListResult with jobs list, total count, limit, and offset.

Raises:

APIError - If the API call fails.

get_job_history

def get_job_history(job_id: str) -> list["JobHistoryEntry"]

Get the full state-transition audit trail for a job.

Each create/update call on the server appends an entry with the status, step, progress, and timestamp at that point.

Arguments:

job_id - Unique job identifier.

Returns:

List of JobHistoryEntry ordered by sequence.

Raises:

APIError - If job not found or API call fails.

wait_for_job

def wait_for_job(job_id: str,
                 *,
                 poll_interval: float = 2.0,
                 timeout: float = 600.0,
                 callback: Any | None = None) -> JobStatusResponse

Poll a job until it reaches a terminal state and return its status.

Arguments:

job_id - Unique job identifier returned by submit_job().
poll_interval - Seconds between status polls (default 2s).
timeout - Maximum seconds to wait (default 600s / 10 min).
callback - Optional callable (JobStatusResponse) -> None invoked after each poll.

Returns:

JobStatusResponse at the terminal state. The anonymization result (if completed) is available in status.context["result"].

Raises:

APIError - If the job ends in a failed state.
TimeoutError - If the job does not complete within timeout.

auto_anonymize

def auto_anonymize(data: DataInputType,
                   *,
                   privacy_model: PrivacyModel
                   | str = PrivacyModel.K_ANONYMITY,
                   k: int = 5,
                   l: int | None = None,
                   t: float | None = None,
                   mode: DetectionMode | str = DetectionMode.AUTO,
                   mlops_config: dict[str, Any] | None = None,
                   **kwargs) -> AutoAnonymizeResult

Automatically detect QIs and anonymize in one step.

Arguments:

data - Inline records (List[Dict]), local file path, or cloud URI.
privacy_model - Privacy model (“k-anonymity”, “l-diversity”, “t-closeness”).
k - K value (default 5).
l - L value for l-diversity.
t - T threshold for t-closeness.
mode - Detection algorithm (“auto”, “ml”, “heuristic”).
mlops_config - MLOps tracking configuration.
**kwargs - max_suppression, sampling_method, use_lattice_search, etc.

Returns:

AutoAnonymizeResult with detection results and anonymized data.

validate

def validate(
        data: DataInputType,
        quasi_identifiers: list[str] | None = None,
        *,
        privacy_model: PrivacyModel | str = PrivacyModel.K_ANONYMITY,
        k: int = 5,
        l: int | None = None,
        t: float | None = None,
        sensitive_attributes: list[str] | None = None) -> ValidationResult

Validate that data meets privacy requirements.

Arguments:

data - Inline records (List[Dict]), local file path, or cloud URI.
quasi_identifiers - QI column names to check.
privacy_model - Privacy model to validate against.
k - Required k for k-anonymity (default 5).
l - Required l for l-diversity.
t - Required t for t-closeness.
sensitive_attributes - Sensitive columns (required for l-diversity/t-closeness).

Returns:

ValidationResult with is_valid, model_type, violations, statistics.

measure

def measure(original_data: DataInputType,
            anonymized_data: DataInputType,
            quasi_identifiers: list[str] | None = None) -> MetricsResult

Measure anonymization quality metrics.

Arguments:

original_data - Original dataset - inline records, local path, or cloud URI.
anonymized_data - Anonymized dataset - inline records, local path, or cloud URI.
quasi_identifiers - QI column names that were generalized.

Returns:

MetricsResult with information_loss and detailed metrics.

create_pattern

def create_pattern(name: str,
                   classification: str,
                   column_patterns: list[str],
                   *,
                   priority: int = 50,
                   value_patterns: list[str] | None = None,
                   min_match_ratio: float = 0.8,
                   description: str | None = None) -> Pattern

Create a custom detection pattern.

Patterns are used during QI detection to automatically classify columns based on their names and values. Custom patterns take precedence over built-in patterns.

Arguments:

name - Unique name for the pattern (e.g., ‘customer_id’)
classification - Classification type - one of:
- “DI”: Direct Identifier (e.g., SSN, email)
- “QI”: Quasi-Identifier (e.g., age, zipcode)
- “SI”: Sensitive Identifier (e.g., salary, diagnosis)
- “NSI”: Non-Sensitive Identifier (safe to publish)
column_patterns - List of column name patterns to match. Case-insensitive. Use ‘’ as wildcard (e.g., [’_id’, ‘user*’])
priority - Priority level (1-1000, lower = checked first). Default: 50
value_patterns - Optional list of regex patterns for value validation
min_match_ratio - Minimum ratio of values that must match (0-1). Default: 0.8
description - Optional description of what this pattern detects

Returns:

Pattern object with assigned ID and metadata

Raises:

APIError - If creation fails (e.g., duplicate name)
ValidationError - If parameters are invalid

list_patterns

def list_patterns(classification: str | None = None) -> PatternListResult

List all custom detection patterns.

Arguments:

classification - Optional filter by classification (DI, QI, SI, NSI)

Returns:

PatternListResult containing list of patterns and total count

get_pattern

def get_pattern(pattern_id: str) -> Pattern

Get a specific pattern by ID.

Arguments:

pattern_id - The pattern ID to retrieve

Returns:

Pattern object

Raises:

APIError - If pattern not found (404)

update_pattern

def update_pattern(pattern_id: str,
                   *,
                   name: str | None = None,
                   classification: str | None = None,
                   column_patterns: list[str] | None = None,
                   priority: int | None = None,
                   value_patterns: list[str] | None = None,
                   min_match_ratio: float | None = None,
                   description: str | None = None) -> Pattern

Update an existing pattern.

Only provided fields will be updated; others remain unchanged.

Arguments:

pattern_id - The pattern ID to update
name - New name for the pattern
classification - New classification (DI, QI, SI, NSI)
column_patterns - New column name patterns
priority - New priority (1-1000)
value_patterns - New value regex patterns
min_match_ratio - New minimum match ratio (0-1)
description - New description

Returns:

Updated Pattern object

Raises:

APIError - If pattern not found or update fails
ValidationError - If parameters are invalid

delete_pattern

def delete_pattern(pattern_id: str) -> dict[str, Any]

Delete a pattern by ID.

Arguments:

pattern_id - The pattern ID to delete

Returns:

Dictionary with confirmation message

Raises:

APIError - If pattern not found (404)

delete_all_patterns

def delete_all_patterns() -> dict[str, Any]

Delete all custom patterns.

WARNING: This removes all customer-defined patterns. Built-in patterns from the YAML config are not affected.

Returns:

Dictionary with count of deleted patterns

reload_patterns

def reload_patterns() -> dict[str, Any]

Reload patterns from storage file.

Use this to sync after manual file edits.

Returns:

Dictionary with count of reloaded patterns

dp_compute

def dp_compute(data: DataInputType,
               *,
               mechanism: DPMechanismType | str = DPMechanismType.MEAN,
               column: str | None = None,
               columns: list[str] | None = None,
               group_by: str | None = None,
               epsilon: float = 1.0,
               delta: float = 0.0,
               noise_type: DPNoiseType | str = DPNoiseType.LAPLACE,
               bounds: tuple | None = None,
               bins: int | None = None,
               histogram_range: tuple | None = None,
               session_id: str | None = None,
               predicate: str | None = None,
               candidates: list | None = None,
               utility_scores: list[float] | None = None,
               sensitivity: float | None = None,
               epsilon_map: dict[str, float] | None = None,
               min_group_size: int | None = None) -> DPComputeResult

Compute a differentially private statistic on a data column.

Arguments:

data - Inline records (List[Dict]), local file path, or cloud URI.
mechanism - DP mechanism (“mean”, “sum”, “variance”, “histogram”, “count”, “exponential”).
column - Column name for single-column queries.
columns - Column names for multi-column queries.
group_by - Categorical column to group by.
epsilon - Privacy parameter epsilon (>0).
delta - Privacy parameter delta (>=0, <1).
noise_type - “laplace” or “gaussian”.
bounds - (lower, upper) clipping bounds. Required for mean/sum/variance.
bins - Number of histogram bins (histogram only).
histogram_range - (min, max) range for histogram bins.
session_id - Budget session ID for cumulative tracking.
predicate - Filter expression (e.g., “> 50”, “<= 100”).
candidates - Candidate outputs (exponential mechanism only).
utility_scores - Utility scores for candidates (exponential only).
sensitivity - Utility function sensitivity (exponential only).
epsilon_map - Per-column or per-group epsilon overrides.
min_group_size - Minimum rows per group (default 5).

Returns:

DPComputeResult with private_value (single) or results dict (multi/group).

dp_stream_update

def dp_stream_update(session_id: str | None = None,
                     data: DataInputType | None = None,
                     *,
                     column: str | None = None,
                     columns: list[str] | None = None,
                     group_by: str | None = None,
                     mechanism: DPStreamMechanismType | str | None = None,
                     epsilon: float | None = None,
                     delta: float | None = None,
                     noise_type: DPNoiseType | str | None = None,
                     bounds: tuple | None = None,
                     get_result: bool = False,
                     window_size: int | None = None,
                     epsilon_map: dict[str, float] | None = None,
                     min_group_size: int | None = None,
                     budget_session_id: str | None = None) -> DPStreamResult

Feed data into a streaming DP session.

On the first call for a session_id, provide mechanism, epsilon, and bounds. Subsequent calls only need session_id, data, and column.

Arguments:

session_id - Unique session identifier.
data - Batch of records. Mutually exclusive with data_path.
data_path - Cloud/local URI for data batch.
column - Column name for single-column streaming.
columns - Column names for multi-column streaming.
group_by - Categorical column to group by.
mechanism - Streaming mechanism. Required on first call.
epsilon - Privacy epsilon. Required on first call.
delta - Privacy delta.
noise_type - Noise mechanism.
bounds - Clipping bounds. Required on first call (except for count).
get_result - If True, also return the current private result.
window_size - Window size for sliding/tumbling window mechanisms.
epsilon_map - Per-column or per-group epsilon overrides.
min_group_size - Minimum rows per group (default 5).
budget_session_id - Link to a budget session for automatic deduction.

Returns:

DPStreamResult with session status and optional results.

dp_stream_delete

def dp_stream_delete(session_id: str) -> None

Delete a streaming DP session.

Arguments:

session_id - Session to delete.

dp_stream_list_sessions

def dp_stream_list_sessions() -> list

List all active streaming DP sessions.

Returns:

List of dicts with session_id, mechanism, column, batches_processed, total_count.

dp_budget_create

def dp_budget_create(session_id: str,
                     epsilon_budget: float,
                     delta_budget: float = 0.0,
                     composition: str = "basic") -> DPBudgetStatus

Create a privacy budget session.

Arguments:

session_id - Unique session identifier.
epsilon_budget - Total epsilon budget.
delta_budget - Total delta budget.
composition - Composition mode (“basic” or “rdp”). RDP requires delta_budget > 0 and yields tighter privacy accounting.

Returns:

DPBudgetStatus with initial budget state.

dp_budget_status

def dp_budget_status(session_id: str) -> DPBudgetStatus

Get privacy budget status for a session.

Arguments:

session_id - Session to query.

Returns:

DPBudgetStatus with current spend and remaining budget.

dp_budget_delete

def dp_budget_delete(session_id: str) -> None

Delete a privacy budget session.

Arguments:

session_id - Session to delete.

dp_advise_composition

def dp_advise_composition(epsilon_budget: float,
                          num_queries: int,
                          delta_budget: float = 0.0,
                          delta_per_query: float = 0.0) -> dict

Get composition advice for planned queries.

Returns optimal per-query epsilon under basic and RDP composition with a recommendation.

Arguments:

epsilon_budget - Total epsilon budget available.
num_queries - Number of planned queries.
delta_budget - Total delta budget (required for RDP comparison).
delta_per_query - Delta per query for Gaussian noise. 0 = Laplace.

Returns:

Dict with basic/rdp analysis, recommendation, and savings_pct.

audit_list

def audit_list(*,
               operation: str | None = None,
               status: str | None = None,
               limit: int = 50,
               offset: int = 0) -> list[AuditEntry]

List audit log entries.

Arguments:

operation - Filter by operation (dp_compute, anonymize_sync, …).
status - Filter by outcome (‘success’ or ’error’).
limit - Max entries to return (1–500).
offset - Pagination offset.

Returns:

List of AuditEntry objects.

audit_get

def audit_get(entry_id: str) -> AuditEntry

Get a single audit entry.

Arguments:

entry_id - Audit entry ID.

Returns:

AuditEntry with full details.

Raises:

APIError - If entry not found (404).

AsyncAnonymizationClient

class AsyncAnonymizationClient()

Asynchronous client for the Anonymization anonymization API.

Same interface as AnonymizationClient but with async/await support.

init

def __init__(base_url: str = DEFAULT_BASE_URL,
             timeout: float = DEFAULT_TIMEOUT,
             headers: dict[str, str] | None = None,
             mlops_config: dict[str, Any] | None = None)

Initialize the async Anonymization client.

Arguments:

base_url - Base URL of the Anonymization API
timeout - Request timeout in seconds
headers - Additional HTTP headers to include in requests
mlops_config - Default MLOps tracking configuration applied to every anonymize, auto_anonymize, apply_anon, and calculate_risk call. Can be overridden per-call.

close

async def close() -> None

Close the HTTP client.

is_healthy

async def is_healthy() -> bool

Check if the API is healthy and responding.

get_health

async def get_health() -> dict[str, Any]

Get detailed health information.

detect_qi

async def detect_qi(
        data: DataInputType,
        *,
        mode: DetectionMode | str = DetectionMode.AUTO,
        sampling_method: SamplingMethod | str = SamplingMethod.FAST,
        cumulative_importance_threshold: float = 0.8,
        max_quasi_identifiers: int = 10,
        uniqueness_threshold: float = 0.95,
        known_identifiers: list[str] | None = None,
        known_sensitive: list[str] | None = None,
        ignore_columns: list[str] | None = None) -> DetectionResult

Detect quasi-identifiers (async version).

Refer to synchronous detect_qi() for full documentation.

generate_config

async def generate_config(data: DataInputType,
                          *,
                          privacy_model: PrivacyModel
                          | str = PrivacyModel.K_ANONYMITY,
                          k: int = 5,
                          l: int | None = None,
                          t: float | None = None,
                          mode: DetectionMode | str = DetectionMode.AUTO,
                          **kwargs) -> AutoConfigResult

Generate anonymization configuration automatically (async version).

calculate_risk

async def calculate_risk(
        data: DataInputType,
        quasi_identifiers: list[str] | None = None,
        *,
        risk_threshold: float = 0.2,
        suppress_value: str = "*",
        include_prosecutor: bool = True,
        include_journalist: bool = True,
        include_marketer: bool = True,
        mlops_config: dict[str, Any] | None = None) -> RiskResult

Calculate re-identification risk metrics (async version).

anonymize

async def anonymize(data: DataInputType,
                    *,
                    privacy_model: PrivacyModel
                    | str = PrivacyModel.K_ANONYMITY,
                    k: int = 5,
                    l: int | None = None,
                    t: float | None = None,
                    attributes: list[dict[str, Any]] | None = None,
                    max_suppression: float = 0.0,
                    output_uri: str | None = None,
                    output_format: str = "csv",
                    mlops_config: dict[str, Any] | None = None,
                    **kwargs) -> AnonymizeResult

Anonymize data (async version). Refer to synchronous anonymize() for full documentation.

submit_job

async def submit_job(data: DataInputType,
                     *,
                     privacy_model: PrivacyModel
                     | str = PrivacyModel.K_ANONYMITY,
                     k: int = 5,
                     l: int | None = None,
                     t: float | None = None,
                     attributes: list[dict[str, Any]] | None = None,
                     max_suppression: float = 0.0,
                     **kwargs) -> JobResponse

Submit anonymization job (async version).

Refer to synchronous submit_job() for full documentation.

get_job_status

async def get_job_status(job_id: str) -> JobStatusResponse

Get job status (async version).

Refer to synchronous get_job_status() for full documentation.

cancel_job

async def cancel_job(job_id: str) -> None

Cancel job (async version). Refer to synchronous cancel_job() for full documentation.

apply_anon

async def apply_anon(
        job_id: str,
        data: DataInputType,
        *,
        mlops_config: dict[str, Any] | None = None) -> "ApplyResult"

Apply saved anonymization (async). Refer to synchronous apply_anon() for full docs.

list_models

async def list_models(*,
                      model_type: str | None = None,
                      all_metrics: bool = False) -> dict[str, Any]

List tracked anonymization models (async).

Refer to synchronous list_models() for full docs.

list_jobs

async def list_jobs(*,
                    status: JobStatus | str | None = None,
                    limit: int = 100,
                    offset: int = 0) -> "JobListResult"

List jobs (async version). Refer to synchronous list_jobs() for full documentation.

get_job_history

async def get_job_history(job_id: str) -> list["JobHistoryEntry"]

Get job history (async version).

Refer to synchronous get_job_history() for full documentation.

wait_for_job

async def wait_for_job(job_id: str,
                       *,
                       poll_interval: float = 2.0,
                       timeout: float = 600.0,
                       callback: Any | None = None) -> JobStatusResponse

Async version of wait_for_job().

Refer to synchronous wait_for_job() for full documentation.

auto_anonymize

async def auto_anonymize(data: DataInputType,
                         *,
                         privacy_model: PrivacyModel
                         | str = PrivacyModel.K_ANONYMITY,
                         k: int = 5,
                         l: int | None = None,
                         t: float | None = None,
                         mode: DetectionMode | str = DetectionMode.AUTO,
                         mlops_config: dict[str, Any] | None = None,
                         **kwargs) -> AutoAnonymizeResult

Auto-detect and anonymize (async version).

Refer to synchronous auto_anonymize() for full docs.

validate

async def validate(
        data: DataInputType,
        quasi_identifiers: list[str] | None = None,
        *,
        privacy_model: PrivacyModel | str = PrivacyModel.K_ANONYMITY,
        k: int = 5,
        l: int | None = None,
        t: float | None = None,
        sensitive_attributes: list[str] | None = None) -> ValidationResult

Validate privacy requirements (async version).

measure

async def measure(original_data: DataInputType,
                  anonymized_data: DataInputType,
                  quasi_identifiers: list[str] | None = None) -> MetricsResult

Measure anonymization quality metrics (async version).

create_pattern

async def create_pattern(name: str,
                         classification: str,
                         column_patterns: list[str],
                         *,
                         priority: int = 50,
                         value_patterns: list[str] | None = None,
                         min_match_ratio: float = 0.8,
                         description: str | None = None) -> Pattern

Create a custom detection pattern (async version).

list_patterns

async def list_patterns(
        classification: str | None = None) -> PatternListResult

List all custom detection patterns (async version).

get_pattern

async def get_pattern(pattern_id: str) -> Pattern

Get a specific pattern by ID (async version).

update_pattern

async def update_pattern(pattern_id: str,
                         *,
                         name: str | None = None,
                         classification: str | None = None,
                         column_patterns: list[str] | None = None,
                         priority: int | None = None,
                         value_patterns: list[str] | None = None,
                         min_match_ratio: float | None = None,
                         description: str | None = None) -> Pattern

Update an existing pattern (async version).

delete_pattern

async def delete_pattern(pattern_id: str) -> dict[str, Any]

Delete a pattern by ID (async version).

delete_all_patterns

async def delete_all_patterns() -> dict[str, Any]

Delete all custom patterns (async version).

reload_patterns

async def reload_patterns() -> dict[str, Any]

Reload patterns from storage file (async version).

audit_list

async def audit_list(*,
                     operation: str | None = None,
                     status: str | None = None,
                     limit: int = 50,
                     offset: int = 0) -> list[AuditEntry]

List audit log entries (async version).

audit_get

async def audit_get(entry_id: str) -> AuditEntry

Get a single audit entry (async version).

exceptions

Anonymization SDK Exceptions.

Custom exception hierarchy for the Anonymization SDK client library. All SDK exceptions inherit from AnonymizationClientError.

AnonymizationClientError

class AnonymizationClientError(Exception)

Base exception for all SDK errors.

ValidationError

class ValidationError(AnonymizationClientError)

Request validation failed (422 from server or client-side validation).

APIError

class APIError(AnonymizationClientError)

API returned an error response (4xx or 5xx status code).

AnonymizationConnectionError

class AnonymizationConnectionError(AnonymizationClientError)

Failed to connect to the API (network/timeout error).

TierRestrictionError

class TierRestrictionError(AnonymizationClientError)

Feature not available in the current server tier (403 from server).

The server returned a tier-restriction error indicating the requested feature requires a higher tier. Inspect the structured fields for details.

models

Anonymization SDK Response Models and Enums.

Contains all enums (PrivacyModel, DetectionMode, etc.) and response dataclasses (DetectionResult, RiskResult, AnonymizeResult, etc.) used by both the synchronous and asynchronous Anonymization clients.

PrivacyModel

class PrivacyModel(StrEnum)

Supported privacy models.

DetectionMode

class DetectionMode(StrEnum)

QI detection algorithm modes.

SamplingMethod

class SamplingMethod(StrEnum)

Sampling methods for detection.

RiskLevel

class RiskLevel(StrEnum)

Risk level classifications.

JobStatus

class JobStatus(StrEnum)

Job execution status.

AttributeClassification

@dataclass
class AttributeClassification()

Classification result for a single attribute.

ModelMetrics

@dataclass
class ModelMetrics()

ML model performance metrics.

DetectionResult

@dataclass
class DetectionResult()

Result of QI detection.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DetectionResult"

Create from API response dict.

ProsecutorRisk

@dataclass
class ProsecutorRisk(_BaseAttackerRisk)

Prosecutor risk model result.

JournalistRisk

@dataclass
class JournalistRisk(_BaseAttackerRisk)

Journalist risk model result.

MarketerRisk

@dataclass
class MarketerRisk()

Marketer risk model result.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "MarketerRisk"

Create from API response dict.

RiskResult

@dataclass
class RiskResult()

Complete risk metrics result.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "RiskResult"

Create from API response dict.

is_k_anonymous

def is_k_anonymous(k: int) -> bool

Check if data satisfies k-anonymity.

MetricsResult

@dataclass
class MetricsResult()

Anonymization quality metrics.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "MetricsResult"

Create from API response dict.

AnonymizeResult

@dataclass
class AnonymizeResult()

Result of anonymization operation.

result_path

Cloud storage URI if saved to cloud

job_id

Solution identifier for apply_anon()

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AnonymizeResult"

Create from API response dict.

ApplyResult

@dataclass
class ApplyResult()

Result of applying a saved anonymization solution to new data.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "ApplyResult"

Create from API response dict.

ValidationResult

@dataclass
class ValidationResult()

Result of privacy validation.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "ValidationResult"

Create from API response dict.

AutoConfigResult

@dataclass
class AutoConfigResult()

Result of auto-configuration generation.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AutoConfigResult"

Create from API response dict.

AutoAnonymizeResult

@dataclass
class AutoAnonymizeResult()

Result of combined detection + anonymization.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AutoAnonymizeResult"

Create from API response dict.

Pattern

@dataclass
class Pattern()

Detection pattern for automatic QI classification.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "Pattern"

Create from API response dict.

PatternListResult

@dataclass
class PatternListResult()

Result of pattern list operation.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "PatternListResult"

Create from API response dict.

JobResponse

@dataclass
class JobResponse()

Response for job submission.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "JobResponse"

Create from API response dict.

JobStatusResponse

@dataclass
class JobStatusResponse()

Response for job status query.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "JobStatusResponse"

Create from API response dict.

JobHistoryEntry

@dataclass
class JobHistoryEntry()

A single point-in-time snapshot from the job audit trail.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "JobHistoryEntry"

Create from API response dict.

JobListResult

@dataclass
class JobListResult()

Paginated list of jobs.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "JobListResult"

Create from API response dict.

DPMechanismType

class DPMechanismType(StrEnum)

Supported batch DP mechanisms.

DPStreamMechanismType

class DPStreamMechanismType(StrEnum)

Supported streaming DP mechanisms.

DPNoiseType

class DPNoiseType(StrEnum)

Supported noise mechanisms.

DPComputeResult

@dataclass
class DPComputeResult()

Result of a batch DP computation.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DPComputeResult"

Create from API response dict.

DPStreamResult

@dataclass
class DPStreamResult()

Result of a streaming DP update.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DPStreamResult"

Create from API response dict.

DPBudgetStatus

@dataclass
class DPBudgetStatus()

Privacy budget status for a session.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DPBudgetStatus"

Create from API response dict.

AuditEntry

@dataclass
class AuditEntry()

A single audit log entry.

from_dict

@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AuditEntry"

Create from API response dict.

6 - Uninstalling Anonymization

Instructions for uninstalling the Anonymization feature.

Open a command prompt.
Navigate to the cloned repository location.
Navigate to the anonymization directory.
```
cd anonymization
```
Run the following command to remove the containers and images.
```
docker compose down --rmi all
```