This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Data Discovery

Identify sensitive data across your organization using AI-powered scanning and classification.

Data Discovery is a powerful feature that helps organizations identify and classify sensitive data across their entire data estate. By leveraging AI-powered scanning and classification, Data Discovery enables organizations to gain visibility into their data landscape, understand where sensitive data resides, and take appropriate actions to protect it.

The documentation here for Data Discovery covers its specific requirements and relationship with AI Developer Edition. For more information, refer to the complete body of the Data Discovery documentation.

1 - Data Discovery Architecture

Architecture of the Data Discovery feature.

Data Discovery is a powerful, developer-friendly feature. For more information, refer to the complete body of the Data Discovery documentation.

Overview

Data Discovery Text Classification service advances data discovery and classification. It specializes in the detection of Personally Identifiable Information (PII), Protected Health Information (PHI), and Payment Card Information (PCI) within plain text and free-text inputs. Unlike traditional structured data tools, it excels in dynamic, unstructured environments such as chatbot conversations, call transcripts, and Generative AI (GenAI) outputs.

Architecture

For more information about the general architecture and working of Data Discovery, refer to General architecture of Data Discovery.

2 - What's New

New features and enhancements of Data Discovery v2.0.0.

Data Discovery

  • Standardized v2 APIs for Classify for Text and Tabular data, and Transform.
  • New endpoints added for API docs, log level management, and version info.
  • Improved Context Provider and Pattern Provider AI models.
  • Updated Classify API default threshold to 0.7. The default threshold for v1.1 remains at 0.0 for compatibility.
  • Added usage metrics and per‑language accuracy metrics.
  • Extended PII detection to multiple Markdown dialects.

For more details, refer to What’s New in Data Discovery.

Major Changes

  • Added Jupyter notebooks examples
    • data-discovery/samples/jupyter/sample-classification-jupyter-text.ipynb
    • data-discovery/samples/jupyter/sample-classification-jupyter-tabular.ipynb
    • data-discovery/samples/jupyter/sample-redaction-jupyter-text.ipynb

For more information on these examples, refer to Notebooks.

3 - Prerequisites for Data Discovery

Prerequisites for the Data Discovery feature.

Ensure that the following prerequisites are met before running these examples for Data Discovery:

  • Docker CLI, Docker Compose, and Python are installed. For more information, refer to AI Developer Edition, Pre-requisites Guide.
  • For shell samples: Bash version greater than or equal to 5.1.8 and curl version greater than or equal to 7.76.1.
  • For notebook samples: JupyterLab version greater than or equal to 4.5.6.

4 - Setting up Data Discovery

Installation instructions for the Data Discovery feature.

Use the containers to set up the Data Discovery components required for identifying sensitive data.

  1. Open a command prompt.

  2. Navigate to the cloned repository location for protegrity-ai-developer-edition.

  3. Run the following command to download and start the containers. The dependent containers are large in size. Based on the network connection, the containers might take time to download and deploy.

    cd data-discovery
    docker compose up -d
    

    Based on your configuration use the docker-compose up -d command. Ensure that you bring down the containers using docker compose down before switching between starting just Data Discovery containers or Data Discovery and Semantic Guardrails containers.

    Note: By default images are obtained from ghcr.io. To obtain images from public.ecr.aws, navigate to the data-discovery directory and copy the .env.example file to .env. Open the .env file and uncomment the REGISTRY=public.ecr.aws/protegrity-ai-developer-edition line in the file. Save the file and run the docker compose up -d command to download and start the containers.

  4. Verify that the containers started successfully.

    docker compose logs
    
  5. Set up the Jupyter notebook for working with the notebooks provided from the cloned repository location for protegrity-ai-developer-edition.

    pip install -r shared/requirements.txt
    
  1. Open a command prompt.

  2. Navigate to the cloned repository location for protegrity-ai-developer-edition.

  3. If the step to stop containers was missed earlier, then use the following commands to identify and remove the AI Developer Edition containers.

    docker compose down --remove-orphans
    
  4. Delete the docker network resources.

    docker network rm -f <network_name_or_id>
    

    For example,

    docker network rm -f protegrity-network
    
  5. Run the following command to download and start the containers. The dependent containers are large in size. Based on the network connection, the containers might take time to download and deploy.

    cd data-discovery
    docker compose up -d
    

    Based on your configuration use the docker-compose up -d command. Ensure that you bring down the containers using docker compose down before switching between starting just Data Discovery containers or Data Discovery and Semantic Guardrails containers.

  6. Verify that the containers started successfully.

    docker compose logs
    
  7. Set up the Jupyter notebook for working with the notebooks provided from the cloned repository location for protegrity-ai-developer-edition.

    pip install -r shared/requirements.txt
    

5 - Running the Data Discovery samples

Instructions for running the Data Discovery samples.

Use the information in this section to run the Data Discovery samples provided in the data-discovery/samples folder. These samples demonstrate how to use the Data Discovery API for classification and redaction of sensitive information in text and tabular data.

Running Data Discovery

The example scripts under the data-discovery/ folder demonstrate classification and redaction using the Data Discovery v2 API. For more information about the Data Discovery APIs, refer to the section Data Discovery APIs.

Note: A dedicated data-discovery/docker-compose.yml is provided to start only the Data Discovery service.

  1. Open a command prompt.

  2. Navigate to the directory where AI Developer Edition is cloned.

  3. Launch data-discovery services. Refer to the docker compose setup page to know how to set up the package.

  4. Run any of the example scripts from the data-discovery/ directory:

    Classification - text input

    python data-discovery/samples/python/sample-classification-python-text.py
    bash data-discovery/samples/bash/sample-classification-bash-text.sh
    

    Classification - tabular (CSV) input

    python data-discovery/samples/python/sample-classification-python-tabular.py
    bash data-discovery/samples/bash/sample-classification-bash-tabular.sh
    

    Redaction

    python data-discovery/samples/python/sample-redaction-python.py
    bash data-discovery/samples/bash/sample-redaction-bash.sh
    
  5. View the output of the files processed on the screen. The output displays the classification labels or redacted text returned by the Data Discovery service.

Using Notebooks for Classifying and Redacting unstructured documents

The notebook demonstrates how to use the Data Discovery API with Python’s requests library to classify and redact sensitive information in unstructured text and tabular data. It submits sample data containing sensitive information to a local Data Discovery service for classification. It also shows how the Transform API replaces detected PII entities with standardized labels, for example, [PERSON] or [SOCIAL_SECURITY_ID].

  1. Make sure you have the Jupyter notebook installed in your system.

  2. Navigate to the directory where AI Developer Edition is cloned.

  3. Run the following command to start Jupyter Lab.

    jupyter lab
    
  4. Copy the URL displayed and navigate to the site from a web browser. Ensure that localhost is replaced with the IP address of the system where the AI Developer Edition is set up.

  5. Open the example at:

    • data-discovery/samples/jupyter/sample-classification-jupyter-text.ipynb
    • data-discovery/samples/jupyter/sample-classification-jupyter-tabular.ipynb
    • data-discovery/samples/jupyter/sample-redaction-jupyter-text.ipynb
  6. Run all cells and see the results of the execution interactively.

6 - Using the Data Discovery APIs

The various APIs of Data Discovery.

Data Discovery has three types of API Endpoints:

  • Classify to identify, classify, and locate sensitive data.
  • Transform to identify, classify, and transform sensitive data.
  • Common APIs, the standard operational endpoints available on the service.

For more information about Data Discovery APIs, refer to the complete body of the Data Discovery documentation.

7 - Uninstalling Data Discovery

Instructions for uninstalling the Data Discovery feature.
  1. Open a command prompt.

  2. Navigate to the cloned repository location.

  3. Uninstall Semantic Guardrails if it is installed. For complete instructions, refer to Uninstalling Semantic Guardrails.

  4. Navigate to the data-discovery directory.

    cd data-discovery
    
  5. Run the following command to remove the containers and images.

    docker compose down --rmi all