Data Discovery
Identify sensitive data across your organization using AI-powered scanning and classification.
Data Discovery is a powerful feature that helps organizations identify and classify sensitive data across their entire data estate. By leveraging AI-powered scanning and classification, Data Discovery enables organizations to gain visibility into their data landscape, understand where sensitive data resides, and take appropriate actions to protect it.
The documentation here for Data Discovery covers its specific requirements and relationship with AI Developer Edition. For more information, refer to the complete body of the Data Discovery documentation.
1 - Data Discovery Architecture
Architecture of the Data Discovery feature.
Data Discovery is a powerful, developer-friendly feature. For more information, refer to the complete body of the Data Discovery documentation.
Overview
Data Discovery Text Classification service advances data discovery and classification. It specializes in the detection of Personally Identifiable Information (PII), Protected Health Information (PHI), and Payment Card Information (PCI) within plain text and free-text inputs. Unlike traditional structured data tools, it excels in dynamic, unstructured environments such as chatbot conversations, call transcripts, and Generative AI (GenAI) outputs.
Architecture
For more information about the general architecture and working of Data Discovery, refer to General architecture of Data Discovery.
2 - What's New
New features and enhancements of Data Discovery v2.0.0.
Data Discovery
- Standardized v2 APIs for Classify for Text and Tabular data, and Transform.
- New endpoints added for API docs, log level management, and version info.
- Improved Context Provider and Pattern Provider AI models.
- Updated Classify API default threshold to 0.7. The default threshold for v1.1 remains at 0.0 for compatibility.
- Added usage metrics and per‑language accuracy metrics.
- Extended PII detection to multiple Markdown dialects.
For more details, refer to What’s New in Data Discovery.
Major Changes
- Added Jupyter notebooks examples
data-discovery/samples/jupyter/sample-classification-jupyter-text.ipynbdata-discovery/samples/jupyter/sample-classification-jupyter-tabular.ipynbdata-discovery/samples/jupyter/sample-redaction-jupyter-text.ipynb
For more information on these examples, refer to Notebooks.
3 - Prerequisites for Data Discovery
Prerequisites for the Data Discovery feature.
Ensure that the following prerequisites are met before running these examples for Data Discovery:
- Docker CLI, Docker Compose, and Python are installed. For more information, refer to AI Developer Edition, Pre-requisites Guide.
- For shell samples: Bash version greater than or equal to 5.1.8 and curl version greater than or equal to 7.76.1.
- For notebook samples: JupyterLab version greater than or equal to 4.5.6.
4 - Setting up Data Discovery
Installation instructions for the Data Discovery feature.
Use the containers to set up the Data Discovery components required for identifying sensitive data.
Open a command prompt.
Navigate to the cloned repository location for protegrity-ai-developer-edition.
Run the following command to download and start the containers. The dependent containers are large in size. Based on the network connection, the containers might take time to download and deploy.
cd data-discovery
docker compose up -d
Based on your configuration use the docker-compose up -d command. Ensure that you bring down the containers using docker compose down before switching between starting just Data Discovery containers or Data Discovery and Semantic Guardrails containers.
Note: By default images are obtained from ghcr.io. To obtain images from public.ecr.aws, navigate to the data-discovery directory and copy the .env.example file to .env. Open the .env file and uncomment the REGISTRY=public.ecr.aws/protegrity-ai-developer-edition line in the file. Save the file and run the docker compose up -d command to download and start the containers.
Verify that the containers started successfully.
Set up the Jupyter notebook for working with the notebooks provided from the cloned repository location for protegrity-ai-developer-edition.
pip install -r shared/requirements.txt
Open a command prompt.
Navigate to the cloned repository location for protegrity-ai-developer-edition.
If the step to stop containers was missed earlier, then use the following commands to identify and remove the AI Developer Edition containers.
docker compose down --remove-orphans
Delete the docker network resources.
docker network rm -f <network_name_or_id>
For example,
docker network rm -f protegrity-network
Run the following command to download and start the containers. The dependent containers are large in size. Based on the network connection, the containers might take time to download and deploy.
cd data-discovery
docker compose up -d
Based on your configuration use the docker-compose up -d command. Ensure that you bring down the containers using docker compose down before switching between starting just Data Discovery containers or Data Discovery and Semantic Guardrails containers.
Verify that the containers started successfully.
Set up the Jupyter notebook for working with the notebooks provided from the cloned repository location for protegrity-ai-developer-edition.
pip install -r shared/requirements.txt
5 - Running the Data Discovery samples
Instructions for running the Data Discovery samples.
Use the information in this section to run the Data Discovery samples provided in the data-discovery/samples folder. These samples demonstrate how to use the Data Discovery API for classification and redaction of sensitive information in text and tabular data.
Running Data Discovery
The example scripts under the data-discovery/ folder demonstrate classification and redaction using the Data Discovery v2 API. For more information about the Data Discovery APIs, refer to the section Data Discovery APIs.
Note: A dedicated data-discovery/docker-compose.yml is provided to start only the Data Discovery service.
Open a command prompt.
Navigate to the directory where AI Developer Edition is cloned.
Launch data-discovery services. Refer to the docker compose setup page to know how to set up the package.
Run any of the example scripts from the data-discovery/ directory:
Classification - text input
python data-discovery/samples/python/sample-classification-python-text.py
bash data-discovery/samples/bash/sample-classification-bash-text.sh
Classification - tabular (CSV) input
python data-discovery/samples/python/sample-classification-python-tabular.py
bash data-discovery/samples/bash/sample-classification-bash-tabular.sh
Redaction
python data-discovery/samples/python/sample-redaction-python.py
bash data-discovery/samples/bash/sample-redaction-bash.sh
View the output of the files processed on the screen. The output displays the classification labels or redacted text returned by the Data Discovery service.
Using Notebooks for Classifying and Redacting unstructured documents
The notebook demonstrates how to use the Data Discovery API with Python’s requests library to classify and redact sensitive information in unstructured text and tabular data. It submits sample data containing sensitive information to a local Data Discovery service for classification. It also shows how the Transform API replaces detected PII entities with standardized labels, for example, [PERSON] or [SOCIAL_SECURITY_ID].
Make sure you have the Jupyter notebook installed in your system.
Navigate to the directory where AI Developer Edition is cloned.
Run the following command to start Jupyter Lab.
Copy the URL displayed and navigate to the site from a web browser. Ensure that localhost is replaced with the IP address of the system where the AI Developer Edition is set up.
Open the example at:
data-discovery/samples/jupyter/sample-classification-jupyter-text.ipynbdata-discovery/samples/jupyter/sample-classification-jupyter-tabular.ipynbdata-discovery/samples/jupyter/sample-redaction-jupyter-text.ipynb
Run all cells and see the results of the execution interactively.
6 - Using the Data Discovery APIs
The various APIs of Data Discovery.
Data Discovery has three types of API Endpoints:
- Classify to identify, classify, and locate sensitive data.
- Transform to identify, classify, and transform sensitive data.
- Common APIs, the standard operational endpoints available on the service.
For more information about Data Discovery APIs, refer to the complete body of the Data Discovery documentation.
7 - Uninstalling Data Discovery
Instructions for uninstalling the Data Discovery feature.
Open a command prompt.
Navigate to the cloned repository location.
Uninstall Semantic Guardrails if it is installed. For complete instructions, refer to Uninstalling Semantic Guardrails.
Navigate to the data-discovery directory.
Run the following command to remove the containers and images.
docker compose down --rmi all