This is the multi-page printable view of this section. Click here to print.
Appendix
1 - Input Sanitization
The Classification service in Data Discovery offers a security feature that rejects unsanitized data. Data that is malformed, non-normalized, containing homoglyphs, hieroglyphs, mixed Unicode variants, or control characters are considered as unsanitized data. These are rejected for classification.
The following are few examples of data that will be rejected:
- Ⅷ
- 𝓉𝑒𝓍𝓉
- Pep
Before invoking the Classification endpoint, ensure that the input text is normalized. Replace invalid characters by their corresponding normalized plaintext characters. If the input text contains any invalid character, a status code of 422 and a message Untrusted input is returned.
For security purposes, the application rejects unsanitized data by default. It is recommended that this feature remains enabled. However, to override this feature, perform the following steps.
Navigate to the
docker_composedirectory.Edit the
docker-compose.yamlfile.Under the
environmentsection ofclassification_service, append the security parameter as follows.
- SECURITY_SETTINGS={"ENABLE_ALL_SECURITY_CONTROLS":false}
Save the changes.
Run the
docker compose downcommand to undeploy the application.Run the
docker compose upcommand to redeploy the application.
2 - Working with the Data Discovery containers
Use Data Discovery by setting up and deploying the containers.
2.1 - Understanding the Docker Compose File
The following variables can be configured in the docker-compose.yml file.
| Variable | Description | Mandatory |
|---|---|---|
| networks:name | Specify the name of the Docker network. | No |
| services:enviroment | Specify the location for the logs in the logging_config parameter. | No |
| classification_service:ports | Specify the listening port for the classification service. By default, the port is set to 8580. | No |
2.2 - Deploying the Application
Ensure that the prerequisites are completed before deploying the application.
Run the following steps to deploy the Data Discovery application on Docker.
Open a command prompt.
Navigate to the AI Developer Edition package directory.
Run the command to start the containers. For example, the following command starts the Classification service container.
docker compose up -d
3 - Supported Sensitive Entity Types
| Entity Name | Data Element | Description |
|---|---|---|
| ABA_ROUTING_NUMBER | number | Routing number used to identify financial institutions in the United States. |
| ACCOUNT_NAME | string | Name associated with a financial account. |
| ACCOUNT_NUMBER | number | Bank account number used to identify financial accounts. |
| AGE | number | Age information used to identify individuals. |
| AMOUNT | int | Specific amount of money, which can be linked to financial transactions. |
| AU_ABN | number | Australian Business Number used to identify businesses in Australia. |
| AU_ACN | number | Australian Company Number used to identify businesses in Australia. |
| AU_MEDICARE | number | Medicare number used to identify individuals for healthcare services in Australia. |
| AU_TFN | number | Tax File Number used to identify taxpayers in Australia. |
| BIC | number | Bank Identifier Code used to identify financial institutions. |
| BITCOIN_ADDRESS | address | Bitcoin wallet address used for digital transactions. |
| BUILDING | address | Building information used to identify specific locations. |
| CITY | city | City information used to identify geographic locations. |
| COMPANY_NAME | string | Name of a company used to identify businesses. |
| COUNTRY | string | Country information used to identify geographic locations. |
| COUNTY | string | County information used to identify geographic locations. |
| CREDIT_CARD | ccn | Credit card number used for financial transactions. |
| CREDIT_CARD_CVV | number | Card Verification Value used to secure credit card transactions. |
| CRYPTO | address | Cryptocurrency wallet address used for digital transactions. |
| CURRENCY | string | Currency information used in financial transactions. |
| CURRENCY_CODE | string | Code representing currency used in financial transactions. |
| CURRENCY_NAME | string | Name of currency used in financial transactions. |
| CURRENCY_SYMBOL | string | Symbol representing currency, sometimes linked to financial transactions. |
| DATE | datetime | Specific date that can be linked to personal activities. |
| DATE_OF_BIRTH | datetime | Date of birth used to identify individuals. |
| DATE_TIME | datetime | Specific date and time that can be linked to personal activities. |
| DRIVER_LICENSE | number | Driver’s license number used to identify individuals. |
| EMAIL_ADDRESS | Email address used for communication and identification. | |
| ES_NIE | nin | Foreigner Identification Number used to identify non-residents in Spain. |
| ES_NIF | nin | Tax Identification Number used to identify taxpayers in Spain. |
| ETHEREUM_ADDRESS | address | Ethereum wallet address used for digital transactions. |
| FI_PERSONAL_IDENTITY_CODE | nin | Personal identity code used to identify individuals in Finland. |
| GENDER | string | Gender information used to identify individuals. |
| GEO_CCORDINATE | address | Geographic coordinates used to identify specific locations. |
| IBAN_CODE | iban | International Bank Account Number used to identify bank accounts globally. |
| ID_CARD | number | Identity card number used to identify individuals. |
| IN_AADHAAR | nin | Unique identification number used to identify residents in India. |
| IN_PAN | number | Permanent Account Number used to identify taxpayers in India. |
| IN_PASSPORT | passport | Passport number used to identify individuals in India. |
| IN_VEHICLE_REGISTRATION | number | Vehicle registration number used to identify vehicles in India. |
| IN_VOTER | number | Voter ID number used to identify registered voters in India. |
| IP_ADDRESS | address | Internet Protocol address used to identify devices on a network. |
| IPV4 | address | IPv4 address used to identify devices on a network. |
| IPV6 | address | IPv6 address used to identify devices on a network. |
| IT_DRIVER_LICENSE | number | Driver’s license number used to identify individuals in Italy. |
| IT_FISCAL_CODE | nin | Fiscal code used to identify taxpayers in Italy. |
| IT_IDENTITY_CARD | number | Identity card number used to identify individuals in Italy. |
| IT_PASSPORT | passport | Passport number used to identify individuals in Italy. |
| LITECOIN_ADDRESS | address | Litecoin wallet address used for digital transactions. |
| LOCATION | address | Specific location or address that can be linked to an individual. |
| MAC | address | Media Access Control address used to identify devices on a network. |
| MEDICAL_LICENSE | number | License number used to identify medical professionals. |
| NRP | number | A person’s nationality, religious or political group. |
| ORGANIZATION | string | Name or identifier used to identify an organization. |
| PASSPORT | passport | Passport number used to identify individuals. |
| PASSWORD | string | Password used to secure access to personal accounts. |
| PERSON | string | Name or identifier used to identify an individual. |
| PHONE_NUMBER | phone | Number used to contact or identify an individual. |
| PIN | number | Personal Identification Number used to secure access to accounts. |
| PL_PESEL | nin | Personal Identification Number used to identify individuals in Poland. |
| SECONDARYADDRESS | address | Additional address information used to identify locations. |
| SG_NRIC_FIN | nin | National Registration Identity Card number used to identify residents in Singapore. |
| SG_UEN | number | Unique Entity Number used to identify businesses in Singapore. |
| SOCIAL_SECURITY_NUMBER | ssn | Social Security Number used to identify individuals. |
| STATE | string | State information used to identify geographic locations. |
| STREET | address | Street address used to identify specific locations. |
| TIME | datetime | Specific time that can be linked to personal activities. |
| TITLE | string | Title or honorific used to identify individuals. |
| UK_NHS | number | National Health Service number used to identify individuals for healthcare services in the United Kingdom. |
| URL | address | Web address that can sometimes contain personal information. |
| US_BANK_NUMBER | number | Bank account number used to identify financial accounts in the United States. |
| US_DRIVER_LICENSE | number | Driver’s license number used to identify individuals in the United States. |
| US_ITIN | number | Individual Taxpayer Identification Number used to identify taxpayers in the United States. |
| US_PASSPORT | passport | Passport number used to identify individuals in the United States. |
| US_SSN | ssn | Social Security Number used to identify individuals in the United States. |
| USERNAME | string | Username used to identify individuals in online systems. |
| ZIP_CODE | zipcode | Postal code used to identify specific geographic areas. |
4 - Data Security Policy
This section describes the Policy configuration used by the AI Developer Edition API Service.
The superuser has all permissions, that is, protect, unprotect, and reprotect operations. Users assigned the admin role will receive protected data when performing an unprotect operation, except in the case of the text data elements, which will return null. All other user roles will receive null as the output for any unprotect operation.
Policy Definition
Generic Data Elements
| Data Element | Method | Use Case | UTF Set | LP | PP | eIV | Role | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Admin | Finance | Marketing | HR | |||||||||||
| P | U | P | U | P | U | P | U | |||||||
| datetime | Tokenization | A date or datetime string. Formats accepted: YYYY/MM/DD HH:MM:SS and YYYY/MM/DD. Delimiters accepted: /, - (required). | N/A | N/A | N/A | No | ✓ | X | X | X | X | ✓ | X | X |
| datetime_yc | Tokenization | A date or datetime string. Formats accepted: YYYY/MM/DD HH:MM:SS and YYYY/MM/DD. Delimiters accepted: /, - (required). Leaves the year in the clear. | N/A | N/A | N/A | No | ✓ | X | X | X | X | ✓ | X | X |
| int | Tokenization | An integer string (4 bytes). | Numeric | No | No | Yes | ✓ | X | X | X | X | ✓ | X | X |
| number | Tokenization | A numeric string. May produce leading zeroes. | Numeric | Yes | No | Yes | ✓ | X | X | X | X | ✓ | X | X |
| string | Tokenization | An alphanumeric string. | Latin + Numeric | Yes | No | Yes | ✓ | X | X | X | X | ✓ | X | X |
| text | Encryption | A long string (e.g., a comment field) using any character set. Use hex or base64 encoding to utilize. | All | No | No | Yes | ✓ | X | X | X | X | ✓ | X | X |
PCI DSS Data Elements
| Data Element | Method | Use Case | UTF Set | LP | PP | eIV | Role | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Admin | Finance | Marketing | HR | |||||||||||
| P | U | P | U | P | U | P | U | |||||||
| ccn | Tokenization | Credit card numbers. | Numeric | No | No | Yes | ✓ | X | X | ✓ | X | X | X | ✓ |
| ccn_bin | Tokenization | Credit card numbers. Leaves 8-digit BIN in the clear. | Numeric | No | No | Yes | ✓ | X | X | ✓ | X | X | X | ✓ |
| iban | Tokenization | IBAN numbers. Preserves the length, case, and position of the input characters but may create invalid IBAN codes. | Latin + Numeric | Yes | Yes | No | ✓ | X | X | ✓ | X | X | X | ✓ |
| iban_cc | Tokenization | IBAN numbers. Leaves letters in the clear. | Latin + Numeric | No | No | Yes | ✓ | X | X | ✓ | X | X | X | ✓ |
PII Data Elements
| Data Element | Method | Use Case | UTF Set | LP | PP | eIV | Role | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Admin | Finance | Marketing | HR | |||||||||||
| P | U | P | U | P | U | P | U | |||||||
| address | Tokenization | Street names | Latin + Numeric | Yes | No | Yes | ✓ | X | X | ✓ | X | X | X | ✓ |
| city | Tokenization | Town or city name | Latin | Yes | No | Yes | ✓ | X | X | ✓ | X | ✓ | X | ✓ |
| Tokenization | Email address. Leaves the domain in the clear. | Latin + Numeric | Yes | No | Yes | ✓ | X | X | ✓ | X | ✓ | X | ✓ | |
| nin | Tokenization | National Insurance Number. Preserves the length, case, and position of the input characters but may create invalid NIN codes. | Latin + Numeric | Yes | Yes | No | ✓ | X | X | X | X | X | X | X |
| name | Tokenization | Person's name | Latin | Yes | No | Yes | ✓ | X | X | ✓ | X | ✓ | X | ✓ |
| passport | Tokenization | Passport codes. Preserves the length, case, and position of the input characters but may create invalid passport numbers. | Latin + Numeric | Yes | Yes | No | ✓ | X | X | X | X | X | X | X |
| phone | Tokenization | Phone number. May produce leading zeroes. | Latin + Numeric | Yes | No | Yes | ✓ | X | X | X | X | X | X | X |
| postcode | Tokenization | Postal codes with digits and characters. Preserves the length, case, and position of the input characters but may create invalid post codes. | Latin + numeric | Yes | Yes | No | ✓ | X | X | ✓ | X | ✓ | X | ✓ |
| ssn | Tokenization | Social Security Number (US) | Latin + Numeric | Yes | No | Yes | ✓ | X | X | X | X | X | X | X |
| zipcode | Tokenization | Zip codes with digits only. May produce leading zeroes. | Numeric | Yes | No | Yes | ✓ | X | X | ✓ | X | ✓ | X | ✓ |
PII Data Elements
| Data Element | Method | Use Case | UTF Set | LP | PP | eIV | Role | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Admin | Finance | Marketing | HR | |||||||||||
| P | U | P | U | P | U | P | U | |||||||
| address_de | Tokenization | Street names (German) | Latin + German + Numeric | Yes | No | Yes | ✓ | X | X | ✓ | X | X | X | ✓ |
| address_fr | Tokenization | Street names (French) | Latin + French + Numeric | Yes | No | Yes | ✓ | X | X | ✓ | X | X | X | ✓ |
| city_de | Tokenization | Town or city name (German) | Latin + German | Yes | No | Yes | ✓ | X | X | ✓ | X | ✓ | X | ✓ |
| city_fr | Tokenization | Town or city name (French) | Latin + French | Yes | No | Yes | ✓ | X | X | ✓ | X | ✓ | X | ✓ |
| name_de | Tokenization | Person's name (German) | Latin + German | Yes | No | Yes | ✓ | X | X | ✓ | X | ✓ | X | ✓ |
| name_fr | Tokenization | Person's name (French) | Latin + French | Yes | No | Yes | ✓ | X | X | ✓ | X | ✓ | X | ✓ |
LEGEND
- eIV: External IV
- LP: Length Preservation
- PP: Position Preservation
- P: User group can protect data
- U: User group can unprotect data
5 - Removing AI Developer Edition
Open a command prompt.
Navigate to the cloned repository location.
Run the following command to remove the containers and images.
docker compose down --rmi allRun the following command to remove the Python module.
pip uninstall protegrity-developer-python
6 - Known Issues
Issue: SSL errors in the Data Discovery container
Description: The tldextract tries to download the following public Suffix lists files:
- https://publicsuffix.org/list/public_suffix_list.dat
- https://raw.githubusercontent.com/publicsuffix/list/master/public_suffix_list.dat
When these lists cannot be downloaded, then the default files included in the package are used and no issue in observed in the classification.