1 - Input Sanitization

Rejecting unsanitized data.

The Classification service in Data Discovery offers a security feature that rejects unsanitized data. Data that is malformed, non-normalized, containing homoglyphs, hieroglyphs, mixed Unicode variants, or control characters are considered as unsanitized data. These are rejected for classification.

The following are few examples of data that will be rejected:

  • 𝓉𝑒𝓍𝓉
  • Pep

Before invoking the Classification endpoint, ensure that the input text is normalized. Replace invalid characters by their corresponding normalized plaintext characters. If the input text contains any invalid character, a status code of 422 and a message Untrusted input is returned.

For security purposes, the application rejects unsanitized data by default. It is recommended that this feature remains enabled. However, to override this feature, perform the following steps.

  1. Navigate to the docker_compose directory.

  2. Edit the docker-compose.yaml file.

  3. Under the environment section of classification_service, append the security parameter as follows.

- SECURITY_SETTINGS={"ENABLE_ALL_SECURITY_CONTROLS":false}
  1. Save the changes.

  2. Run the docker compose down command to undeploy the application.

  3. Run the docker compose up command to redeploy the application.

2 - Working with the Data Discovery containers

Using the Data Discovery containers.

Use Data Discovery by setting up and deploying the containers.

2.1 - Understanding the Docker Compose File

Details of the configurable parameters in the docker-compose.yml file.

The following variables can be configured in the docker-compose.yml file.

VariableDescriptionMandatory
networks:nameSpecify the name of the Docker network.No
services:enviromentSpecify the location for the logs in the logging_config parameter.No
classification_service:portsSpecify the listening port for the classification service. By default, the port is set to 8580.No

2.2 - Deploying the Application

Deploying the Data Discovery container.

Ensure that the prerequisites are completed before deploying the application.

Run the following steps to deploy the Data Discovery application on Docker.

  1. Open a command prompt.

  2. Navigate to the AI Developer Edition package directory.

  3. Run the command to start the containers. For example, the following command starts the Classification service container.

docker compose up -d

3 - Supported Sensitive Entity Types

PII entities supported by Protegrity AI Developer Edition.
Entity NameData ElementDescription
ABA_ROUTING_NUMBERnumberRouting number used to identify financial institutions in the United States.
ACCOUNT_NAMEstringName associated with a financial account.
ACCOUNT_NUMBERnumberBank account number used to identify financial accounts.
AGEnumberAge information used to identify individuals.
AMOUNTintSpecific amount of money, which can be linked to financial transactions.
AU_ABNnumberAustralian Business Number used to identify businesses in Australia.
AU_ACNnumberAustralian Company Number used to identify businesses in Australia.
AU_MEDICAREnumberMedicare number used to identify individuals for healthcare services in Australia.
AU_TFNnumberTax File Number used to identify taxpayers in Australia.
BICnumberBank Identifier Code used to identify financial institutions.
BITCOIN_ADDRESSaddressBitcoin wallet address used for digital transactions.
BUILDINGaddressBuilding information used to identify specific locations.
CITYcityCity information used to identify geographic locations.
COMPANY_NAMEstringName of a company used to identify businesses.
COUNTRYstringCountry information used to identify geographic locations.
COUNTYstringCounty information used to identify geographic locations.
CREDIT_CARDccnCredit card number used for financial transactions.
CREDIT_CARD_CVVnumberCard Verification Value used to secure credit card transactions.
CRYPTOaddressCryptocurrency wallet address used for digital transactions.
CURRENCYstringCurrency information used in financial transactions.
CURRENCY_CODEstringCode representing currency used in financial transactions.
CURRENCY_NAMEstringName of currency used in financial transactions.
CURRENCY_SYMBOLstringSymbol representing currency, sometimes linked to financial transactions.
DATEdatetimeSpecific date that can be linked to personal activities.
DATE_OF_BIRTHdatetimeDate of birth used to identify individuals.
DATE_TIMEdatetimeSpecific date and time that can be linked to personal activities.
DRIVER_LICENSEnumberDriver’s license number used to identify individuals.
EMAIL_ADDRESSemailEmail address used for communication and identification.
ES_NIEninForeigner Identification Number used to identify non-residents in Spain.
ES_NIFninTax Identification Number used to identify taxpayers in Spain.
ETHEREUM_ADDRESSaddressEthereum wallet address used for digital transactions.
FI_PERSONAL_IDENTITY_CODEninPersonal identity code used to identify individuals in Finland.
GENDERstringGender information used to identify individuals.
GEO_CCORDINATEaddressGeographic coordinates used to identify specific locations.
IBAN_CODEibanInternational Bank Account Number used to identify bank accounts globally.
ID_CARDnumberIdentity card number used to identify individuals.
IN_AADHAARninUnique identification number used to identify residents in India.
IN_PANnumberPermanent Account Number used to identify taxpayers in India.
IN_PASSPORTpassportPassport number used to identify individuals in India.
IN_VEHICLE_REGISTRATIONnumberVehicle registration number used to identify vehicles in India.
IN_VOTERnumberVoter ID number used to identify registered voters in India.
IP_ADDRESSaddressInternet Protocol address used to identify devices on a network.
IPV4addressIPv4 address used to identify devices on a network.
IPV6addressIPv6 address used to identify devices on a network.
IT_DRIVER_LICENSEnumberDriver’s license number used to identify individuals in Italy.
IT_FISCAL_CODEninFiscal code used to identify taxpayers in Italy.
IT_IDENTITY_CARDnumberIdentity card number used to identify individuals in Italy.
IT_PASSPORTpassportPassport number used to identify individuals in Italy.
LITECOIN_ADDRESSaddressLitecoin wallet address used for digital transactions.
LOCATIONaddressSpecific location or address that can be linked to an individual.
MACaddressMedia Access Control address used to identify devices on a network.
MEDICAL_LICENSEnumberLicense number used to identify medical professionals.
NRPnumberA person’s nationality, religious or political group.
ORGANIZATIONstringName or identifier used to identify an organization.
PASSPORTpassportPassport number used to identify individuals.
PASSWORDstringPassword used to secure access to personal accounts.
PERSONstringName or identifier used to identify an individual.
PHONE_NUMBERphoneNumber used to contact or identify an individual.
PINnumberPersonal Identification Number used to secure access to accounts.
PL_PESELninPersonal Identification Number used to identify individuals in Poland.
SECONDARYADDRESSaddressAdditional address information used to identify locations.
SG_NRIC_FINninNational Registration Identity Card number used to identify residents in Singapore.
SG_UENnumberUnique Entity Number used to identify businesses in Singapore.
SOCIAL_SECURITY_NUMBERssnSocial Security Number used to identify individuals.
STATEstringState information used to identify geographic locations.
STREETaddressStreet address used to identify specific locations.
TIMEdatetimeSpecific time that can be linked to personal activities.
TITLEstringTitle or honorific used to identify individuals.
UK_NHSnumberNational Health Service number used to identify individuals for healthcare services in the United Kingdom.
URLaddressWeb address that can sometimes contain personal information.
US_BANK_NUMBERnumberBank account number used to identify financial accounts in the United States.
US_DRIVER_LICENSEnumberDriver’s license number used to identify individuals in the United States.
US_ITINnumberIndividual Taxpayer Identification Number used to identify taxpayers in the United States.
US_PASSPORTpassportPassport number used to identify individuals in the United States.
US_SSNssnSocial Security Number used to identify individuals in the United States.
USERNAMEstringUsername used to identify individuals in online systems.
ZIP_CODEzipcodePostal code used to identify specific geographic areas.

4 - Data Security Policy

Data Security Policy configuration.

This section describes the Policy configuration used by the AI Developer Edition API Service.

The superuser has all permissions, that is, protect, unprotect, and reprotect operations. Users assigned the admin role will receive protected data when performing an unprotect operation, except in the case of the text data elements, which will return null. All other user roles will receive null as the output for any unprotect operation.

Policy Definition

Generic Data Elements

Data ElementMethodUse CaseUTF SetLPPPeIVRole
AdminFinanceMarketingHR
PUPUPUPU
datetimeTokenizationA date or datetime string. Formats accepted: YYYY/MM/DD HH:MM:SS and YYYY/MM/DD. Delimiters accepted: /, - (required).N/AN/AN/ANo
datetime_ycTokenizationA date or datetime string. Formats accepted: YYYY/MM/DD HH:MM:SS and YYYY/MM/DD. Delimiters accepted: /, - (required). Leaves the year in the clear.N/AN/AN/ANo
intTokenizationAn integer string (4 bytes).NumericNoNoYes
numberTokenizationA numeric string. May produce leading zeroes.NumericYesNoYes
stringTokenizationAn alphanumeric string.Latin + NumericYesNoYes
textEncryptionA long string (e.g., a comment field) using any character set. Use hex or base64 encoding to utilize.AllNoNoYes

PCI DSS Data Elements

Data ElementMethodUse CaseUTF SetLPPPeIVRole
AdminFinanceMarketingHR
PUPUPUPU
ccnTokenizationCredit card numbers.NumericNoNoYesX
ccn_binTokenizationCredit card numbers. Leaves 8-digit BIN in the clear.NumericNoNoYesX
ibanTokenizationIBAN numbers. Preserves the length, case, and position of the input characters but may create invalid IBAN codes.Latin + NumericYesYesNoX
iban_ccTokenizationIBAN numbers. Leaves letters in the clear.Latin + NumericNoNoYesX

PII Data Elements

Data ElementMethodUse CaseUTF SetLPPPeIVRole
AdminFinanceMarketingHR
PUPUPUPU
addressTokenizationStreet namesLatin + NumericYesNoYes
cityTokenizationTown or city nameLatinYesNoYes
emailTokenizationEmail address. Leaves the domain in the clear.Latin + NumericYesNoYes
ninTokenizationNational Insurance Number. Preserves the length, case, and position of the input characters but may create invalid NIN codes.Latin + NumericYesYesNo
nameTokenizationPerson's nameLatinYesNoYes
passportTokenizationPassport codes. Preserves the length, case, and position of the input characters but may create invalid passport numbers.Latin + NumericYesYesNo
phoneTokenizationPhone number. May produce leading zeroes.Latin + NumericYesNoYes
postcodeTokenizationPostal codes with digits and characters. Preserves the length, case, and position of the input characters but may create invalid post codes.Latin + numericYesYesNo
ssnTokenizationSocial Security Number (US)Latin + NumericYesNoYes
zipcodeTokenizationZip codes with digits only. May produce leading zeroes.NumericYesNoYes

PII Data Elements

Data ElementMethodUse CaseUTF SetLPPPeIVRole
AdminFinanceMarketingHR
PUPUPUPU
address_deTokenizationStreet names (German)Latin + German + NumericYesNoYes
address_frTokenizationStreet names (French)Latin + French + NumericYesNoYes
city_deTokenizationTown or city name (German)Latin + GermanYesNoYes
city_frTokenizationTown or city name (French)Latin + FrenchYesNoYes
name_deTokenizationPerson's name (German)Latin + GermanYesNoYes
name_frTokenizationPerson's name (French)Latin + FrenchYesNoYes

LEGEND

  • eIV: External IV
  • LP: Length Preservation
  • PP: Position Preservation
  • P: User group can protect data
  • U: User group can unprotect data

5 - Removing AI Developer Edition

Steps for removing the product.
  1. Open a command prompt.

  2. Navigate to the cloned repository location.

  3. Run the following command to remove the containers and images.

    docker compose down --rmi all
    
  4. Run the following command to remove the Python module.

    pip uninstall protegrity-developer-python
    

6 - Known Issues

Issues and workaround information.

Issue: SSL errors in the Data Discovery container

Description: The tldextract tries to download the following public Suffix lists files:

When these lists cannot be downloaded, then the default files included in the package are used and no issue in observed in the classification.