Input Sanitization

Rejecting unsanitized data.

The Classification service in Data Discovery offers a security feature that rejects unsanitized data. Data that is malformed, non-normalized, containing homoglyphs, hieroglyphs, mixed Unicode variants, or control characters are considered as unsanitized data. These are rejected for classification.

The following are few examples of data that will be rejected:

  • 𝓉𝑒𝓍𝓉
  • Pep

Before invoking the Classification endpoint, ensure that the input text is normalized. Replace invalid characters by their corresponding normalized plaintext characters. If the input text contains any invalid character, a status code of 422 and a message Untrusted input is returned.

For security purposes, the application rejects unsanitized data by default. It is recommended that this feature remains enabled. However, to override this feature, perform the following steps.

  1. Navigate to the docker_compose directory.

  2. Edit the docker-compose.yaml file.

  3. Under the environment section of classification_service, append the security parameter as follows.

- SECURITY_SETTINGS={"ENABLE_ALL_SECURITY_CONTROLS":false}
  1. Save the changes.

  2. Run the docker compose down command to undeploy the application.

  3. Run the docker compose up command to redeploy the application.


Last modified : July 25, 2025