Using the Synthetic Data APIs

Listing the APIs for Synthetic Data.

base

Base HTTP client for SDK communication

Provides low-level HTTP communication utilities shared across all SDK clients. Handles request construction, response parsing, error handling, and retries.

dataframe_to_base64
def dataframe_to_base64(df: pd.DataFrame) -> str

Convert DataFrame to base64-encoded CSV string for inline data transfer.

Arguments:

  • df - DataFrame to encode.

Returns:

  • str - Base64-encoded CSV string.

Examples:

```
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
encoded = dataframe_to_base64(df)
print(encoded[:20])
YSxiCjEsMwoyLDQK
```

BaseClient

class BaseClient()

Base HTTP client with common request/response logic.

Provides methods for making HTTP requests to the synthesis API with automatic retry, error handling, and response parsing.

Arguments:

  • config ClientConfig - Client configuration with endpoint and settings.

Examples:

```python
config = ClientConfig(endpoint="http://localhost:8000")
client = BaseClient(config)
response = client._request("POST", "/pty/syntheticdata/v2/synthesize", data=payload)
```

__init__
def __init__(config: ClientConfig)

Initialize base client.

Arguments:

  • config - Client configuration.

client

Remote synthesizer clients for the Synthetic Data SDK

Provides client classes that mirror the local synthesizer interface but delegate computation to a remote REST API. Enables distributed synthesis workflows without local compute resources.

Classes: SynthesisClient: Low-level API client for direct endpoint access. RemoteVineCopula: Remote single-table vine copula synthesizer. RemoteMultiTableVineCopula: Remote multi-table synthesizer.

Examples:

Single-table synthesis:

```python
from synthetic_data_sdk import RemoteVineCopula
import pandas as pd

# Initialize client
synth = RemoteVineCopula(
    endpoint="http://api.example.com:8000", categorical_cols=["city", "product"]
)

# Fit on training data
df = pd.read_csv("customers.csv")
synth.fit(df)

# Generate synthetic data
synthetic = synth.transform(n=10000)
synthetic.to_csv("synthetic_customers.csv", index=False)
```

Multi-table synthesis:

```python
from synthetic_data_sdk import RemoteMultiTableVineCopula

synth = RemoteMultiTableVineCopula(
    endpoint="http://api.example.com:8000",
    relationships=[
        ("customers", "customer_id", "orders", "customer_id"),
        ("orders", "order_id", "items", "order_id"),
    ],
    synthesizer_params={
        "customers": {"categorical_cols": ["city"]},
        "orders": {"categorical_cols": ["status"]},
    },
)

tables = {"customers": customers_df, "orders": orders_df, "items": items_df}
synth.fit(tables)
synthetic_tables = synth.transform(n=500)
```

Model persistence:

```python
# Fit and save on server
synth = RemoteVineCopula(endpoint="http://api.example.com:8000", model_version="prod_v2")
synth.fit(training_data)

# Later: load and use
synth = RemoteVineCopula(endpoint="http://api.example.com:8000", model_version="prod_v2")
synthetic = synth.transform(n=5000)  # No refitting needed
```

SynthesisClient

class SynthesisClient(BaseClient)

Low-level client for direct API interaction.

Provides a thin wrapper around the synthesis endpoint for applications that need fine-grained control over request payloads. Most users should use RemoteVineCopula or RemoteMultiTableVineCopula instead.

Arguments:

  • config ClientConfig - Client configuration.

Examples:

```python
from synthetic_data_sdk import SynthesisClient, ClientConfig

config = ClientConfig(endpoint="http://localhost:8000")
client = SynthesisClient(config)

# Manual request construction
response = client.synthesize(
    model_name="vine",
    action="fit_transform",
    training_data="data/customers.csv",
    n_samples=1000,
    parameters={"categorical_cols": ["city"]},
)
print(response["status"])
success
```

synthesize
def synthesize(model_name: str,
               action: str,
               training_data: str | None = None,
               training_data_path: str | None = None,
               training_data_tables: dict[str, str] | None = None,
               n_samples: int | None = None,
               model_version: str | None = None,
               parameters: dict[str, Any] | None = None,
               output_uri: str | None = None,
               mlops_config: dict[str, Any] | None = None) -> dict[str, Any]

Send synthesis request to API.

Arguments:

  • model_name - Model type (‘vine’ or ‘vine_multitable’).
  • action - Action to perform (‘fit’, ’transform’, ‘fit_transform’).
  • training_data - Base64-encoded CSV string for single-table inline data.
  • training_data_path - Cloud URI or local path to single-table training data CSV.
  • training_data_tables - Dict mapping table names to local paths/file:// URIs or cloud URIs for multi-table. All tables must use the same input kind; mixing raises ValueError.
  • n_samples - Number of synthetic samples to generate.
  • model_version - Version identifier for model persistence.
  • parameters - Model-specific parameters.
  • output_uri - Cloud URI to write synthetic data to (e.g. ‘s3://bucket/out.csv’). When omitted, synthetic data is returned inline in the response. Supported schemes: s3://, gs://, azure://, minio://.
  • mlops_config - Per-request MLOps tracking configuration. When provided, overrides the server’s default MLOps settings for this request only. All keys are optional and fall back to the server configuration when omitted. Useful for multi-tenant MLOps setups where each caller tracks to their own Postgres / artifact store. Accepted keys (all optional):
    • database_dsn: connection string for the MLOps DB.
    • storage_dsn: artifact storage URI (s3://, local://, gs://, azure://, minio://).
    • experiment_prefix: defaults to ‘synthetic-data’.

Example:

  • {"database_dsn" - “postgresql://user:pw@host:5432/mlops”,
  • "storage_dsn" - “s3://key:secret@us-east-1/bucket/mlops/”}

Returns:

  • dict - API response with status, data, and metadata.

Raises:

  • SynthesisAPIError - If request fails.

Examples:

```python
response = client.synthesize(
    model_name="vine",
    action="fit_transform",
    training_data_path="s3://key:secret@region/bucket/data.csv",
    n_samples=1000,
    parameters={"categorical_cols": ["city"]},
    mlops_config={"database_dsn": "postgresql://user:pw@host:5432/mlops"},
)
data_synthesis = response["data"]
```

list_models
def list_models(model_type: str | None = None,
                all_metrics: bool = False) -> list[dict[str, Any]]

List model versions currently in Production.

Arguments:

  • model_type - Filter by algorithm class (e.g. "vine").
  • all_metrics - When False (default), metrics contains only the promotion metric. When True, all logged metrics are returned.

Returns:

List of dicts with model_name, model_type, model_version, semantic_version, stage, input_schema, metrics, registered_at.

synthesize_async
def synthesize_async(model_name: str,
                     action: str,
                     training_data: str | None = None,
                     training_data_path: str | None = None,
                     training_data_tables: dict[str, str] | None = None,
                     n_samples: int | None = None,
                     model_version: str | None = None,
                     parameters: dict[str, Any] | None = None,
                     output_uri: str | None = None) -> dict[str, Any]

Submit a synthesis request for background execution.

Returns immediately with a job_id that can be polled via :meth:get_job_status.

Arguments:

  • model_name - Model type (‘vine’ or ‘vine_multitable’).
  • action - Action to perform (‘fit’, ’transform’, ‘fit_transform’).
  • training_data - Base64-encoded CSV string for single-table inline data.
  • training_data_path - Cloud URI or local path to single-table training data CSV.
  • training_data_tables - Dict mapping table names to local paths/file:// URIs or cloud URIs for multi-table. All tables must use the same input kind; mixing raises ValueError.
  • n_samples - Number of synthetic samples to generate.
  • model_version - Version identifier for model persistence.
  • parameters - Model-specific parameters.
  • output_uri - Cloud URI to write synthetic data to. None means inline response.

Returns:

  • dict - {"job_id": "...", "status": "queued", ...}

get_job_status
def get_job_status(job_id: str) -> dict[str, Any]

Get the current status of a job.

Arguments:

  • job_id - Unique job identifier returned by :meth:synthesize_async.

Returns:

  • dict - Job status including job_id, status, progress, step, message, error, synth_data_uri, timestamps.

list_jobs
def list_jobs(status: str | None = None,
              limit: int = 100,
              offset: int = 0) -> dict[str, Any]

List jobs with optional filtering and pagination.

Arguments:

  • status - Filter by status (pending, running, completed, failed, cancelled).
  • limit - Page size (1-1000).
  • offset - Page offset.

Returns:

  • dict - {"jobs": [...], "total": int, "limit": int, "offset": int}

get_job_history
def get_job_history(job_id: str) -> list[dict[str, Any]]

Get the full state-transition audit trail for a job.

Arguments:

  • job_id - Unique job identifier.

Returns:

  • list - List of history entry dicts with sequence, status, progress, step, changed_at, etc.

delete_job
def delete_job(job_id: str) -> None

Delete a job record (or cancel if running).

Arguments:

  • job_id - Unique job identifier.

wait_for_job
def wait_for_job(job_id: str,
                 poll_interval: float = 2.0,
                 timeout: float = 600.0,
                 callback: Any | None = None) -> dict[str, Any]

Poll a job until it reaches a terminal state.

Arguments:

  • job_id - Unique job identifier.
  • poll_interval - Seconds between polls (default 2s).
  • timeout - Maximum seconds to wait (default 600s / 10 min).
  • callback - Optional callable (status_dict) -> None invoked on each poll.

Returns:

  • dict - Final job status.

Raises:

  • TimeoutError - If job doesn’t complete within timeout.

generate_conditional
def generate_conditional(real_data: str | pd.DataFrame,
                         model_name: str,
                         n_samples: int,
                         conditions: dict[str, Any] | None = None,
                         amplify_patterns: float | None = None,
                         inject_drift: dict[str, float] | None = None,
                         categorical_cols: list[str] | None = None,
                         random_state: int | None = None) -> dict[str, Any]

Generate synthetic data matching conditional scenarios.

Fits a synthesizer on real data and generates synthetic samples matching specific conditions (filters), with optional pattern amplification and distribution drift injection. Useful for scenario testing, edge case generation, and what-if analysis.

Arguments:

  • real_data - Path to CSV file or DataFrame containing training data.
  • model_name - Model type (‘vine’, ‘smote’, ’tabdiff’, ’tabulargan’).
  • n_samples - Number of synthetic samples to generate.
  • conditions - Dictionary of column conditions. Examples:
    • Exact match: {‘fraud’: 1, ‘status’: ‘active’}
    • Comparison: {‘age’: ‘>65’, ‘income’: ‘<=50000’}
    • Range: {‘age’: ‘between(30,50)’}
    • Membership: {‘city’: ‘in(NYC,LA,Chicago)’}
  • amplify_patterns - Multiplier for conditional pattern amplification (e.g., 1.5 for 50% increase).
  • inject_drift - Dictionary of column drift shifts. Examples:
    • {‘income’: -20000, ‘age’: -5} # Recession scenario
    • {‘credit_score’: -50} # Credit deterioration
  • categorical_cols - List of categorical column names for proper encoding.
  • random_state - Random seed for reproducibility.

Returns:

  • dict - Response containing:
    • success (bool): Whether generation succeeded.
    • n_samples (int): Number of samples generated.
    • synthetic_data (str): Base64-encoded CSV data.
    • conditions_applied (dict): Conditions that were applied.
    • drift_applied (dict): Drift shifts that were applied.
    • warnings (List[str]): Any warnings from generation.
    • metadata (dict): Model and column information.

Raises:

  • SynthesisAPIError - If request fails.

Examples:

Fraud scenario - generate high-risk fraud cases:

```python
from synthetic_data_sdk import SynthesisClient, ClientConfig
import pandas as pd

config = ClientConfig(endpoint="http://localhost:8000")
client = SynthesisClient(config)

response = client.generate_conditional(
    real_data="data/transactions.csv",
    model_name="vine",
    n_samples=1000,
    conditions={"fraud": 1, "age": ">65"},
    categorical_cols=["status", "fraud"],
)

# Decode synthetic data
import base64, io
decoded = base64.b64decode(response["synthetic_data"])
synthetic = pd.read_csv(io.StringIO(decoded.decode("utf-8")))
print(f"Generated {len(synthetic)} fraud cases")
```

Recession scenario - income and employment impact:

```python
response = client.generate_conditional(
    real_data=customer_df,  # Can pass DataFrame directly
    model_name="vine",
    n_samples=5000,
    conditions={"age": ">55"},  # Focus on older customers
    inject_drift={
        "income": -20000,  # $20k income decrease
        "credit_score": -50,  # 50-point credit drop
    },
    categorical_cols=["status", "region"],
)
print(response["drift_applied"])
{'income': -20000, 'credit_score': -50}
```

Edge case generation - extreme values:

```python
response = client.generate_conditional(
    real_data="data/loans.csv",
    model_name="vine",
    n_samples=500,
    conditions={"loan_amount": ">100000", "credit_score": "<600"},
    amplify_patterns=2.0,  # 2x amplification for extreme patterns
    random_state=42,
)
```

_RemoteSingleTableSynthesizer

class _RemoteSingleTableSynthesizer()

Base class for remote single-table synthesizers.

Eliminates code duplication across RemoteVineCopula, RemoteTabDiff, RemoteSMOTE, and RemoteTabularGAN. Subclasses need only set _model_name and _version_prefix class attributes. Model-specific methods (e.g. transform_conditional) can be added in the subclass.

__init__
def __init__(endpoint: str | None = None,
             model_version: str | None = None,
             config: ClientConfig | None = None,
             mlops_config: dict[str, Any] | None = None,
             **parameters)

Initialize a remote single-table synthesizer.

Arguments:

  • endpoint - API endpoint URL. Not required if config is provided.
  • model_version - Version identifier for model persistence.
  • config - Advanced client configuration (timeouts, auth, etc.).
  • mlops_config - Per-request MLOps tracking configuration. When provided, overrides the server’s default MLOps settings for this request only. All keys are optional and fall back to the server configuration when omitted. Accepted keys: database_dsn, storage_dsn, experiment_prefix, auto_promote, promotion_metric, promotion_direction.
  • **parameters - Model-specific hyper-parameters forwarded to the synthesizer constructor.

list_models
def list_models(all_metrics: bool = False) -> list[dict[str, Any]]

List Production model versions for this model type.

Calls the shared models endpoint with model_type=<_model_name> and returns all versions currently in the Production stage.

Arguments:

  • all_metrics - When False (default), metrics contains only the promotion metric. When True, all logged metrics are returned.

Returns:

List of dicts with model_name, model_type, model_version, semantic_version, stage, input_schema, metrics, registered_at.

fit
def fit(df: pd.DataFrame | str | Path) -> "RemoteVineCopula"

Fit model on training data.

Uploads training data to the API and triggers model fitting. The fitted model is stored on the server using the configured model_version.

Arguments:

  • df - Training data as DataFrame, local file path, or cloud URI.

Returns:

Self (for method chaining).

Raises:

  • SynthesisAPIError - If fitting fails.

transform
def transform(n: int, **kwargs) -> pd.DataFrame

Generate synthetic data using a fitted model.

Arguments:

  • n - Number of synthetic samples to generate.

Returns:

Synthetic data with the same schema as the training data.

Raises:

  • RuntimeError - If model is not fitted.
  • SynthesisAPIError - If generation fails.

fit_transform
def fit_transform(df: pd.DataFrame, n: int, **kwargs) -> pd.DataFrame

Fit model and generate synthetic data in one call.

Arguments:

  • df - Training data.
  • n - Number of synthetic samples to generate.

Returns:

Synthetic data.

Raises:

  • SynthesisAPIError - If operation fails.

summary
def summary() -> dict[str, Any]

Get summary statistics from a fitted model.

Returns:

Model summary with statistics and metadata.

Raises:

  • RuntimeError - If model is not fitted.
  • SynthesisAPIError - If request fails.

evaluate
def evaluate(real_data: pd.DataFrame | str,
             synthetic_data: pd.DataFrame | str,
             categorical_cols: list[str] | None = None,
             target_col: str | None = None,
             task_type: str | None = None,
             eval_params: dict[str, Any] | None = None) -> dict[str, Any]

Evaluate synthetic data quality against real data.

Computes comprehensive quality metrics including univariate distributions, correlation preservation, mutual information, predictive performance, and privacy metrics.

Arguments:

  • real_data - Real training data.
  • synthetic_data - Synthetic data to evaluate.
  • categorical_cols - Categorical column names.
  • target_col - Target column for TSTR/TRTR evaluation.
  • task_type - 'classification' or 'regression' for TSTR.
  • eval_params - Additional FidelityEvaluator configuration.

Returns:

Evaluation metrics dictionary.

Raises:

  • SynthesisAPIError - If evaluation fails.

RemoteVineCopula

class RemoteVineCopula(_RemoteSingleTableSynthesizer)

Remote client for single-table vine copula synthesis.

Mirrors the interface of the local VineCopula class but delegates all computation to a remote REST API. Provides the same fit/transform workflow without requiring local compute resources.

In addition to the shared single-table methods (fit, transform, fit_transform, evaluate, summary), this class exposes transform_conditional for scenario-based generation.

Arguments:

  • endpoint - Base URL of the synthesis API.
  • model_version - Version identifier for model persistence.
  • config - Advanced client configuration.
  • storage_config - Artifact storage credentials/configuration.
  • **parameters - Model parameters (categorical_cols, vine_type, etc.).

Examples:

```python
synth = RemoteVineCopula(
    endpoint="http://localhost:8000",
    categorical_cols=["city", "product"],
    vine_type="cvine",
)
synth.fit(df)
synthetic = synth.transform(n=1000)
```

transform_conditional
def transform_conditional(df: pd.DataFrame,
                          n: int,
                          conditions: dict[str, Any] | None = None,
                          amplify_patterns: float | None = None,
                          inject_drift: dict[str, float] | None = None,
                          random_state: int | None = None) -> pd.DataFrame

Generate conditional synthetic data matching specific scenarios.

Fits a vine copula on the provided data and generates synthetic samples matching specified conditions, with optional pattern amplification and distribution drift.

Arguments:

  • df - Training data to fit the model on.
  • n - Number of synthetic samples to generate.
  • conditions - Column conditions (exact, comparison, range, membership).
  • amplify_patterns - Multiplier for conditional pattern amplification.
  • inject_drift - Column drift shifts (e.g. {'income': -20000}).
  • random_state - Random seed for reproducibility.

Returns:

Synthetic data matching specified conditions.

Raises:

  • SynthesisAPIError - If generation fails.

RemoteMultiTableVineCopula

class RemoteMultiTableVineCopula()

Remote client for multi-table vine copula synthesis.

Mirrors the interface of MultiTableVineCopula but delegates computation to a remote API. Preserves foreign key relationships across tables.

Arguments:

  • endpoint str - Base URL of the synthesis API.
  • relationships List[Tuple[str, str, str, str]] - Foreign key relationships.
  • model_version Optional[str] - Version identifier for model persistence.
  • config Optional[ClientConfig] - Advanced client configuration.
  • synthesizer_params Optional[Dict[str, Dict]] - Per-table parameters.

Examples:

Multi-table synthesis:

```python
from synthetic_data_sdk import RemoteMultiTableVineCopula

synth = RemoteMultiTableVineCopula(
    endpoint="http://localhost:8000",
    relationships=[("customers", "customer_id", "orders", "customer_id")],
    synthesizer_params={"customers": {"categorical_cols": ["city", "segment"]}},
)

tables = {"customers": customers_df, "orders": orders_df}
synth.fit(tables)
synthetic = synth.transform(n=100)
print(synthetic.keys())
dict_keys(['customers', 'orders'])
```

__init__
def __init__(relationships: list[tuple[str, str, str, str]],
             endpoint: str | None = None,
             model_version: str | None = None,
             config: ClientConfig | None = None,
             synthesizer_params: dict[str, dict[str, Any]] | None = None,
             primary_keys: dict[str, str] | None = None,
             mlops_config: dict[str, Any] | None = None)

Initialize remote multi-table client.

Arguments:

  • relationships - List of (parent_table, parent_col, child_table, child_col).
  • endpoint - API endpoint URL. Not required if config is provided.
  • model_version - Version identifier for persistence.
  • config - Advanced client configuration. If provided, endpoint can be omitted.
  • synthesizer_params - Per-table parameters (categorical_cols, etc.).
  • primary_keys - Optional mapping of table name -> primary-key column name for tables that are not inferred automatically (e.g. leaf tables).
  • Example - {"order_items": "item_id"}.
  • mlops_config - Per-request MLOps tracking configuration. When provided, overrides the server’s default MLOps settings for this request only. All keys are optional and fall back to the server configuration when omitted. Accepted keys: database_dsn, storage_dsn, experiment_prefix, auto_promote, promotion_metric, promotion_direction.

list_models
def list_models(all_metrics: bool = False) -> list[dict[str, Any]]

List Production model versions for multi-table vine copula.

Calls the shared models endpoint with model_type=vine_multitable and returns all versions currently in the Production stage.

Arguments:

  • all_metrics - When False (default), metrics contains only the promotion metric. When True, all logged metrics are returned.

Returns:

List of dicts with model_name, model_type, model_version, semantic_version, stage, input_schema, metrics, registered_at.

fit
def fit(
    tables: dict[str, pd.DataFrame] | dict[str, str] | dict[str, Path]
) -> "RemoteMultiTableVineCopula"

Fit multi-table model on training data.

Dual-Mode Data Loading: The SDK automatically detects your input type for EACH table and selects the appropriate loading mode:

  1. Dict of DataFrames → Inline Base64:
  • Each DataFrame encoded as base64
  • All tables sent in HTTP body as dict
  • Server decodes using explicit is_inline=True flag
  1. Dict of Local Files → Inline Base64:
  • SDK reads each file on client side
  • Converts to dict of base64 strings
  • Server decodes using explicit is_inline=True flag
  1. Dict of Cloud URIs → Server Load:
  • URIs passed directly (no data transfer in HTTP)
  • Server loads each table from cloud storage
  • Uses explicit is_inline=False flag

Mixed modes are NOT supported - all tables must use same mode.

Arguments:

  • tables - Training tables keyed by name. Each value can be:
    • DataFrame (mode 1)
    • Local file path (mode 2)
    • Cloud URI with supported scheme (mode 3) All tables must be the same type.

Returns:

  • RemoteMultiTableVineCopula - Self (for method chaining).

Raises:

  • SynthesisAPIError - If fitting fails.
  • ValueError - If tables use mixed modes (e.g., DataFrame + cloud URI).
  • FileNotFoundError - If any local file path doesn’t exist.

Examples:

```python
synth = RemoteMultiTableVineCopula(
    endpoint="http://localhost:8000",
    relationships=[{"parent": "customers", "child": "orders"}],
)

# Mode 1: Dict of DataFrames (auto-detects → inline base64)
tables = {"customers": customers_df, "orders": orders_df}
synth.fit(tables)

# Mode 2: Dict of local files (SDK reads → inline base64)
tables = {"customers": "./data/customers.csv", "orders": "./data/orders.csv"}
synth.fit(tables)

# Mode 3: Dict of cloud URIs (passes URIs → server loads)
tables = {
    "customers": "s3://bucket/customers.csv",
    "orders": "s3://bucket/orders.csv",
}
synth.fit(tables)
```

transform
def transform(n: int, **kwargs) -> dict[str, pd.DataFrame]

Generate synthetic multi-table data.

Arguments:

  • n int - Number of parent table samples.
  • **kwargs - Additional generation parameters.

Returns:

Dict[str, pd.DataFrame]: Synthetic tables keyed by name.

Raises:

  • RuntimeError - If model not fitted.
  • SynthesisAPIError - If generation fails.

Examples:

```python
synth.fit(tables)
synthetic = synth.transform(n=100)
print(f"Customers: {len(synthetic['customers'])} rows")
print(f"Orders: {len(synthetic['orders'])} rows")
```

fit_transform
def fit_transform(tables: dict[str, pd.DataFrame], n: int,
                  **kwargs) -> dict[str, pd.DataFrame]

Fit model and generate synthetic data in one call.

Arguments:

  • tables Dict[str, pd.DataFrame] - Training tables.
  • n int - Number of parent table samples.
  • **kwargs - Additional parameters.

Returns:

Dict[str, pd.DataFrame]: Synthetic tables.

Raises:

  • SynthesisAPIError - If operation fails.

Examples:

```python
synth = RemoteMultiTableVineCopula(
    endpoint="http://localhost:8000",
    relationships=[("customers", "id", "orders", "customer_id")],
)
synthetic = synth.fit_transform(tables, n=100)
```

summary
def summary() -> dict[str, Any]

Get summary statistics from fitted model.

Returns:

  • dict - Model summary with statistics and metadata.

Raises:

  • RuntimeError - If model not fitted.
  • SynthesisAPIError - If request fails.

validate_relationships
def validate_relationships(tables: dict[str, pd.DataFrame]) -> dict[str, Any]

Validate foreign key relationships in multi-table data.

Arguments:

  • tables Dict[str, pd.DataFrame] - Tables to validate.

Returns:

  • dict - Validation results with ‘valid’ boolean and ‘violations’ list.

Raises:

  • RuntimeError - If model not fitted.
  • SynthesisAPIError - If request fails.

relational_score
def relational_score(real_tables: dict[str, pd.DataFrame],
                     synth_tables: dict[str, pd.DataFrame]) -> dict[str, Any]

Compute relational fidelity score comparing real and synthetic data.

Evaluates relational integrity metrics:

  • Foreign key violation rate
  • Cardinality preservation (child count distributions)
  • Join distribution similarity (cross-table correlations)
  • Overall composite relational score

Arguments:

  • real_tables Dict[str, pd.DataFrame] - Real multi-table data.
  • synth_tables Dict[str, pd.DataFrame] - Synthetic multi-table data.

Returns:

  • dict - Relational fidelity scores with structure: {
  • 'fk_violation_rate' - float,
  • 'fk_violations' - list,
  • 'cardinality_preservation' - {
  • 'mean_error' - float,
  • 'max_error' - float,
  • 'details' - list },
  • 'join_distribution_similarity' - float,
  • 'join_details' - list,
  • 'overall_relational_score' - float (0-1),
  • 'interpretation' - str }

Raises:

  • RuntimeError - If model not fitted.
  • SynthesisAPIError - If request fails.

Example:

```python
    # After fitting and generating synthetic data
    client = RemoteMultiTableVineCopula(...)
    client.fit(real_tables)
    synthetic = client.transform(n=1000)
    scores = client.relational_score(real_tables, synthetic)
    print(f"Overall score: {scores['overall_relational_score']:.2f}")
    print(f"FK violations: {scores['fk_violation_rate']:.2%}")
    print(f"Interpretation: {scores['interpretation']}")
```

get_table_order
def get_table_order() -> list[str]

Get the topological order of tables for sampling.

Returns:

  • list - List of table names in sampling order.

Raises:

  • RuntimeError - If model not fitted.
  • SynthesisAPIError - If request fails.

evaluate
def evaluate(real_tables: dict[str, pd.DataFrame],
             synthetic_tables: dict[str, pd.DataFrame],
             categorical_cols: dict[str, list[str]] | None = None,
             target_col: str | None = None,
             task_type: str | None = None,
             eval_params: dict[str, Any] | None = None) -> dict[str, Any]

Evaluate multi-table synthetic data quality.

Evaluates each table individually and returns per-table metrics.

Arguments:

  • real_tables Dict[str, pd.DataFrame] - Real training tables.
  • synthetic_tables Dict[str, pd.DataFrame] - Synthetic tables to evaluate.
  • categorical_cols Dict[str, List[str]], optional - Per-table categorical columns.
  • target_col str, optional - Target column for TSTR/TRTR.
  • task_type str, optional - ‘classification’ or ‘regression’.
  • eval_params dict, optional - FidelityEvaluator configuration.

Returns:

  • dict - Per-table evaluation metrics.

Raises:

  • SynthesisAPIError - If evaluation fails.

Examples:

```python
synth.fit(real_tables)
synthetic = synth.transform(n=100)

metrics = synth.evaluate(
    real_tables, synthetic, categorical_cols={"customers": ["city"]}
)
print(metrics["customers"]["correlation_error"])
```

RemoteTabDiff

class RemoteTabDiff(_RemoteSingleTableSynthesizer)

Remote client for TabDiff diffusion-based synthesis.

Mirrors the interface of the local TabDiff class but delegates all computation to a remote REST API. Ideal for GPU-intensive synthesis without local GPU resources.

Inherits fit, transform, fit_transform, evaluate, and summary from :class:_RemoteSingleTableSynthesizer.

Arguments:

  • endpoint - Base URL of the synthesis API.
  • model_version - Version identifier for model persistence.
  • config - Advanced client configuration.
  • storage_config - Artifact storage credentials/configuration.
  • **parameters - Model parameters (categorical_cols, epochs, etc.).

Examples:

```python
synth = RemoteTabDiff(
    endpoint="http://localhost:8000",
    categorical_cols=["city", "product"],
    epochs=1000,
)
synth.fit(df)
synthetic = synth.transform(n=1000)
```

RemoteSMOTE

class RemoteSMOTE(_RemoteSingleTableSynthesizer)

Remote client for SMOTE-based synthesis.

Mirrors the interface of the local SMOTE class but delegates all computation to a remote REST API. Useful for oversampling minority classes in imbalanced datasets.

Inherits fit, transform, fit_transform, evaluate, and summary from :class:_RemoteSingleTableSynthesizer.

Arguments:

  • endpoint - Base URL of the synthesis API.
  • model_version - Version identifier for model persistence.
  • config - Advanced client configuration.
  • storage_config - Artifact storage credentials/configuration.
  • **parameters - Model parameters (categorical_cols, k, noise_scale, etc.).

Examples:

```python
synth = RemoteSMOTE(
    endpoint="http://localhost:8000",
    categorical_cols=["class"],
    k=5,
)
synth.fit(df)
synthetic = synth.transform(n=1000)
```

RemoteTabularGAN

class RemoteTabularGAN(_RemoteSingleTableSynthesizer)

Remote client for TabularGAN-based synthesis.

Mirrors the interface of the local TabularGAN class but delegates all computation to a remote REST API. Uses CTABGAN architecture with mode-specific normalization for mixed continuous/categorical columns.

Inherits fit, transform, fit_transform, evaluate, and summary from :class:_RemoteSingleTableSynthesizer.

Arguments:

  • endpoint - Base URL of the synthesis API.
  • model_version - Version identifier for model persistence.
  • config - Advanced client configuration.
  • storage_config - Artifact storage credentials/configuration.
  • **parameters - Model parameters (categorical_cols, epochs, etc.).

Examples:

```python
synth = RemoteTabularGAN(
    endpoint="http://localhost:8000",
    categorical_cols=["city", "product"],
    epochs=300,
)
synth.fit(df)
synthetic = synth.transform(n=1000)
```

PrivacyEvaluator

class PrivacyEvaluator()

Remote client for privacy attack evaluation.

Provides methods to evaluate privacy risks in synthetic data using membership inference attacks, sensitive attribute reconstruction, and linkage attack risk analysis.

Arguments:

  • endpoint str, optional - Base URL of the synthesis API.
  • config Optional[ClientConfig] - Advanced client configuration.

Attributes:

  • client SynthesisClient - Underlying HTTP client.

Examples:

Basic privacy evaluation:

```python
from synthetic_data_sdk import PrivacyEvaluator
import pandas as pd

evaluator = PrivacyEvaluator(endpoint="http://localhost:8000")

# Load datasets
train_real = pd.read_csv("data/train.csv")
test_real = pd.read_csv("data/test.csv")
synthetic = pd.read_csv("data/synthetic.csv")

# Evaluate privacy
results = evaluator.evaluate(
    train_real_data=train_real,
    test_real_data=test_real,
    synthetic_data=synthetic,
    sensitive_columns=["ssn", "salary", "diagnosis"],
)

print(f"Overall Risk: {results['overall_risk']}")
print(f"Successful Attacks: {results['summary']['successful_attacks']}")
```

Custom configuration:

```python
results = evaluator.evaluate(
    train_real_data=train_real,
    test_real_data=test_real,
    synthetic_data=synthetic,
    sensitive_columns=["income", "health_status"],
    k_values=[2, 5, 10, 20],
    config={"shadow_models": 10, "attack_model": "xgboost", "random_state": 42},
)
```

Access individual attack results:

```python
for attack in results["attacks"]:
    print(f"Attack: {attack['attack_type']}")
    print(f"Risk Level: {attack['risk_level']}")
    print(f"Metrics: {attack['metrics']}")
```

__init__
def __init__(endpoint: str | None = None, config: ClientConfig | None = None)

Initialize privacy evaluator client.

Arguments:

  • endpoint - API endpoint URL (e.g., ‘http://api.example.com:8000’). Not required if config is provided.
  • config - Advanced client configuration (timeout, retries, etc.). If provided, endpoint can be omitted.

evaluate
def evaluate(train_real_data: pd.DataFrame | str,
             test_real_data: pd.DataFrame | str,
             synthetic_data: pd.DataFrame | str,
             sensitive_columns: list[str] | None = None,
             k_values: list[int] | None = None,
             config: dict[str, Any] | None = None) -> dict[str, Any]

Evaluate privacy risks in synthetic data.

Executes membership inference attacks, sensitive attribute reconstruction, and linkage attack risk analysis to assess privacy preservation quality.

Arguments:

  • train_real_data - Real training data (DataFrame or path/URI to CSV). This is the data used to train the synthesizer.
  • test_real_data - Real test/holdout data (DataFrame or path/URI to CSV). This data was NOT used to train the synthesizer.
  • synthetic_data - Synthetic data (DataFrame or path/URI to CSV).
  • sensitive_columns - List of column names considered sensitive. If None, all columns are considered.
  • k_values - List of k values for k-anonymity analysis. Defaults to [2, 3, 5, 10].
  • config - Advanced configuration options:
    • shadow_models (int): Number of shadow models for MIA (default: 5)
    • attack_model (str): ML model for MIA (‘catboost’, ‘xgboost’, ‘rf’)
    • sarp_target_model (str): ML model for SARP (‘xgboost’, ‘catboost’)
    • optuna_trials (int): Number of Optuna optimization trials
    • random_state (int): Random seed for reproducibility

Returns:

  • dict - Privacy evaluation results containing:
    • request_id (str): Request identifier
    • status (str): ‘success’ or ’error’
    • overall_risk (str): Overall risk level (’low’, ‘medium’, ‘high’, ‘critical’)
    • attacks (List[dict]): Results for each attack type
    • summary (dict): High-level summary with:
    • total_attacks: Number of attacks executed
    • successful_attacks: Number of successful attacks
    • risk_distribution: Count by risk level
    • recommendations: List of recommended actions
    • metadata (dict): Execution details (time, dataset sizes, etc.)

Raises:

  • SynthesisAPIError - If evaluation fails.
  • ValidationError - If data schemas don’t match or inputs are invalid.
  • ConnectionError - If API is unreachable.

Notes:

Attack Types:

  1. Membership Inference Attack (MIA):
  • Determines if a record was in training data
  • Reports: precision, recall, AUC-ROC
  • High success rate indicates privacy risk
  1. Sensitive Attribute Reconstruction (SARP):
  • Attempts to predict sensitive attributes
  • Reports: accuracy, F1 score per sensitive column
  • High accuracy indicates information leakage
  1. Linkage Attack Risk:
  • Analyzes k-anonymity of synthetic data
  • Reports: violation percentage for each k value
  • High violations indicate re-identification risk

Examples:

DataFrame inputs:

```python
evaluator = PrivacyEvaluator(endpoint="http://localhost:8000")

results = evaluator.evaluate(
    train_real_data=train_df,
    test_real_data=test_df,
    synthetic_data=synth_df,
    sensitive_columns=["ssn", "salary"],
)

print(f"Overall Risk: {results['overall_risk']}")
print(f"Attacks: {len(results['attacks'])}")
```

Path/URI inputs:

```python
results = evaluator.evaluate(
    train_real_data="s3://data/train.csv",
    test_real_data="s3://data/test.csv",
    synthetic_data="s3://data/synthetic.csv",
    sensitive_columns=["income", "diagnosis"],
)
```

Custom configuration:

```python
results = evaluator.evaluate(
    train_real_data=train_df,
    test_real_data=test_df,
    synthetic_data=synth_df,
    sensitive_columns=["health_status"],
    k_values=[2, 5, 10, 20],
    config={
        "shadow_models": 10,
        "attack_model": "xgboost",
        "sarp_target_model": "catboost",
        "optuna_trials": 50,
        "random_state": 42,
    },
)
```

Accessing results:

```python
# Overall assessment
print(f"Risk: {results['overall_risk']}")
print(f"Recommendations: {results['summary']['recommendations']}")

# Individual attacks
for attack in results["attacks"]:
    if attack["attack_type"] == "membership_inference":
        print(f"MIA AUC: {attack['metrics']['auc_roc']:.3f}")
        print(f"MIA Risk: {attack['risk_level']}")

# Linkage risk
for attack in results["attacks"]:
    if "linkage" in attack["attack_type"]:
        violations = attack["metrics"]["k_anonymity_violations"]
        for k, pct in violations.items():
            print(f"k={k}: {pct:.1f}% at risk")
```

CertificationClient

class CertificationClient()

Client for certifying synthetic data quality via remote API.

Provides a comprehensive certification score (0-100) with letter grade (A+ to F) by aggregating fidelity, privacy, utility, and completeness metrics.

Score Components:

  • Fidelity (40%): Statistical similarity to real data
  • Privacy (30%): Protection against attacks and memorization
  • Utility (20%): Usefulness for downstream tasks
  • Completeness (10%): Coverage and diversity

Grade Scale:

  • A+ (97-100): Production-ready, exceptional quality
  • A (93-96): Production-ready, excellent quality
  • A- (90-92): Production-ready, very good quality
  • B+ (87-89): Production-ready with minor concerns
  • B (83-86): Production-ready, acceptable quality
  • B- (80-82): Conditional production use
  • C+ (75-79): Development/testing only
  • C (70-74): Significant improvements needed
  • F (<60): Failure - do not use

Examples:

```python
from synthetic_data_sdk import CertificationClient
import pandas as pd

# Initialize client
cert = CertificationClient(endpoint="http://localhost:8000")

# Certify synthetic data (basic)
real = pd.read_csv("real.csv")
synthetic = pd.read_csv("synthetic.csv")

result = cert.certify(
    real_data=real,
    synthetic_data=synthetic,
    categorical_cols=["city", "gender"],
    target_col="income",
    task_type="regression",
)

print(f"Grade: {result['grade']}")
print(f"Score: {result['overall_score']:.1f}/100")
print(f"Risk: {result['risk_level']}")
print(f"Summary: {result['summary']}")

# Certify with privacy attacks
train_real = pd.read_csv("train_real.csv")
test_real = pd.read_csv("test_real.csv")

result = cert.certify(
    real_data=real,
    synthetic_data=synthetic,
    categorical_cols=["city", "gender"],
    target_col="income",
    task_type="regression",
    include_privacy_attacks=True,
    train_real_data=train_real,
    test_real_data=test_real,
    feature_cols=["age", "income", "education"],
    sensitive_col="medical_condition",
    quasi_identifiers=["zipcode", "age", "gender"],
)

print(f"Grade: {result['grade']}")
print("Recommendations:")
for rec in result["recommendations"]:
    print(f"  - {rec}")
```

__init__
def __init__(endpoint: str = "http://localhost:8000",
             api_key: str | None = None,
             config: ClientConfig | None = None)

Initialize certification client.

Arguments:

  • endpoint - Base URL of the synthesis API server (ignored if config is provided)
  • api_key - Optional API key for authentication (ignored if config is provided)
  • config - Optional ClientConfig object. If not provided, will create one from endpoint and api_key

certify
def certify(real_data: pd.DataFrame | str,
            synthetic_data: pd.DataFrame | str,
            categorical_cols: list[str] | None = None,
            target_col: str | None = None,
            task_type: str | None = None,
            include_privacy_attacks: bool = False,
            train_real_data: pd.DataFrame | str | None = None,
            test_real_data: pd.DataFrame | str | None = None,
            feature_cols: list[str] | None = None,
            sensitive_col: str | None = None,
            quasi_identifiers: list[str] | None = None,
            fidelity_weight: float = 0.40,
            privacy_weight: float = 0.30,
            utility_weight: float = 0.20,
            completeness_weight: float = 0.10) -> dict[str, Any]

Certify synthetic data quality with comprehensive scoring.

Arguments:

  • real_data - Real dataset (DataFrame or CSV path)
  • synthetic_data - Synthetic dataset (DataFrame or CSV path)
  • categorical_cols - List of categorical column names
  • target_col - Target column for utility evaluation (TSTR/TRTR)
  • task_type - Task type: ‘classification’ or ‘regression’
  • include_privacy_attacks - Run privacy attack evaluation (MIA, SARP, Linkage)
  • train_real_data - Training split of real data (required if include_privacy_attacks=True)
  • test_real_data - Test split of real data (required if include_privacy_attacks=True)
  • feature_cols - Feature columns for MIA (privacy attacks)
  • sensitive_col - Sensitive column for SARP (privacy attacks)
  • quasi_identifiers - Quasi-identifier columns for linkage analysis
  • fidelity_weight - Weight for fidelity component (default 40%)
  • privacy_weight - Weight for privacy component (default 30%)
  • utility_weight - Weight for utility component (default 20%)
  • completeness_weight - Weight for completeness component (default 10%)

Returns:

Dict with certification results:

  • overall_score: 0-100 certification score
  • grade: Letter grade (A+ to F)
  • risk_level: ’low’, ‘medium’, ‘high’, or ‘critical’
  • breakdown: Detailed score components
  • recommendations: List of actionable recommendations
  • summary: Natural language summary
  • metadata: Certification metadata

Raises:

  • SynthesisAPIError - If certification fails

Example:

cert = CertificationClient(endpoint=“http://localhost:8000”)

result = cert.certify( real_data=“data/real.csv”, synthetic_data=“data/synthetic.csv”, categorical_cols=[“city”], target_col=“income”, task_type=“regression”, )

  • print(f"Grade - {result[‘grade’]} ({result[‘overall_score’]:.1f}/100)")
  • print(f"Risk - {result[‘risk_level’]}")

CausalEvaluator

class CausalEvaluator()

Remote client for causal fidelity evaluation.

Provides methods to evaluate whether synthetic data preserves causal relationships, decision boundaries, and fairness properties from the original real data.

Arguments:

  • endpoint str, optional - Base URL of the synthesis API.
  • config Optional[ClientConfig] - Advanced client configuration.

Attributes:

  • client SynthesisClient - Underlying HTTP client.

Examples:

Treatment effect evaluation:

```python
from synthetic_data_sdk import CausalEvaluator
import pandas as pd

evaluator = CausalEvaluator(endpoint="http://localhost:8000")

# Load datasets
real = pd.read_csv("data/real.csv")
synthetic = pd.read_csv("data/synthetic.csv")

# Evaluate treatment effect stability
results = evaluator.evaluate(
    real_data=real,
    synthetic_data=synthetic,
    treatment_col="received_treatment",
    outcome_col="recovery_time",
    covariates=["age", "severity"],
)

print(f"Overall Preserved: {results['overall_preserved']}")
print(f"Preservation Rate: {results['summary']['preservation_rate']:.1f}%")
```

Decision consistency evaluation:

```python
results = evaluator.evaluate(
    real_data=real,
    synthetic_data=synthetic,
    target_col="purchased",
    feature_cols=["age", "income", "score"],
    task_type="classification",
)

print(
    "Decision Agreement: "
    f"{results['evaluations'][0]['metrics']['decision_agreement']:.2%}"
)
```

Comprehensive evaluation:

```python
results = evaluator.evaluate(
    real_data=real,
    synthetic_data=synthetic,
    treatment_col="treatment",
    outcome_col="outcome",
    target_col="target",
    feature_cols=["age", "income"],
    task_type="classification",
    sensitive_attr="gender",
)

for evaluation in results["evaluations"]:
    print(f"{evaluation['evaluation_type']}: {evaluation['preserved']}")
```

__init__
def __init__(endpoint: str | None = None, config: ClientConfig | None = None)

Initialize causal evaluator client.

Arguments:

  • endpoint - API endpoint URL (e.g., ‘http://api.example.com:8000’). Not required if config is provided.
  • config - Advanced client configuration (timeout, retries, etc.). If provided, endpoint can be omitted.

evaluate
def evaluate(real_data: pd.DataFrame | str,
             synthetic_data: pd.DataFrame | str,
             treatment_col: str | None = None,
             outcome_col: str | None = None,
             covariates: list[str] | None = None,
             target_col: str | None = None,
             feature_cols: list[str] | None = None,
             task_type: str | None = None,
             sensitive_attr: str | None = None,
             config: dict[str, Any] | None = None) -> dict[str, Any]

Evaluate causal fidelity in synthetic data.

Executes treatment effect stability, decision consistency, and/or fairness shift analyses based on the parameters provided.

Arguments:

  • real_data - Real/original data (DataFrame or path/URI to CSV).
  • synthetic_data - Synthetic data (DataFrame or path/URI to CSV).
  • treatment_col - Column name for treatment indicator (binary 0/1). Required for treatment effect analysis.
  • outcome_col - Column name for outcome/response variable. Required for treatment effect analysis.
  • covariates - List of covariate columns for treatment effect adjustment. Optional for treatment effect analysis.
  • target_col - Target/label column name. Required for decision consistency and fairness analysis.
  • feature_cols - List of feature column names for modeling. Required for decision consistency analysis.
  • task_type - Machine learning task type (‘classification’ or ‘regression’). Required for decision consistency analysis.
  • sensitive_attr - Sensitive attribute column (e.g., ‘gender’, ‘race’). Required for fairness shift analysis.
  • config - Advanced configuration options:
    • ate_threshold (float): Threshold for treatment effect preservation
    • fairness_threshold (float): Threshold for fairness shift
    • test_size (float): Train/test split ratio
    • random_state (int): Random seed for reproducibility

Returns:

  • dict - Causal evaluation results containing:
    • request_id (str): Request identifier
    • status (str): ‘success’ or ’error’
    • overall_preserved (bool): Whether all evaluations passed
    • evaluations (List[dict]): Results for each evaluation type
    • summary (dict): High-level summary with:
    • total_evaluations: Number of evaluations executed
    • preserved_evaluations: Number that passed
    • preservation_rate: Percentage preserved
    • recommendations: List of recommended actions
    • metadata (dict): Execution details (time, dataset sizes, config)

Raises:

  • SynthesisAPIError - If evaluation fails.
  • ValidationError - If data schemas don’t match or inputs are invalid.
  • ConnectionError - If API is unreachable.

Notes:

Evaluation Types:

  1. Treatment Effect Stability:
  • Compares Average Treatment Effect (ATE) between real and synthetic
  • Required: treatment_col, outcome_col
  • Optional: covariates for adjustment
  • Reports: ate_preserved, ate_relative_error
  1. Decision Consistency:
  • Compares decision boundaries of models trained on real vs synthetic
  • Required: target_col, feature_cols, task_type
  • Reports: decision_agreement, consistency_score
  1. Fairness Shift:
  • Measures changes in demographic parity
  • Required: sensitive_attr, target_col
  • Reports: fairness_preserved, fairness_shift

At least one evaluation type must be specified by providing the required parameters for that evaluation.

Examples:

Treatment effect only:

        ```python
        evaluator = CausalEvaluator(endpoint="http://localhost:8000")

        results = evaluator.evaluate(
            real_data=real_df,
            synthetic_data=synth_df,
            treatment_col="treatment",
            outcome_col="outcome",
            covariates=["age", "income"],
        )

        te_result = results["evaluations"][0]
        print(f"ATE Preserved: {te_result['preserved']}")
        print(f"ATE Real: {te_result['metrics']['ate_real']:.3f}")
        print(f"ATE Synth: {te_result['metrics']['ate_synth']:.3f}")
        ```

Decision consistency only:

        ```python
        results = evaluator.evaluate(
            real_data="s3://data/real.csv",
            synthetic_data="s3://data/synthetic.csv",
            target_col="purchased",
            feature_cols=["age", "income", "score"],
            task_type="classification",
        )

        dc_result = results["evaluations"][0]
        print(f"Decision Agreement: {dc_result['metrics']['decision_agreement']:.2%}")
        ```

Fairness shift only:

        ```python
        results = evaluator.evaluate(
            real_data=real_df,
            synthetic_data=synth_df,
            sensitive_attr="gender",
            target_col="hired",
        )

        fs_result = results["evaluations"][0]
        print(f"Fairness Preserved: {fs_result['preserved']}")
        ```

Comprehensive evaluation (all three):

        ```python
        results = evaluator.evaluate(
            real_data=real_df,
            synthetic_data=synth_df,
            treatment_col="treatment",
            outcome_col="outcome",
            covariates=["age", "income"],
            target_col="target",
            feature_cols=["age", "income", "score"],
            task_type="classification",
            sensitive_attr="gender",
            config={"ate_threshold": 0.15, "fairness_threshold": 0.1, "random_state": 42},
        )

        print(f"Total Evaluations: {results['summary']['total_evaluations']}")
        print(f"Preservation Rate: {results['summary']['preservation_rate']:.1f}%")

        for evaluation in results["evaluations"]:
            print(f"

{evaluation[’evaluation_type’]}:") print(f" Preserved: {evaluation[‘preserved’]}") print(f" {evaluation[‘interpretation’]}") ```

Path/URI inputs:

        ```python
        results = evaluator.evaluate(
            real_data="gs://bucket/real.parquet",
            synthetic_data="gs://bucket/synthetic.parquet",
            treatment_col="treatment",
            outcome_col="outcome",
        )
        ```

config

Configuration for the Synthetic Data SDK

Manages client-side settings for API communication, including endpoint URLs, timeouts, retry policies, and authentication.


Supported Data URI Formats

Cloud data can be passed to any SDK method that accepts a str path. The server resolves URIs using pty-ai-artifact-storage-lib. Credentials are embedded directly in the URI, there is no separate storage_config field.

AWS S3

s3://bucket/path/to/file.csv # IAM / instance role s3://ACCESS_KEY:SECRET_KEY@REGION/bucket/prefix/ # explicit credentials s3://bucket/path?region=us-east-1&connect_timeout=5&read_timeout=600

Google Cloud Storage

gs://bucket/path/to/file.csv # Application Default Creds gs://project@/bucket/prefix/?credentials=/path/to/sa.json

Azure Blob Storage

azure://container/blob.csv # DefaultAzureCredential Connection string and account URL are server-side settings only

MinIO (S3-compatible)

minio://ACCESS_KEY:SECRET_KEY@HOST:PORT/bucket/prefix/ http://bucket.HOST:PORT/prefix # virtual-hosted style

Local filesystem (SDK reads client-side and sends inline)

  • file:///abs/path/to/file.csv
  • /abs/path/to/file.csv
  • ./relative/path.csv The SDK detects these, reads the file locally, and sends data as inline base64. The server never receives a local path.

Multi-table (dict of URIs or local paths)

Pass a dict[str, str] mapping table names to any of the URI formats above when using multi-table synthesizers. All tables must use the same input kind: either all local paths/file:// URIs (SDK reads and sends inline) or all cloud URIs (passed to the server). Mixing the two raises ValueError.

Timeout query parameters (all cloud backends)

ParameterDefaultRange
connect_timeout10 s1 – 300 s
read_timeout300 s1 – 3600 s

Examples:

```python
from synthetic_data_sdk import ClientConfig, RemoteVineCopula

# Custom configuration
config = ClientConfig(endpoint="http://api.example.com:8000", timeout=60, max_retries=3)

synth = RemoteVineCopula(config=config)

# Pass S3 URI with embedded credentials
synth.fit("s3://AKID:SECRET@us-east-1/my-bucket/train.csv")

# Pass GCS URI (uses Application Default Credentials)
synth.fit("gs://my-bucket/data/train.csv")

# Multi-table with MinIO
from synthetic_data_sdk import RemoteMultiTableVineCopula
mt = RemoteMultiTableVineCopula(
    endpoint="http://api.example.com:8000",
    relationships=[("customers", "id", "orders", "customer_id")],
)
mt.fit(
    {
        "customers": "minio://key:secret@minio.example.com:9000/bucket/customers.csv",
        "orders": "minio://key:secret@minio.example.com:9000/bucket/orders.csv",
    }
)
```

ClientConfig

@dataclass
class ClientConfig()

Configuration for SDK clients.

Attributes:

  • endpoint str - Base URL of the synthesis API (e.g., ‘http://localhost:8000’).
  • timeout int - Request timeout in seconds. Default: 300 (5 minutes).
  • max_retries int - Maximum number of retry attempts for failed requests. Default: 3.
  • verify_ssl bool - Whether to verify SSL certificates. Default: True.
  • api_key Optional[str] - API key for authentication (if required). Default: None.
  • headers dict - Additional HTTP headers to include in all requests.

Examples:

Production configuration:

```python
config = ClientConfig(
    endpoint="https://api.example.com",
    timeout=120,
    max_retries=5,
    verify_ssl=True,
    api_key="your-api-key-here",
)
```

Development configuration:

```python
config = ClientConfig(
    endpoint="http://localhost:8000",
    timeout=60,
    verify_ssl=False,  # For self-signed certs
)
```

Custom headers:

```python
config = ClientConfig(
    endpoint="http://api.example.com:8000",
    headers={"X-Organization-ID": "org-123", "X-Environment": "staging"},
)
```

__post_init__
def __post_init__()

Validate and normalize configuration.

from_env
@classmethod
def from_env(cls) -> "ClientConfig"

Create configuration from environment variables.

Reads:

  • SYNTHESIS_ENDPOINT: API endpoint URL (may include path prefix)
  • SYNTHESIS_API_KEY: API key for authentication
  • SYNTHESIS_TIMEOUT: Request timeout (seconds)
  • SYNTHESIS_VERIFY_SSL: Whether to verify SSL (true/false)

Returns:

  • ClientConfig - Configuration instance.

Examples:

```python
import os
os.environ["SYNTHESIS_ENDPOINT"] = "http://api.example.com:8000"
os.environ["SYNTHESIS_API_KEY"] = "sk-..."
config = ClientConfig.from_env()
print(config.endpoint)
http://api.example.com:8000
```

constants

Constants for Synthetic Data SDK.

This module provides standardized constants for field names used in API requests, ensuring consistency and type safety across the SDK.

DataFieldNames

class DataFieldNames()

Standard field names for data parameters in API requests.

All data fields map to DataInput objects on the server side, a single field that holds exactly one of: inline (base64 CSV), uri (cloud/file URI), inline_tables (multi-table base64 dict), or uri_tables (multi-table URI dict).

Examples:

```python
# Using constants for clarity
payload = {DataFieldNames.TRAINING: {"inline": base64_data}}

# Cloud URI
payload = {DataFieldNames.TRAINING: {"uri": "s3://bucket/train.csv"}}
```

ModelNames

class ModelNames()

Standard model names supported by the API.

Examples:

```python
client = RemoteVineCopula(endpoint="http://localhost:8000")
assert client.model_name == ModelNames.VINE
```

ActionNames

class ActionNames()

Standard action names for synthesis operations.

Examples:

```python
payload = {"action": ActionNames.FIT_TRANSFORM}
```

exceptions

Exceptions for the Synthetic Data SDK

Custom exception hierarchy for distinguishing between different types of API errors, enabling granular error handling in client applications.

Exception Hierarchy: SynthesisAPIError (base) ├── ConnectionError (network failures) ├── ValidationError (4xx client errors) └── ServerError (5xx server errors)

Examples:

```python
from synthetic_data_sdk import RemoteVineCopula, ValidationError, ServerError
try:
    synth = RemoteVineCopula(endpoint="http://localhost:8000")
    synth.fit(invalid_data)
except ValidationError as e:
    print(f"Invalid request: {e}")
    print(f"Fix your data and retry")
except ServerError as e:
    print(f"Server error: {e}")
    print(f"Contact support with request_id: {e.request_id}")
except ConnectionError as e:
    print(f"Network error: {e}")
    print(f"Check server availability")
```

SynthesisAPIError

class SynthesisAPIError(Exception)

Base exception for all Synthesis API errors.

Attributes:

  • message str - Human-readable error description.
  • status_code Optional[int] - HTTP status code if available.
  • request_id Optional[str] - Request ID for tracking/debugging.
  • response_body Optional[dict] - Full API response for detailed inspection.

Examples:

```python
try:
    synth.fit(data)
except SynthesisAPIError as e:
    logger.error(
        f"API error: {e.message}",
        extra={"request_id": e.request_id, "status_code": e.status_code},
    )
```

__init__
def __init__(message: str,
             status_code: int | None = None,
             request_id: str | None = None,
             response_body: dict | None = None)

Initialize API error.

Arguments:

  • message - Human-readable error description.
  • status_code - HTTP status code (e.g., 400, 500).
  • request_id - Unique request identifier from API response.
  • response_body - Full JSON response from API for debugging.

__str__
def __str__() -> str

Format error message with optional metadata.

ConnectionError

class ConnectionError(SynthesisAPIError)

Network or connection-related errors.

Raised when:

  • Server is unreachable
  • Network timeout
  • DNS resolution failure
  • SSL/TLS errors

Examples:

```python
try:
    synth = RemoteVineCopula(endpoint="http://nonexistent:8000")
    synth.fit(df)
except ConnectionError:
    print("Server unreachable - check endpoint URL and network")
```

ValidationError

class ValidationError(SynthesisAPIError)

Request validation errors (HTTP 4xx).

Raised when:

  • Missing required fields
  • Invalid parameter values
  • Malformed request body
  • Unknown model name

Examples:

```python
try:
    synth = RemoteVineCopula(endpoint="http://localhost:8000")
    synth.transform(n=-10)  # Invalid n_samples
except ValidationError as e:
    print(f"Invalid request: {e}")
    # Fix the issue and retry
```

ServerError

class ServerError(SynthesisAPIError)

Server-side errors (HTTP 5xx).

Raised when:

  • Internal server error
  • Service temporarily unavailable
  • Synthesis operation failed

Examples:

```python
try:
    synth = RemoteVineCopula(endpoint="http://localhost:8000")
    synth.fit(df)
except ServerError as e:
    print(f"Server error - contact support")
    print(f"Request ID: {e.request_id}")
    # Implement retry logic with exponential backoff
```

TierRestrictionError

class TierRestrictionError(SynthesisAPIError)

Feature not available in the server’s active tier (HTTP 403).

Raised when the server returns 403 because the requested feature requires a higher product tier than the one currently deployed.

Attributes:

  • feature - Gated feature name (e.g. "tabdiff").
  • component - Component category (e.g. "models").
  • current_tier - Tier running on the server.
  • required_tier - Lowest tier that unlocks the feature.

Examples:

```python
try:
    synth = RemoteTabDiff(endpoint="http://localhost:8000")
    synth.fit(df)
except TierRestrictionError as e:
    print(f"Upgrade required: current={e.current_tier}, need={e.required_tier}")
```

Last modified : June 12, 2026