Skip to main content

Ingestion API

For programmatic data ingestion, Iara Data provides two secure, high-performance mechanisms to ingest data directly into your warehouse:

  1. Webhook Inbound Push API — Event-driven, auto-healing schema evolution, and multi-table support.
  2. Standard Metrics API — Programmatic ingestion for metrics and structured logs.

⚡ Webhook Inbound Push API

The Inbound Webhook API allows external applications or webhooks (e.g., Stripe, Shopify, custom CRMs) to push raw JSON payloads directly into the platform.

Endpoint

POST /v1/ingest/push/{ingestToken}
  • Authentication: The ingestToken is a unique, secure token generated during connection setup.
  • Tenant Isolation: Must include the x-tenant-id header to route the request to the correct tenant sandbox.

Payload Formats

You can ingest data in three different structures, depending on your architecture:

1. Single Table Payload (Explicit Target)

Send a flat array of objects. The target table must be specified via the ?table=your_table_name query parameter or the x-target-table header.

[
{ "id": 1, "phone": "11999999999", "new_col_x": "value" },
{ "id": 2, "phone": "11988888888", "new_col_x": "other" }
]

2. Envelope Payload (Explicit Target)

Send a single JSON object containing a rows array.

{
"table": "users_logs",
"rows": [
{ "id": 1, "email": "test@example.com", "action": "login" }
]
}

3. Hybrid Multi-Table Payload (Dynamic Ingestion)

To maximize performance and combine payloads, you can push data for multiple tables in a single request. The platform automatically routes each array to its corresponding physical table:

{
"test_tbl_orders": [
{ "order_id": 101, "total": 150.50, "status": "completed" }
],
"test_tbl_users": [
{ "user_id": 1, "phone": "11999999999", "name": "John Doe" }
]
}
tip

Auto-Bootstrap: If a table in a multi-table payload is not yet registered in the platform, a pipeline job is automatically created on-the-fly with flexible ingestion settings.


🧠 Self-Learning & Auto-Healing Schema

When new fields are pushed to a dataset, the Inbound API automatically resolves mismatches without breaking ingestion or failing jobs.

1. Zero-Cost Offline Pre-filtering

To optimize token usage and cost, the platform implements deterministic checks before invoking the AI model:

  1. Full Coverage (Bypass IA): If all officially registered columns are present in the payload, any additional fields are treated as brand new physical columns and evolved in the schema.
  2. Deterministic Typos (Bypass IA): If there are 1 or 2 new columns, the engine calculates the Levenshtein distance against empty/unfilled registered columns. If the distance is $\le 2$ and the types are compatible (e.g., destination is varchar/text), the engine maps the typo to the canonical column offline (e.g., phnoe $\rightarrow$ phone) and saves it as an alias.
  3. AI Schema Matcher fallback: If offline matches fail or there are $>2$ unresolved columns, the request goes to the AI Matcher.

2. AI Schema Matcher

The system invokes the configured model to map semantic synonyms:

  • For example, numero_movel or celular are automatically identified as synonyms for phone.
  • The resolved name is saved as an alias in the database, and future payloads are normalized automatically.
  • Unrecognized semantic columns (e.g. birth_place mapping to null) trigger a physical DDL ALTER TABLE ADD COLUMN in Trino.

3. Asynchronous Catalog Enrichment & PII

Whenever a physical schema evolves:

  • The platform triggers an asynchronous background worker task via the integrated AI model.
  • It evaluates descriptions, tags, and automatically flags sensitive fields as PII (Personally Identifiable Information).
  • Auto-Hashing: Any column flagged as PII is automatically hashed using SHA-256 upon ingestion to protect data privacy.

📊 Standard Metrics API

Used for direct structural inserts of system logs or timeseries metrics.

Endpoint

POST /v1/ingest/metrics
  • Authentication: Include your API key in the Authorization: Bearer YOUR_API_KEY header.

Request Body

{
"table": "my_metrics",
"rows": [
{
"date": "2026-06-12",
"metric_name": "cpu_usage",
"value": 45.2,
"region": "us-east"
}
]
}

⚙️ Limits, Idempotency & Security

Idempotency

Always include an Idempotency-Key header on POST requests. The system caches responses for 24 hours to prevent duplicate records in case of network retries.

Rate Limits

Rate limits are configured per connection (e.g., up to 300 requests/minute). Custom IP ranges (Allowed IPs) can be whitelisted to restrict webhook execution.

DoS Protection

The platform routes all raw pushes to a highly isolated PG-backed queue (boitata_inbound_queue). The background pipeline worker processes these messages in micro-batches every 3 seconds, shielding the analytical engine (Trino/Iceberg) from concurrent request spikes.