Skip to main content

CBT

A ClickHouse-focused data transformation tool providing fast idempotent transformations with dependency management for building reliable data pipelines.

Overview

CBT simplifies building data transformation pipelines on ClickHouse by providing:

  • Fast Transformations: Optimized for ClickHouse with native query execution
  • Idempotent Operations: Safe to re-run transformations without side effects
  • Dependency Management: Automatic validation of upstream data availability
  • Position Tracking: Precise interval tracking for incremental transformations
  • Gap Detection: Automatically identifies and backfills missing data intervals
  • Scheduled Jobs: Support for both incremental and scheduled transformations

Key Features

Incremental Transformations

Process data in ordered intervals with precise position tracking:

  • Maintains exact boundaries for every processed interval
  • Supports gap detection and automatic backfilling
  • Validates dependency availability before processing
  • Perfect for event stream processing and time-series aggregations

Scheduled Transformations

Execute transformations on a schedule without position tracking:

  • Ideal for reference data updates (exchange rates, user lists)
  • System health monitoring and report generation
  • Database maintenance tasks
  • Runs independently of data positions

Multi-Instance Architecture

CBT runs as a unified binary handling both coordination and task execution:

  • Multiple instances can run for high availability
  • Automatic task deduplication via Redis-backed queue
  • Tag-based worker filtering for specialized processing
  • Shared ClickHouse and Redis infrastructure

Architecture

         ┌───────────────┐
│ CBT │
└───────┬───────┘

┌────────┴────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Redis │ │ ClickHouse │
│ │ │ │
│ • Task Queue │ │ • Data │
│ • Scheduling │ │ • Admin │
└──────────────┘ └──────────────┘

Use Cases

Data Pipeline Engineering

  • Build complex transformation pipelines with dependency management
  • Transform raw Ethereum data into analytics-ready tables
  • Aggregate metrics across multiple data sources with automatic validation

Analytics Platform

  • Power dashboards and reporting with transformed data
  • Maintain materialized views and aggregation tables
  • Ensure data consistency across dependent transformations

Real-time Processing

  • Process streaming data in ordered intervals
  • Handle late-arriving data with gap detection
  • Maintain data freshness with scheduled updates

How It Works

Model Definition

Models are defined in YAML+SQL files that specify:

  • Transformation type (incremental or scheduled)
  • Dependencies on external data sources or other transformations
  • Processing intervals and schedules
  • SQL transformation logic or external script execution

External Models

Define source data boundaries:

---
table: beacon_blocks
interval:
type: slot
---
SELECT
min(slot) as min,
max(slot) as max
FROM ethereum.beacon_blocks

Transformation Models

Process data with dependency validation:

---
type: incremental
table: block_stats
interval:
max: 3600
dependencies:
- ethereum.beacon_blocks
schedules:
forwardfill: "@every 5m"
backfill: "@every 1h"
---
INSERT INTO analytics.block_stats
SELECT
slot,
COUNT(*) as block_count,
AVG(gas_used) as avg_gas
FROM ethereum.beacon_blocks
WHERE slot BETWEEN {{ .bounds.start }} AND {{ .bounds.end }}
GROUP BY slot;

Integration with ethPandaOps Stack

CBT powers data transformation in the ethPandaOps infrastructure:

  • Xatu Data: Transforms raw Xatu network data into analytics tables
  • The Lab: Provides transformed data for visualization and analysis
  • Public Data: Powers the public datasets available to the community

Additional Features

Frontend UI

CBT includes a web-based frontend for:

  • Real-time visualization of transformation pipelines
  • Model dependency graphs
  • Transformation status monitoring
  • Interactive exploration of data models

REST API

Query model metadata and transformation state via REST API:

  • List all models with filtering by type and database
  • Get detailed model information including dependencies
  • Query transformation status and progress
  • OpenAPI specification included

Resources

  • CBT API: Automatic REST API generator for ClickHouse databases managed with CBT
  • ClickHouse Proto Gen: Generate Protocol Buffer schemas from ClickHouse tables
  • Xatu: Collect Ethereum network data that can be transformed with CBT
  • The Lab: Visualize data transformed by CBT

Community

Need help or want to contribute?