CBT
A ClickHouse-focused data transformation tool providing fast idempotent transformations with dependency management for building reliable data pipelines.
Overview
CBT simplifies building data transformation pipelines on ClickHouse by providing:
- Fast Transformations: Optimized for ClickHouse with native query execution
- Idempotent Operations: Safe to re-run transformations without side effects
- Dependency Management: Automatic validation of upstream data availability
- Position Tracking: Precise interval tracking for incremental transformations
- Gap Detection: Automatically identifies and backfills missing data intervals
- Scheduled Jobs: Support for both incremental and scheduled transformations
Key Features
Incremental Transformations
Process data in ordered intervals with precise position tracking:
- Maintains exact boundaries for every processed interval
- Supports gap detection and automatic backfilling
- Validates dependency availability before processing
- Perfect for event stream processing and time-series aggregations
Scheduled Transformations
Execute transformations on a schedule without position tracking:
- Ideal for reference data updates (exchange rates, user lists)
- System health monitoring and report generation
- Database maintenance tasks
- Runs independently of data positions
Multi-Instance Architecture
CBT runs as a unified binary handling both coordination and task execution:
- Multiple instances can run for high availability
- Automatic task deduplication via Redis-backed queue
- Tag-based worker filtering for specialized processing
- Shared ClickHouse and Redis infrastructure
Architecture
┌───────────────┐
│ CBT │
└───────┬───────┘
│
┌────────┴────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Redis │ │ ClickHouse │
│ │ │ │
│ • Task Queue │ │ • Data │
│ • Scheduling │ │ • Admin │
└──────────────┘ └──────────────┘
Use Cases
Data Pipeline Engineering
- Build complex transformation pipelines with dependency management
- Transform raw Ethereum data into analytics-ready tables
- Aggregate metrics across multiple data sources with automatic validation
Analytics Platform
- Power dashboards and reporting with transformed data
- Maintain materialized views and aggregation tables
- Ensure data consistency across dependent transformations
Real-time Processing
- Process streaming data in ordered intervals
- Handle late-arriving data with gap detection
- Maintain data freshness with scheduled updates
How It Works
Model Definition
Models are defined in YAML+SQL files that specify:
- Transformation type (incremental or scheduled)
- Dependencies on external data sources or other transformations
- Processing intervals and schedules
- SQL transformation logic or external script execution
External Models
Define source data boundaries:
---
table: beacon_blocks
interval:
type: slot
---
SELECT
min(slot) as min,
max(slot) as max
FROM ethereum.beacon_blocks
Transformation Models
Process data with dependency validation:
---
type: incremental
table: block_stats
interval:
max: 3600
dependencies:
- ethereum.beacon_blocks
schedules:
forwardfill: "@every 5m"
backfill: "@every 1h"
---
INSERT INTO analytics.block_stats
SELECT
slot,
COUNT(*) as block_count,
AVG(gas_used) as avg_gas
FROM ethereum.beacon_blocks
WHERE slot BETWEEN {{ .bounds.start }} AND {{ .bounds.end }}
GROUP BY slot;
Integration with ethPandaOps Stack
CBT powers data transformation in the ethPandaOps infrastructure:
- Xatu Data: Transforms raw Xatu network data into analytics tables
- The Lab: Provides transformed data for visualization and analysis
- Public Data: Powers the public datasets available to the community
Additional Features
Frontend UI
CBT includes a web-based frontend for:
- Real-time visualization of transformation pipelines
- Model dependency graphs
- Transformation status monitoring
- Interactive exploration of data models
REST API
Query model metadata and transformation state via REST API:
- List all models with filtering by type and database
- Get detailed model information including dependencies
- Query transformation status and progress
- OpenAPI specification included
Resources
Related Tools
- CBT API: Automatic REST API generator for ClickHouse databases managed with CBT
- ClickHouse Proto Gen: Generate Protocol Buffer schemas from ClickHouse tables
- Xatu: Collect Ethereum network data that can be transformed with CBT
- The Lab: Visualize data transformed by CBT
Community
Need help or want to contribute?
- Report issues on GitHub
- Join us on the Ethereum R&D Discord
- Check out related tools in the ethPandaOps ecosystem