# arc **Repository Path**: jeffstone/arc ## Basic Information - **Project Name**: arc - **Description**: Time-series data warehouse built for speed. 2.42M records/sec on local NVMe. DuckDB + Parquet + Arrow + flexible storage (local/MinIO/S3). AGPL-3.0 - **Primary Language**: Unknown - **License**: AGPL-3.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-28 - **Last Updated**: 2025-10-28 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

Arc Logo

Arc Core

High-performance time-series data warehouse built on DuckDB and Parquet with flexible storage options.

> **Alpha Release - Technical Preview** > Arc Core is currently in active development and evolving rapidly. While the system is stable and functional, it is **not recommended for production workloads** at this time. We are continuously improving performance, adding features, and refining the API. Use in development and testing environments only. ## Features - **High-Performance Ingestion**: MessagePack binary protocol (recommended), InfluxDB Line Protocol (drop-in replacement), JSON - **VSCode Extension**: Full-featured database manager with query editor, notebooks, CSV import, and alerting - [Install Now](https://marketplace.visualstudio.com/items?itemName=basekick-labs.arc-db-manager) - **Multi-Database Architecture**: Organize data by environment, tenant, or application with database namespaces - [Learn More](#multi-database-architecture) - **Continuous Queries**: Manual downsampling and aggregation for materialized views (automatic scheduling in enterprise edition) - [Learn More](docs/CONTINUOUS_QUERIES.md) - **Retention Policies**: Time-based data lifecycle management with manual execution (automatic scheduling in enterprise edition) - [Learn More](docs/RETENTION_POLICIES.md) - **Write-Ahead Log (WAL)**: Optional durability feature for zero data loss (disabled by default) - [Learn More](docs/WAL.md) - **Automatic File Compaction**: Merges small Parquet files into larger ones for 10-50x faster queries (enabled by default) - [Learn More](docs/COMPACTION.md) - **Delete Operations**: GDPR-ready precise deletion with zero overhead on writes/queries using rewrite-based approach - [Learn More](docs/DELETE.md) - **DuckDB Query Engine**: Fast analytical queries with SQL, cross-database joins, and advanced analytics - **Flexible Storage Options**: Local filesystem (fastest), MinIO (distributed), AWS S3/R2 (cloud), or Google Cloud Storage - **Data Import**: Import data from InfluxDB, TimescaleDB, HTTP endpoints - **Query Caching**: Configurable result caching for improved performance - **Apache Superset Integration**: Native dialect for BI dashboards with multi-database schema support - **Production Ready**: Docker deployment with health checks and monitoring ## Performance Benchmark **Arc achieves 2.42M records/sec with columnar MessagePack format and authentication enabled!** ### Write Performance - Format Comparison | Wire Format | Throughput | p50 Latency | p95 Latency | p99 Latency | Notes | |-------------|------------|-------------|-------------|-------------|-------| | **MessagePack Columnar** | **2.42M RPS** | **1.74ms** | **28.13ms** | **45.27ms** | Zero-copy passthrough + auth cache (RECOMMENDED) | | **MessagePack Row** | **908K RPS** | **136.86ms** | **851.71ms** | **1542ms** | Legacy format with conversion overhead | | **Line Protocol** | **240K RPS** | N/A | N/A | N/A | InfluxDB compatibility mode | **Columnar Format Advantages:** - **2.66x faster throughput** vs row format (2.42M vs 908K RPS) - **78x lower p50 latency** (1.74ms vs 136.86ms) - **30x lower p95 latency** (28.13ms vs 851.71ms) - **34x lower p99 latency** (45.27ms vs 1542ms) - **Near-zero authentication overhead** with 30s token cache *Tested on Apple M3 Max (14 cores), native deployment, 400 workers* *MessagePack columnar format with zero-copy Arrow passthrough* ### Authentication Performance Arc includes built-in token-based authentication with minimal performance overhead thanks to intelligent caching: | Configuration | Throughput | p50 Latency | p95 Latency | p99 Latency | Notes | |--------------|-----------|-------------|-------------|-------------|-------| | **Auth Disabled** | 2.42M RPS | 1.64ms | 27.27ms | 41.63ms | No security (not recommended) | | **Auth + Cache (30s TTL)** | **2.42M RPS** | **1.74ms** | **28.13ms** | **45.27ms** | **Production recommended** | | **Auth (no cache)** | 2.31M RPS | 6.36ms | 41.41ms | 63.31ms | 5ms SQLite lookup overhead | **Key Insights:** - **Token caching** eliminates auth performance penalty (only +0.1ms overhead vs no auth) - **30-second TTL** provides excellent hit rate at 2.4M RPS workloads - **Security with speed**: Full authentication with near-zero performance impact - **Configurable TTL**: Adjust cache duration via `AUTH_CACHE_TTL` (default: 30s) **Cache Statistics:** - **Hit rate**: 99.9%+ at sustained high throughput - **Revocation delay**: Max 30 seconds (cache TTL) - **Manual invalidation**: `POST /api/v1/auth/cache/invalidate` for immediate effect - **Monitoring**: `GET /api/v1/auth/cache/stats` for cache performance metrics ### Storage Backend Performance | Storage Backend | Throughput | Notes | |----------------|------------|-------| | **Local NVMe** | **2.42M RPS** | Direct filesystem (fastest) | | **MinIO** | **~2.1M RPS** | S3-compatible object storage | **Why is columnar format so much faster?** 1. **Zero conversion overhead** - No flatten tags/fields, no row→column conversion 2. **Better batching** - 1000 records in one columnar structure vs 1000 individual dicts 3. **Smaller wire payload** - Field names sent once instead of repeated per-record 4. **More efficient memory** - Arrays are more compact than list of dicts 5. **Less lock contention** - Fewer buffer operations per batch **Optimal Configuration:** - **Format:** MessagePack columnar (2.55x faster than row format) - **Workers:** ~30x CPU cores for I/O-bound workloads (e.g., 14 cores = 400 workers) - **Deployment:** Native mode (3.5x faster than Docker) - **Storage:** Local filesystem for maximum performance, MinIO for distributed deployments - **Protocol:** MessagePack binary columnar (`/api/v1/write/msgpack`) - **Performance Stack:** - `uvloop`: 2-4x faster event loop (Cython-based C implementation) - `httptools`: 40% faster HTTP parser - `orjson`: 20-50% faster JSON serialization (Rust + SIMD) - **Optimizations:** - Zero-copy columnar passthrough (no data transformation) - Non-blocking flush operations (writes continue during I/O) ## Quick Start (Native - Recommended for Maximum Performance) **Native deployment delivers 2.32M RPS vs 570K RPS in Docker (4.1x faster).** ```bash # One-command start (auto-installs MinIO, auto-detects CPU cores) ./start.sh native # Alternative: Manual setup python3.11 -m venv venv source venv/bin/activate pip install -r requirements.txt cp .env.example .env # Start MinIO natively (auto-configured by start.sh) brew install minio/stable/minio minio/stable/mc # macOS # OR download from https://min.io/download for Linux # Start Arc (auto-detects optimal worker count: 3x CPU cores) ./start.sh native ``` Arc API will be available at `http://localhost:8000` MinIO Console at `http://localhost:9001` (minioadmin/minioadmin) ## Quick Start (Docker) ```bash # Start Arc Core with MinIO docker-compose up -d # Check status docker-compose ps # View logs docker-compose logs -f arc-api # Stop docker-compose down ``` **Note:** Docker mode achieves ~570K RPS. For maximum performance (2.32M RPS with columnar format), use native deployment. ## Remote Deployment Deploy Arc Core to a remote server: ```bash # Docker deployment ./deploy.sh -h your-server.com -u ubuntu -m docker # Native deployment ./deploy.sh -h your-server.com -u ubuntu -m native ``` ## Configuration Arc Core uses a centralized `arc.conf` configuration file (TOML format). This provides: - Clean, organized configuration structure - Environment variable overrides for Docker/production - Production-ready defaults - Comments and documentation inline ### Primary Configuration: arc.conf Edit the `arc.conf` file for all settings: ```toml # Server Configuration [server] host = "0.0.0.0" port = 8000 workers = 8 # Adjust based on load: 4=light, 8=medium, 16=high # Authentication [auth] enabled = true default_token = "" # Leave empty to auto-generate # Query Cache [query_cache] enabled = true ttl_seconds = 60 # Storage Backend Configuration [storage] backend = "local" # Options: local, minio, s3, gcs, ceph # Option 1: Local Filesystem (fastest, single-node) [storage.local] base_path = "./data/arc" # Or "/mnt/nvme/arc-data" for dedicated storage database = "default" # Option 2: MinIO (recommended for distributed deployments) # [storage] # backend = "minio" # [storage.minio] # endpoint = "http://minio:9000" # access_key = "minioadmin" # secret_key = "minioadmin123" # bucket = "arc" # database = "default" # use_ssl = false # Option 3: AWS S3 / Cloudflare R2 # [storage] # backend = "s3" # [storage.s3] # bucket = "arc-data" # database = "default" # region = "us-east-1" # access_key = "YOUR_ACCESS_KEY" # secret_key = "YOUR_SECRET_KEY" # Option 4: Google Cloud Storage # [storage] # backend = "gcs" # [storage.gcs] # bucket = "arc-data" # database = "default" # project_id = "my-project" # credentials_file = "/path/to/service-account.json" ``` **Configuration Priority** (highest to lowest): 1. Environment variables (e.g., `ARC_WORKERS=16`) 2. `arc.conf` file 3. Built-in defaults ### Storage Backend Selection Guide | Backend | Performance | Use Case | Pros | Cons | |---------|-------------|----------|------|------| | **Local** | Fastest (2.32M RPS) | Single-node, development, edge | Direct I/O, no overhead, simple setup | No distribution, single point of failure | | **MinIO** | Fast (~2.0M RPS) | Distributed, multi-tenant | S3-compatible, scalable, cost-effective | Requires MinIO service, slight overhead | | **AWS S3** | Cloud-native | Production, unlimited scale | Fully managed, 99.999999999% durability | Network latency, costs | | **GCS** | Cloud-native | Google Cloud deployments | Integrated with GCP, global CDN | Network latency, costs | **Recommendation:** - **Development/Testing**: Local filesystem (`backend = "local"`) - **Production (single-node)**: Local filesystem with NVMe storage - **Production (distributed)**: MinIO or AWS S3/R2 - **Cloud deployments**: AWS S3, Cloudflare R2, or Google Cloud Storage ### Environment Variable Overrides You can override any setting via environment variables: ```bash # Server ARC_HOST=0.0.0.0 ARC_PORT=8000 ARC_WORKERS=8 # Storage - Local Filesystem STORAGE_BACKEND=local STORAGE_LOCAL_BASE_PATH=/data/arc STORAGE_LOCAL_DATABASE=default # Storage - MinIO (alternative) # STORAGE_BACKEND=minio # MINIO_ENDPOINT=minio:9000 # MINIO_ACCESS_KEY=minioadmin # MINIO_SECRET_KEY=minioadmin123 # MINIO_BUCKET=arc # Cache QUERY_CACHE_ENABLED=true QUERY_CACHE_TTL=60 # Logging LOG_LEVEL=INFO ``` **Legacy Support**: `.env` files are still supported for backward compatibility, but `arc.conf` is recommended. ## Getting Started ### VSCode Extension - The Easiest Way to Get Started **Arc Database Manager** for VS Code provides a complete development toolkit with visual database exploration, query execution, and data management - no command line required!

#### Key Features: - **Visual Connection Management** - Connect to multiple Arc servers with saved connections - **SQL Query Editor** - IntelliSense auto-completion for tables, columns, and DuckDB functions - **Arc Notebooks** - Mix SQL and Markdown in `.arcnb` files with parameterized queries - **Schema Explorer** - Browse databases and tables with right-click context menus - **CSV Import Wizard** - Import CSV files with auto-detection and batch processing - **Alerting System** - Create alerts with desktop notifications - **Auto-Visualizations** - Automatic chart generation for time-series data - **Query History** - Automatic logging of all queries with saved favorites - **Dark Mode** - Automatic theme adaptation #### Quick Install: 1. Open VS Code 2. Search for **"Arc Database Manager"** in Extensions marketplace 3. Click Install 4. Connect to your Arc server and start querying! **→ [Install from VS Code Marketplace](https://marketplace.visualstudio.com/items?itemName=basekick-labs.arc-db-manager)** **→ [View Extension Documentation](https://github.com/basekick-labs/vscode-extension)** --- ### 1. Get Your Admin Token After starting Arc Core, create an admin token for API access: ```bash # Docker deployment docker exec -it arc-api python3 -c " from api.auth import AuthManager auth = AuthManager(db_path='/data/arc.db') token = auth.create_token('my-admin', description='Admin token') print(f'Admin Token: {token}') " # Native deployment cd /path/to/arc-core source venv/bin/activate python3 -c " from api.auth import AuthManager auth = AuthManager(db_path='./data/arc.db') token = auth.create_token('my-admin', description='Admin token') print(f'Admin Token: {token}') " ``` Save this token - you'll need it for all API requests. ### 2. API Endpoints All endpoints require authentication via Bearer token: ```bash # Set your token export ARC_TOKEN="your-token-here" ``` #### Health Check ```bash curl http://localhost:8000/health ``` #### Ingest Data (MessagePack - Columnar Format RECOMMENDED) **Columnar MessagePack format is 2.55x faster than row format** with zero-copy passthrough to Arrow: ```python import msgpack import requests from datetime import datetime import os # Get or create API token token = os.getenv("ARC_TOKEN") if not token: from api.auth import AuthManager auth = AuthManager(db_path='./data/arc.db') token = auth.create_token(name='my-app', description='My application') print(f"Created token: {token}") print(f"Save it: export ARC_TOKEN='{token}'") # COLUMNAR FORMAT (RECOMMENDED - 2.55x faster) # All data organized as columns (arrays), not rows data = { "m": "cpu", # measurement name "columns": { # columnar data structure "time": [ int(datetime.now().timestamp() * 1000), int(datetime.now().timestamp() * 1000) + 1000, int(datetime.now().timestamp() * 1000) + 2000 ], "host": ["server01", "server02", "server03"], "region": ["us-east", "us-west", "eu-central"], "datacenter": ["aws", "gcp", "azure"], "usage_idle": [95.0, 85.0, 92.0], "usage_user": [3.2, 10.5, 5.8], "usage_system": [1.8, 4.5, 2.2] } } # Send columnar data (2.32M RPS throughput) response = requests.post( "http://localhost:8000/api/v1/write/msgpack", headers={ "Authorization": f"Bearer {token}", "Content-Type": "application/msgpack", "x-arc-database": "default" # Optional: specify database }, data=msgpack.packb(data) ) # Check response (returns 204 No Content on success) if response.status_code == 204: print(f"Successfully wrote {len(data['columns']['time'])} records!") else: print(f"Error {response.status_code}: {response.text}") ``` **High-throughput batch ingestion** (columnar format - 2.32M RPS): ```python # Generate 10,000 records in columnar format num_records = 10000 base_time = int(datetime.now().timestamp() * 1000) data = { "m": "sensor_data", "columns": { "time": [base_time + i for i in range(num_records)], "sensor_id": [f"sensor_{i % 100}" for i in range(num_records)], "location": [f"zone_{i % 10}" for i in range(num_records)], "type": ["temperature"] * num_records, "temperature": [20 + (i % 10) for i in range(num_records)], "humidity": [60 + (i % 20) for i in range(num_records)], "pressure": [1013 + (i % 5) for i in range(num_records)] } } response = requests.post( "http://localhost:8000/api/v1/write/msgpack", headers={ "Authorization": f"Bearer {token}", "Content-Type": "application/msgpack" }, data=msgpack.packb(data) ) if response.status_code == 204: print(f"Wrote 10,000 records successfully at 2.32M RPS!") ```

Row Format (Legacy - 2.55x slower, kept for compatibility)

**Only use row format if you cannot generate columnar data client-side:** ```python # ROW FORMAT (LEGACY - 908K RPS, much slower) # Each record is a separate dictionary data = { "batch": [ { "m": "cpu", "t": int(datetime.now().timestamp() * 1000), "h": "server01", "tags": { "region": "us-east", "dc": "aws" }, "fields": { "usage_idle": 95.0, "usage_user": 3.2, "usage_system": 1.8 } }, { "m": "cpu", "t": int(datetime.now().timestamp() * 1000), "h": "server02", "tags": { "region": "us-west", "dc": "gcp" }, "fields": { "usage_idle": 85.0, "usage_user": 10.5, "usage_system": 4.5 } } ] } response = requests.post( "http://localhost:8000/api/v1/write/msgpack", headers={ "Authorization": f"Bearer {token}", "Content-Type": "application/msgpack" }, data=msgpack.packb(data) ) ``` **Performance Warning**: Row format has 20-26x higher latency and 2.55x lower throughput than columnar format. Use columnar format whenever possible.

#### Ingest Data (Line Protocol - InfluxDB Compatibility) **For drop-in replacement of InfluxDB** - compatible with Telegraf and InfluxDB clients: ```bash # InfluxDB 1.x compatible endpoint curl -X POST "http://localhost:8000/api/v1/write?db=mydb" \ -H "Authorization: Bearer $ARC_TOKEN" \ -H "Content-Type: text/plain" \ --data-binary "cpu,host=server01 value=0.64 1633024800000000000" # Multiple measurements curl -X POST "http://localhost:8000/api/v1/write?db=metrics" \ -H "Authorization: Bearer $ARC_TOKEN" \ -H "Content-Type: text/plain" \ --data-binary "cpu,host=server01,region=us-west value=0.64 1633024800000000000 memory,host=server01,region=us-west used=8.2,total=16.0 1633024800000000000 disk,host=server01,region=us-west used=120.5,total=500.0 1633024800000000000" ``` **Telegraf configuration** (drop-in InfluxDB replacement): ```toml [[outputs.influxdb]] urls = ["http://localhost:8000"] database = "telegraf" skip_database_creation = true # Authentication username = "" # Leave empty password = "$ARC_TOKEN" # Use your Arc token as password # Or use HTTP headers [outputs.influxdb.headers] Authorization = "Bearer $ARC_TOKEN" ``` #### Query Data **Basic query** (Python): ```python import requests import os token = os.getenv("ARC_TOKEN") # Your API token # Simple query response = requests.post( "http://localhost:8000/api/v1/query", headers={ "Authorization": f"Bearer {token}", "Content-Type": "application/json" }, json={ "sql": "SELECT * FROM cpu WHERE host = 'server01' ORDER BY time DESC LIMIT 10", "format": "json" } ) data = response.json() print(f"Rows: {len(data['data'])}") for row in data['data']: print(row) ``` **Using curl**: ```bash curl -X POST http://localhost:8000/api/v1/query \ -H "Authorization: Bearer $ARC_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "sql": "SELECT * FROM cpu WHERE host = '\''server01'\'' LIMIT 10", "format": "json" }' ``` **Advanced queries with DuckDB SQL**: ```python # Time-series aggregation response = requests.post( "http://localhost:8000/api/v1/query", headers={"Authorization": f"Bearer {token}"}, json={ "sql": """ SELECT time_bucket(INTERVAL '5 minutes', time) as bucket, host, AVG(usage_idle) as avg_idle, MAX(usage_user) as max_user FROM cpu WHERE time > now() - INTERVAL '1 hour' GROUP BY bucket, host ORDER BY bucket DESC """, "format": "json" } ) # Window functions response = requests.post( "http://localhost:8000/api/v1/query", headers={"Authorization": f"Bearer {token}"}, json={ "sql": """ SELECT timestamp, host, usage_idle, AVG(usage_idle) OVER ( PARTITION BY host ORDER BY timestamp ROWS BETWEEN 5 PRECEDING AND CURRENT ROW ) as moving_avg FROM cpu ORDER BY timestamp DESC LIMIT 100 """, "format": "json" } ) # Join multiple measurements response = requests.post( "http://localhost:8000/api/v1/query", headers={"Authorization": f"Bearer {token}"}, json={ "sql": """ SELECT c.timestamp, c.host, c.usage_idle as cpu_idle, m.used_percent as mem_used FROM cpu c JOIN mem m ON c.timestamp = m.timestamp AND c.host = m.host WHERE c.timestamp > now() - INTERVAL '10 minutes' ORDER BY c.timestamp DESC """, "format": "json" } ) ``` ### Apache Arrow Columnar Queries Arc supports Apache Arrow format for zero-copy columnar data transfer, ideal for analytics workloads and data pipelines. **Performance Benefits:** - **7.36x faster** for large result sets (100K+ rows) - **43% smaller payloads** compared to JSON - **Zero-copy** for Pandas, Polars, and other Arrow-compatible tools - **Columnar format** stays efficient from Parquet → DuckDB → Arrow → client **Python Example with Pandas:** ```python import requests import pyarrow as pa import pandas as pd # Execute query and get Arrow format response = requests.post( "http://localhost:8000/api/v1/query/arrow", headers={"Authorization": f"Bearer {token}"}, json={ "sql": """ SELECT time_bucket(INTERVAL '1 hour', time) as hour, host, AVG(usage_idle) as avg_cpu_idle, COUNT(*) as sample_count FROM cpu WHERE time > now() - INTERVAL '24 hours' GROUP BY hour, host ORDER BY hour DESC """ } ) # Parse Arrow IPC stream reader = pa.ipc.open_stream(response.content) arrow_table = reader.read_all() # Convert to Pandas DataFrame (zero-copy) df = arrow_table.to_pandas() print(f"Retrieved {len(df)} rows") print(df.head()) ``` **Polars Example (even faster):** ```python import requests import pyarrow as pa import polars as pl response = requests.post( "http://localhost:8000/api/v1/query/arrow", headers={"Authorization": f"Bearer {token}"}, json={"sql": "SELECT * FROM cpu WHERE host = 'server01' LIMIT 100000"} ) # Parse Arrow and convert to Polars (zero-copy) reader = pa.ipc.open_stream(response.content) arrow_table = reader.read_all() df = pl.from_arrow(arrow_table) print(df.describe()) ``` **When to use Arrow format:** - Large result sets (10K+ rows) - Wide tables with many columns - Data pipelines feeding into Pandas/Polars - Analytics notebooks and dashboards - ETL processes requiring columnar data **When to use JSON format:** - Small result sets (<1K rows) - Simple API integrations - Web dashboards - Quick debugging and testing ## Multi-Database Architecture Arc supports multiple databases (namespaces) within a single instance, allowing you to organize and isolate data by environment, tenant, or application. ### Storage Structure Data is organized as: `{bucket}/{database}/{measurement}/{year}/{month}/{day}/{hour}/file.parquet` ``` arc/ # MinIO bucket ├── default/ # Default database │ ├── cpu/2025/01/15/14/ # CPU metrics │ ├── mem/2025/01/15/14/ # Memory metrics │ └── disk/2025/01/15/14/ # Disk metrics ├── production/ # Production database │ ├── cpu/2025/01/15/14/ │ └── mem/2025/01/15/14/ └── staging/ # Staging database ├── cpu/2025/01/15/14/ └── mem/2025/01/15/14/ ``` ### Configuration Configure the database in `arc.conf`: ```toml [storage.minio] endpoint = "http://localhost:9000" access_key = "minioadmin" secret_key = "minioadmin" bucket = "arc" database = "default" # Database namespace ``` Or via environment variable: ```bash export MINIO_DATABASE="production" ``` ### Writing to Specific Databases **MessagePack Protocol (Columnar - Recommended):** ```python import msgpack import requests from datetime import datetime token = "your-token-here" # Columnar format (2.55x faster) data = { "m": "cpu", "columns": { "time": [int(datetime.now().timestamp() * 1000)], "host": ["server01"], "usage_idle": [95.0], "usage_user": [3.2], "usage_system": [1.8] } } # Write to production database response = requests.post( "http://localhost:8000/api/v1/write/msgpack", headers={ "x-api-key": token, "Content-Type": "application/msgpack", "x-arc-database": "production" # Specify database }, data=msgpack.packb(data) ) # Write to staging database response = requests.post( "http://localhost:8000/api/v1/write/msgpack", headers={ "x-api-key": token, "Content-Type": "application/msgpack", "x-arc-database": "staging" # Different database }, data=msgpack.packb(data) ) ``` **Line Protocol:** ```bash # Write to default database (uses configured database) curl -X POST http://localhost:8000/api/v1/write/line-protocol \ -H "x-api-key: $ARC_TOKEN" \ -d 'cpu,host=server01 usage_idle=95.0' # Write to specific database curl -X POST http://localhost:8000/api/v1/write/line-protocol \ -H "x-api-key: $ARC_TOKEN" \ -H "x-arc-database: production" \ -d 'cpu,host=server01 usage_idle=95.0' ``` ### Querying Across Databases **Show Available Databases:** ```sql SHOW DATABASES; -- Output: -- default -- production -- staging ``` **Show Tables in Current Database:** ```sql SHOW TABLES; -- Output: -- database | table_name | storage_path | file_count | total_size_mb -- default | cpu | s3://arc/default/cpu/ | 150 | 75.2 -- default | mem | s3://arc/default/mem/ | 120 | 52.1 ``` **Query Specific Database:** ```python # Query production database response = requests.post( "http://localhost:8000/api/v1/query", headers={"Authorization": f"Bearer {token}"}, json={ "sql": "SELECT * FROM production.cpu WHERE timestamp > NOW() - INTERVAL 1 HOUR", "format": "json" } ) # Query default database (no prefix needed) response = requests.post( "http://localhost:8000/api/v1/query", headers={"Authorization": f"Bearer {token}"}, json={ "sql": "SELECT * FROM cpu WHERE timestamp > NOW() - INTERVAL 1 HOUR", "format": "json" } ) ``` **Cross-Database Queries:** ```python # Compare production vs staging metrics response = requests.post( "http://localhost:8000/api/v1/query", headers={"Authorization": f"Bearer {token}"}, json={ "sql": """ SELECT p.timestamp, p.host, p.usage_idle as prod_cpu, s.usage_idle as staging_cpu, (p.usage_idle - s.usage_idle) as diff FROM production.cpu p JOIN staging.cpu s ON p.timestamp = s.timestamp AND p.host = s.host WHERE p.timestamp > NOW() - INTERVAL 1 HOUR ORDER BY p.timestamp DESC LIMIT 100 """, "format": "json" } ) ``` ### Use Cases **Environment Separation:** ```toml # Production instance database = "production" # Staging instance database = "staging" # Development instance database = "dev" ``` **Multi-Tenant Architecture:** ```python # Write tenant-specific data headers = { "x-api-key": token, "x-arc-database": f"tenant_{tenant_id}" } ``` **Data Lifecycle Management:** ```python # Hot data (frequent queries) database = "hot" # Warm data (occasional queries) database = "warm" # Cold data (archival) database = "cold" ``` ### Apache Superset Integration In Superset, Arc databases appear as **schemas**: 1. Install the Arc Superset dialect: ```bash pip install arc-superset-dialect ``` 2. Connect to Arc: ``` arc://your-token@localhost:8000/default ``` 3. View databases as schemas in the Superset UI: ``` Schema: default ├── cpu ├── mem └── disk Schema: production ├── cpu └── mem Schema: staging ├── cpu └── mem ``` For more details, see the [Multi-Database Migration Plan](DATABASE_MIGRATION_PLAN.md). ## Write-Ahead Log (WAL) - Durability Feature Arc includes an optional Write-Ahead Log (WAL) for applications requiring **zero data loss** on system crashes. WAL is **disabled by default** to maximize throughput. ### When to Enable WAL Enable WAL if you need: - **Zero data loss** on crashes - **Regulatory compliance** (finance, healthcare) - **Guaranteed durability** for critical data Keep WAL disabled if you: - **Prioritize maximum throughput** (2.01M records/sec) - **Can tolerate 0-5 seconds data loss** on rare crashes - **Have upstream retry logic** (Kafka, message queues) ### Performance Impact | Configuration | Throughput | Data Loss Risk | |--------------|-----------|----------------| | **WAL Disabled (default)** | 2.01M rec/s | 0-5 seconds | | **WAL async** | 1.67M rec/s (-17%) | <1 second | | **WAL fdatasync** | 1.63M rec/s (-19%) | Near-zero | | **WAL fsync** | 1.67M rec/s (-17%) | Zero | ### Enable WAL Edit `.env` file: ```bash # Enable Write-Ahead Log for durability WAL_ENABLED=true WAL_SYNC_MODE=fdatasync # Recommended: balanced mode WAL_DIR=./data/wal WAL_MAX_SIZE_MB=100 WAL_MAX_AGE_SECONDS=3600 ``` ### Monitor WAL Check WAL status via API: ```bash # Get WAL status curl http://localhost:8000/api/wal/status # Get detailed statistics curl http://localhost:8000/api/wal/stats # List WAL files curl http://localhost:8000/api/wal/files # Health check curl http://localhost:8000/api/wal/health # Cleanup old recovered files curl -X POST http://localhost:8000/api/wal/cleanup ``` **For complete WAL documentation, see [docs/WAL.md](docs/WAL.md)** ## File Compaction - Query Optimization Arc automatically **compacts small Parquet files into larger ones** to dramatically improve query performance. During high-throughput ingestion, Arc creates many small files (50-100MB). Compaction merges these into optimized 512MB files, reducing file count by 100x and improving query speed by 10-50x. ### Why Compaction Matters **The Small File Problem:** - High-throughput ingestion creates 100+ small files per hour - DuckDB must open every file for queries → slow query performance - Example: 1000 files × 5ms open time = 5 seconds just to start querying **After Compaction:** - **2,704 files → 3 files (901x reduction)** - Real production test results - **80.4% compression ratio** (3.7 GB → 724 MB with ZSTD) - Query time: 5 seconds → 0.05 seconds (100x faster) - Better compression (ZSTD vs Snappy during writes) - Improved DuckDB parallel scanning ### How It Works Compaction runs automatically on a schedule (default: every hour at :05): 1. **Scans** for completed hourly partitions (e.g., `2025/10/08/14/`) 2. **Locks** partition to prevent concurrent compaction 3. **Downloads** all small files for that partition 4. **Merges** using DuckDB into optimized 512MB files 5. **Uploads** compacted files with `.compacted` suffix 6. **Deletes** old small files from storage 7. **Cleanup** temp files and releases lock ### Configuration Compaction is **enabled by default** in [arc.conf](arc.conf): ```toml [compaction] enabled = true min_age_hours = 1 # Wait 1 hour before compacting (let hour complete) min_files = 10 # Only compact if ≥10 files exist target_file_size_mb = 512 # Target size for compacted files schedule = "5 * * * *" # Cron schedule: every hour at :05 max_concurrent_jobs = 2 # Run 2 compactions in parallel compression = "zstd" # Better compression than snappy compression_level = 3 # Balance compression vs speed ``` ### Monitoring Compaction Check compaction status via API: ```bash # Get current status curl http://localhost:8000/api/compaction/status # Get detailed statistics curl http://localhost:8000/api/compaction/stats # List eligible partitions curl http://localhost:8000/api/compaction/candidates # Manually trigger compaction curl -X POST http://localhost:8000/api/compaction/trigger # View active jobs curl http://localhost:8000/api/compaction/jobs # View job history curl http://localhost:8000/api/compaction/history ``` ### Reducing File Count at Source **Best practice**: Reduce file generation by increasing buffer size before they're written: ```toml [ingestion] buffer_size = 200000 # Up from 50,000 (4x fewer files) buffer_age_seconds = 10 # Up from 5 (2x fewer files) ``` **Impact**: - **Files generated**: 2,000/hour → 250/hour (8x reduction) - **Compaction time**: 150s → 20s (7x faster) - **Memory usage**: +300MB per worker (~12GB total on 42 workers) - **Query freshness**: 5s → 10s delay This is the **most effective optimization** - fewer files means faster compaction AND faster queries. ### When to Disable Compaction Compaction should remain enabled for production, but you might disable it: - **Testing**: When you want to see raw ingestion files - **Low write volume**: If you write <10 files per hour - **Development**: When iterating on ingestion code To disable, edit [arc.conf](arc.conf): ```toml [compaction] enabled = false ``` **For complete compaction documentation, see [docs/COMPACTION.md](docs/COMPACTION.md)** ## Architecture Overview Arc's architecture is optimized for high-throughput time-series ingestion with **MessagePack columnar format** as the recommended ingestion path, delivering 2.32M records/sec with zero-copy passthrough to Parquet. ``` ┌─────────────────────────────────────────────────────────────┐ │ Client Applications │ │ (Python, Go, JavaScript, Telegraf, curl, etc.) │ └──────────────────┬──────────────────────────────────────────┘ │ │ HTTP/HTTPS ▼ ┌─────────────────────────────────────────────────────────────┐ │ Arc API Layer (FastAPI) │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ │ │ MessagePack │ │ Line Protocol│ │ Query Engine │ │ │ │Columnar (REC)│ │ (Legacy) │ │ (DuckDB) │ │ │ │ 2.32M RPS │ │ 240K RPS │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────────┘ │ └──────────────────┬──────────────────────────────────────────┘ │ │ Write Pipeline ▼ ┌─────────────────────────────────────────────────────────────┐ │ Buffering & Processing Layer │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ ArrowParquetBuffer (MessagePack Columnar) │ │ │ │ RECOMMENDED - Zero-copy passthrough │ │ │ │ - Client sends columnar data │ │ │ │ - Direct PyArrow RecordBatch → Parquet │ │ │ │ - No row→column conversion (2.55x faster) │ │ │ │ - Minimal memory overhead │ │ │ │ - Throughput: 2.32M RPS │ │ │ └──────────────────────────────────────────────────────┘ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ ParquetBuffer (Line Protocol / MessagePack Row) │ │ │ │ LEGACY - For compatibility │ │ │ │ - Flattens tags/fields │ │ │ │ - Row→column conversion │ │ │ │ - Polars DataFrame → Parquet │ │ │ │ - Throughput: 240K-908K RPS │ │ │ └──────────────────────────────────────────────────────┘ │ └──────────────────┬──────────────────────────────────────────┘ │ │ Parquet Files (columnar format) ▼ ┌─────────────────────────────────────────────────────────────┐ │ Storage Backend (Pluggable) │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ Local NVMe (Fastest - 2.32M RPS) │ │ │ │ • Direct I/O, minimal overhead │ │ │ │ • Best for single-node, development, edge │ │ │ └────────────────────────────────────────────────────────┘ │ │ ┌────────────────────────────────────────────────────────┐ │ │ │ MinIO (Recommended for Production - ~2.0M RPS) │ │ │ │ • S3-compatible, distributed, scalable │ │ │ │ • High availability, erasure coding │ │ │ │ • Multi-tenant, object versioning │ │ │ └────────────────────────────────────────────────────────┘ │ │ │ │ Alternative backends: AWS S3/R2, Google Cloud Storage │ └─────────────────────────────────────────────────────────────┘ │ │ Query Path (Direct Parquet reads) ▼ ┌─────────────────────────────────────────────────────────────┐ │ Query Engine (DuckDB) │ │ - Direct Parquet reads from object storage │ │ - Columnar execution engine │ │ - Query cache for common queries │ │ - Full SQL interface (Postgres-compatible) │ │ - Zero-copy aggregations on columnar data │ └─────────────────────────────────────────────────────────────┘ ``` ### Ingestion Flow (Columnar Format - Recommended) 1. **Client generates columnar data**: `{m: "cpu", columns: {time: [...], host: [...], val: [...]}}` 2. **MessagePack serialization**: Binary encoding (10-30% smaller than JSON) 3. **Arc receives columnar batch**: No parsing overhead, validates array lengths 4. **Zero-copy passthrough**: Direct PyArrow RecordBatch creation 5. **Buffering**: In-memory columnar batches (minimal overhead) 6. **Parquet writes**: Direct columnar → Parquet (no conversion) 7. **Storage**: Write to local NVMe or MinIO (2.32M RPS sustained) **Key Advantages:** - **2.55x faster throughput** vs row format (2.32M vs 908K RPS) - **20-26x lower latency** (p50: 6.75ms vs 136ms) - **Zero conversion overhead** - No flatten, no row→column conversion - **Better compression** - Field names sent once, not per-record - **More efficient memory** - Arrays more compact than list of dicts ### Why MinIO? Arc Core is designed with **MinIO as the primary storage backend** for several key reasons: 1. **Unlimited Scale**: Store petabytes of time-series data without hitting storage limits 2. **Cost-Effective**: Commodity hardware or cloud storage at fraction of traditional database costs 3. **Distributed Architecture**: Built-in replication and erasure coding for data durability 4. **S3 Compatibility**: Works with any S3-compatible storage (AWS S3, GCS, Wasabi, etc.) 5. **Performance**: Direct Parquet reads from object storage with DuckDB's efficient execution 6. **Separation of Compute & Storage**: Scale storage and compute independently 7. **Self-Hosted Option**: Run on your own infrastructure without cloud vendor lock-in The MinIO + Parquet + DuckDB combination provides the perfect balance of cost, performance, and scalability for analytical time-series workloads. ## Performance Arc Core has been benchmarked using [ClickBench](https://github.com/ClickHouse/ClickBench) - the industry-standard analytical database benchmark with 100M row dataset (14GB) and 43 analytical queries. ### ClickBench Results **Hardware: AWS c6a.4xlarge** (16 vCPU AMD EPYC 7R13, 32GB RAM, 500GB gp2) - **Cold Run Total**: 120.25s (sum of 43 queries, first execution with proper cache flushing) - **Warm Run Total**: 35.70s (sum of 43 queries, best of 3 runs) - **Cold/Warm Ratio**: 3.37x (proper cache flushing verification) - **Storage**: 13.76 GB Parquet (Snappy compression) - **Success Rate**: 43/43 queries (100%) - **vs QuestDB**: 1.80x faster cold, 1.20x faster warm - **vs TimescaleDB**: 9.39x faster cold, 12.39x faster warm **Hardware: Apple M3 Max** (14 cores ARM, 36GB RAM) - **Cold Run Total**: 22.64s (sum of 43 queries, first execution) - **With Query Cache**: 16.87s (60s TTL caching enabled, 1.34x speedup) - **Cache Hit Performance**: 3-20ms per query (sub-second for all cached queries) - **Cache Hit Rate**: 51% of queries benefit from caching (22/43 queries) - **Aggregate Performance**: ~4.4M rows/sec cold, ~5.9M rows/sec cached - **Storage**: Local NVMe SSD - **Success Rate**: 43/43 queries (100%) - **Optimizations**: DuckDB pool (early connection release), async gzip decompression ### Key Performance Characteristics - **Columnar Storage**: Parquet format with Snappy compression - **Query Engine**: DuckDB with default settings (ClickBench compliant) - **Result Caching**: 60s TTL for repeated queries (production mode) - **End-to-End**: All timings include HTTP/JSON API overhead ### Fastest Queries (M3 Max) | Query | Time | Description | |-------|------|-------------| | Q1 | 0.043s | Simple COUNT(*) aggregation | | Q7 | 0.036s | MIN/MAX on date column | | Q8 | 0.039s | GROUP BY with filter | | Q20 | 0.047s | Point lookup by UserID | | Q42 | 0.043s | Multi-column aggregation | ### Most Complex Queries | Query | Time | Description | |-------|------|-------------| | Q29 | 8.09s | REGEXP_REPLACE with heavy string operations | | Q19 | 1.69s | Timestamp conversion with GROUP BY | | Q33 | 1.28s | Complex multi-column aggregations | | Q23 | 1.10s | String matching with LIKE patterns | **Benchmark Configuration**: - Dataset: 100M rows, 14GB Parquet (ClickBench hits.parquet) - Protocol: HTTP REST API with JSON responses - Caching: Disabled for benchmark compliance - Tuning: None (default DuckDB settings) See full results and methodology at [ClickBench Results](https://benchmark.clickhouse.com) and [Arc's ClickBench repository](https://github.com/Basekick-Labs/ClickBench/tree/main/arc). ## Docker Services The `docker-compose.yml` includes: - **arc-api**: Main API server (port 8000) - **minio**: S3-compatible storage (port 9000, console 9001) - **minio-init**: Initializes MinIO buckets on startup ## Development ```bash # Run with auto-reload uvicorn api.main:app --reload --host 0.0.0.0 --port 8000 # Run tests (if available in parent repo) pytest tests/ ``` ## Monitoring Health check endpoint: ```bash curl http://localhost:8000/health ``` Logs: ```bash # Docker docker-compose logs -f arc-api # Native (systemd) sudo journalctl -u arc-api -f ``` ## API Reference ### Public Endpoints (No Authentication Required) - `GET /` - API information - `GET /health` - Service health check - `GET /ready` - Readiness probe - `GET /docs` - Swagger UI documentation - `GET /redoc` - ReDoc documentation - `GET /openapi.json` - OpenAPI specification **Note**: All other endpoints require token authentication via `x-api-key` header. ### Data Ingestion **MessagePack Binary Protocol** (Recommended - 2.66x faster): - `POST /api/v1/write/msgpack` - Write data via MessagePack columnar format - `GET /api/v1/write/msgpack/stats` - Get ingestion statistics - `GET /api/v1/write/msgpack/spec` - Get protocol specification **Line Protocol** (InfluxDB compatibility): - `POST /api/v1/write` - InfluxDB 1.x compatible write - `POST /api/v1/write/influxdb` - InfluxDB 2.x API format - `POST /api/v1/write/line-protocol` - Line protocol endpoint - `POST /api/v1/write/flush` - Force flush write buffer - `GET /api/v1/write/health` - Write endpoint health check - `GET /api/v1/write/stats` - Write statistics ### Query Endpoints - `POST /api/v1/query` - Execute DuckDB SQL query (JSON response) - `POST /api/v1/query/arrow` - Execute query (Apache Arrow IPC format) - `POST /api/v1/query/estimate` - Estimate query cost - `POST /api/v1/query/stream` - Stream large query results (CSV) - `GET /api/v1/query/{measurement}` - Get measurement data - `GET /api/v1/query/{measurement}/csv` - Export measurement as CSV - `GET /api/v1/measurements` - List all measurements/tables ### Authentication & Security - `GET /api/v1/auth/verify` - Verify token validity - `GET /api/v1/auth/tokens` - List all tokens - `POST /api/v1/auth/tokens` - Create new token - `GET /api/v1/auth/tokens/{token_id}` - Get token details - `PATCH /api/v1/auth/tokens/{token_id}` - Update token - `DELETE /api/v1/auth/tokens/{token_id}` - Delete token - `POST /api/v1/auth/tokens/{token_id}/rotate` - Rotate token (generate new) - `GET /api/v1/auth/cache/stats` - Authentication cache statistics - `POST /api/v1/auth/cache/invalidate` - Invalidate auth cache ### Monitoring & Metrics - `GET /health` - Service health check - `GET /ready` - Readiness probe - `GET /api/v1/metrics` - Prometheus metrics - `GET /api/v1/metrics/timeseries/{metric_type}` - Time-series metrics - `GET /api/v1/metrics/endpoints` - Endpoint statistics - `GET /api/v1/metrics/query-pool` - Query pool status - `GET /api/v1/metrics/memory` - Memory profile - `GET /api/v1/logs` - Application logs ### Connection Management **Data Source Connections**: - `GET /api/v1/connections/datasource` - List data source connections - `POST /api/v1/connections/datasource` - Create connection - `GET /api/v1/connections/datasource/{connection_id}` - Get connection details **InfluxDB Connections**: - `GET /api/v1/connections/influx` - List InfluxDB connections - `POST /api/v1/connections/influx` - Create InfluxDB connection - `GET /api/v1/connections/influx/{connection_id}` - Get connection details **Storage Connections**: - `GET /api/v1/connections/storage` - List storage backends - `POST /api/v1/connections/storage` - Create storage connection - `GET /api/v1/connections/storage/{connection_id}` - Get storage details **HTTP/JSON Connections**: - `GET /api/v1/connections/http_json` - List HTTP/JSON connections - `POST /api/v1/connections/http_json` - Create HTTP/JSON connection **Connection Operations**: - `POST /api/v1/connections/{connection_type}/test` - Test connection - `POST /api/v1/connections/{connection_type}/{connection_id}/activate` - Activate connection - `DELETE /api/v1/connections/{connection_type}/{connection_id}` - Delete connection **Setup**: - `POST /api/v1/setup/default-connections` - Create default connections ### Retention Policies - `GET /api/v1/retention` - List all retention policies - `POST /api/v1/retention` - Create retention policy - `GET /api/v1/retention/{id}` - Get policy details - `PUT /api/v1/retention/{id}` - Update policy - `DELETE /api/v1/retention/{id}` - Delete policy - `POST /api/v1/retention/{id}/execute` - Execute policy (manual trigger with dry-run support) - `GET /api/v1/retention/{id}/executions` - Get execution history See [Retention Policies Documentation](docs/RETENTION_POLICIES.md) for complete guide. ### Continuous Queries - `GET /api/v1/continuous_queries` - List all continuous queries - `POST /api/v1/continuous_queries` - Create continuous query - `GET /api/v1/continuous_queries/{id}` - Get query details - `PUT /api/v1/continuous_queries/{id}` - Update query - `DELETE /api/v1/continuous_queries/{id}` - Delete query - `POST /api/v1/continuous_queries/{id}/execute` - Execute query manually (with dry-run support) - `GET /api/v1/continuous_queries/{id}/executions` - Get execution history **Use Cases:** - **Downsampling**: Aggregate high-resolution data (10s → 1m → 1h → 1d retention tiers) - **Materialized Views**: Pre-compute aggregations for faster dashboard queries - **Summary Tables**: Create daily/hourly summaries for long-term analysis - **Storage Optimization**: Reduce storage by aggregating old data See [Continuous Queries Documentation](docs/CONTINUOUS_QUERIES.md) for complete guide with SQL examples. ### Export Jobs - `GET /api/v1/jobs` - List all export jobs - `POST /api/v1/jobs` - Create new export job - `PUT /api/v1/jobs/{job_id}` - Update job configuration - `DELETE /api/v1/jobs/{job_id}` - Delete job - `GET /api/v1/jobs/{job_id}/executions` - Get job execution history - `POST /api/v1/jobs/{job_id}/run` - Run job immediately - `POST /api/v1/jobs/{job_id}/cancel` - Cancel running job - `GET /api/v1/monitoring/jobs` - Monitor job status ### HTTP/JSON Export - `GET /api/v1/http-json/connections` - List HTTP/JSON connections - `POST /api/v1/http-json/connections` - Create HTTP/JSON connection - `GET /api/v1/http-json/connections/{connection_id}` - Get connection details - `POST /api/v1/http-json/connections/{connection_id}/test` - Test connection - `POST /api/v1/http-json/connections/{connection_id}/activate` - Activate connection - `POST /api/v1/http-json/connections/{connection_id}/discover-schema` - Discover schema - `POST /api/v1/http-json/export` - Export data via HTTP ### Cache Management - `GET /api/v1/cache/stats` - Cache statistics - `GET /api/v1/cache/health` - Cache health status - `POST /api/v1/cache/clear` - Clear query cache ### Compaction Management - `GET /api/v1/compaction/status` - Current compaction status - `GET /api/v1/compaction/stats` - Detailed statistics - `GET /api/v1/compaction/candidates` - List eligible partitions - `POST /api/v1/compaction/trigger` - Manually trigger compaction - `GET /api/v1/compaction/jobs` - View active jobs - `GET /api/v1/compaction/history` - View job history ### Write-Ahead Log (WAL) - `GET /api/v1/wal/status` - WAL status and configuration - `GET /api/v1/wal/stats` - WAL statistics - `GET /api/v1/wal/health` - WAL health check - `GET /api/v1/wal/files` - List WAL files - `POST /api/v1/wal/cleanup` - Clean up old WAL files - `GET /api/v1/wal/recovery/history` - Recovery history ### Avro Schema Registry - `GET /api/v1/avro/schemas` - List all schemas - `GET /api/v1/avro/schemas/{schema_id}` - Get schema by ID - `GET /api/v1/avro/schemas/topic/{topic_name}` - Get schema by topic ### Delete Operations **Note**: Disabled by default. Set `delete.enabled=true` in `arc.conf` to enable. - `POST /api/v1/delete` - Delete data matching WHERE clause (supports dry-run) - `GET /api/v1/delete/config` - Get delete configuration and limits **Key Features**: - **Zero overhead on writes/queries**: Deleted data physically removed via file rewrites - **Precise deletion**: Delete any rows matching a SQL WHERE clause - **GDPR compliance**: Remove specific user data permanently - **Safety mechanisms**: Dry-run mode, confirmation thresholds, row limits - **Use cases**: GDPR requests, error cleanup, decommissioning hosts/sensors See [DELETE.md](docs/DELETE.md) for detailed documentation. ### Interactive API Documentation Arc Core includes auto-generated API documentation: - **Swagger UI**: `http://localhost:8000/docs` - **ReDoc**: `http://localhost:8000/redoc` - **OpenAPI JSON**: `http://localhost:8000/openapi.json` ## Integrations ### Apache Superset - Interactive Dashboards Create interactive dashboards and visualizations for your Arc data using Apache Superset: **Quick Start:** ```bash # Install the Arc dialect pip install arc-superset-dialect # Or use Docker with Arc pre-configured git clone https://github.com/basekick-labs/arc-superset-dialect.git cd arc-superset-dialect docker build -t superset-arc . docker run -d -p 8088:8088 superset-arc ``` **Connect to Arc:** 1. Access Superset at `http://localhost:8088` (admin/admin) 2. Add database connection: `arc://YOUR_API_KEY@localhost:8000/default` 3. Start building dashboards with SQL queries **Example Dashboard Queries:** ```sql -- Time-series CPU usage SELECT time_bucket(INTERVAL '5 minutes', timestamp) as time, host, AVG(usage_idle) as avg_idle FROM cpu WHERE timestamp > NOW() - INTERVAL 6 HOUR GROUP BY time, host ORDER BY time DESC; -- Correlate CPU and Memory SELECT c.timestamp, c.host, c.usage_idle, m.used_percent FROM cpu c JOIN mem m ON c.timestamp = m.timestamp AND c.host = m.host WHERE c.timestamp > NOW() - INTERVAL 1 HOUR LIMIT 1000; ``` **Learn More:** - [Arc Superset Dialect Repository](https://github.com/basekick-labs/arc-superset-dialect) - [Integration Guide](https://github.com/basekick-labs/arc-superset-dialect#readme) - [PyPI Package](https://pypi.org/project/arc-superset-dialect/) (once published) ## Roadmap Arc Core is under active development. Current focus areas: - **Performance Optimization**: Further improvements to ingestion and query performance - **API Stability**: Finalizing core API contracts - **Enhanced Monitoring**: Additional metrics and observability features - **Documentation**: Expanded guides and tutorials - **Production Hardening**: Testing and validation for production use cases We welcome feedback and feature requests as we work toward a stable 1.0 release. ## License Arc Core is licensed under the **GNU Affero General Public License v3.0 (AGPL-3.0)**. This means: - **Free to use** - Use Arc Core for any purpose - **Free to modify** - Modify the source code as needed - **Free to distribute** - Share your modifications with others - **Share modifications** - If you modify Arc and run it as a service, you must share your changes under AGPL-3.0 ### Why AGPL? AGPL-3.0 ensures that improvements to Arc benefit the entire community, even when run as a cloud service. This prevents the "SaaS loophole" where companies could take the code, improve it, and keep changes proprietary. ### Commercial Licensing For organizations that require: - Proprietary modifications without disclosure - Commercial support and SLAs - Enterprise features and managed services Please contact us at: **enterprise[at]basekick[dot]net** We offer dual licensing and commercial support options. ## Support - **Discord Community**: [Join our Discord](https://discord.gg/nxnWfUxsdm) - Get help, share feedback, and connect with other Arc users - **GitHub Issues**: [Report bugs and request features](https://github.com/basekick-labs/arc-core/issues) - **Enterprise Support**: enterprise[at]basekick[dot]net - **General Inquiries**: support[at]basekick[dot]net ## Disclaimer **Arc Core is provided "as-is" in alpha state.** While we use it extensively for development and testing, it is not yet production-ready. Features and APIs may change without notice. Always back up your data and test thoroughly in non-production environments before considering any production deployment.