ClickHouse Knowledge Base | ClickHouse Docs

A Python client working example for connecting to ClickHouse Cloud Service

September 10, 2024 · 2 min read

This is a step by step example on how to start using Python with ClickHouse Cloud service.

Note

Keep in mind that Python versions and libraries dependencies are constantly evolving. Make also sure to use the latest supported versions of both the driver and Python environment when trying this.

At the time of writing this article, we're using the clickhouse-connect driver version 0.5.23 and python 3.11.2 respectively.

Steps

Check the Python version:

$  python -V
Python 3.11.2

We'll assemble the project in a folder called ch-python:

$ mkdir ch-python
$ cd ch-python

Create a dependencies file named requirements.txt with:

clickhouse-connect==0.5.23

Create a python source file named main.py:

import clickhouse_connect
import sys
import json

CLICKHOUSE_CLOUD_HOSTNAME = 'HOSTNAME.clickhouse.cloud'
CLICKHOUSE_CLOUD_USER = 'default'
CLICKHOUSE_CLOUD_PASSWORD = 'YOUR_SECRET_PASSWORD'

client = clickhouse_connect.get_client(
    host=CLICKHOUSE_CLOUD_HOSTNAME, port=8443, username=CLICKHOUSE_CLOUD_USER, password=CLICKHOUSE_CLOUD_PASSWORD)

print("connected to " + CLICKHOUSE_CLOUD_HOSTNAME + "\n")
client.command(
    'CREATE TABLE IF NOT EXISTS new_table (key UInt32, value String, metric Float64) ENGINE MergeTree ORDER BY key')

print("table new_table created or exists already!\n")

row1 = [1000, 'String Value 1000', 5.233]
row2 = [2000, 'String Value 2000', -107.04]
data = [row1, row2]
client.insert('new_table', data, column_names=['key', 'value', 'metric'])

print("written 2 rows to table new_table\n")

QUERY = "SELECT max(key), avg(metric) FROM new_table"

result = client.query(QUERY)

sys.stdout.write("query: ["+QUERY + "] returns:\n\n")
print(result.result_rows)

Create the virtual environment:

chpython$ python -m venv venv

Load the virtual environment:

chpython$ source venv/bin/activate

Once loaded, your terminal prompt should be prefixed with (venv), install dependencies:

(venv) ➜  chpython$ pip install -r requirements.txt
Collecting certifi
  Using cached certifi-2023.5.7-py3-none-any.whl (156 kB)
Collecting urllib3>=1.26
  Using cached urllib3-2.0.2-py3-none-any.whl (123 kB)
Collecting pytz
  Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting zstandard
  Using cached zstandard-0.21.0-cp311-cp311-macosx_11_0_arm64.whl (364 kB)
Collecting lz4
  Using cached lz4-4.3.2-cp311-cp311-macosx_11_0_arm64.whl (212 kB)
Installing collected packages: pytz, zstandard, urllib3, lz4, certifi, clickhouse-connect
Successfully installed certifi-2023.5.7 clickhouse-connect-0.5.23 lz4-4.3.2 pytz-2023.3 urllib3-2.0.2 zstandard-0.21.0

Launch the code!

(venv) chpython$ venv/bin/python main.py

connected to HOSTNAME.clickhouse.cloud

table new_table created or exists already!

written 2 rows to table new_table

query: [SELECT max(key), avg(metric) FROM new_table] returns:

[(2000, -50.9035)]

Tip

If using an older Python version (e.g. 3.9.6) you might be getting an ImportError related to urllib3 library. In that case either upgrade your Python environment to a newer version or pin the urllib3 version to 1.26.15 in your requirements.txt file.

Terraform example on how to use Cloud API

September 10, 2024 · 5 min read

How can I use API to manage clusters on ClickHouse Cloud?

Answer

We will use Terraform to configure our infra and ClickHouse Provider

Steps:

1). Create an API Key on Cloud. Follow the docs here - https://clickhouse.com/docs/en/cloud/manage/openapi

Save the creds locally.

2). Install Terraform using - https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli

You can use Homebrew package manager if you're on Mac.

3). Create a directory anywhere you like:

mkdir test
➜  test pwd
/Users/jaijhala/Desktop/terraform/test

4). Create 2 files: main.tf and secret.tfvars

Copy the following:

main.tf file would be:

terraform {
 required_providers {
   clickhouse = {
     source = "ClickHouse/clickhouse"
     version = "0.0.2"
   }
 }
}

variable "organization_id" {
  type = string
}

variable "token_key" {
  type = string
}

variable "token_secret" {
  type = string
}

provider clickhouse {
  environment   = "production"
  organization_id = var.organization_id
  token_key     = var.token_key
  token_secret  = var.token_secret
}


variable "service_password" {
  type = string
  sensitive   = true
}

resource "clickhouse_service" "service123" {
  name          = "jai-terraform"
  cloud_provider = "aws"
  region        = "us-east-2"
  tier          = "development"
  idle_scaling   = true
  password  = var.service_password
  ip_access = [
    {
        source      = "0.0.0.0/0"
        description = "Anywhere"
    }
  ]
}

output "CLICKHOUSE_HOST" {
  value = clickhouse_service.service123.endpoints.0.host
}

You can replace your own parameters like service name, region etc.. in the resources section above.

secret.tfvars is where you'll put all the API Key related info that you downloaded earlier. The idea behind this file is that all your secret credentials will be hidden from the main config file.

It would be something like (replace these parameters):

organization_id = "e957a5f7-4qe3-4b05-ad5a-d02b2dcd0593"
token_key = "QWhhkMeytqQruTeKg"
token_secret = "4b1dNmjWdLUno9lXxmKvSUcPP62jvn7irkuZPbY"
service_password = "password123!"

5). Run terraform init from this directory

Expected output:

Initializing the backend...

Initializing provider plugins...
- Finding clickhouse/clickhouse versions matching "0.0.2"...
- Installing clickhouse/clickhouse v0.0.2...
- Installed clickhouse/clickhouse v0.0.2 (self-signed, key ID D7089EE5C6A92ED1)

Partner and community providers are signed by their developers.
If you'd like to know more about provider signing, you can read about it here:
https://www.terraform.io/docs/cli/plugins/signing.html

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

6). Run terraform apply -var-file=secret.tfvars command.

Something like:

➜  test terraform apply -var-file=secret.tfvars

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with
the following symbols:
  + create

Terraform will perform the following actions:

  # clickhouse_service.service123 will be created
  + resource "clickhouse_service" "service123" {
      + cloud_provider = "aws"
      + endpoints      = (known after apply)
      + id             = (known after apply)
      + idle_scaling   = true
      + ip_access      = [
          + {
              + description = "Anywhere"
              + source      = "0.0.0.0/0"
            },
        ]
      + last_updated   = (known after apply)
      + name           = "jai-terraform"
      + password       = (sensitive value)
      + region         = "us-east-2"
      + tier           = "development"
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + CLICKHOUSE_HOST = (known after apply)

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

Type yes and hit enter

Side note: Notice it says password = (sensitive value) above. This is because we set sensitive = true for the password in the main.tf file.

7). It will take a couple of mins to create the service but eventually it should come up like:

  Enter a value: yes

clickhouse_service.service123: Creating...
clickhouse_service.service123: Still creating... [10s elapsed]
clickhouse_service.service123: Still creating... [20s elapsed]
clickhouse_service.service123: Still creating... [30s elapsed]
clickhouse_service.service123: Still creating... [40s elapsed]
clickhouse_service.service123: Still creating... [50s elapsed]
clickhouse_service.service123: Still creating... [1m0s elapsed]
clickhouse_service.service123: Still creating... [1m10s elapsed]
clickhouse_service.service123: Still creating... [1m20s elapsed]
clickhouse_service.service123: Still creating... [1m30s elapsed]
clickhouse_service.service123: Still creating... [1m40s elapsed]
clickhouse_service.service123: Creation complete after 1m41s [id=aa8d8d63-1878-4600-8470-630715af38ed]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Outputs:

CLICKHOUSE_HOST = "h3ljlaqez6.us-east-2.aws.clickhouse.cloud"
➜  test

8). Check Cloud Console, you should be able to see the service created.

9). To clean up/destroy the service again, run terraform destroy -var-file=secret.tfvars

Something like:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with
the following symbols:
  - destroy

Terraform will perform the following actions:

  # clickhouse_service.service123 will be destroyed
  - resource "clickhouse_service" "service123" {
      - cloud_provider = "aws" -> null
      - ............

Plan: 0 to add, 0 to change, 1 to destroy.

Changes to Outputs:
  - CLICKHOUSE_HOST = "h3ljlaqez6.us-east-2.aws.clickhouse.cloud" -> null

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value:

Type yes and hit enter

10).

clickhouse_service.service123: Destroying... [id=aa8d8d63-1878-4600-8470-630715af38ed]
clickhouse_service.service123: Still destroying... [id=aa8d8d63-1878-4600-8470-630715af38ed, 10s elapsed]
clickhouse_service.service123: Still destroying... [id=aa8d8d63-1878-4600-8470-630715af38ed, 20s elapsed]
clickhouse_service.service123: Destruction complete after 27s

Destroy complete! Resources: 1 destroyed.

And it should be gone from the Cloud Console.

More details about the Cloud API can be found here - https://clickhouse.com/docs/en/cloud/manage/api/api-overview

How to connect to CH Cloud using SSH Keys

July 25, 2024 · 2 min read

Question

How can I connect to CH Cloud service using SSH Key Authentication?

Answer

1) Use ssh-keygen to create the keypair. Example:

➜  new ssh-keygen \
-t ed25519 \
> -f /Users/testuser/.ssh/ch_key
Generating public/private ed25519 key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /Users/testuser/.ssh/ch_key
Your public key has been saved in /Users/testuser/.ssh/ch_key.pub
.....

2) Use the public key (ch_key.pub in above example) to create the USER.

clickhouse-cloud :) CREATE USER abcuser IDENTIFIED WITH ssh_key BY KEY 'AAAABBBcdE1lZDI1NTE5AAAAIISdl4CrGM8mckXBUXLjL3ef9XwnycDWEvBPu3toB40m' TYPE 'ssh-ed25519';

CREATE USER abcuser IDENTIFIED WITH ssh_key BY KEY AAAABBBcdE1lZDI1NTE5AAAAIISdl4CrGM8mckXBUXLjL3ef9XwnycDWEvBPu3toB40m TYPE `ssh-ed25519`

Query id: 34c6aad6-5f88-4c80-af7a-7d37c91ba7d5

Ok.

3) Run SHOW users to confirm the user creation.

4) Grant default_role to the user (optional).

clickhouse-cloud :) grant default_role to abcuser;

GRANT default_role TO abcuser

Query id: 4a054003-220a-4dea-8e8d-eb1f08ee7b10

Ok.

0 rows in set. Elapsed: 0.137 sec.

5) Use the private key now to authenticate against the service.

➜  new ./clickhouse client --host myhost.us-central1.gcp.clickhouse.cloud --secure --user abcuser --ssh-key-file '/Users/testuser/.ssh/ch_key'
ClickHouse client version 23.12.1.863 (official build).
Enter your private key passphrase (leave empty for no passphrase):
Connecting to myhost.us-central1.gcp.clickhouse.cloud:9440 as user abcuser.
Connected to ClickHouse server version 23.9.2.

clickhouse-cloud :) select currentUser();

SELECT currentUser()

Query id: d4b6bb60-ef45-47d3-8740-db9f2941dcd2

┌─currentUser()─┐
│ abcuser       │
└───────────────┘

1 row in set. Elapsed: 0.001 sec.

clickhouse-cloud :)

Mapping of metrics used in system.dashboards to Prometheus metrics in system.custom_metrics

July 23, 2024 · 3 min read

The table below provides the mapping for the metrics used in system.dashboards to Prometheus metrics in system.custom_metrics.
This is useful for customers who want to monitor for the same metrics found in system.dashboards.

Mapping table for metrics in system.dashboards to Prometheus metrics in system.custom_metrics

Dashboard	Title	Prometheus Metric Name (system.custom_metrics)
Overview	Queries/second	ClickHouseProfileEvents_Query
Overview	CPU Usage (cores)	ClickHouseProfileEvents_OSCPUVirtualTimeMicroseconds
Overview	Queries Running	ClickHouseMetrics_Query
Overview	Merges Running	ClickHouseMetrics_Merge
Overview	Selected Bytes/second	ClickHouseProfileEvents_SelectedBytes
Overview	IO Wait	ClickHouseProfileEvents_OSIOWaitMicroseconds
Overview	CPU Wait	ClickHouseProfileEvents_OSCPUWaitMicroseconds
Overview	OS CPU Usage (Userspace)	ClickHouseAsyncMetrics_OSUserTimeNormalized
Overview	OS CPU Usage (Kernel)	ClickHouseAsyncMetrics_OSSystemTimeNormalized
Overview	Read From Disk	ClickHouseProfileEvents_OSReadBytes
Overview	Read From Filesystem	ClickHouseProfileEvents_OSReadChars
Overview	Memory (tracked)	ClickHouseMetrics_MemoryTracking
Overview	Load Average (15 minutes)	ClickHouseAsyncMetrics_LoadAverage15
Overview	Selected Rows/second	ClickHouseProfileEvents_SelectedRows
Overview	Inserted Rows/second	ClickHouseProfileEvents_InsertedRows
Overview	Total MergeTree Parts	ClickHouseAsyncMetrics_TotalPartsOfMergeTreeTables
Overview	Max Parts For Partition	ClickHouseAsyncMetrics_MaxPartCountForPartition
Cloud overview	Queries/second	ClickHouseProfileEvents_Query
Cloud overview	CPU Usage (cores)	ClickHouseProfileEvents_OSCPUVirtualTimeMicroseconds
Cloud overview	Queries Running	ClickHouseMetrics_Query
Cloud overview	Merges Running	ClickHouseMetrics_Merge
Cloud overview	Selected Bytes/second	ClickHouseProfileEvents_SelectedBytes
Cloud overview	IO Wait (local fs)	ClickHouseProfileEvents_OSIOWaitMicroseconds
Cloud overview	S3 read wait	ClickHouseProfileEvents_ReadBufferFromS3Microseconds
Cloud overview	S3 read errors/sec	ProfileEvent_ReadBufferFromS3RequestsErrors
Cloud overview	CPU Wait	ClickHouseProfileEvents_OSCPUWaitMicroseconds
Cloud overview	OS CPU Usage (Userspace, normalized)	ClickHouseAsyncMetrics_OSUserTimeNormalized
Cloud overview	OS CPU Usage (Kernel, normalized)	ClickHouseAsyncMetrics_OSSystemTimeNormalized
Cloud overview	Read From Disk (bytes/sec)	ClickHouseProfileEvents_OSReadBytes
Cloud overview	Read From Filesystem (bytes/sec)	ClickHouseProfileEvents_OSReadChars
Cloud overview	Memory (tracked, bytes)	ClickHouseMetrics_MemoryTracking
Cloud overview	Load Average (15 minutes)	ClickHouseAsyncMetrics_LoadAverage15
Cloud overview	Selected Rows/sec	ClickHouseProfileEvents_SelectedRows
Cloud overview	Inserted Rows/sec	ClickHouseProfileEvents_InsertedRows
Cloud overview	Total MergeTree Parts	ClickHouseAsyncMetrics_TotalPartsOfMergeTreeTables
Cloud overview	Max Parts For Partition	ClickHouseAsyncMetrics_MaxPartCountForPartition
Cloud overview	Read From S3 (bytes/sec)	ClickHouseProfileEvents_ReadBufferFromS3Bytes
Cloud overview	Filesystem Cache Size	ClickHouseMetrics_FilesystemCacheSize
Cloud overview	Disk S3 write req/sec	ClickHouseProfileEvents_DiskS3PutObject + ClickHouseProfileEvents_DiskS3UploadPart + ClickHouseProfileEvents_DiskS3CreateMultipartUpload + ClickHouseProfileEvents_DiskS3CompleteMultipartUpload
Cloud overview	Disk S3 read req/sec	ClickHouseProfileEvents_DiskS3GetObject + ClickHouseProfileEvents_DiskS3HeadObject + ClickHouseProfileEvents_DiskS3ListObjects
Cloud overview	FS cache hit rate	ClickHouseProfileEvents_CachedReadBufferReadFromCacheBytes / (ClickHouseProfileEvents_CachedReadBufferReadFromCacheBytes + ClickHouseProfileEvents_CachedReadBufferReadFromSourceBytes)
Cloud overview	Page cache hit rate	greatest(0, (sum(ClickHouseProfileEvents_OSReadChars) - sum(ClickHouseProfileEvents_OSReadBytes)) / (sum(ClickHouseProfileEvents_OSReadChars) + sum(ClickHouseProfileEvents_ReadBufferFromS3Bytes)))
Cloud overview	Network receive bytes/sec	ClickHouseProfileEvents_NetworkReceiveBytes
Cloud overview	Network send bytes/sec	ClickHouseProfileEvents_NetworkSendBytes

ODBC authentication failed error when using the Power BI ClickHouse connector

July 10, 2024 · One min read

Question

When trying to connect from PowerBI to ClickHouse using the connector, you receive the following authentication error:

We encountered an error while trying to connect.
Details: "ODBC: ERROR [HY000] HTTP status code: 403
Received error:
Code: 516. DB::Exception: default: Authentication failed: password is incorrect, or there is no user with such name.
If you have installed ClickHouse and forgot password you can reset it in the configuration file.
The password for default user is typically located at /etc/clickhouse-server/users.d/default-password.xml and deleting this file will reset the password.
See also /etc/clickhouse-server/users.ml on the server where
ClickHouse is installed.

powerbi_error

Answer

Check the password being used to see if the password contains a tilde ~.

The recommendation is to use a dedicated user for the connection and set the password manually. If using ClickHouse Cloud and the admin level of permissions with the default user is needed, then create a new user and and assign the default_role.

For more information:
https://clickhouse.com/docs/en/operations/access-rights#user-account-management
https://clickhouse.com/docs/en/cloud/security/cloud-access-management#database-roles

How to create a table that can query multiple remote clusters

July 3, 2024 · 2 min read

Question

How do I create a table that can query other clusters or instances?

Answer

Below is a simple example to test functionality.

In this example, ClickHouse Cloud is use but the example will work when using self-hosted clusters also. The targets will need to change to the urls/hosts/dns of a target node or load balancer.

In cluster A:

./clickhouse client --host clusterA.us-west-2.aws.clickhouse.cloud --secure --password 'Password123!'

Create the database:

create database db1;

Create the table:

 CREATE TABLE db1.table1_remote1
(
    `id` UInt32,
    `timestamp_column` DateTime,
    `string_column` String
)
ENGINE = MergeTree()
ORDER BY id;

Insert some sample rows:

insert into db1.table1_remote1
values
(1, '2023-09-29 00:01:00', 'a'),
(2, '2023-09-29 00:02:00', 'b'),
(3, '2023-09-29 00:03:00', 'c');

In cluster B:

./clickhouse client --host clusterB.us-east-2.aws.clickhouse.cloud --secure --password 'Password123!'

Create the database:

create database db1;

Create the table:

CREATE TABLE db1.table1_remote2
(
    `id` UInt32,
    `timestamp_column` DateTime,
    `string_column` String
)
ENGINE = MergeTree()
ORDER BY id;

Insert sample rows:

insert into db1.table1_remote1
values
(4, '2023-09-29 00:04:00', 'x'),
(5, '2023-09-29 00:05:00', 'y'),
(6, '2023-09-29 00:06:00', 'z');

In Cluster C:
*this cluster will be used to gather the data from the other two clusters, however, can also be used as a source.

./clickhouse client --host clusterC.us-west-2.aws.clickhouse.cloud --secure --password 'Password123!'

Create the database:

create database db1;

Create the remote tables with remoteSecure() to connect to the other clusters.
Definition for remote cluster A table:

CREATE TABLE db1.table1_remote1_main
(
    `id` UInt32,
    `timestamp_column` DateTime,
    `string_column` String
) AS remoteSecure('clusterA.us-west-2.aws.clickhouse.cloud:9440', 'db1.table1_remote1', 'default', 'Password123!');

Definition for remote cluster B table:

CREATE TABLE db1.table1_remote2_main
(
    `id` UInt32,
    `timestamp_column` DateTime,
    `string_column` String
) AS remoteSecure('clusterB.us-east-2.aws.clickhouse.cloud:9440', 'db1.table1_remote2', 'default', 'Password123!')

Create the merge table to be used to gather results:

create table db1.table1_merge_remote
(
  id UInt32,
  timestamp_column DateTime,
  string_column String
)
engine = Merge('db1', 'table.\_main');

Test the results:

clickhouse-cloud :) select * from db1.table1_merge_remote;

SELECT *
FROM db1.table1_merge_remote

Query id: 46b6e741-bbd1-47ed-b40e-69ddb6e0c364

┌─id─┬────timestamp_column─┬─string_column─┐
│  1 │ 2023-09-29 00:01:00 │ a             │
│  2 │ 2023-09-29 00:02:00 │ b             │
│  3 │ 2023-09-29 00:03:00 │ c             │
└────┴─────────────────────┴───────────────┘
┌─id─┬────timestamp_column─┬─string_column─┐
│  4 │ 2023-09-29 00:04:00 │ x             │
│  5 │ 2023-09-29 00:05:00 │ y             │
│  6 │ 2023-09-29 00:06:00 │ z             │
└────┴─────────────────────┴───────────────┘

6 rows in set. Elapsed: 0.275 sec. 

For more info:
https://clickhouse.com/docs/en/sql-reference/table-functions/remote
https://clickhouse.com/docs/en/engines/table-engines/special/merge

Tips and tricks on optimizing basic data types in ClickHouse

July 2, 2024 · 4 min read

Question

What data types should I use in ClickHouse to optimize my queries for speed and storage?

Answer

Many times when using an automated conversion from another system or trying to choose a data type, users will often choose the "more is better" or "choose what's easier" or "choose the most generic" approaches. This will likely work for small datasets in the millions, maybe even billions of rows. It may not be noticeable and is acceptable for those type of sets where users' queries difference is small in their use-cases.

It will not be acceptable, however, as the data grows and becomes more noticeable.

The difference between a query taking 50ms and 500ms may be okay for most use-cases, for example in a webUI, but one is 10x slower than the other, even though for a front-end user, is not very noticeable.

Example initial table:

timestamp Datetime64(9),
group_id Int64,
vendor_id String,
product_id String,
category1 Int64,
code_name String,
paid_status String,
country_code String,
description String,
price Float64,
attributes Map(String, String)

Sample data:

3456, 0123456789, bd6087b7-6026-4974-9122-bc99faae5d84, "2024-03-01 01:00:01.000", 98, "bear", paid", "us", "corvette model car", 123.45, {"color" : "blue", "size" : "S"}
156, 0000012345, bd6087b7-6026-4974-9122-bc99faae5d84, "2024-03-01 01:00:02:123", 45, "tiger", "not paid", "uk", "electric car", 53432.10, {"color" : "red", "model" : "X"} 
...

Below are some recommendations where this data could be optimized:

timestamp : DateTime64(9)
Unless needing scientific precision, a value of 9 precision (nanoseconds), is unlikely necessary. Possibly could be for display or ordering, but usually not in queries for searching, primary keys, etc.

Recommendation:
For PK, order by: DateTime
For display or ordering: add additional column - i.e. timestamp_microseconds : DateTime64(6)

group_id : Int64
This appears to be an integer, select the smallest integer type that will fit the max number for the column required. From this sample dataset and name of the column, it is unlikely to need a quintillion values, probably an Int16 would work where it could have up to 16k values.

Recommendation: Int16

vendor_id : String
This column looks like to be a number but has leading zeros, likely important to keep formatting. Also appears to be only a certain number of chars.

Recommendation: FixedString(10)

product_id : String
This one is alphanumeric so intuitively would be a string, however, it is also a UUID.

Recommendation: UUID

category1 : Int64
The values are small, probably not very many categories and not looking to grow very much or limited. Less than 255

Recommendation: UInt8

code_name : String
This field looks like it may have only a limited number of strings that would be used. For this kind of situation where the number of string values might be in the hundreds or thousands, low cardinality fields help.

Recommendation: LowCardinality(String)

paid_status : String
There is a string value of "paid" or "not_paid". For situations where there may be only two values, use a boolean.

Recommendation: Bool

country_code : String
Sometimes there are columns which meet multiple optimizations. In this example, there are only a certain number of country codes and they are all two character identifiers.

Recommendation: LowCardinality(FixedString(2))

price : Float64
Floats are not recommended when there is a fixed precision that is known, especially for financial data and calculations. Best is to use Decimal types for the precision necessary. For this use-case, it is likely that the price of the item is not over 999,999.00

Recommendation: Decimal(10,2)

attributes : map
Often there might be a table with dynamic attributes in maps. Searching for keys or values is usually slower. There’s a couple of ways that the maps can be made faster. If there are keys that will be present in most records, best to place those in a separate column as low cardinality and those that will be sparse in another column high cardinality. From there, it will be more efficient to create skip indexes although it may increase the complexity of the queries.

Recommendation: lc_attributes: Map(String, String), hc_attributes: Map(String, String).

Depending on the queries, the options that can also be used to create a skip index and/or extract the attributes are:
Using Array Join to extract into columns using a Materialized View: https://clickhouse.com/docs/knowledgebase/using-array-join-to-extract-and-query-attributes
Using a skip index for the keys: https://clickhouse.com/docs/knowledgebase/improve-map-performance

How to use array join to extract and query varying attributes using map keys and values

June 21, 2024 · 4 min read

Question

If I have varying attributes in a column using map types, how can I extract them and use them in queries?

Answer

This is a basic example of extracting keys and values from a variable attributes field. This method will create seemingly duplicates from each row in the source/raw table. Due to the keys and values being extracted, however, they can be put into the Primary Key or a secondary with an index, such as a bloom filter.

In this example, we basically have a source that creates a metrics table, it has multiple attributes that can apply in an attributes field that has maps. If there are attributes that will always be present for records, it is better to pull those out into their own columns and populate.

You should be able to just copy and paste to see what the outputs would be and what the materialized view does in this instance.

Create a sample database:

create database db1;

Create the initial table that will have the rows and attributes:

create table db1.table1_metric_map
(
  id UInt32,
  timestamp DateTime,
  metric_name String,
  metric_value Int32,
  attributes Map(String, String)
)
engine = MergeTree()
order by timestamp;

Insert sample rows into the table. The sample size is intentionally small so that when the materialized view is created, you can see how the rows are multiplied for each attribute.

insert into db1.table1_metric_map
VALUES
(1, '2023-09-20 00:01:00', 'ABC', 10, {'env':'prod','app':'app1','server':'server1'}),
(2, '2023-09-20 00:01:00', 'ABC', 20,{'env':'prod','app':'app2','server':'server1','dc':'dc1'}),
(3, '2023-09-20 00:01:00', 'ABC', 30,{'env':'qa','app':'app1','server':'server1'}),
(4, '2023-09-20 00:01:00', 'ABC', 40,{'env':'qa','app':'app2','server':'server1','dc':'dc1'}),
(5, '2023-09-20 00:01:00', 'DEF', 50,{'env':'prod','app':'app1','server':'server2'}),
(6, '2023-09-20 00:01:00', 'DEF', 60, {'env':'prod','app':'app2','server':'server1'}),
(7, '2023-09-20 00:01:00', 'DEF', 70,{'env':'qa','app':'app1','server':'server1'}),
(8, '2023-09-20 00:01:00', 'DEF', 80,{'env':'qa','app':'app2','server':'server1'}),
(9, '2023-09-20 00:02:00', 'ABC', 90,{'env':'prod','app':'app1','server':'server1'}),
(10, '2023-09-20 00:02:00', 'ABC', 100,{'env':'prod','app':'app1','server':'server2'}),
(11, '2023-09-20 00:02:00', 'ABC', 110,{'env':'qa','app':'app1','server':'server1'}),
(12, '2023-09-20 00:02:00', 'ABC', 120,{'env':'qa','app':'app1','server':'server1'}),
(13, '2023-09-20 00:02:00', 'DEF', 130,{'env':'prod','app':'app1','server':'server1'}),
(14, '2023-09-20 00:02:00', 'DEF', 140,{'env':'prod','app':'app2','server':'server1','dc':'dc1'}),
(15, '2023-09-20 00:02:00', 'DEF', 150,{'env':'qa','app':'app1','server':'server2'}),
(16, '2023-09-20 00:02:00', 'DEF', 160,{'env':'qa','app':'app1','server':'server1','dc':'dc1'}),
(17, '2023-09-20 00:03:00', 'ABC', 170,{'env':'prod','app':'app1','server':'server1'}),
(18, '2023-09-20 00:03:00', 'ABC', 180,{'env':'prod','app':'app1','server':'server1'}),
(19, '2023-09-20 00:03:00', 'ABC', 190,{'env':'qa','app':'app1','server':'server1'}),
(20, '2023-09-20 00:03:00', 'ABC', 200,{'env':'qa','app':'app1','server':'server2'}),
(21, '2023-09-20 00:03:00', 'DEF', 210,{'env':'prod','app':'app1','server':'server1'}),
(22, '2023-09-20 00:03:00', 'DEF', 220,{'env':'prod','app':'app1','server':'server1'}),
(23, '2023-09-20 00:03:00', 'DEF', 230,{'env':'qa','app':'app1','server':'server1'}),
(24, '2023-09-20 00:03:00', 'DEF', 240,{'env':'qa','app':'app1','server':'server1'});

We can then create a materialized view with array join so that it can extract the map attributes onto keys and values columns. For demonstration, in the example below, it uses an implicit table (with the POPULATE command, and backing table like .inner.{uuid}... ). The recommended best practice, however, is to use an explicit table where you wouldd define the table first, then create a materialized view on top with the TO command instead.

CREATE MATERIALIZED VIEW db1.table1_metric_map_mv
ORDER BY id
POPULATE AS
select 
  *, 
  attributes.keys as attribute_keys, 
  attributes.values as attribute_values
from db1.table1_metric_map
array join attributes
where notEmpty(attributes.keys);

The new table will have more rows and will have the keys extracted, like this:

SELECT *
FROM db1.table1_metric_map_mv
LIMIT 5

Query id: b7384381-53af-4e3e-bc54-871f61c033a6

┌─id─┬───────────timestamp─┬─metric_name─┬─metric_value─┬─attributes───────────┬─attribute_keys─┬─attribute_values─┐
│  1 │ 2023-09-20 00:01:00 │ ABC         │           10 │ ('env','prod')       │ env            │ prod             │
│  1 │ 2023-09-20 00:01:00 │ ABC         │           10 │ ('app','app1')       │ app            │ app1             │
│  1 │ 2023-09-20 00:01:00 │ ABC         │           10 │ ('server','server1') │ server         │ server1          │
│  2 │ 2023-09-20 00:01:00 │ ABC         │           20 │ ('env','prod')       │ env            │ prod             │
│  2 │ 2023-09-20 00:01:00 │ ABC         │           20 │ ('app','app2')       │ app            │ app2             │
└────┴─────────────────────┴─────────────┴──────────────┴──────────────────────┴────────────────┴──────────────────┘

From here, in order to query for your rows that need certain attributes, you would do something like this:

SELECT
    t1_app.id AS id,
    timestamp,
    metric_name,
    metric_value
FROM
(
    SELECT *
    FROM db1.table1_metric_map_mv
    WHERE (attribute_keys = 'app') AND (attribute_values = 'app1') AND (metric_name = 'ABC')
) AS t1_app
INNER JOIN
(
    SELECT *
    FROM db1.table1_metric_map_mv
    WHERE (attribute_keys = 'server') AND (attribute_values = 'server1')
) AS t2_server ON t1_app.id = t2_server.id

Query id: 72ce7f19-b02a-4b6e-81e7-a955f257436d

┌─id─┬───────────timestamp─┬─metric_name─┬─metric_value─┐
│  1 │ 2023-09-20 00:01:00 │ ABC         │           10 │
│  3 │ 2023-09-20 00:01:00 │ ABC         │           30 │
│  9 │ 2023-09-20 00:02:00 │ ABC         │           90 │
│ 11 │ 2023-09-20 00:02:00 │ ABC         │          110 │
│ 12 │ 2023-09-20 00:02:00 │ ABC         │          120 │
│ 17 │ 2023-09-20 00:03:00 │ ABC         │          170 │
│ 18 │ 2023-09-20 00:03:00 │ ABC         │          180 │
│ 19 │ 2023-09-20 00:03:00 │ ABC         │          190 │
└────┴─────────────────────┴─────────────┴──────────────┘

How to set up ClickHouse on Docker with ODBC to connect to a Microsoft SQL Server (MSSQL) database

May 29, 2024 · 3 min read

Question

How do I set up ClickHouse with a Docker image to connect to Microsoft SQL Server?

Answer

Notes on this example

Uses the ClickHouse Docker Ubuntu image
Uses the FreeTDS Driver
Uses MSSQL Server 2012R2
Windows hostname for this example is MARSDB2.marsnet2.local at IP: 192.168.1.133 (update with your hostname and/or IP)
MSSQL Instance name MARSDB2
MSSQL Login and datbase users are sql_user

Example setup in MSSQL for testing

Database and table created in MSSQL:

Screenshot 2024-01-01 at 8 25 50 PM

MSSQL Login User, sql_user:

Screenshot 2024-01-01 at 8 27 11 PM

Database membership roles for sql_user:

Screenshot 2024-01-01 at 8 27 35 PM

Database User with Login:

Screenshot 2024-01-01 at 8 35 34 PM

Configuring ClickHouse with ODBC

Create a working directory:

mkdir ch-odbc-mssql
cd ch-odbc-mssql

Create an odbc.ini file:

vim odbc.ini

Add the following entries to update the name of the DSN and IP:

[marsdb2_mssql]
Driver = FreeTDS
Server = 192.168.1.133

Create an odbcinst.ini file:

vim odbcinst.ini

Add the following entries (trace is optional but helps with debugging):

[ODBC]
Trace = Yes
TraceFile = /tmp/odbc.log

[FreeTDS]
Description = FreeTDS
Driver = /usr/lib/aarch64-linux-gnu/odbc/libtdsodbc.so
Setup = /usr/lib/x86_64-linux-gnu/odbc/libtdsS.so
UsageCount = 1

Configure a Dockerfile to download the image and add the TDS and required ODBC libraries

Create the Dockerfile:

vim Dockerfile

Add the contents of the Dockerfile:

FROM clickhouse/clickhouse-server:23.10

# Install the ODBC driver

RUN apt-get update && apt-get install -y --no-install-recommends unixodbc \
    && apt-get install -y freetds-bin freetds-common freetds-dev libct4 libsybdb5 \
    && apt-get install tdsodbc

Build the new docker image:

docker build . -t marsnet/clickhouse-odbc:23.10

Create a docker-compose.yml file:

vim docker-compose.yml

Add the following contents to the YAML:

version: '3.7'
services:
  clickhouse:
    image: marsnet/clickhouse-odbc:23.10
    container_name: clickhouse-odbc
    hostname: clickhouse-host
    ports:
      - "9000:9000"
      - "8123:8123"
      - "9009:9009"
    volumes:
      - ./odbc.ini:/etc/odbc.ini
      - ./odbcinst.ini:/etc/odbcinst.ini
    restart: always
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 262144
        hard: 262144
    deploy:
      resources:
        limits:
          memory: 4g

Start the container:

docker compose up --detach

After you start the container, you should see something like this:

ch-odbc-mssql % docker compose up --detach
[+] Running 1/1
 ✔ Container clickhouse-odbc  Started

Check to ensure the container is running:

ch-odbc-mssql % docker ps
CONTAINER ID   IMAGE                           COMMAND            CREATED          STATUS              PORTS                                                                    NAMES
87a400b803ce   marsnet/clickhouse-odbc:23.10   "/entrypoint.sh"   57 minutes ago   Up About a minute   0.0.0.0:8123->8123/tcp, 0.0.0.0:9000->9000/tcp, 0.0.0.0:9009->9009/tcp   clickhouse-odbc

Test ODBC connection

./clickhouse client

Test the SELECT using the odbc table function to the remote MSSQL Database table:

clickhouse-host :) SELECT * from odbc('DSN=marsdb2_mssql;port=1433;Uid=sql_user;Pwd=ClickHouse123;Database=db1', 'table1');

SELECT *
FROM odbc('DSN=marsdb2_mssql;port=1433;Uid=sql_user;Pwd=ClickHouse123;Database=db1', 'table1')

Query id: 23494da2-6e12-4ade-95fa-372a0420cac1

┌─id─┬─column1─┐
│  1 │ abc     │
│  2 │ def     │
│  3 │ ghi     │
└────┴─────────┘

3 rows in set. Elapsed: 0.188 sec. 

You can also create a remote table using the odbc table engine:

CREATE TABLE table1_odbc_mssql
(
    `id` Int32,
    `column1` String
)
ENGINE = ODBC('DSN=marsdb2_mssql;port=1433;Uid=sql_user;Pwd=ClickHouse123;Database=db1', 'dbo', 'table1')

Use a SELECT query to test the new remote table:

clickhouse-host :) select * from table1_odbc_mssql;

SELECT *
FROM table1_odbc_mssql

Query id: 94724368-485d-4364-ae58-a435a225c37d

┌─id─┬─column1─┐
│  1 │ abc     │
│  2 │ def     │
│  3 │ ghi     │
└────┴─────────┘

3 rows in set. Elapsed: 0.218 sec. 

For more information, please see:

Simple example flow for extracting JSON data using a landing table with a Materialized View

May 20, 2024 · 2 min read

Question

How do I work with JSON message using a source or landing table to extract with a Materialized View?
How do I work with JSON without the experimental JSON Object?

Answer

A common pattern to work with JSON data is to send the data to a landing table and use JSONExtract functions to pull the data onto a new table using a Materialized View trigger. This is normally done in the following flow and pattern:

source data --> MergeTree table --> Materialized View (with base table) --> application/client

The landing table should have a raw string field where you would store the raw json. It should also have one to two other fields that can be used for management of that table so that it could be partitioned and trimmed as the data ages.

*some integrations can add fields to the original data for example if using the ClickHouse Kafka Connector Sink.

Simplified example below:

create the example database

create database db1;

create a landing table where your raw json will be inserted:

create table db1.table2_json_raw
(
    id Int32,
    timestamp DateTime,
    raw String
)
engine = MergeTree()
order by timestamp;

create the base table for the materialized view

create table db1.table2_json_mv_base
(
 id Int32,
 timestamp DateTime,
 raw_string String,
 custId Int8,
 custName String
)
engine = MergeTree()
order by timestamp;

create the materialized view to the base table

create materialized view db1.table2_json_mv to db1.table2_json_mv_base
AS SELECT
 id,
 timestamp,
 raw as raw_string,
 simpleJSONExtractRaw(raw, 'customerId') as custId,
 simpleJSONExtractRaw(raw, 'customerName') as custName
 FROM
db1.table2_json_raw;

insert some sample rows

 insert into db1.table2_json_raw
 values
 (1, '2024-05-16 00:00:00', '{"customerId":1, "customerName":"ABC"}'),
 (2, '2024-05-16 00:00:01', '{"customerId":2, "customerName":"XYZ"}');

view the results from the extraction and the materialized view that would be used in the queries

clickhouse-cloud :) select * from db1.table2_json_mv;

SELECT *
FROM db1.table2_json_mv

Query id: 12655fd3-567a-4dfb-9ef7-abc4b11ad044

┌─id─┬───────────timestamp─┬─raw_string─────────────────────────────┬─custId─┬─custName─┐
│  1 │ 2024-05-16 00:00:00 │ {"customerId":1, "customerName":"ABC"} │ 1      │ "ABC"    │
│  2 │ 2024-05-16 00:00:01 │ {"customerId":2, "customerName":"XYZ"} │ 2      │ "XYZ"    │
└────┴─────────────────────┴────────────────────────────────────────┴────────┴──────────┘

Additional Reference links:
Materialized Views: https://clickhouse.com/docs/en/guides/developer/cascading-materialized-views
Working with JSON: https://clickhouse.com/docs/en/integrations/data-formats/json#other-approaches
JSON functions: https://clickhouse.com/docs/en/sql-reference/functions/json-functions

Steps​

How can I use API to manage clusters on ClickHouse Cloud?​

Answer​

Question​

Answer​

Mapping table for metrics in system.dashboards to Prometheus metrics in system.custom_metrics​

Question​

Answer​

Question​

Answer​

Question​

Answer​

Question​

Answer​

Question​

Answer​

Example setup in MSSQL for testing​

Configuring ClickHouse with ODBC​

Configure a Dockerfile to download the image and add the TDS and required ODBC libraries​

Test ODBC connection​

Question​

Answer​

Steps

How can I use API to manage clusters on ClickHouse Cloud?

Answer

Question

Answer

Mapping table for metrics in system.dashboards to Prometheus metrics in system.custom_metrics

Question

Answer

Question

Answer

Question

Answer

Question

Answer

Question

Answer

Example setup in MSSQL for testing

Configuring ClickHouse with ODBC

Configure a Dockerfile to download the image and add the TDS and required ODBC libraries

Test ODBC connection

Question

Answer