跳至主要內容

Character Set and Collations

Klustron大约 3 分钟

Character Set and Collations

Klustron supports all character sets and collations supported by its storage nodes --- kunlun-mysql or kunlun-mariadb. Collation is an attribute of a character column/field or a character literal, comparisons of character values will always be done using methods of its collation, unless a collation is explicitly specified. Charsets and encoding specified in CREATE DATABASE statement is NOT effective in Klustron for user tables.

To get the full list of charsets and collations that can be used in Klustron, execute 'SHOW COLLATION' or SELECT * FROM pg_collation where collprovider='m'.

When creating a table, one could optionally specify a CHARSET and/or a COLLATE clause for each character column (i.e. varchar(N), char(N), text, longtext, mediumtext) or for all such columns that don't have a CHARSET/COLLATE clause at once at table level, in MySQL syntax, in ways explained below.

If CHARSET clause is specified along with a COLLATE clause for a column or a table, the CHARSET setting isn't effective, and only the COLLATE clause is effective; If CHARSET is specified but a COLLATE clause is not specified for a column or a table, it's equivalent to specifying the charset's default collation for the column or table; The complete list of supported charsets and their default collations is listed at the end of this document, identical to that of community MySQL-8.0.

Using COLLATE Clauses by Examples

create table books(
id serial primary key,
name varchar(256) COLLATE utf8mb4_0900_as_cs NOT NULL, -- explicit collation setting
isbn varchar(64) not null,  -- implicit collation setting, will use table level setting if specified, or session level settings (storage_table_collate) otherwise
authors varchar(256) COLLATE utf8_general_ci not null, -- explicit collation setting
pub_date date);

Then we can see the definition with 'SHOW CREATE TABLE' statement as below. Note that the 'isbn' column has a collation of utf8_general_ci, inherited from the session level storage_table_collate setting, as listed below.

mysql> show create table books;
+-----------+-----------------------------------------------------------------------------+
| Table | Create Table                                                                    |
+-----------+-----------------------------------------------------------------------------+
| books | CREATE TABLE books (
 id integer DEFAULT "nextval"('books_id_seq'::regclass) NOT NULL,
 name character varying(256) COLLATE "utf8mb4_0900_as_cs" NOT NULL,
 isbn character varying(64) COLLATE "utf8_general_ci" NOT NULL,
 authors character varying(256) COLLATE "utf8_general_ci" NOT NULL,
 pub_date date,
 CONSTRAINT books_pkey PRIMARY KEY (id)
) WITH (engine=innodb, shard='2') |
+-----------+-----------------------------------------------------------------------------+
1 row in set (0.01 sec)

While creating a table, the storage_table_charset and storage_table_collate config variables are used as default charset and collation for the table's character columns which has no table level or column level charset&collation clauses. And in this case, if storage_table_collate is set, it is used as collation for the columns with neither CHARSET nor COLLATE clauses; otherwise if storage_table_charset is set, its default collation is used for this purpose.

And the storage_set_names_charset and storage_set_names_collate settings are session variables in connections of current client session(in current computing node) to all storage nodes, set with 'SET NAMES charset COLLAT collation' statement. You seldom need to modify the two variables, if you do, modify them in config files and they will be effective only in sessions started afterwards.

mysql> select name, setting from pg_settings where name like 'storage%';
+---------------------------+------------------+
| name                      | setting          |
+---------------------------+------------------+
| storage_set_names_charset | utf8mb4          |
| storage_set_names_collate | utf8mb4_0900_bin |
| storage_table_charset     | utf8             |
| storage_table_collate     | utf8_general_ci  |
+---------------------------+------------------+
4 rows in set (0.12 sec)

In below example, when we set storage_table_collate before creating table 'publisher' whose columns 'name' and 'address' has no explicit collation setting but the 'publisher' table is given a collation, the two columns' collations will be assigned with the their table's collation. And at the same time, column 'postcode' has its explicitly specified collation 'latin1_general_ci', which is kept as its collation.

mysql> set storage_table_charset='utf8mb4';
Query OK, 0 rows affected (0.01 sec)

mysql> set storage_table_collate='utf8mb4_0900_as_cs';
Query OK, 0 rows affected (0.01 sec)


mysql> create table publisher(id serial primary key, name varchar(128), address varchar(256), postcode varchar(16) COLLATE latin1_general_ci) COLLATE utf8_general_ci;
Query OK, 0 rows affected (0.09 sec)

mysql> show create table publisher;
+-----------+-----------------------------------------------------------------------------+
| Table     | Create Table                                                                |
+-----------+-----------------------------------------------------------------------------+
| publisher | CREATE TABLE publisher (
 id integer DEFAULT "nextval"('publisher_id_seq'::regclass) NOT NULL,
 name character varying(128) COLLATE "utf8_general_ci",
 address character varying(256) COLLATE "utf8_general_ci",
 postcode character varying(16) COLLATE "latin1_general_ci",
 CONSTRAINT publisher_pkey PRIMARY KEY (id)
) WITH (engine=innodb, shard='2') |
+-----------+-----------------------------------------------------------------------------+
1 row in set (0.01 sec)

And if neither table nor column level charset/collation specified, and no valid value set to storage_table_charset or storage_table_collate, then when creating a table, its character columns will be given the default collation of the character type, currently being utf8mb4_0900_bin.

This query lists the default collation of each character type.

mysql> select typname, t2.collname from pg_type t1 join pg_collation t2  on t1.typcollation=t2.oid where typcategory='S' and typtype='b' and typarray!=0;
+---------+------------------+
| typname | collname         |
+---------+------------------+
| varchar | utf8mb4_0900_bin |
| bpchar  | utf8mb4_0900_bin |
| text    | utf8mb4_0900_bin |
+---------+------------------+

Supported Charsets and Their Default Collations

Below query lists the charsets and their default collations supported by Klustron.

mysql> select  CHARACTER_SET_NAME, COLLATION_NAME from information_schema.collations where  IS_DEFAULT = 'YES' ;
+--------------------+---------------------+
| CHARACTER_SET_NAME | COLLATION_NAME      |
+--------------------+---------------------+
| armscii8           | armscii8_general_ci |
| ascii              | ascii_general_ci    |
| big5               | big5_chinese_ci     |
| binary             | binary              |
| cp1250             | cp1250_general_ci   |
| cp1251             | cp1251_general_ci   |
| cp1256             | cp1256_general_ci   |
| cp1257             | cp1257_general_ci   |
| cp850              | cp850_general_ci    |
| cp852              | cp852_general_ci    |
| cp866              | cp866_general_ci    |
| cp932              | cp932_japanese_ci   |
| dec8               | dec8_swedish_ci     |
| eucjpms            | eucjpms_japanese_ci |
| euckr              | euckr_korean_ci     |
| gb18030            | gb18030_chinese_ci  |
| gb2312             | gb2312_chinese_ci   |
| gbk                | gbk_chinese_ci      |
| geostd8            | geostd8_general_ci  |
| greek              | greek_general_ci    |
| hebrew             | hebrew_general_ci   |
| hp8                | hp8_english_ci      |
| keybcs2            | keybcs2_general_ci  |
| koi8r              | koi8r_general_ci    |
| koi8u              | koi8u_general_ci    |
| latin1             | latin1_swedish_ci   |
| latin2             | latin2_general_ci   |
| latin5             | latin5_turkish_ci   |
| latin7             | latin7_general_ci   |
| macce              | macce_general_ci    |
| macroman           | macroman_general_ci |
| sjis               | sjis_japanese_ci    |
| swe7               | swe7_swedish_ci     |
| tis620             | tis620_thai_ci      |
| ucs2               | ucs2_general_ci     |
| ujis               | ujis_japanese_ci    |
| utf16              | utf16_general_ci    |
| utf16le            | utf16le_general_ci  |
| utf32              | utf32_general_ci    |
| utf8               | utf8_general_ci     |
| utf8mb3            | utf8mb3_general_ci  |
| utf8mb4            | utf8mb4_0900_ai_ci  |
+--------------------+---------------------+