Sequences v5

Many applications require that unique surrogate ids be assigned to database entries. Often the database SEQUENCE object is used to produce these. In PostgreSQL, these can be either:

  • A manually created sequence using the CREATE SEQUENCE command and retrieved by calling the nextval() function
  • serial and bigserial columns or, alternatively, GENERATED BY DEFAULT AS IDENTITY columns

However, standard sequences in PostgreSQL aren't multi-node aware and produce values that are unique only on the local node. This is important because unique ids generated by such sequences cause conflict and data loss by means of discarded INSERT actions in multi-master replication.

Permissions Required

PGD global sequences are on by default. This means that for any user to use sequences, the user must have at least the bdr_application role assigned to them.

PGD global sequences

For this reason, PGD provides an application-transparent way to generate unique ids using sequences on bigint or bigserial datatypes across the whole PGD group, called global sequences.

PGD global sequences provide an easy way for applications to use the database to generate unique synthetic keys in an asynchronous distributed system that works for mostbut not necessarily allcases.

Using PGD global sequences allows you to avoid the problems with insert conflicts. If you define a PRIMARY KEY or UNIQUE constraint on a column that's using a global sequence, no node can ever get the same value as any other node. When PGD synchronizes inserts between the nodes, they can never conflict.

PGD global sequences extend PostgreSQL sequences, so they are crash-safe. To use them, you must be granted the bdr_application role.

There are various possible algorithms for global sequences:

  • SnowflakeId sequences
  • Globally allocated range sequences

SnowflakeId sequences generate values using an algorithm that doesn't require inter-node communication at any point. It's faster and more robust and has the useful property of recording the timestamp when the values were created.

SnowflakeId sequences have the restriction that they work only for 64-bit BIGINT datatypes and produce values up to 19 digits long, which might be too long for use in some host language datatypes, such as Javascript Integer types. Globally allocated sequences allocate a local range of values that can be replenished as needed by inter-node consensus, making them suitable for either BIGINT or INTEGER sequences.

You can create a global sequence using the bdr.alter_sequence_set_kind() function. This function takes a standard PostgreSQL sequence and marks it as a PGD global sequence. It can also convert the sequence back to the standard PostgreSQL sequence.

PGD also provides the configuration variable bdr.default_sequence_kind, which determines the kind of sequence to create when the CREATE SEQUENCE command is executed or when a serial, bigserial, or GENERATED BY DEFAULT AS IDENTITY column is created. Valid settings are:

  • local, meaning that newly created sequences are the standard PostgreSQL (local) sequences.
  • galloc, which always creates globally allocated range sequences.
  • snowflakeid, which creates global sequences for BIGINT sequences that consist of time, nodeid, and counter components. You can't use it with INTEGER sequences (so you can use it for bigserial but not for serial).
  • timeshard, which is the older version of SnowflakeId sequence and is provided for backward compatibility only. The SnowflakeId is preferred.
  • distributed (the default), which is a special value that you can use only for bdr.default_sequence_kind. It selects snowflakeid for int8 sequences (that is, bigserial) and galloc sequence for int4 (that is, serial) and int2 sequences.

The bdr.sequences view shows information about individual sequence kinds.

currval() and lastval() work correctly for all types of global sequence.

SnowflakeId sequences

The ids generated by SnowflakeId sequences are loosely time ordered so you can use them to get the approximate order of data insertion, like standard PostgreSQL sequences. Values generated within the same millisecond might be out of order, even on one node. The property of loose time ordering means they're suitable for use as range partition keys.

SnowflakeId sequences work on one or more nodes and don't require any inter-node communication after the node-join process completes. So you can continue to use them even if there's the risk of extended network partitions. They aren't affected by replication lag or inter-node latency.

SnowflakeId sequences generate unique ids in a different way from standard sequences. The algorithm uses three components for a sequence number. The first component of the sequence is a timestamp at the time of sequence number generation. The second component of the sequence number is the unique id assigned to each PGD node, which ensures that the ids from different nodes are always different. The third component is the number generated by the local sequence.

While adding a unique node id to the sequence number is enough to ensure there are no conflicts, you also want to keep another useful property of sequences. The ordering of the sequence numbers roughly corresponds to the order in which data was inserted into the table. Putting the timestamp first ensures this.

A few limitations and caveats apply to SnowflakeId sequences.

SnowflakeId sequences are 64 bits wide and need a bigint or bigserial. Values generated are up to 19 digits long. There's no practical 32-bit integer version, so you can't use it with serial sequences. Use globally allocated range sequences instead.

For SnowflakeId, there's a limit of 4096 sequence values generated per millisecond on any given node (about 4 million sequence values per second). In case the sequence value generation wraps around within a given millisecond, the SnowflakeId sequence waits until the next millisecond and gets a fresh value for that millisecond.

Since SnowflakeId sequences encode timestamps into the sequence value, you can generate new sequence values only within the given time frame (depending on the system clock). The oldest timestamp that you can use is 2016-10-07, which is the epoch time for the SnowflakeId. The values wrap to negative values in the year 2086 and completely run out of numbers by 2156.

Since timestamp is an important part of a SnowflakeId sequence, there's additional protection from generating sequences with a timestamp older than the latest one used in the lifetime of a Postgres process (but not between Postgres restarts).

The INCREMENT option on a sequence used as input for SnowflakeId sequences is effectively ignored. This might be relevant for applications that do sequence ID caching, like many object-relational mapper (ORM) tools, notably Hibernate. Because the sequence is time based, this has little practical effect since the sequence advances to a new noncolliding value by the time the application can do anything with the cached values.

Similarly, you might change the START, MINVALUE, MAXVALUE, and CACHE settings on the underlying sequence, but there's no benefit to doing so. The sequence's low 14 bits are used and the rest is discarded, so the value-range limits don't affect the function's result. For the same reason, setval() isn't useful for SnowflakeId sequences.

Timeshard sequences

Timeshard sequences are provided for backward compatibility with existing installations but aren't recommended for new application use. We recommend using the SnowflakeId sequence instead.

Timeshard is very similar to SnowflakeId but has different limits and fewer protections and slower performance.

The differences between timeshard and SnowflakeId are as following:

  • Timeshard can generate up to 16384 per millisecond (about 16 million per second), which is more than SnowflakeId. However, there's no protection against wraparound within a given millisecond. Schemas using the timeshard sequence must protect the use of the UNIQUE constraint when using timeshard values for given column.
  • The timestamp component of timeshard sequence runs out of values in the year 2050 and, if used in combination with bigint, the values wrap to negative numbers in the year 2033. This means that sequences generated after 2033 have negative values. This is a considerably shorter time span than SnowflakeId and is the main reason why SnowflakeId is preferred.
  • Timeshard sequences require occasional disk writes (similar to standard local sequences). SnowflakeIds are calculated in memory so the SnowflakeId sequences are in general a little faster than timeshard sequences.

Globally allocated range sequences

The globally allocated range (or galloc) sequences allocate ranges (chunks) of values to each node. When the local range is used up, a new range is allocated globally by consensus amongst the other nodes. This uses the key space efficiently but requires that the local node be connected to a majority of the nodes in the cluster for the sequence generator to progress when the currently assigned local range is used up.

Unlike SnowflakeId sequences, galloc sequences support all sequence data types provided by PostgreSQL: smallint, integer, and bigint. This means that you can use galloc sequences in environments where 64-bit sequences are problematic. Examples include using integers in javascript, since that supports only 53-bit values, or when the sequence is displayed on output with limited space.

The range assigned by each voting is currently predetermined based on the datatype the sequence is using:

  • smallint 1 000 numbers
  • integer 1 000 000 numbers
  • bigint 1 000 000 000 numbers

Each node allocates two chunks of seq_chunk_size, one for the current use plus a reserved chunk for future usage, so the values generated from any one node increase monotonically. However, viewed globally, the values generated aren't ordered at all. This might cause a loss of performance due to the effects on b-tree indexes and typically means that generated values aren't useful as range-partition keys.

The main downside of the galloc sequences is that, once the assigned range is used up, the sequence generator has to ask for consensus about the next range for the local node that requires inter-node communication. This might lead to delays or operational issues if the majority of the PGD group isn't accessible. (This might be avoided in later releases.)

The CACHE, START, MINVALUE, and MAXVALUE options work correctly with galloc sequences. However, you need to set them before transforming the sequence to the galloc kind. The INCREMENT BY option also works correctly. However, you can't assign an increment value that's equal to or more than the above ranges assigned for each sequence datatype. setval() doesn't reset the global state for galloc sequences; don't use it.

A few limitations apply to galloc sequences. PGD tracks galloc sequences in a special PGD catalog bdr.sequence_alloc. This catalog is required to track the currently allocated chunks for the galloc sequences. The sequence name and namespace is stored in this catalog. Since the sequence chunk allocation is managed by Raft, whereas any changes to the sequence name/namespace is managed by the replication stream, PGD currently doesn't support renaming galloc sequences or moving them to another namespace or renaming the namespace that contains a galloc sequence. Be mindful of this limitation while designing application schema.

Converting a local sequence to a galloc sequence

Before transforming a local sequence to galloc, you need to take care of several prerequisites.

1. Verify that sequence and column data type match

Check that the sequence's data type matches the data type of the column with which it will be used. For example, you can create a bigint sequence and assign an integer column's default to the nextval() returned by that sequence. With galloc sequences, which for bigint are allocated in blocks of 1 000 000 000, this quickly results in the values returned by nextval() exceeding the int4 range if more than two nodes are in use.

The following example shows what can happen:

CREATE SEQUENCE int8_seq;

SELECT sequencename, data_type FROM pg_sequences;
 sequencename | data_type
--------------+-----------
 int8_seq     | bigint
(1 row)

CREATE TABLE seqtest (id INT NOT NULL PRIMARY KEY);

ALTER SEQUENCE int8_seq OWNED BY seqtest.id;

SELECT bdr.alter_sequence_set_kind('public.int8_seq'::regclass, 'galloc', 1);
 alter_sequence_set_kind
-------------------------

(1 row)

ALTER TABLE seqtest ALTER COLUMN id SET DEFAULT nextval('int8_seq'::regclass);

After executing INSERT INTO seqtest VALUES(DEFAULT) on two nodes, the table contains the following values:

SELECT * FROM seqtest;
     id
------------
          2
 2000000002
(2 rows)

However, attempting the same operation on a third node fails with an integer out of range error, as the sequence generated the value 4000000002.

Tip

You can retrieve the current data type of a sequence from the PostgreSQL pg_sequences view. You can modify the data type of a sequence with ALTER SEQUENCE ... AS ..., for example, ALTER SEQUENCE public.sequence AS integer, as long as its current value doesn't exceed the maximum value of the new data type.

2. Set a new start value for the sequence

When the sequence kind is altered to galloc, it's rewritten and restarts from the defined start value of the local sequence. If this happens on an existing sequence in a production database, you need to query the current value and then set the start value appropriately. To help with this use case, PGD allows users to pass a starting value with the function bdr.alter_sequence_set_kind(). If you're already using offset and you have writes from multiple nodes, you need to check what's the greatest used value and restart the sequence to at least the next value.

-- determine highest sequence value across all nodes
SELECT max((x->'response'->'command_tuples'->0->>'nextval')::bigint)
    FROM json_array_elements(
        bdr.run_on_all_nodes(
            E'SELECT nextval(\'public.sequence\');'
            )::jsonb AS x;

-- turn into a galloc sequence
SELECT bdr.alter_sequence_set_kind('public.sequence'::regclass, 'galloc', $MAX + $MARGIN);

Since users can't lock a sequence, you must leave a $MARGIN value to allow operations to continue while the max() value is queried.

The bdr.sequence_alloc table gives information on the chunk size and the ranges allocated around the whole cluster.

In this example, the sequence starts at 333, and the cluster has two nodes. The number of allocation is 4, which is 2 per node, and the chunk size is 1000000 that's related to an integer sequence.

SELECT * FROM bdr.sequence_alloc
    WHERE seqid = 'public.categories_category_seq'::regclass;
          seqid          | seq_chunk_size | seq_allocated_up_to | seq_nallocs |       seq_last_alloc
-------------------------+----------------+---------------------+-------------+-----------------------------
 categories_category_seq |        1000000 |             4000333 |           4 | 2020-05-21 20:02:15.957835+00
(1 row)

To see the ranges currently assigned to a given sequence on each node, use these queries:

  • Node Node1 is using range from 333 to 2000333.
SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,2)'; -- first range
 range_start | range_end
-------------+-----------
         334 |   1000333
(1 row)

SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,3)'; -- second range
 range_start | range_end
-------------+-----------
     1000334 |   2000333
(1 row)
  • Node Node2 is using range from 2000004 to 4000003.
SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,2)'; -- first range
 range_start | range_end
-------------+-----------
     2000334 |   3000333
(1 row)

SELECT last_value AS range_start, log_cnt AS range_end
    FROM categories_category_seq WHERE ctid = '(0,3)'; -- second range
 range_start | range_end
-------------+-----------
     3000334 |   4000333
NOTE

You can't combine it to a single query (like WHERE ctid IN ('(0,2)', '(0,3)')), as that still shows only the first range.

When a node finishes a chunk, it asks a consensus for a new one and gets the first available. In the example, it's from 4000334 to 5000333. This is the new reserved chunk and starts to consume the old reserved chunk.

UUIDs, KSUUIDs, and other approaches

You can generate globally unique ids in other ways without using the global sequences that can be used with PGD. For example:

  • UUIDs and their PGD variant, KSUUIDs
  • Local sequences with a different offset per node (i.e., manual)
  • An externally coordinated natural key

PGD applications can't use other methods safely: counter-table-based approaches relying on SELECT ... FOR UPDATE, UPDATE ... RETURNING ... or similar for sequence generation doesn't work correctly in PGD because PGD doesn't take row locks between nodes. The same values are generated on more than one node. For the same reason, the usual strategies for "gapless" sequence generation don't work with PGD. In most cases, the application coordinates generating sequences that must be gapless from some external source using two-phase commit. Or it generates them only on one node in the PGD group.

UUIDs

UUID keys instead avoid sequences entirely and use 128-bit universal unique identifiers. These are random or pseudorandom values that are so large that it's nearly impossible for the same value to be generated twice. There's no need for nodes to have continuous communication when using UUID keys.

In the unlikely event of a collision, conflict detection chooses the newer of the two inserted records to retain. Conflict logging, if enabled, records such an event. However, it's exceptionally unlikely to ever occur, since collisions become practically likely only after about 2^64 keys are generated.

The main downside of UUID keys is that they're somewhat inefficient in terms of space and the network. They consume more space not only as a primary key but also where referenced in foreign keys and when transmitted on the wire. Also, not all applications cope well with UUID keys.

KSUUIDs

PGD provides functions for working with a K-Sortable variant of UUID data, known as KSUUID, which generates values that can be stored using the PostgreSQL standard UUID data type. A KSUUID value is similar to UUIDv1 in that it stores both timestamp and random data, following the UUID standard. The difference is that KSUUID is K-Sortable, meaning that it's weakly sortable by timestamp. This makes it more useful as a database key, as it produces more compact btree indexes, which improves the effectiveness of search, and allows natural time-sorting of result data. Unlike UUIDv1, KSUUID values don't include the MAC of the computer on which they were generated, so there are no security concerns from using them.

We now recommend KSUUID v2 in all cases. You can directly sort values generated with regular comparison operators.

There are two versions of KSUUID in PGD: v1 and v2. The legacy KSUUID v1 is deprecated but is kept to support existing installations. Don't use it for new installations. The internal contents of v1 and v2 aren't compatible. As such, the functions to manipulate them also aren't compatible. The v2 of KSUUID also no longer stores the UUID version number.

See KSUUID v2 functions and KSUUID v1 functions in the PGD reference.

Step and offset sequences

In offset-step sequences, a normal PostgreSQL sequence is used on each node. Each sequence increments by the same amount and starts at differing offsets. For example, with step 1000, node1's sequence generates 1001, 2001, 3001, and so on. node2's sequence generates 1002, 2002, 3002, and so on. This scheme works well even if the nodes can't communicate for extended periods. However, the designer must specify a maximum number of nodes when establishing the schema, and it requires per-node configuration. Mistakes can easily lead to overlapping sequences.

It's relatively simple to configure this approach with PGD by creating the desired sequence on one node, like this:

CREATE TABLE some_table (
    generated_value bigint primary key
);

CREATE SEQUENCE some_seq INCREMENT 1000 OWNED BY some_table.generated_value;

ALTER TABLE some_table ALTER COLUMN generated_value SET DEFAULT nextval('some_seq');

Then, on each node calling setval(), give each node a different offset starting value, for example:

-- On node 1
SELECT setval('some_seq', 1);

-- On node 2
SELECT setval('some_seq', 2);

 -- ... etc

Be sure to allow a large enough INCREMENT to leave room for all the nodes you might ever want to add, since changing it in the future is difficult and disruptive.

If you use bigint values, there's no practical concern about key exhaustion, even if you use offsets of 10000 or more. It would take hundreds of years, with hundreds of machines, doing millions of inserts per second, to have any chance of approaching exhaustion.

PGD doesn't currently offer any automation for configuring the per-node offsets on such step/offset sequences.

Composite keys

A variant on step/offset sequences is to use a composite key composed of PRIMARY KEY (node_number, generated_value), where the node number is usually obtained from a function that returns a different number on each node. You can create such a function by temporarily disabling DDL replication and creating a constant SQL function. Alternatively, you can use a one-row table that isn't part of a replication set to store a different value in each node.

See also