Conflict-free replicated data types v5
Conflict-free replicated data types (CRDT) support merging values from concurrently modified rows instead of discarding one of the rows as traditional resolution does.
Permissions Required
PGD CRDT require usage access to CRDT types, therefore a user must have at least the bdr_application role assigned to them.
Each CRDT type is implemented as a separate PostgreSQL data type with an extra callback added to the bdr.crdt_handlers
catalog. The merge process happens inside the PGD writer on the apply side without any user
action needed.
CRDTs require the table to have column-level conflict resolution enabled, as described in CLCD.
The only action you need to take is to use a particular data type in CREATE/ALTER TABLE rather than standard built-in data types such as integer. For example, consider the following table with one regular integer counter and a single row:
Suppose you issue the following SQL on two nodes at same time:
After both updates are applied, you can see the resulting values using this query:
This code shows that you lost one of the increments due to the update_if_newer
conflict resolver. If you use the CRDT counter data type instead, the result looks like this:
Again issue the following SQL on two nodes at same time, and then wait for the changes to be applied:
This example shows that CRDTs correctly allow accumulator columns to work, even in the face of asynchronous concurrent updates that otherwise conflict.
The crdt_gcounter
type is an example of state-based CRDT types that work only with reflexive UPDATE SQL, such as x = x + 1
, as the example shows.
The bdr.crdt_raw_value
configuration option determines whether queries return the current value or the full internal state of the CRDT type. By default, only the current numeric value is returned. When set to true
, queries return representation of the full state. You can use the special hash operator
(#
) to request only the current numeric value without using the special operator (the default behavior). If the full state is dumped using bdr.crdt_raw_value = on
, then the value can reload only with bdr.crdt_raw_value = on
.
Note
The bdr.crdt_raw_value
applies formatting only of data returned to clients, that is, simple column references in the select list. Any column references in other parts of the query (such as WHERE
clause or even expressions in the select list) might still require use of the #
operator.
Another class of CRDT data types is referred to as delta CRDT types. These are a special subclass of operation-based CRDTs.
With delta CRDTs, any update to a value is compared to the previous value on the same node. Then a change is applied as a delta on all other nodes.
Suppose you issue the following SQL on two nodes at same time:
After both updates are applied, you can see the resulting values using this query:
With a regular integer
column, the result is 2
. But when you update the row with a delta CRDT counter, you start with the OLD row version, make a NEW row version, and send both to the remote node. There, compare them with the version found there (e.g., the LOCAL version). Standard CRDTs merge the NEW and the LOCAL version, while delta CRDTs compare the OLD and NEW versions and apply the delta
to the LOCAL version.
The CRDT types are installed as part of bdr
into the bdr
schema. For convenience, the basic operators (+
, #
and !
) and a number of common aggregate functions (min
, max
, sum
, and avg
) are created in pg_catalog
. Thus they are available without having to tweak search_path
.
An important question is how query planning and optimization works with these new data types. CRDT types are handled transparently. Both ANALYZE
and the optimizer work, so estimation and query planning works fine without having to do anything else.
State-based and operation-based CRDTs
Following the notation from [1], both operation-based and state-based CRDTs are implemented.
Operation-based CRDT types (CmCRDT)
The implementation of operation-based types is trivial because the operation isn't transferred explicitly but computed from the old and new row received from the remote node.
Currently, these operation-based CRDTs are implemented:
crdt_delta_counter
—bigint
counter (increments/decrements)crdt_delta_sum
—numeric
sum (increments/decrements)
These types leverage existing data types with a little bit of code to compute the delta. For example, crdt_delta_counter
is a domain on a bigint
.
This approach is possible only for types for which the method for computing the delta is known, but the result is simple and cheap (both in terms of space and CPU) and has a couple of added benefits. For example, it can leverage operators/syntax for the underlying data type.
The main disadvantage is that you can't reset this value reliably in an asynchronous and concurrent environment.
Note
Implementing more complicated operation-based types by creating custom data types is possible, storing the state and the last operation. (Every change is decoded and transferred, so multiple operations aren't needed). But at that point, the main benefits (simplicity, reuse of existing data types) are lost without gaining any advantage compared to state-based types (for example, still no capability to reset) except for the space requirements. (A per-node state isn't needed.)
State-based CRDT types (CvCRDT)
State-based types require a more complex internal state and so can't use the regular data types directly the way operation-based types do.
Currently, four state-based CRDTs are implemented:
crdt_gcounter
—bigint
counter (increment-only)crdt_gsum
—numeric
sum/counter (increment-only)crdt_pncounter
—bigint
counter (increments/decrements)crdt_pnsum
—numeric
sum/counter (increments/decrements)
The internal state typically includes per-node information, increasing the on-disk size but allowing added benefits. The need to implement custom data types implies more code (in/out functions and operators).
The advantage is the ability to reliably reset the values, a somewhat self-healing nature in the presence of lost changes (which doesn't happen in a cluster that operates properly), and the ability to receive changes from other than source nodes.
Consider, for example, that a value is modified on node A, and the change gets replicated to B but not C due to network issue between A and C. If B modifies the value and this change gets replicated to C, it includes even the original change from A. With operation-based CRDTs, node C doesn't receive the change until the A-C network connection starts working again.
The main disadvantages of CvCRDTs are higher costs in terms of disk space and CPU usage. A bit of information for each node is needed, including nodes that were already removed from the cluster. The complex nature of the state (serialized into varlena types) means increased CPU use.
Disk-space requirements
An important consideration is the overhead associated with CRDT types, particularly the on-disk size.
For operation-based types, this is trivial because the types are merely domains on top of other types. They have the same disk space requirements no matter how many nodes are there:
crdt_delta_counter
— Same asbigint
(8 bytes)crdt_delta_sum
— Same asnumeric
(variable, depending on precision and scale)
There's no dependency on the number of nodes because operation-based CRDT types don't store any per-node information.
For state-based types, the situation is more complicated. All the types are variable length (stored essentially as a bytea
column) and consist of a header and a certain amount of per-node information for each node that modified the value.
For the bigint
variants, formulas computing approximate size are:
crdt_gcounter
—32B (header) + N * 12B (per-node)
crdt_pncounter
-—48B (header) + N * 20B (per-node)
N
denotes the number of nodes that modified this value.
For the numeric
variants, there's no exact formula because both the header and per-node parts include numeric
variable-length values. To give you an idea of how many such values you need to keep:
crdt_gsum
- fixed:
20B (header) + N * 4B (per-node)
- variable:
(2 + N)
numeric
values
- fixed:
crdt_pnsum
- fixed:
20B (header) + N * 4B (per-node)
- variable:
(4 + 2 * N)
numeric
values
- fixed:
Note
It doesn't matter how many nodes are in the cluster if the values are never updated on multiple nodes. It also doesn't matter whether the updates were concurrent (causing a conflict).
In addition, it doesn't matter how many of those nodes were already removed from the cluster. There's no way to compact the state yet.
CRDT types versus conflicts handling
As tables can contain both CRDT and non-CRDT columns (most columns are expected to be non-CRDT), you need to do both the regular conflict resolution and CRDT merge.
The conflict resolution happens first and is responsible for deciding the tuple to keep (applytuple) and the one to discard. The merge phase happens next, merging data for CRDT columns from the discarded tuple into the applytuple.
Note
This handling makes CRDT types somewhat more expensive compared to plain conflict resolution because the merge needs to happen every time. This is the case even when the conflict resolution can use one of the fast paths (such as those modified in the current transaction).
CRDT types versus conflict reporting
By default, detected conflicts are individually reported. Without CRDT types, this makes sense because the conflict resolution essentially throws away half of the available information (local or remote row, depending on configuration). This presents a data loss.
CRDT types allow both parts of the information to be combined without throwing anything away, eliminating the data loss issue. This approach makes the conflict reporting unnecessary.
For this reason, conflict reporting is skipped when the conflict can be fully resolved by CRDT merge. Each column must meet at least one of these two conditions:
- The values in local and remote tuple are the same (NULL or equal).
- It uses a CRDT data type and so can be merged.
Note
Conflict reporting is also skipped when there are no CRDT columns but all values in local/remote tuples are equal.
Resetting CRDT values
Resetting CRDT values is possible but requires special handling. The asynchronous nature of the cluster means that different nodes might see the reset operation at different places in the change stream no matter how it's implemented. Different nodes might also initiate a reset concurrently, that is, before observing the reset from the other node.
In other words, to make the reset operation behave correctly, it needs to be commutative with respect to the regular operations. Many naive ways to reset a value that might work well on a single-node fail for this reason.
For example, the simplest approach to resetting a value might be:
With state-based CRDTs this doesn't work. It throws away the state for the other nodes but only locally. It's added back by merge functions on remote nodes, causing diverging values and eventually receiving it back due to changes on the other nodes.
With operation-based CRDTs, this might seem to work because the update is interpreted as a subtraction of -cnt
. But it works only in the absence of concurrent resets. Once two nodes attempt to do a reset at the same time, the delta is applied twice, getting a negative value (which isn't expected from a reset).
It might also seem that you can use DELETE + INSERT
as a reset, but this approach has a couple of weaknesses, too. If the row is reinserted with the same key, it's not guaranteed that all nodes see it at the same position in the stream of operations with respect to changes from other nodes. PGD specifically discourages reusing the same primary key value since it can lead to data anomalies in concurrent cases.
State-based CRDT types can reliably handle resets using a special !
operator like this:
"Reliably" means the values don't have the two issues of multiple concurrent resets and divergence.
Operation-based CRDT types can be reset reliably only using Eager Replication, since this avoids multiple concurrent resets. You can also use Eager Replication to set either kind of CRDT to a specific value.
Implemented CRDT data types
Currently, six CRDT data types are implemented:
- Grow-only counter and sum
- Positive-negative counter and sum
- Delta counter and sum
The counters and sums behave mostly the same, except that the counter types are integer based (bigint
), while the sum types are decimal-based (numeric
).
Additional CRDT types, described at [1], might be implemented later.
You can list the currently implemented CRDT data types with the following query:
grow-only counter (crdt_gcounter
)
Supports only increments with nonnegative values (
value + int
andcounter + bigint
operators).You can obtain the current value of the counter either using
#
operator or by casting it tobigint
.Isn't compatible with simple assignments like
counter = value
(which is common pattern when the new value is computed somewhere in the application).Allows simple reset of the counter using the
!
operator (counter = !counter
).You can inspect the internal state using
crdt_gcounter_to_text
.
grow-only sum (crdt_gsum
)
Supports only increments with nonnegative values (
sum + numeric
).You can obtain the current value of the sum either by using the
#
operator or by casting it tonumeric
.Isn't compatible with simple assignments like
sum = value
, which is the common pattern when the new value is computed somewhere in the application.Allows simple reset of the sum using the
!
operator (sum = !sum
).Can inspect internal state using
crdt_gsum_to_text
.
positive-negative counter (crdt_pncounter
)
Supports increments with both positive and negative values (through
counter + int
andcounter + bigint
operators).You can obtain the current value of the counter either by using the
#
operator or by casting tobigint
.Isn't compatible with simple assignments like
counter = value
, which is the common pattern when the new value is computed somewhere in the application.Allows simple reset of the counter using the
!
operator (counter = !counter
).You can inspect the internal state using
crdt_pncounter_to_text
.
positive-negative sum (crdt_pnsum
)
Supports increments with both positive and negative values through
sum + numeric
.You can obtain the current value of the sum either by using then
#
operator or by casting tonumeric
.Isn't compatible with simple assignments like
sum = value
, which is the common pattern when the new value is computed somewhere in the application.Allows simple reset of the sum using the
!
operator (sum = !sum
).You can inspect the internal state using
crdt_pnsum_to_text
.
delta counter (crdt_delta_counter
)
Is defined a
bigint
domain, so works exactly like abigint
column.Supports increments with both positive and negative values.
Is compatible with simple assignments like
counter = value
, which is common when the new value is computed somewhere in the application.There's no simple way to reset the value reliably.
delta sum (crdt_delta_sum
)
Is defined as a
numeric
domain so works exactly like anumeric
column.Supports increments with both positive and negative values.
Is compatible with simple assignments like
sum = value
, which is common when the new value is computed somewhere in the application.There's no simple way to reset the value reliably.
[1] https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type