Cassandra offers operators the ability to configure compression on a per-table basis. Compression reduces the size of
data on disk by compressing the SSTable in user-configurable compression chunk_length_in_kb
. Because Cassandra
SSTables are immutable, the CPU cost of compressing is only necessary when the SSTable is written - subsequent updates
to data will land in different SSTables, so Cassandra will not need to decompress, overwrite, and recompress data when
UPDATE commands are issued. On reads, Cassandra will locate the relevant compressed chunks on disk, decompress the full
chunk, and then proceed with the remainder of the read path (merging data from disks and memtables, read repair, and so
on).
Compression is configured on a per-table basis as an optional argument to CREATE TABLE
or ALTER TABLE
. By
default, three options are relevant:
class
specifies the compression class - Cassandra provides four classes (LZ4Compressor
,
SnappyCompressor
, DeflateCompressor
and ZstdCompressor
). The default is LZ4Compressor
.chunk_length_in_kb
specifies the number of kilobytes of data per compression chunk. The default is 64KB.crc_check_chance
determines how likely Cassandra is to verify the checksum on each compression chunk during
reads. The default is 1.0.compression_level
is only applicable for ZstdCompressor
and accepts values between -131072
and 2
.Users can set compression using the following syntax:
CREATE TABLE keyspace.table (id int PRIMARY KEY) WITH compression = {'class': 'LZ4Compressor'};
Or
ALTER TABLE keyspace.table WITH compression = {'class': 'SnappyCompressor', 'chunk_length_in_kb': 128, 'crc_check_chance': 0.5};
Once enabled, compression can be disabled with ALTER TABLE
setting enabled
to false
:
ALTER TABLE keyspace.table WITH compression = {'enabled':'false'};
Operators should be aware, however, that changing compression is not immediate. The data is compressed when the SSTable
is written, and as SSTables are immutable, the compression will not be modified until the table is compacted. Upon
issuing a change to the compression options via ALTER TABLE
, the existing SSTables will not be modified until they
are compacted - if an operator needs compression changes to take effect immediately, the operator can trigger an SSTable
rewrite using nodetool scrub
or nodetool upgradesstables -a
, both of which will rebuild the SSTables on disk,
re-compressing the data in the process.
Compression’s primary benefit is that it reduces the amount of data written to disk. Not only does the reduced size save in storage requirements, it often increases read and write throughput, as the CPU overhead of compressing data is faster than the time it would take to read or write the larger volume of uncompressed data from disk.
Compression is most useful in tables comprised of many rows, where the rows are similar in nature. Tables containing similar text columns (such as repeated JSON blobs) often compress very well.
chunk_length_in_kb
and compression ratios.crc_check_chance
(a float from
0.0 to 1.0) to allow Cassandra to probabilistically validate chunks on read to verify bits on disk are not corrupt.Advanced users can provide their own compression class by implementing the interface at
org.apache.cassandra.io.compress.ICompressor
.