Skip to main content

Parquet

parquet-logoapache-logo

Parquet is a binary columnar format optimized for compact storage on disk.

The GitHUB specification of Apache Parquet.

Pages

columns can be divided into pages (similar to Apache Arrow record batches) so that partial columns covering a range of rows can be read without reading the entire file.

Alternatives

In contrast to Arrow which is designed to minimize serialization and deserialization, Parquet is optimized for storage on disk.

Compression

Since Parquet is designed for read-write access, compression is applied per column chunk.

A wide range of compression codecs are supported. Internal parquet compression formats.

TypeReadWrite
UNCOMPRESSED
GZIP
SNAPPY
BROTLINo
LZO
LZ4
LZ4_RAW
ZSTD

Encoding

Some encodings are intended to improve successive column compression by organizing data so that it is less random.

The following Parquet encodings are supported:

EncodingReadWriteTypes
PLAINAll
PLAIN_DICTIONARYAll
RLE_DICTIONARYAll
DELTA_BINARY_PACKEDINT32, INT64, INT_8, INT_16, INT_32, INT_64, UINT_8, UINT_16, UINT_32, UINT_64, TIME_MILLIS, TIME_MICROS, TIMESTAMP_MILLIS, TIMESTAMP_MICROS
DELTA_BYTE_ARRAYBYTE_ARRAY, UTF8
DELTA_LENGTH_BYTE_ARRAYBYTE_ARRAY, UTF8

Repetition

There are three repetition types in Parquet:

RepetitionSupported
REQUIRED
OPTIONAL
REPEATED

Record Shredding

The optional and repeated flags allow for very flexible, nested JSON like data storage in table cells.

The algorithm for compacting is referred to as Record Shredding

Types

TBA - This table is not complete

NameTypeSupported
boolBOOLEAN"
int32INT32"
int64INT64"
int96INT96"
floatFLOAT"
doubleDOUBLE"
bytearrayBYTE_ARRAY"
FixedLenByteArrayFIXED_LEN_BYTE_ARRAY, length=10"
utf8BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"
int_8INT32, convertedtype=INT32, convertedtype=INT_8"
int_16INT32, convertedtype=INT_16"
int_32INT32, convertedtype=INT_32"
int_64INT64, convertedtype=INT_64"
uint_8INT32, convertedtype=UINT_8"
uint_16INT32, convertedtype=UINT_16"
uint_32INT32, convertedtype=UINT_32"
uint_64INT64, convertedtype=UINT_64"
dateINT32, convertedtype=DATE"
date2INT32, convertedtype=DATE, logicaltype=DATE"
timemillisINT32, convertedtype=TIME_MILLIS"
timemillis2INT32, logicaltype=TIME, logicaltype.isadjustedtoutc=true, logicaltype.unit=MILLIS"
timemicrosINT64, convertedtype=TIME_MICROS"
timemicros2INT64, logicaltype=TIME, logicaltype.isadjustedtoutc=false, logicaltype.unit=MICROS"
timestampmillisINT64, convertedtype=TIMESTAMP_MILLIS"
timestampmillis2INT64, logicaltype=TIMESTAMP, logicaltype.isadjustedtoutc=true, logicaltype.unit=MILLIS"
timestampmicrosINT64, convertedtype=TIMESTAMP_MICROS"
timestampmicros2INT64, logicaltype=TIMESTAMP, logicaltype.isadjustedtoutc=false, logicaltype.unit=MICROS"
intervalBYTE_ARRAY, convertedtype=INTERVAL"
decimal1INT32, convertedtype=DECIMAL, scale=2, precision=9"
decimal2INT64, convertedtype=DECIMAL, scale=2, precision=18"
decimal3FIXED_LEN_BYTE_ARRAY, convertedtype=DECIMAL, scale=2, precision=10, length=12"
decimal4BYTE_ARRAY, convertedtype=DECIMAL, scale=2, precision=20"
decimal5INT32, logicaltype=DECIMAL, logicaltype.precision=10, logicaltype.scale=2"
parquetmap, type=MAP, convertedtype=MAP, keytype=BYTE_ARRAY, keyconvertedtype=UTF8, valuetype=INT32"
listMAP convertedtype=LIST, valuetype=BYTE_ARRAY, valueconvertedtype=UTF8
`repeatedINT32 repetitiontype=REPEATED"`

Format Structure

parquet-file-format