Next Previous Up Contents
Next: Feather
Up: Output Formats
Previous: IPAC

4.1.2.7 Parquet

Parquet is a columnar format developed within the Apache project. Data is compressed on disk and read into memory before use.

At present, only very limited metadata is written. Parquet does not seem(?) to have any standard format for per-column metadata, so the only information written about each column apart from its datatype is its name.

Parquet support is currently somewhat experimental.

Note:

The parquet I/O handlers require large external libraries, which are not always bundled with the library/application software because of their size. In some configurations, parquet support may not be present, and attempts to read or write parquet files will result in a message like:
   Parquet-mr libraries not available
If you can supply the relevant libaries on the classpath at runtime, the parquet support will work. At time of writing, the required libraries are included in the topcat-extra.jar monolithic jar file (though not topcat-full.jar), and are included if you have the topcat-all.dmg file. They can also be found in the starjava github repository (https://github.com/Starlink/starjava/tree/master/parquet/src/lib or you can acquire them from the Parquet MR package. These arrangements may be revised in future releases, for instance if parquet usage becomes more mainstream. The required dependencies are a minimal subset of those required by the Parquet MR submodule parquet-cli, in particular the files aircompressor-0.21.jar commons-collections-3.2.2.jar commons-configuration2-2.1.1.jar commons-lang3-3.9.jar failureaccess-1.0.1.jar guava-27.0.1-jre.jar hadoop-auth-3.2.3.jar hadoop-common-3.2.3.jar hadoop-mapreduce-client-core-3.2.3.jar htrace-core4-4.1.0-incubating.jar parquet-cli-1.13.1.jar parquet-column-1.13.1.jar parquet-common-1.13.1.jar parquet-encoding-1.13.1.jar parquet-format-structures-1.13.1.jar parquet-hadoop-1.13.1.jar parquet-jackson-1.13.1.jar slf4j-api-1.7.22.jar slf4j-nop-1.7.22.jar snappy-java-1.1.8.3.jar stax2-api-4.2.1.jar woodstox-core-5.3.0.jar.
These libraries support some, but not all, of the compression formats defined for parquet, currently uncompressed, gzip, snappy and lz4_raw. Supplying more of the parquet-mr dependencies at runtime would extend this list.

The handler behaviour may be modified by specifying one or more comma-separated name=value configuration options in parentheses after the handler name, e.g. "parquet(compression=gzip,groupArray=false)". The following options are available:

compression = uncompressed|snappy|gzip|lz4_raw
Configures the type of compression used for output. Supported values are probably uncompressed, snappy, gzip and lz4_raw. Others may be available if the relevant codecs are on the classpath at runtime. If no value is specified, the parquet-mr library default is used, which is probably uncompressed. (Default: null)
groupArray = true|false
Controls the low-level detail of how array-valued columns are written. For an array-valued int32 column named IVAL, groupArray=false will write it as "repeated int32 IVAL" while groupArray=true will write it as "optional group IVAL (LIST) { repeated group list { optional int32 item} }". I don't know why you'd want to do it the latter way, but some other parquet writers seem to do that by default, so there must be some good reason. (Default: false)
usedict = true|false|null
Determines whether dictionary encoding is used for output. This will work well to compress the output for columns with a small number of distinct values. Even when this setting is true, dictionary encoding is abandoned once many values have been encountered (the dictionary gets too big). If no value is specified, the parquet-mr library default is used, which is probably true. (Default: null)

If no output format is explicitly chosen, writing to a filename with the extension ".parquet" or ".parq" (case insensitive) will select parquet format for output.


Next Previous Up Contents
Next: Feather
Up: Output Formats
Previous: IPAC

TOPCAT - Tool for OPerations on Catalogues And Tables
Starlink User Note253
TOPCAT web page: http://www.starlink.ac.uk/topcat/
Author email: m.b.taylor@bristol.ac.uk
Mailing list: topcat-user@jiscmail.ac.uk