Next Previous Up Contents
Next: HAPI
Up: Input Formats
Previous: MRT

4.1.1.11 Parquet

Parquet is a columnar format developed within the Apache project. Data is compressed on disk and read into memory before use.

This input handler will read columns representing scalars, strings and one-dimensional arrays of the same. It is not capable of reading multi-dimensional arrays, more complex nested data structures, or some more exotic data types like 96-bit integers. If such columns are encountered in an input file, a warning will be emitted through the logging system and the column will not appear in the read table. Support may be introduced for some additional types if there is demand.

At present, only very limited metadata is read. Parquet does not seem(?) to have any standard format for per-column metadata, so the only information read about each column apart from its datatype is its name.

Depending on the way that the table is accessed, the reader tries to take advantage of the column and row block structure of parquet files to read the data in parallel where possible.

Parquet support is currently somewhat experimental.

Note:

The parquet I/O handlers require large external libraries, which are not always bundled with the library/application software because of their size. In some configurations, parquet support may not be present, and attempts to read or write parquet files will result in a message like:
   Parquet-mr libraries not available
If you can supply the relevant libaries on the classpath at runtime, the parquet support will work. At time of writing, the required libraries are included in the topcat-extra.jar monolithic jar file; they can also be found in the starjava github repository (https://github.com/Starlink/starjava/tree/master/parquet/src/lib, use parquet-mr-stil.jar and its dependencies), or you can acquire them from the Parquet MR package. These arrangements may be revised in future releases, for instance if parquet usage becomes more mainstream. The required dependencies are those of the Parquet MR submodule parquet-cli, in particular the files parquet-cli-1.11.1.jar, parquet-column-1.11.1.jar, parquet-common-1.11.1.jar, parquet-encoding-1.11.1.jar, parquet-format-structures-1.11.1.jar, parquet-hadoop-1.11.1-noshaded.jar, parquet-jackson-1.11.1.jar, commons-collections-3.2.2.jar, commons-configuration-1.6.jar, commons-lang-2.6.jar, failureaccess-1.0.1.jar, guava-27.0.1-jre.jar, hadoop-auth-2.7.3.jar, hadoop-common-2.7.3.jar, log4j-1.2.17.jar, slf4j-api-1.7.22.jar, slf4j-log4j12-1.7.22.jar, snappy-java-1.1.7.3.jar.

The handler behaviour may be modified by specifying one or more comma-separated name=value configuration options in parentheses after the handler name, e.g. "parquet(cachecols=true,nThread=4)". The following options are available:

cachecols = true|false|null
Forces whether to read all the column data at table load time. If true, then when the table is loaded, all data is read by column into local scratch disk files, which is generally the fastest way to ingest all the data. If false, the table rows are read as required, and possibly cached using the normal STIL mechanisms. If null (the default), the decision is taken automatically based on available information.
nThread = <int>
Sets the number of read threads used for concurrently reading table columns if the columns are cached at load time - see the cachecols option. If the value is <=0 (the default), a value is chosen based on the number of apparently available processors.

This format can be automatically identified by its content so you do not need to specify the format explicitly when reading parquet tables, regardless of the filename.


Next Previous Up Contents
Next: HAPI
Up: Input Formats
Previous: MRT

TOPCAT - Tool for OPerations on Catalogues And Tables
Starlink User Note253
TOPCAT web page: http://www.starlink.ac.uk/topcat/
Author email: m.b.taylor@bristol.ac.uk
Mailing list: topcat-user@jiscmail.ac.uk