parquet
Parquet is a columnar format developed within the Apache project. Data is compressed on disk and read into memory before use. The file format is described at https://github.com/apache/parquet-format. This software is written with reference to version 2.10.0 of the format.
The parquet file format itself defines only rather limited semantic metadata, so that there is no standard way to record column units, descriptions, UCDs etc. By default, additional metadata is written in the form of a DATA-less VOTable attached to the file footer, as described by the VOParquet convention. This additional metadata can then be retrieved by other VOParquet-aware software.
Note:
The parquet I/O handlers require large external libraries, which are not always bundled with the library/application software because of their size. In some configurations, parquet support may not be present, and attempts to read or write parquet files will result in a message like:Parquet-mr libraries not availableIf you can supply the relevant libaries on the classpath at runtime, the parquet support will work. At time of writing, the required libraries are included in thetopcat-extra.jarmonolithic jar file (though nottopcat-full.jar), and are included if you have thetopcat-all.dmgfile. They can also be found in the starjava github repository (https://github.com/Starlink/starjava/tree/master/parquet/src/lib or you can acquire them from the Parquet MR package. These arrangements may be revised in future releases, for instance if parquet usage becomes more mainstream. The required dependencies are a minimal subset of those required by the Parquet MR submoduleparquet-cliat version 1.13.1, in particular the filesaircompressor-0.21.jarcommons-collections-3.2.2.jarcommons-configuration2-2.1.1.jarcommons-lang3-3.9.jarfailureaccess-1.0.1.jarguava-27.0.1-jre.jarhadoop-auth-3.2.3.jarhadoop-common-3.2.3.jarhadoop-mapreduce-client-core-3.2.3.jarhtrace-core4-4.1.0-incubating.jarparquet-cli-1.13.1.jarparquet-column-1.13.1.jarparquet-common-1.13.1.jarparquet-encoding-1.13.1.jarparquet-format-structures-1.13.1.jarparquet-hadoop-1.13.1.jarparquet-jackson-1.13.1.jarslf4j-api-1.7.22.jarslf4j-nop-1.7.22.jarsnappy-java-1.1.8.3.jarstax2-api-4.2.1.jarwoodstox-core-5.3.0.jarzstd-jni-1.5.0-1.jar.
These libraries support some, but not all, of the compression formats defined for parquet, currentlyuncompressed,gzip,snappy,zstdandlz4_raw. Supplying more of the parquet-mr dependencies at runtime would extend this list. Unlike the rest of TOPCAT/STILTS/STIL which is written in pure java, some of these libraries (currently the snappy and zstd compression codecs) contain native code, which means they may not work on all architectures. At time of writing all common architectures are covered, but there is the possibility of failure with ajava.lang.UnsatisfiedLinkErroron other platforms if attempting to read/write files that use those compression algorithms.
The handler behaviour may be modified by specifying
one or more comma-separated name=value configuration options
in parentheses after the handler name, e.g.
"parquet(votmeta=false,compression=gzip)".
The following options are available:
votmeta = true|false
IVOA.VOTable-Parquet.content,
according to the
VOParquet convention (version 1.0).
This enables items such as Units, UCDs and column descriptions, that would otherwise be lost in the serialization,
to be stored in the output parquet file.
This information can then be recovered by parquet readers
that understand this convention.
(Default: true)
compression = uncompressed|snappy|zstd|gzip|lz4_raw
uncompressed, snappy,
zstd, gzip and lz4_raw.
Others may be available if the relevant codecs are on the
classpath at runtime.
If no value is specified, the parquet-mr library default
is used, which is probably uncompressed.
(Default: null)
kvmap = key1:value1;key2:value2;...
<key>:<value>
and separated with a semicolon,
so for instance you could write
"kvmap=author:Messier;year:1774".
This will overwrite any map entries that would otherwise
have been written.
If a value starts with the at sign ("@")
it is interpreted as giving the name of a file
whose contents will be used instead of the literal value.
Specifying an empty entry will ensure it is not written
into the key=value list.The following output format specification would write
parquet output including VOParquet metadata from
a manually prepared VOTable file meta.vot:
parquet(votmeta=false,kvmap=IVOA.VOTable-Parquet.version:1.0;IVOA.VOTable-Parquet.content:@meta.vot)
usedict = true|false|null
true.
(Default: null)
groupArray = true|false
groupArray=false will write it as
"repeated int32 IVAL"
while groupArray=true will write it as
"optional group IVAL (LIST) {repeated group list
{optional int32 element}}".
Although setting it false may be slightly more
efficient, the default is true,
since if any of the columns have array values that either
may be null or may have elements which are null,
groupArray-style declarations for all columns are required
by the Parquet file format:
"A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field. Implementations should use either LIST and MAP annotations or unannotated repeated fields, but not both. When using the annotations, no unannotated repeated types are allowed."
If this option is set false and an attempt is made to write
null arrays or arrays with null values, writing will fail.
(Default: true)
If no output format is explicitly chosen,
writing to a filename with
the extension ".parquet" or ".parq" (case insensitive)
will select parquet format for output.