Next Previous Up Contents
Next: IPAC
Up: Input Formats
Previous: ECSV
In many cases tables are stored in some sort of unstructured plain
text format, with cells separated by spaces or some other delimiters.
There is a wide variety of such formats depending on what delimiters
are used, how columns are identified, whether blank values are permitted
and so on. It is impossible to cope with them all, but the ASCII handler
attempts to make a good guess about how to interpret a given ASCII file as
a table, which in many cases is successful. In particular, if you just
have columns of numbers separated by something that looks like spaces,
you should be just fine.
Here are the detailed rules for how the ASCII-format tables are
interpreted:
- Bytes in the file are interpreted as ASCII characters
- Each table row is represented by a single line of text
- Lines are terminated by one or more contiguous line termination
characters: line feed (0x0A) or carriage return (0x0D)
- Within a line, fields are separated by one or more whitespace
characters: space (" ") or tab (0x09)
- A field is either an unquoted sequence of non-whitespace characters,
or a sequence of non-newline characters between matching
single (') or double (") quote characters -
spaces are therefore allowed in quoted fields
- Within a quoted field, whitespace characters are permitted and are
treated literally
- Within a quoted field, any character preceded by a backslash character
("\") is treated literally. This allows quote characters to appear
within a quoted string.
- An empty quoted string (two adjacent quotes)
or the string "
null
" (unquoted) represents
the null value
- All data lines must contain the same number of fields (this is the
number of columns in the table)
- The data type of a column is guessed according to the fields that
appear in the table. If all the fields in one column can be parsed
as integers (or null values), then that column will turn into an
integer-type column. The types that are tried, in order of
preference, are:
Boolean
,
Short
Integer
,
Long
,
Float
,
Double
,
String
- Some special values are permitted for floating point columns:
NaN
for not-a-number, which is treated the same as a
null value for most purposes, and Infinity
or inf
for infinity (with or without a preceding +/- sign).
These values are matched case-insensitively.
- Empty lines are ignored
- Anything after a hash character "#" (except one in a quoted string)
on a line is ignored as far as table data goes;
any line which starts with a "!" is also ignored.
However, lines which start with a "#" or "!" at the start of the table
(before any data lines) will be interpreted as metadata as follows:
- The last "#"/"!"-starting line before the first data line may
contain
the column names. If it has the same number of fields as
there are columns in the table, each field will be taken to be
the title of the corresponding column. Otherwise, it will be
taken as a normal comment line.
- Any comment lines before the first data line not covered by the
above will be concatenated to form the "description" parameter
of the table.
If the list of rules above looks frightening, don't worry,
in many cases it ought
to make sense of a table without you having to read the small print.
Here is an example of a suitable ASCII-format table:
#
# Here is a list of some animals.
#
# RECNO SPECIES NAME LEGS HEIGHT/m
1 pig "Pigling Bland" 4 0.8
2 cow Daisy 4 2
3 goldfish Dobbin "" 0.05
4 ant "" 6 0.001
5 ant "" 6 0.001
6 ant '' 6 0.001
7 "queen ant" 'Ma\'am' 6 2e-3
8 human "Mark" 2 1.8
In this case it will identify the following columns:
Name Type
---- ----
RECNO Short
SPECIES String
NAME String
LEGS Short
HEIGHT/m Float
It will also use the text "Here is a list of some animals
"
as the Description parameter of the table.
Without any of the comment lines, it would still interpret the table,
but the columns would be given the names col1
..col5
.
The handler behaviour may be modified by specifying
one or more comma-separated name=value configuration options
in parentheses after the handler name, e.g.
"ascii(maxSample=5000)
".
The following options are available:
-
maxSample = <int>
- Controls how many rows of the input file are sampled
to determine column datatypes.
When reading ASCII files, since no type information is present
in the input file, the handler has to look at the column data
to see what type of value appears to be present
in each column, before even starting to read the data in.
By default it goes through the whole table when doing this,
which can be time-consuming for large tables.
If this value is set, it limits the number of rows
that are sampled in this data characterisation pass,
which can reduce read time substantially.
However, if values near the end of the table differ
in apparent type from those near the start,
it can also result in getting the datatypes wrong.
(Default:
0
)
This format cannot be automatically identified
by its content, so in general it is necessary
to specify that a table is in
ASCII
format when reading it.
However, if the input file has
the extension ".txt
" (case insensitive)
an attempt will be made to read it using this format.
Next Previous Up Contents
Next: IPAC
Up: Input Formats
Previous: ECSV
TOPCAT - Tool for OPerations on Catalogues And Tables
Starlink User Note253
TOPCAT web page:
http://www.starlink.ac.uk/topcat/
Author email:
m.b.taylor@bristol.ac.uk
Mailing list:
topcat-user@jiscmail.ac.uk