Parsing oscilloscope binary data

Posted on

Problem

I wrote a simply parser for Tektronix’s internal binary file format .isf. The code is supposed to read in the file and return the header data as well as the actual oszi-data.

Since I mostly code by myself, I would like your opinion on whether my code follows common practice and whether it can be improved.

import numpy as np
import os.path

def read_curve(binaryfile):
    """
    Reads one tektronix .isf file and returns a dictionary containing
    all tags as keys. The actual data is stored in the key "data".
    """
    postfixs = [".isf", ".ISF"]
    if os.path.splitext(binaryfile)[-1] not in postfixs:
        raise ValueError("File type unkown.")

    with open(binaryfile, 'rb') as bfile:

        # read header
        header = {}
        current = 0

        while True:

            current, name = _read_chunk(bfile, " ")

            if name != ":CURVE":
                current, value = _read_chunk(bfile, ";")

                assert name not in header
                header[name] = value
            else:
                # ":CURVE" is the last tag of the header, followed by
                # a '#' and a 7 digit number.
                header[name] = bfile.read(8)
                current = bfile.tell()
                break

        assert header["ENCDG"] == "BINARY"

        # read data as numpy array
        header["data"] = _read_data(bfile, current, header)

    return header


def _read_chunk(headerfile, delimiter):
    """
    Reads one chunk of header data. Based on delimiter, this may be a tag
    (ended by " ") or the value of a tag (ended by ";").
    """
    chunk = []
    while True:
        c = headerfile.read(1)
        if c != delimiter:
            chunk.append(c)
        else:
            return headerfile.tell(), "".join(chunk)


def _read_data(bfile, position, header):
    """
    Reads in the binary data as numpy array.
    Apparently, there are only 1d-signals stored in .isf files, so a 1d-array
    is read.
    Returns a 2d-array with timepoints and datapoints aligned.
    """
    # determine the datatype from header tags
    datatype = ">" if header["BYT_OR"] == "MSB" else "<"
    if header["BN_FMT"] == "RI":
        datatype += "i"
    else:
        datatype += "u"
    datatype += header[":WFMPRE:BYT_NR"]

    bfile.seek(position)
    data = np.fromfile(bfile, datatype)
    assert data.size == int(header["NR_PT"])

    # calculate true values
    data = data * float(header["YMULT"]) + float(header["YZERO"])

    # create timepoints
    t = np.arange(data.size)
    t = t * float(header["XINCR"]) + float(header["XZERO"])

    # create single array
    res = np.concatenate((t[None, :], data[None, :]), axis=0)

    return res

Edit: I posted my revised code here: Parsing oscilloscope data, follow up

Solution

Disclaimer: I haven’t touch a tektronix device for years.

  • _read_chunk fails if WFID contains a semicolon.
  • Don’t read data blindly till EOF. The header promises that many samples, but they might be trailing data.
  • The 7 digit number following a :CURVE # is not necessarily 7 digit. You must parse it as well (it is a single byte of count, aka X, followed by count digit number, aka YYY, which represents the byte length of the data).

    I would also assert that YYY equals to NR_PT * BYT_NR

This seems pretty nice. I have just some nitpicks about this part:

postfixs = [".isf", ".ISF"]
if os.path.splitext(binaryfile)[-1] not in postfixs:
    raise ValueError("File type unkown.")

Nitpicks:

  • I generally don’t like repeating the same thing in lowercase and uppercase versions. I prefer to use lowercase, and convert untrusted input to lowercase. However, that will make this check less strict: iSF and ISf also become valid extensions, which you might not want.

  • I’d spell out “postfixs” as “postfixes”, or better yet, use the more common term “extensions”

  • Instead of a list of allowed extensions, I’d use a set. Although in this example you surely won’t be hitting performance issues, it’s a matter of principle and habit

So I’d write this as:

extensions = set([".isf"])
if os.path.splitext(binaryfile)[-1].lower() not in extensions:
    raise ValueError("File type unknown.")

As for the rest, this seems pretty clean to me. I especially like the assert statements documenting your assumptions.

_read_chunk() does not need to return the filepointer position as you never use it. After leaving the main loop you (have to) read the filepointer explicitely. Thus, current is oblivious.

Speaking of naming, ‘current’ can be a lot of things…’currentposition’ or the like would be self-documenting. Likewise, a ‘bfile’ and a ‘binaryfile’ might be an ‘isf_file’; it isn’t strictly binary after all.

The guidelines for python style (PEP8) point out that blank lines be used sparsely. I would omit them especially at the beginning of a code block like a loop as to visually underline the block.

_read_chunk() makes me feel uncomfortable – you know it’s text data but resort to byte-wise reading and searching (for the delims). Built-in string search is much cleaner and easier to follow. You could read in the first 1 MB or so and parse all string data up to the ‘#’ (if this is unique). Its string position gives you the starting point for reading the binary data then. Anyway, very concise and clean programming style, a pleasure to read!

Edit:
According to this Tektronix doc you need to parse the next few chars after the ‘:CURVE’ token like this (same what vnp hinted at):

else:
    # ":CURVE" is the last tag of the header, followed by
    # a '#', a single ASCII length char and a variable-length ASCII digit number.
    skip = int(bfile.read(1))
    assert 1 <= skip <= 9
    numrecs  = int(bfile.read(skip))
    assert numrecs == header['NR_PT'] * header['BYT_NR']
    current = bfile.tell()
    break

Leave a Reply

Your email address will not be published. Required fields are marked *