Problem
I wrote a simply parser for Tektronix’s internal binary file format .isf
. The code is supposed to read in the file and return the header data as well as the actual oszi-data.
Since I mostly code by myself, I would like your opinion on whether my code follows common practice and whether it can be improved.
import numpy as np
import os.path
def read_curve(binaryfile):
"""
Reads one tektronix .isf file and returns a dictionary containing
all tags as keys. The actual data is stored in the key "data".
"""
postfixs = [".isf", ".ISF"]
if os.path.splitext(binaryfile)[-1] not in postfixs:
raise ValueError("File type unkown.")
with open(binaryfile, 'rb') as bfile:
# read header
header = {}
current = 0
while True:
current, name = _read_chunk(bfile, " ")
if name != ":CURVE":
current, value = _read_chunk(bfile, ";")
assert name not in header
header[name] = value
else:
# ":CURVE" is the last tag of the header, followed by
# a '#' and a 7 digit number.
header[name] = bfile.read(8)
current = bfile.tell()
break
assert header["ENCDG"] == "BINARY"
# read data as numpy array
header["data"] = _read_data(bfile, current, header)
return header
def _read_chunk(headerfile, delimiter):
"""
Reads one chunk of header data. Based on delimiter, this may be a tag
(ended by " ") or the value of a tag (ended by ";").
"""
chunk = []
while True:
c = headerfile.read(1)
if c != delimiter:
chunk.append(c)
else:
return headerfile.tell(), "".join(chunk)
def _read_data(bfile, position, header):
"""
Reads in the binary data as numpy array.
Apparently, there are only 1d-signals stored in .isf files, so a 1d-array
is read.
Returns a 2d-array with timepoints and datapoints aligned.
"""
# determine the datatype from header tags
datatype = ">" if header["BYT_OR"] == "MSB" else "<"
if header["BN_FMT"] == "RI":
datatype += "i"
else:
datatype += "u"
datatype += header[":WFMPRE:BYT_NR"]
bfile.seek(position)
data = np.fromfile(bfile, datatype)
assert data.size == int(header["NR_PT"])
# calculate true values
data = data * float(header["YMULT"]) + float(header["YZERO"])
# create timepoints
t = np.arange(data.size)
t = t * float(header["XINCR"]) + float(header["XZERO"])
# create single array
res = np.concatenate((t[None, :], data[None, :]), axis=0)
return res
Edit: I posted my revised code here: Parsing oscilloscope data, follow up
Solution
Disclaimer: I haven’t touch a tektronix device for years.
_read_chunk
fails if WFID contains a semicolon.- Don’t read data blindly till EOF. The header promises that many samples, but they might be trailing data.
-
The
7 digit number
following a:CURVE #
is not necessarily 7 digit. You must parse it as well (it is a single byte ofcount
, akaX
, followed bycount
digit number, akaYYY
, which represents the byte length of the data).I would also assert that
YYY
equals toNR_PT * BYT_NR
This seems pretty nice. I have just some nitpicks about this part:
postfixs = [".isf", ".ISF"]
if os.path.splitext(binaryfile)[-1] not in postfixs:
raise ValueError("File type unkown.")
Nitpicks:
-
I generally don’t like repeating the same thing in lowercase and uppercase versions. I prefer to use lowercase, and convert untrusted input to lowercase. However, that will make this check less strict:
iSF
andISf
also become valid extensions, which you might not want. -
I’d spell out “postfixs” as “postfixes”, or better yet, use the more common term “extensions”
-
Instead of a
list
of allowed extensions, I’d use aset
. Although in this example you surely won’t be hitting performance issues, it’s a matter of principle and habit
So I’d write this as:
extensions = set([".isf"])
if os.path.splitext(binaryfile)[-1].lower() not in extensions:
raise ValueError("File type unknown.")
As for the rest, this seems pretty clean to me. I especially like the assert
statements documenting your assumptions.
_read_chunk()
does not need to return the filepointer position as you never use it. After leaving the main loop you (have to) read the filepointer explicitely. Thus, current
is oblivious.
Speaking of naming, ‘current’ can be a lot of things…’currentposition’ or the like would be self-documenting. Likewise, a ‘bfile’ and a ‘binaryfile’ might be an ‘isf_file’; it isn’t strictly binary after all.
The guidelines for python style (PEP8) point out that blank lines be used sparsely. I would omit them especially at the beginning of a code block like a loop as to visually underline the block.
_read_chunk()
makes me feel uncomfortable – you know it’s text data but resort to byte-wise reading and searching (for the delims). Built-in string search is much cleaner and easier to follow. You could read in the first 1 MB or so and parse all string data up to the ‘#’ (if this is unique). Its string position gives you the starting point for reading the binary data then. Anyway, very concise and clean programming style, a pleasure to read!
Edit:
According to this Tektronix doc you need to parse the next few chars after the ‘:CURVE’ token like this (same what vnp hinted at):
else:
# ":CURVE" is the last tag of the header, followed by
# a '#', a single ASCII length char and a variable-length ASCII digit number.
skip = int(bfile.read(1))
assert 1 <= skip <= 9
numrecs = int(bfile.read(skip))
assert numrecs == header['NR_PT'] * header['BYT_NR']
current = bfile.tell()
break