Problem
I’m reading a file with the following data,
char peer0_3[] = { /* Packet 647 */
0x02, 0x00, 0x04, 0x00, 0x11, 0x01, 0x06, 0x1b,
0x04, 0x01, 0x31, 0x0a, 0x32, 0x30, 0x31, 0x39,
0x2d, 0x30, 0x36, 0x2d, 0x31, 0x30, 0x0a, 0x32,
0x30, 0x31, 0x39, 0x2d, 0x30, 0x36, 0x2d, 0x31,
0x30, 0x01, 0x30 };
with the following code, to get this ‘C’ text in a python list
f=open(thisFile,'r')
contents=f.read()
tcontents=re.sub('/*.**/','',contents).replace('n', '').strip() #suppress the comments
xcontents = re.findall("{(.+?)};", tcontents, re.S) #find the arrays
frames=[]
for item in xcontents:
splitData =item.split(',')
buff=bytes()
for item in splitData:
packed = struct.pack("B", int(item,0) )
buff+=packed
frames.append(buff)
That’s working fine, but I was wondering if there was not a smarter and more compact method?
Solution
I think this can be done a lot more concisely. You need a function to recognize bytes until the end of the array, one to convert them to integers and one to pull it all together:
def get_bytes(f):
for line in f:
yield from re.findall(r'0xd{2}', line)
if line.rstrip().endswith(";"):
break
def convert_str_to_bytes(s):
return int(s, 0).to_bytes(1, 'big')
def convert_arrays(file_name):
with open(file_name) as f:
while True:
arr = b''.join(map(convert_str_to_bytes, get_bytes(f)))
if arr:
yield arr
else:
return
if __name__ == "__main__":
print(list(convert_arrays('c_like_code.txt')))
# [b'x02x00x04x00x11x01x06x04x0112019061020190610x010']
- You open your file but never close it, please use the
with
statement to manage you files life automatically; - You don’t need
struct
to convert anint
to a singlebyte
: you can use theint.to_bytes
method instead; frames = []; for item in xcontent: frames.append(…transform…(item))
cries for a list-comprehension instead; same forbuff = bytes(); for item in splitData: buff += …change…(item)
: useb''.join
;- You shouldn’t need to search for delimiters and comments using your regex since all you are interested in are the hexadecimal numbers (
r'0xd{2}'
), but this would require a little bit of preprocessing to extract “logical lines” from the C code; - You shouldn’t need to read the whole file at once. Instead, reading it line by line and processing only the handful of lines corresponding to a single expression at once would help getting rid of the “search for delimiters” regex.
Proposed improvements:
import re
HEX_BYTE = re.compile(r'0xd{2}')
def find_array_line(file_object):
"""
Yield a block of lines from an opened file, stopping
when the last line ends with a semicolon.
"""
for line in file_object:
line = line.strip()
yield line
# In Python 3.8 you can merge the previous 2 lines into
# yield (line := line.strip())
if line.endswith(';'):
break
def read_array_line(filename):
"""
Yield blocks of lines in the file named filename as a
single string each.
"""
with open(filename) as source_file:
while True:
line = ''.join(find_array_line(source_file))
if not line:
break
yield line
# In Python 3.8 you can write the loop
# while line := ''.join(find_array_line(source_file)):
# yield line
def convert_arrays(filename):
"""
Consider each block of lines in the file named filename
as an array and yield them as a single bytes object each.
"""
for line in read_array_line(filename):
yield b''.join(
int(match.group(), 0).to_bytes(1, 'big')
for match in HEX_BYTE.finditer(line))
if __name__ == '__main__':
print(list(convert_arrays('c_like_code.txt')))
A more maintainable way would be to look up a parsing package like PyParsing. Else, don’t forget to put spaces around operators
- from
buff+=packed
tobuff += packed
and - from
splitData =item.split(',')
tosplitData = item.split(',')
. You can also read files as
with open(thisFile) as f:
contents = f.read()
Not specifiying any mode assumes read mode ('r'
)
Since someone else already mentioned pyparsing, here is an annotated parser for your C code:
c_source = """
char peer0_3[] = { /* Packet 647 */
0x02, 0x00, 0x04, 0x00, 0x11, 0x01, 0x06, 0x1b,
0x04, 0x01, 0x31, 0x0a, 0x32, 0x30, 0x31, 0x39,
0x2d, 0x30, 0x36, 0x2d, 0x31, 0x30, 0x0a, 0x32,
0x30, 0x31, 0x39, 0x2d, 0x30, 0x36, 0x2d, 0x31,
0x30, 0x01, 0x30 };
"""
import pyparsing as pp
ppc = pp.pyparsing_common
# normally string literals are added to the parsed output, but here anything that is just
# added to the parser as a string, we will want suppressed
pp.ParserElement.inlineLiteralsUsing(pp.Suppress)
# pyparsing already includes a definition for a hex_integer, including parse-time
# conversion to int
hexnum = "0x" + ppc.hex_integer
# pyparsing also defines a helper for elements that are in a delimited list (with ','
# as the default delimiter)
hexnumlist = pp.delimitedList(hexnum)
# build up a parser, and add names for the significant parts, so we can get at them easily
# post-parsing
# pyparsing will skip over whitespace that may appear between any of these expressions
decl_expr = ("char"
+ ppc.identifier("name")
+ "[]" + "=" + "{"
+ hexnumlist("bytes")
+ "}" + ";")
# ignore pesky comments, which can show up anywhere
decl_expr.ignore(pp.cStyleComment)
# try it out
result = decl_expr.parseString(c_source)
print(result.dump())
print(result.name)
print(result.bytes)
Prints
['peer0_3', 2, 0, 4, 0, 17, 1, 6, 27, 4, 1, 49, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 1, 48]
- bytes: [2, 0, 4, 0, 17, 1, 6, 27, 4, 1, 49, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 1, 48]
- name: 'peer0_3'
peer0_3
[2, 0, 4, 0, 17, 1, 6, 27, 4, 1, 49, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 1, 48]