# Parsing C-like code to extract info

Posted on

Problem

I’m reading a file with the following data,

``````char peer0_3[] = { /* Packet 647 */
0x02, 0x00, 0x04, 0x00, 0x11, 0x01, 0x06, 0x1b,
0x04, 0x01, 0x31, 0x0a, 0x32, 0x30, 0x31, 0x39,
0x2d, 0x30, 0x36, 0x2d, 0x31, 0x30, 0x0a, 0x32,
0x30, 0x31, 0x39, 0x2d, 0x30, 0x36, 0x2d, 0x31,
0x30, 0x01, 0x30 };
``````

with the following code, to get this ‘C’ text in a python list

``````f=open(thisFile,'r')
xcontents = re.findall("{(.+?)};", tcontents, re.S)                       #find the arrays
frames=[]
for item in xcontents:
splitData =item.split(',')
buff=bytes()
for item in splitData:
packed =  struct.pack("B", int(item,0) )
buff+=packed
frames.append(buff)
``````

That’s working fine, but I was wondering if there was not a smarter and more compact method?

Solution

I think this can be done a lot more concisely. You need a function to recognize bytes until the end of the array, one to convert them to integers and one to pull it all together:

``````def get_bytes(f):
for line in f:
yield from re.findall(r'0xd{2}', line)
if line.rstrip().endswith(";"):
break

def convert_str_to_bytes(s):
return int(s, 0).to_bytes(1, 'big')

def convert_arrays(file_name):
with open(file_name) as f:
while True:
arr = b''.join(map(convert_str_to_bytes, get_bytes(f)))
if arr:
yield arr
else:
return

if __name__ == "__main__":
print(list(convert_arrays('c_like_code.txt')))
# [b'x02x00x04x00x11x01x06x04x0112019061020190610x010']
``````

• You open your file but never close it, please use the `with` statement to manage you files life automatically;
• You don’t need `struct` to convert an `int` to a single `byte`: you can use the `int.to_bytes` method instead;
• `frames = []; for item in xcontent: frames.append(…transform…(item))` cries for a list-comprehension instead; same for `buff = bytes(); for item in splitData: buff += …change…(item)`: use `b''.join`;
• You shouldn’t need to search for delimiters and comments using your regex since all you are interested in are the hexadecimal numbers (`r'0xd{2}'`), but this would require a little bit of preprocessing to extract “logical lines” from the C code;
• You shouldn’t need to read the whole file at once. Instead, reading it line by line and processing only the handful of lines corresponding to a single expression at once would help getting rid of the “search for delimiters” regex.

Proposed improvements:

``````import re

HEX_BYTE = re.compile(r'0xd{2}')

def find_array_line(file_object):
"""
Yield a block of lines from an opened file, stopping
when the last line ends with a semicolon.
"""
for line in file_object:
line = line.strip()
yield line
# In Python 3.8 you can merge the previous 2 lines into
# yield (line := line.strip())
if line.endswith(';'):
break

"""
Yield blocks of lines in the file named filename as a
single string each.
"""
with open(filename) as source_file:
while True:
line = ''.join(find_array_line(source_file))
if not line:
break
yield line
# In Python 3.8 you can write the loop
# while line := ''.join(find_array_line(source_file)):
#     yield line

def convert_arrays(filename):
"""
Consider each block of lines in the file named filename
as an array and yield them as a single bytes object each.
"""
yield b''.join(
int(match.group(), 0).to_bytes(1, 'big')
for match in HEX_BYTE.finditer(line))

if __name__ == '__main__':
print(list(convert_arrays('c_like_code.txt')))
``````

A more maintainable way would be to look up a parsing package like PyParsing. Else, don’t forget to put spaces around operators

• from `buff+=packed` to `buff += packed` and
• from `splitData =item.split(',')` to `splitData = item.split(',')`. You can also read files as
``````with open(thisFile) as f:
``````

Not specifiying any mode assumes read mode (`'r'`)

Since someone else already mentioned pyparsing, here is an annotated parser for your C code:

``````c_source = """
char peer0_3[] = { /* Packet 647 */
0x02, 0x00, 0x04, 0x00, 0x11, 0x01, 0x06, 0x1b,
0x04, 0x01, 0x31, 0x0a, 0x32, 0x30, 0x31, 0x39,
0x2d, 0x30, 0x36, 0x2d, 0x31, 0x30, 0x0a, 0x32,
0x30, 0x31, 0x39, 0x2d, 0x30, 0x36, 0x2d, 0x31,
0x30, 0x01, 0x30 };
"""

import pyparsing as pp
ppc = pp.pyparsing_common

# normally string literals are added to the parsed output, but here anything that is just
# added to the parser as a string, we will want suppressed
pp.ParserElement.inlineLiteralsUsing(pp.Suppress)

# pyparsing already includes a definition for a hex_integer, including parse-time
# conversion to int
hexnum = "0x" + ppc.hex_integer

# pyparsing also defines a helper for elements that are in a delimited list (with ','
# as the default delimiter)
hexnumlist = pp.delimitedList(hexnum)

# build up a parser, and add names for the significant parts, so we can get at them easily
# post-parsing
# pyparsing will skip over whitespace that may appear between any of these expressions
decl_expr = ("char"
+ ppc.identifier("name")
+ "[]" + "=" + "{"
+ hexnumlist("bytes")
+ "}" + ";")

# ignore pesky comments, which can show up anywhere
decl_expr.ignore(pp.cStyleComment)

# try it out
result = decl_expr.parseString(c_source)
print(result.dump())
print(result.name)
print(result.bytes)
``````

Prints

``````['peer0_3', 2, 0, 4, 0, 17, 1, 6, 27, 4, 1, 49, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 1, 48]
- bytes: [2, 0, 4, 0, 17, 1, 6, 27, 4, 1, 49, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 1, 48]
- name: 'peer0_3'
peer0_3
[2, 0, 4, 0, 17, 1, 6, 27, 4, 1, 49, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 1, 48]
``````