Parsing C-like code to extract info

Posted on

Problem

I’m reading a file with the following data,

char peer0_3[] = { /* Packet 647 */
0x02, 0x00, 0x04, 0x00, 0x11, 0x01, 0x06, 0x1b, 
0x04, 0x01, 0x31, 0x0a, 0x32, 0x30, 0x31, 0x39, 
0x2d, 0x30, 0x36, 0x2d, 0x31, 0x30, 0x0a, 0x32, 
0x30, 0x31, 0x39, 0x2d, 0x30, 0x36, 0x2d, 0x31, 
0x30, 0x01, 0x30 };

with the following code, to get this ‘C’ text in a python list

f=open(thisFile,'r')
contents=f.read()
tcontents=re.sub('/*.**/','',contents).replace('n', '').strip()      #suppress the comments
xcontents = re.findall("{(.+?)};", tcontents, re.S)                       #find the arrays
frames=[]
for item in xcontents:
    splitData =item.split(',')
    buff=bytes()
    for item in splitData:
        packed =  struct.pack("B", int(item,0) )
        buff+=packed
    frames.append(buff)

That’s working fine, but I was wondering if there was not a smarter and more compact method?

Solution

I think this can be done a lot more concisely. You need a function to recognize bytes until the end of the array, one to convert them to integers and one to pull it all together:

def get_bytes(f):
    for line in f:
        yield from re.findall(r'0xd{2}', line)
        if line.rstrip().endswith(";"):
            break

def convert_str_to_bytes(s):
    return int(s, 0).to_bytes(1, 'big')

def convert_arrays(file_name):
    with open(file_name) as f:
        while True:
            arr = b''.join(map(convert_str_to_bytes, get_bytes(f)))
            if arr:
                yield arr
            else:
                return

if __name__ == "__main__":
    print(list(convert_arrays('c_like_code.txt')))
    # [b'x02x00x04x00x11x01x06x04x0112019061020190610x010']

  • You open your file but never close it, please use the with statement to manage you files life automatically;
  • You don’t need struct to convert an int to a single byte: you can use the int.to_bytes method instead;
  • frames = []; for item in xcontent: frames.append(…transform…(item)) cries for a list-comprehension instead; same for buff = bytes(); for item in splitData: buff += …change…(item): use b''.join;
  • You shouldn’t need to search for delimiters and comments using your regex since all you are interested in are the hexadecimal numbers (r'0xd{2}'), but this would require a little bit of preprocessing to extract “logical lines” from the C code;
  • You shouldn’t need to read the whole file at once. Instead, reading it line by line and processing only the handful of lines corresponding to a single expression at once would help getting rid of the “search for delimiters” regex.

Proposed improvements:

import re


HEX_BYTE = re.compile(r'0xd{2}')


def find_array_line(file_object):
    """
    Yield a block of lines from an opened file, stopping
    when the last line ends with a semicolon.
    """
    for line in file_object:
        line = line.strip()
        yield line
        # In Python 3.8 you can merge the previous 2 lines into
        # yield (line := line.strip())
        if line.endswith(';'):
            break


def read_array_line(filename):
    """
    Yield blocks of lines in the file named filename as a
    single string each.
    """
    with open(filename) as source_file:
        while True:
            line = ''.join(find_array_line(source_file))
            if not line:
                break
            yield line
        # In Python 3.8 you can write the loop
        # while line := ''.join(find_array_line(source_file)):
        #     yield line


def convert_arrays(filename):
    """
    Consider each block of lines in the file named filename
    as an array and yield them as a single bytes object each.
    """
    for line in read_array_line(filename):
        yield b''.join(
                int(match.group(), 0).to_bytes(1, 'big')
                for match in HEX_BYTE.finditer(line))


if __name__ == '__main__':
    print(list(convert_arrays('c_like_code.txt')))

A more maintainable way would be to look up a parsing package like PyParsing. Else, don’t forget to put spaces around operators

  • from buff+=packed to buff += packed and
  • from splitData =item.split(',') to splitData = item.split(','). You can also read files as
with open(thisFile) as f:
    contents = f.read()

Not specifiying any mode assumes read mode ('r')

Since someone else already mentioned pyparsing, here is an annotated parser for your C code:

c_source = """
char peer0_3[] = { /* Packet 647 */
0x02, 0x00, 0x04, 0x00, 0x11, 0x01, 0x06, 0x1b,
0x04, 0x01, 0x31, 0x0a, 0x32, 0x30, 0x31, 0x39,
0x2d, 0x30, 0x36, 0x2d, 0x31, 0x30, 0x0a, 0x32,
0x30, 0x31, 0x39, 0x2d, 0x30, 0x36, 0x2d, 0x31,
0x30, 0x01, 0x30 };
"""

import pyparsing as pp
ppc = pp.pyparsing_common

# normally string literals are added to the parsed output, but here anything that is just
# added to the parser as a string, we will want suppressed
pp.ParserElement.inlineLiteralsUsing(pp.Suppress)

# pyparsing already includes a definition for a hex_integer, including parse-time
# conversion to int
hexnum = "0x" + ppc.hex_integer

# pyparsing also defines a helper for elements that are in a delimited list (with ',' 
# as the default delimiter)
hexnumlist = pp.delimitedList(hexnum)

# build up a parser, and add names for the significant parts, so we can get at them easily
# post-parsing
# pyparsing will skip over whitespace that may appear between any of these expressions
decl_expr = ("char"
             + ppc.identifier("name")
             + "[]" + "=" + "{" 
             + hexnumlist("bytes") 
             + "}" + ";")

# ignore pesky comments, which can show up anywhere
decl_expr.ignore(pp.cStyleComment)

# try it out
result = decl_expr.parseString(c_source)
print(result.dump())
print(result.name)
print(result.bytes)

Prints

['peer0_3', 2, 0, 4, 0, 17, 1, 6, 27, 4, 1, 49, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 1, 48]
- bytes: [2, 0, 4, 0, 17, 1, 6, 27, 4, 1, 49, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 1, 48]
- name: 'peer0_3'
peer0_3
[2, 0, 4, 0, 17, 1, 6, 27, 4, 1, 49, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 10, 50, 48, 49, 57, 45, 48, 54, 45, 49, 48, 1, 48]

Leave a Reply

Your email address will not be published. Required fields are marked *