Problem
I wrote a simple Python 3 program that takes data from either a file or standard input and xor encrypts or decrypts the data. By default, the output is encoded in base64, however there is a flag for disabling that --raw
. It works as intended except when I am using the raw mode, in which case an extra line and some random data is appended to the output when decrypting xor data.
#!/usr/bin/env python3
from itertools import cycle
import argparse
import base64
import re
def xor(data, key):
return ''.join(chr(ord(str(a)) ^ ord(str(b))) for (a, b) in zip(data, cycle(key)))
# check if a string is base64 encoded.
def is_base64(s):
pattern = re.compile("^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$")
if not s or len(s) < 1:
return False
else:
return pattern.match(s)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--infile', type=argparse.FileType('r'), default='-', dest='data', help='Data to encrypt'
'or decrypt')
parser.add_argument('-r', '--raw', dest='raw', default=False, action='store_true', help='Do not use base64 encoding')
parser.add_argument('-k', '--key', dest='key', help='Key to encrypt with', required=True)
args = parser.parse_args()
data = args.data.read()
key = args.key
raw = args.raw
if raw:
ret = xor(data, key)
print(str(ret))
else:
if is_base64(data):
# print('is base64')
decoded = base64.b64decode(data).decode()
ret = xor(decoded, key)
print(ret)
else:
# print('is not base64')
ret = xor(data, key)
encoded = base64.b64encode(bytes(ret, "utf-8"))
print(encoded.decode())
When running without the --raw
flag, everything performs as intended:
$ echo lol|./xor.py -k 123
XV1fOw==
echo lol|./xor.py -k 123 |./xor.py -k 123
lol
However, if I disable base64, something rather odd happens. It’s easier to demonstrate then it is to explain:
$ echo lol|./xor.py -k 123 -r |./xor.py -k 123 -r
lol
8
Does anyone know why I am seeing the character 8
appended to the output of xor decrypted data? I have a c program called xorpipe that I use for this exact use case, and it does not suffer this bug. I wanted to rewrite it in Python.
I am looking for other constructive criticism, suggestions, or reviews as well. Particular, I would like argparse to be able to determine whether the supplied input is either a file, string, or data piped from standard input. This is easy to accomplish bash or C, but I am not sure how best to do this in Python.
Solution
Does anyone know why I am seeing the character 8 appended to the output of xor decrypted data?
The statement echo lol
pipes lolnr
to Python, which encodes the line breaks as ;_
which is decoded into an 8
. Unfortunately echo -n
doesn’t work here but adding .strip()
to the input data in the Python script fixes this issue.
PEP8 is Python’s internal coding standards which specify how line breaks and code should be structured: https://www.python.org/dev/peps/pep-0008/ . The guide is very long; you can use autopep8
to auto-format the code.
if raw:
ret = xor(data, key)
print(str(ret))
else:
if is_base64(data):
# print('is base64')
decoded = base64.b64decode(data).decode()
ret = xor(decoded, key)
print(ret)
else:
# print('is not base64')
ret = xor(data, key)
encoded = base64.b64encode(bytes(ret, "utf-8"))
print(encoded.decode())
I would simplify the nested if-statements and add a return
to print(str(ret))
then the is_base64
could be unindented, or I would set a variable called decoded
to the final string to print, then print it out at the end of the if/elif
loop.
is_base64(s)
could just run base64.b64decode(data).decode()
and return False
if any exceptions were thrown during decoding instead of the regex.
I would remove the commented out code such as # print('is base64')
.