Problem
Would this function be sufficient enough to remove all malicious code and foreign characters from a string?
//Clean the string
$out = ltrim($do);
$out = rtrim($out);
$out = preg_replace('/[^(x20-x7F)]*/','', strip_tags($out));
This data is not going into a SQL database, so I dont have to worry about sql injection attempts. Is there any way to improve my code and make it more efficient?
This function would clean any user inputted data (forms, and ?), and then save it do a database. This would be used in a global sanitizing function.
Solution
That one is quite simple (for me at least) since there is a very general answer 🙂
NO
There is no way you can ever really safely “repair” user input.
“Please provide a list of everything you shouldn’t to with a hammer”
is way harder than
“list all appropriate uses of a hammer”.
You might forget one or two but no harm done there if you go back and add them.
It might sound harsh but something will always byte you and if it’s only EVAL(UNHEX(ASD23426363))
or something like that. (Sql example even so you did say it not sql but whatever).
Of course you might want to do things like strip out tags out of input for html context anyways so that a hole in your other code is not as easily exploited and less damage will be done but i shouldn’t be your only defense.
Someone that said Filter Input, Escape Output
way better than i could. Terry Chay
In short: Whatever you are doing there, find the appropriate escaping function and use it.
Shell context ? escapeshellarg
Html context ? htmlspecialchars
Database context ? Use prepared statements and never worry about sql injection again
Little edit:
I know what i said and the blog post contradict a little but thats fine with me. ‘In practice’ will always differ from general advice and sometimes you want to do everything you can 😉
I’m doing a copy-pasta from My Answer to a similar question on SO:
Always remember, Filter In, Escape Out for all user supplied (or untrusted) input.
When reading user supplied data, filter it to known values. DO NOT BLACKLIST! Always always always always whitelist what you are expecting to get. If you’re expecting a hex number, validate it with a regex like: ^[a-f0-9]+$
. Figure out what you expect, and filter towards that. Do none of your filenames have anything but alpha, numeric and .
? Then filter to ^[a-z0-9.]+$
. But don’t start thinking blacklisting against things. It won’t work.
When using user-data, escape it properly for the use at hand. If it’s going in a database, either bind it as a parameterized query, or escape it with the database’s escape function. If you’re calling a shell command, escape it with escapeshellarg()
. If you’re using it in a regex pattern, escape it with preg_quote()
. There are more than that, but you get the idea.
When outputting user data, escape it properly for the format you’re outputting it as. If you’re outputting it to HTML or XML, use htmlspecialchars()
. If you’re outputting to raw headers for some reason, escape any linebreaks (str_replace(array("r", "n"), array('r', 'n'), $string)
). Etc, etc, etc.
But always filter using a white-list, and always escape using the correct method for the context. Otherwise there’s a significant chance you’ll miss something…