How to clean data
Back in the day nucleocide.net was hacked. Why? I was putting user data directly into SQL commands without any checking of the data. This is bad. I'm not going to get into a big discussion about various injections, but I will mention the functions that I use to prevent them.
These functions use Regular Expression's to check the data. What is a RegEx? It is basically a pattern matching language. You can compare a string to a regex to see if it is valid or not, or you can strip invalid characters.
Here is my RegEx function to make sure a string only contains letters, numbers, and the underscore character:
function nukeAlphaNum($value) {
return ereg_replace("[^a-zA-Z0-9_]", "", $value);
}Simple, huh? You could even put it all on one line if you wanted. The function takes one argument, a string, and returns another string that only contains lowercase letters (a-z) uppercase letters (A-Z) numbers (0-9) and underscore (_). ereg_replace takes three arguments, the RegEx, the character to replace it with (in this case nothingness) and the string that it is sifting through (the one we send to the function).
Here are some others that I use:
function nukeAlpha($value) {
return ereg_replace("[^a-zA-Z]", "", $value);
}
function nukeHex($value) {
return ereg_replace("[^0-9a-fA-F]", "", $value);
}
function nukeNum($value) {
return ereg_replace("[^0-9]", "", $value);
} These are all pretty self explanatory. There is one drawback; RegEx isn't the most processor friendly function. On small sites like I run it is fine but on larger sites with a ton of users and a shared hosting plan, it might get out of hand.
There is one more type of RegEx function that I'll use. These use the eregi function. Here is an example of what I use to validate an email address:
function nukeValidEmail($value) {
if (eregi("^[_a-z0-9-]+(\.[_a-z0-9-]+)*@[a-z0-9-]+
(\.[a-z0-9-]+)*(\.[a-z]{2,3})$", $value))
return true;
else
return false;
}Pretty, isn't it? (I broke the eregi line in half to fit in this layout). Instead of removing bad characters, this function returns a true if the email is valid and a false if it is not valid (duh). Different applications call for different measures. This RegEx is overly complex to explain here.
This last function is what I use to validate a website. I allow blank strings, http://, and a full website. I allow the first two options because not everyone that creates an account on my various sites has a website:
function nukeValidWebsite($value) {
if (eregi("^(http|ftp|https)://[-A-Za-z0-9._/]+", $value))
return true;
else if (empty($value) || $value == "http://")
return true;
else
return false;
}Have fun and stay clean! If you would like more information on regular expressions check out Regular-Expressions.info.
If you would like more information about securing your site from injections, check out this article from Penguicon 2006 (I attended and the author Flavio daCosta hooked me up with this presentation).



