Tokenizing a string in php

I decided today to resume development of my BayesSpam plugin for SquirrelMail. My first priority was speeding up the parsing of messages.

My first step was a simple one. I was doing this:

while (preg_match('/([a-zA-Z][a-zA-Z-_']{0,44})[,."')?!:;/&]{0,5}([ tnr]|$)/',$string,$matches)) {
    $string = preg_replace('/([a-zA-Z][a-zA-Z-_']{0,44})[,."')?!:;/&]{0,5}([ tnr]|$)/',' ',$string,1);
    if (isset($matches[1]) && $matches[1] && strlen($matches[1]) >= 3)
        $return[] = $token_type.': '.$matches[1];
}

I replaced it with this:

$token = strtok($string, " rnt,."()?!:;/&");
while($token !== false) {
    if (strlen($token) >= 3 && strlen($token) < 45) {
        $return[] = $token_type.': '.$token;
    }
    $token = strtok(" rnt,."()?!:;/&");
}

Benchmarks show it as taking ~50% as long as the original version, which is a significant speedup.

%d bloggers like this: