There are two problems with using simple regular expressions to identify and link different email messages and web pages. First you have to come up with a good signature pattern. If you are starting out with a single email message, for example, then you need to define a number of patterns and try them out to see which, if any, match similar related messages, with no false positives. This is a process of trial and error that can be quite time consuming.
On top of that, you have to deal with the variations that are introduced into similar messages by spammers in order to circumvent antispam filters. In many ways, these filters are trying to do the same job as you. They want to find unique patterns that mark a message as being spam so they can divert it from your Inbox. The spammers know this, and they know a lot about the methods used by these filters. In order for their spam to keep flowing, they continually introduce variation into their messages in the hope that these disrupt whatever patterns are being scanned for.
These variations may take the form of random words being added to the end of a message, spelling changes being made to recognizable words, and message headers being continually changed between each batch of mail.
Consider the following very similar blocks of text taken from two
phishing emails that targeted eBay users. In order to get around spam
filters, the author has inserted three words
(and
, the
,
then
) into the second version and changed the
capitalization of two other words (Has
,
If
).
During our regular verification of accounts we couldn't verify your current information, either your informationhas
changed or it is incomplete .If
the account is not updated to current information within 5 days, your access to Buy or Sell on eBay will be restricted. During our regularand
verification ofthe
accounts we couldn't verify your current information, either your informationHas
changed or it is incomplete .if
the account is not updated to current information within 5 daysthen
, your access to Buy or Sell on eBay will be restricted.
If the words were not highlighted, you would hardly notice the difference, but the changes are enough to prevent certain types of filter from working. Similarly, any one of them might be enough to disrupt a match to a signature pattern that happened to span it.