Concept Parsing Spam Filter

From Computer Tyme Support Wiki

Revision as of 16:32, 18 May 2016 by Marc (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This is a spam filtering trick I'm using but it's not SA, but could be easily adapted to SA.

Rather that just scan for regex strings it's useful to have a way to tell what things the message is talking about and reduce those to a single token that represents a concept. Then the concepts can be combined to produce rules.

For example, let's take your typical 419 scam: Generally it will have these kinds of characteristics.

dear stranger
i need your information
offers lots of money
dying of something
worships god
bank account
transfer money
reply to me
trust me
africa
united nations
western union

So the idea is to reduce the message to a string of characteristics and then combine those characteristics into rules where the characteristics by themselves are harmless. So - here's my format for extracting characteristics.

I create files that contain lines that are individual regular expressions. The name of the file becomes the name of the token. All these files are contained in a single directory and all are read and processed. The first line of the file is the number of matches required to trigger the token. That way if there's a few false matches it's no problem. Here's and example:

File name: lose-weight

attractive breakthrough burn fat clinically proven diet every woman exercise extra weight fat burn flab flabby formula jiggly just days lose weight medication metabolism natural weight obesity overweight slim thick legs tighten up tummy weight weight (rapidly|fast) weight loss weight reducing

The first line is the number 3 which means that 3 lines have to match to trigger the token. If 3 lines match then the word "lose-weight" is printed to the output stream.

This next file is named "stranger"

1 my name is (dear|attn|attention) .{0,10}(friend|stock|IT |Internet|candidate|sirs?|madam|partner|investor|bel$ introdic(e|ing) (myself|ourselves) (hi|hello) (dear|friend) i am (a|an)\b (i am|i'm) (mr.|ms.|mrs) greetings my dearest introduc(e|ing) myself hi there \bhi, hello[,!] greeting good day contacting you my dear one contact(ed|ing) you

As you can see - the concept I'm extracting is that they don't know me. Here's my "lots-of-money" file:

the sum of (billion|million|thousand) .{0,20}(dollars|pounds|euros|usd) ,000 (usd|euros) gold this money (\d\d0,000|\d,\d00.00) (united state(s)?|us|american) dollars pounds sterling british pounds us(d)?\$ huge amount of money


Some result strings I get from what I have so far:

accountant email-adr friend https investor law lotsofmoney maillist mailto phone-num trust cialis click css deals drugs email-adr html http maillist mailto optout phone-num regards click contact css details doitnow email-adr http https marketing optout phone-num price privacy remote-img claim click css dear email-adr guarantee html http mailto phone-num privacy reply2me security

The code that does this is very simple. I coded it in PHP but would be trivial to convert to Perl. Here's the entire program:

<?php

$message = file_get_contents ('php://stdin'); $message = strtolower($message);

$dir = scandir('/etc/exim/control/content'); foreach ($dir as $file) {

  if (strlen($file) > 2) {
     Scan($file);
  }

}

function Scan ($file) { global $message, $count;

  $reg = file ('/etc/exim/control/content/'.$file);
  $count = 0;
  $trigger = intval($reg[0]);
  $reg[0] = ;
  foreach ($reg as $regline) {
     $regline = trim($regline);
     if (($regline) and (preg_match('/'.$regline.'/i',$message,$matches))) {
        $count++;
     }
  }
  if (($trigger > 0 ) and ($count >= $trigger)) {
     echo "$file ";
  }

}

?>

You could write combination rules or feed these tokens into Bayes to make it self scoring. I'm throwing them into the AI system I developed a few months ago which does all the combining and scoring for me. But I think bayes should have a similar effect.

Just sharing this in case anyone finds it useful.

Enjoy,

-- Marc Perkel - Sales/Support support@junkemailfilter.com http://www.junkemailfilter.com Junk Email Filter dot com 415-992-3400

Personal tools