spam判定（何回かに分けます＝長いので）

2021.04.05

python

2 comments

ここで言いたいことがあります

大概のスパム判定の実装を見ると
ソフトまでは良いでしょう
しかし、各メールソフトに組み込む（アドオン）ための内容が欠如しています
誰もスパム判定をテキスト変換した後に行っている訳ではありません

この部分が私にとって

『？？？＝手抜き』

です！！！

確かに自身で確認する場合は、CentOSやFedoraなどで
自宅サーバ＋メールサーバ（ネームサーバ込みで）を構築する必要があるでしょうが

オマケが有ります
私にあるDMが届きました
その内容が

①貴方は０．４％の確率で選ばれました

選ばれたのなら、その理由を述べるはず

②無料です

　無料で提供して、あなたにどのようなメリットがあるの（何の得があるの？）

③億万長者になれます（チャンスです）

　欲に眩むと引っかかりやすいワードです
世の中は搾取する側とされる側で成り立っています
私が億万長者になったら、誰かが貧乏をする
それに、そのノウハウを売る側が無料の訳ありませんよね！？

コード
# 全てのテキストを巡回して単語データベースを作成する
import os, glob
import MeCab
import numpy as np
import pickle
# 保存ファイル名
savefile = “./ok-spam.pickle”
# MeCabを読み込む
tagger = MeCab.Tagger()
# まず、変数を準備する
word_dic = {“__id”: 0} # 単語辞書
files = [] # 読み込んだ単語データを追加する
# 指定したディレクトリ内のファイル一覧を読む
def read_files(dir, label):
# テキストファイルの一覧を取得する
files = glob.glob(dir + ‘/*.txt’)
for f in files:
read_file(f, label)
# ファイルを読み込む
def read_file(filename, label):
words = []
# ファイル内容をutf8で読む
with open(filename, “rt”, encoding=”utf-8″) as f:
text = f.read()
files.append({
“label”: label,
“words”: text_to_ids(text)
})
# テキストを単語IDのリストに変換
def text_to_ids(text):
# 形態素解析を行う
word_s = tagger.parse(text)
words = []
# 単語を辞書に登録する
for line in word_s.split(“\n”):
if line == ‘EOS’ or line == ”: continue
word = line.split(“\t”)[0]
params = line.split(“\t”)[1].split(“,”)
hinsi = params[0] # 品詞
hinsi2 = params[1] # 品詞の説明
org = params[6] # 単語の原型
# 余分な助詞・助動詞・記号・数字は捨てる
if not (hinsi in [‘名詞’, ‘動詞’, ‘形容詞’]): continue
if hinsi == ‘名詞’ and hinsi2 == ‘数’: continue
# 単語をidに変換する
id = word_to_id(org)
words.append(id)
return words
# 単語をidに変換
def word_to_id(word):
# 単語が辞書に登録されているか？
if not (word in word_dic):
# 登録されていないので新たにIDを割り振る
id = word_dic[“__id”]
word_dic[“__id”] += 1
word_dic[word] = id
else:
# 既存の単語IDを返す
id = word_dic[word]
return id
# 単語の頻出頻度のデータを作成する
def make_freq_data_allfiles():
y = []
x = []
for f in files:
y.append(f[‘label’])
x.append(make_freq_data(f[‘words’]))
return y, x
def make_freq_data(words):
# 単語の出現回数を調べる
cnt = 0
dat = np.zeros(word_dic[“__id”], ‘float’)
for w in words:
dat[w] += 1
cnt += 1
# 回数を出現頻度に直す
dat = dat / cnt
return dat
# ファイルの一覧から学習用のデータベースを作る
if __name__ == “__main__”:
read_files(“ok”, 0)
read_files(“spam”, 1)
y, x = make_freq_data_allfiles()
# ファイルにデータを保存する
pickle.dump([y, x, word_dic], open(savefile, ‘wb’))
print(“ok”)
ok