An overview of the

Advertising, sensitive word detection has always been a vexing topic, and simply adding a list of sensitive words can’t solve the problem. Ban the word today, and tomorrow a new one will appear, more than the endless descendants of the foolish old man.

Sensitive word matching works well in some semantically, but this scenario is not suitable for high-concurrency, high-access QPS services. Some time ago to see spam detection using bayesian classification algorithm, the accuracy of this method of “learning” in the form of a depends on the accuracy of the prior probability, and the company that for a long time to sort out the list of banned words is a very good source, with bayesian classification data more and more, the accuracy of the classification will be more and more high, Later only need to add the contraband word documents can be convenient and accurate.

PHP does not make good use of memory for Bayesian classification. A process is created for each request, and each request is independent of each other, so each request will have to redo the bayesian classification data set construction. This efficiency is expected, so IT is not intended to be implemented in PHP.

Go has a reputation for being fast, so use it. The question then arises as to how to implement this detection service with GO as a PHP backend. Interprocess data generally takes one of the following forms:

  • http
  • rpc
  • unix domain socket
  • pipe

After reading the blog.csdn.net/lengyue… After this article, I decided to use the Unix domain socket form, after all, NGINX and PHP-FPM communication is done this way, should be not bad efficiency.

implementation

Golang backend

package main import ( "src/github.com/ajph/nbclassifier-go" "log" "os" "bufio" "io" "net" "syscall" "fmt" "Src/github.com/yanyiwu/gojieba" "strings") const SPAM_CHECK_SOCKET_FILE = "/ TMP/spamcheck. The sock" / / implement simple bayesian classification using a go func getWords(filepath string)[]string { file, err := os.Open(filepath) if err ! = nil { log.Fatal(err) } defer file.Close() reader := bufio.NewReader(file) ret := []string{} for { line, err := reader.ReadString('n') if err ! EOF = = = nil | | IO. Err {the if the line = = "" {break}} line = strings. Trim (line," n ") FMT. Println (" processing words: " + line) ret = append(ret, line) } return ret } func learn(){ m := nbclassifier.New() m.NewClass("normal") normalwords := getWords("normalwords.txt") //fmt.Println(normalwords) m.Learn("normal", normalwords...) //m.Learn("normal", "a", "need") m.NewClass("forbidden") forbiddenwords := getWords("forbiddenwords.txt") //fmt.Println(forbiddenwords) m.Learn("forbidden", forbiddenwords...) //m.Learn("forbidden", " design ", "banner", " picture", " logo ", "clip art", " ad ", "clipart", "hairstyles", " drawing", " rendering", " diagram ", " poster", "изображение") M. newclass ("terror") terrorWords := getWords(" terrorWords.txt ") // FMT.Println(terrorWords) m.Learn("terror", terrorwords...) //m.Learn("terror", "..." , "0", "1", "2", "3", "4", "5", "6", "7", "eight" and "9", "..." , "image", "pinterest", ".c", "ltd.", "vector", "quote", "video", "search", "?" , "click", "psd", "ai", "print", "file", "related", "download", "submit", "view", "buy", "how", "maker", "online", " on", "by") m.SaveToFile("materiel.json") } func reloadModel() *nbclassifier.Model{ model, _ := nbclassifier.LoadFromFile("materiel.json") //fmt.Println(model.Classes[0].Items[0]) //fmt.Println(model.Classes[1])  //fmt.Println(model.Classes[2]) return model } func match(model *nbclassifier.Model, // Jieba := gojieba.newjieba () defer jieba.Free() words := jieba.Cut(content, true) CLS, unsure,_ := model.Classify(words...) FMT.Println(" Detection classifies as: " + cls.Id) result := "normal" if unsure == false { result = cls.Id fmt.Println(cls, unsure) } return result } func run() { socket, _ := net.Listen("unix", SPAM_CHECK_SOCKET_FILE) defer syscall.Unlink(SPAM_CHECK_SOCKET_FILE) learn() client, _ := socket.Accept() buf := make([]byte, 1024) datalength, _ := client.Read(buf) data := buf[:datalength] fmt.Println("client msg:" + string(data)) checkret := match(model, string(data)) fmt.Println("check result: " + checkret) response := []byte("") if len(checkret) > 0 { response = []byte(checkret) } _,_ = client.Write(response) } } func main() {// start sock, check service run() // fmt.println (reloadModel())}Copy the code

PHP front end

<? $MSG = "you lie, you fart, you fool "; $SOCKET_FILE = "/tmp/spamcheck.sock"; $socket = socket_create(AF_UNIX, SOCK_STREAM, 0); socket_connect($socket, $SOCKET_FILE); socket_send($socket, $msg, strlen($msg), 0); $response = socket_read($socket, 1024); socket_close($socket); var_dump($response);Copy the code

test

Open the service

The source code

Github.com/guoruibiao….

Conclusion finishing

So far, sock seems to have some limitations. At present, it is only a single machine, and further optimization can be considered. I’ll leave it at that. I’ll follow up later…