System overview

  • Channel monitoring aims to capture information related to App by crawling various channels, network disks, forums and post bars, identify positive piracy through analysis of the obtained information, and issue statistical analysis reports to help App developers monitor positive piracy in application markets, posts, forums and network disks.

  • Here are the specific function points:

  1. Through the background upload APP submit channel monitoring system.
  2. After obtaining the uploaded APP, the background processes the APP to obtain the application signature information, application name, application package name, and application file structure.
  3. According to the obtained application information and web crawler technology, the Internet channel is picked up.
  4. Real-time crawling of 428 domestic application markets, network disk, 70 developer forums and developer communities, 78 client security related tiebar and other channels. (in progress……)
  5. The crawling data is stored uniformly in the database.
  6. One-click analysis of suspected pirated applications is conducted to discover malicious codes injected by suspected pirated applications and modified resources. (in progress……)
  7. Support the removal of suspected pirated applications. (Follow-up plan)
  8. Generate documents for feedback of channel data monitored, suspected pirated application data discovered, and removal process and results of suspected pirated apps.

The function is gradually improving…..

  • Attached is a rendering:


    rendering

Mind mapping


Channel monitoring mind map

The red line circled above is the BBS crawler system I want to introduce to you today.

The realization of BBS crawler system

BBS crawler system is based on the open source PySpider crawler system on Github. The number of start has exceeded 10,000, and the number of fork has also exceeded 2600, which is also a relatively high-quality open source project. I’ll start with a brief overview of pySpider’s feature points and architectural design:

The introduction of pyspider


The introduction of pyspider

The screenshot above is from the documentation on the PySpider website. Pyspider, written in Python language, supports visual interface for online editing and debugging of crawler scripts, real-time monitoring of crawler tasks, distributed deployment, and common databases such as mysql, SQLite, ES, and MongoDB for data storage. Its distributed deployment is realized by message middleware, which can use redis, RabbitMQ and other middleware as its distributed deployment message middleware. In addition, it also supports the retry of crawler task, speed limit through token bucket, task priority can be controlled, expiration time of crawler task and so on. Many of its characteristics are universal crawler system should have, if we want to write a crawler system from scratch, we have to solve these problems, time-consuming and laborious, it is better to stand on the shoulders of giants, transform it into wheels suitable for our orbit.

Pyspider architecture design

  • Pyspider crawlers can be divided into the following core components:
  1. Fetcher – Fetches Internet resources based on urls and downloads HTML content. Fetcher is implemented in asynchronous I/O mode and supports a large amount of concurrent fetching. The bottleneck is mainly the I/O overhead and IP resources. If a single IP is crawled too fast, it is easy to touch the anti-crawler system of BBS, and it is easy to be shielded. IP proxy pool can be used to solve the PROBLEM of IP limit. If there is no IP proxy pool, the speed limiting mechanism of PySpider can be used to prevent contact with anti-crawler mechanism. Supports multi-node deployment.

  2. Processor – Processes the crawler script we wrote. For example, extracting links in the page, extracting page-turning links, extracting detailed information in the page, etc., this part consumes CPU resources. Supports multi-node deployment.

  3. Webui-a visual interface that supports online editing and debugging of crawler scripts. Support real-time online monitoring of crawler tasks; You can start, stop, delete, and limit crawler tasks on the page. Supports multi-node deployment.

  4. Scheduler – A scheduled task component. Each URL corresponds to a task, and the Scheduler is responsible for the distribution of tasks, the warehousing of new urls, and the coordination of various components through message queues. Only one node can be deployed.

  5. ResultWorker – Results are written to the component. Support the custom implementation of crawler results, for example, we implemented the custom result write based on RDS. The default output is SQLite, which can be exported to execel, JSON, etc.

  • Architecture diagram


    Pyspider architecture

  • The PySpider webui looks like this


webui

  • The script we’re going to write
from pyspider.libs.base_handler import * class Handler(BaseHandler): # Global Settings, need to configure corresponding measures in the case of anti-crawling (such as proxy,UA,Cookies, render.......) , crawl_config = {} @every(minutes=24 * 60) def on_start(self): # seed urls self.crawl('http://scrapy.org/', callback=self.index_page) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('a[href^="http"]').items(): self.crawl(each.attr.href, callback=self.detail_page) def detail_page(self, response): # result return { "url": response.url, "title": response.doc('title').text(), }Copy the code

For more documentation on Pyspider, see the Pyspider documentation

Application of Pyspider in BBS crawler system


The flow chart

The steps of the flowchart are described as follows:

  • Step 1-1: After the user writes the script on the WebUI, debugged and saved it, click Run to start the crawler script. The web user interface (WebUI) invokes the new_task method of the Scheduler using XMLRPC to create a crawler task. (This step encapsulates the script creation process of PySpider through Java, realizing the automatic creation of crawler script according to the apK uploaded by the user)

  • Step 2: After scheduler gets crawler task (defined by JSON_string) through XMLRPC, it starts some process of updating project status, update priority, update time, etc. Once this logic is processed, the URL is sent to the fetcher using the send_task method.

  • Step 3 and 4: Fetcher fetches the HTML page using asynchronous IO and sends the result to the processor.

  • Step 5: After retrieving the HTML page from fetcher, the processor calls the index_Page method of the user. Get the hyperlinks in the page in the index_Page method, while calling the detial_Page method. The Processor sends the parsed result to the ResultWorker(queue based on the NCR list implementation).

  • Step 6 and 7: The ResultWorker component retrieves the result from the NCR and writes it to the DATABASE RDS.

  • Step 1-2: Index_Page in the Processor can extract the details of the required page as well as the page-turning links, and then send the page-turning links to the Scheduler to form a whole closed loop.

Deployment structure diagram


Deployment diagram

  • Fetcher is deployed on the extranet to prevent anti-crawler systems from obtaining IP sources. Seven fetchers are deployed online. Fetchers are mainly used for HTTP access, fetching HTML pages, and using asynchronous IO.
  • Processor is deployed on an Intranet, and two nodes are deployed online.
  • The Scheduler is deployed on the Intranet. One node is deployed online and deployed on the same machine as the WebUI.
  • ResultWorker is deployed on the Intranet, and two nodes are deployed.
  • Web user interface (WebUI) Deployment On the Intranet, only two nodes are deployed and Http Auth is used to restrict page access permissions.
  • Fetcher, Processor, Scheduler and ResultWorker all communicate with each other through Redis queues (based on lists). Scheduler is in control and is responsible for the distribution of crawler tasks. A URL crawler is a task, and the task object contains the task_id, which is implemented by default based on the MD5 value of the URL and used for URL deduplication. Each task has a default priority, which can be used in the index_page and detail_page methods using @priority. By customizing priorities, we can implement depth-first or breadth-first traversal of a page.

BBS compares the process of piracy


Automatic login and reply

BBS crawler is based on Python language, and the logic after crawling to HTML page is based on Java language. The result of the PySpider crawler is inserted into the database, and Java’s scheduled task reads the database record for subsequent processing.

  1. Register the automatic login and reply module in the service registry. Automatic login and reply based on HtmlUnit or WebDriver + Headless Chrome implementation. Captcha recognition service can recognize Chinese, English, numbers and questions. Sliding verification codes are not supported. At present, the Chrome plug-in is used to manually copy the Cookie content to implement the sliding verification code. After successful login, cookie content is obtained for subsequent automatic reply. At the same time, in order to ensure the limitation of cookie, the validity of cookie is checked regularly and cookie is updated in time.

  2. The Java scheduled task starts reading the crawler result record of the PySpider.

  3. Call the automatic login and reply module based on the database field instanceName. InstanceName field is to ensure that different crawler result pages correctly correspond to their own BBS cookies, so as to facilitate automatic reply. After the automatic reply succeeds, the replied page is inserted into the RDS to facilitate subsequent logical processing.

  4. Read the replied pages from the RDS, traverse each page’s nodes, properties, and so on for download links. The traversal here is more complex, requiring all kinds of exception and irregular handling. Extracted links may be disk links, attachment links and so on. Most of the attachments on the forum are baidu web disk short links, and many need to input extraction code, we have realized the automation of the process. You can extract links to Baidu web disk in the page of the post, and obtain the corresponding extraction code; Then, NodeJs+PhathomJs is used to extract the real download address. As a result of baidu network disk speed limit, all our follow-up and support breakpoint download, support higher fault tolerance.

  5. According to the download link for classification processing. If it is the download link of baidu web disk, call baidu web disk related tool class to obtain the real download address of the file in the web disk. Single or batch download is supported. If it is the actual download address, download it directly.

  6. Analysis of positive piracy according to comparative algorithm.

  7. Insert the comparison results into the database, and then do statistical analysis.

Extract the real download address of Baidu web disk


Extract the web disk link

  • After the Java scheduled task obtains the replied posts, it analyzes and traverses the nodes and contents of HTML elements, and extracts the web disk links and corresponding extraction codes according to the ongoing process. After obtaining the link and extraction code, PhantomJs executes JS to obtain the information of extracting the web disk link, which may involve verification code recognition, because Baidu web disk limits the number of extracting the real download address from the same link in a short period of time. If the number of extracting is more than three times, the verification code needs to be entered. After obtaining all the required information, such as: token, loginID, etc., then use Java to send ajax request with these parameters to obtain the real download address. The downloaded file may be a compressed package or a single APK file, depending on whether you are sharing multiple files or a single file.

  • Extract Baidu web disk information JS script:

var page = require('webpage').create(), stepIndex = 0, loadInProgress = false;
var fs = require('fs');
var system = require('system');

page.viewportSize = {
  width: 480,
  height: 800
};

// 命令行参数
var args = system.args;
var codeVal = (args[2] === 'null') ? "" : args[2], loadUrl = args[1];
console.log('codeVal=' + codeVal + '; loadUrl=' + loadUrl);

// 直接运行脚本的参数
// var codeVal = "nd54";
// var loadUrl = "https://pan.baidu.com/s/1sltxlYP";

// phantom.setProxy('116.62.112.142','16816', 'http', 'jingxuan2046', 'p4ke0xy1');
// 配置
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36';
page.settings.resourceTimeout = 5000;

// 回调函数
page.onConsoleMessage = function (msg) {
  console.log(msg);
};

page.onResourceRequested = function (request) {
  //console.log('Request Faild:' + JSON.stringify(request, undefined, 4));
};

page.onError = function (msg, trace) {
  var msgStack = ['PHANTOM ERROR: ' + msg];
  if (trace && trace.length) {
    msgStack.push('TRACE:');
    trace.forEach(function (t) {
      msgStack.push(
          ' -> ' + (t.file || t.sourceURL) + ': ' + t.line + (t.function
              ? ' (in function ' + t.function + ')' : ''));
    });
  }
  console.error(msgStack.join('\n'));
  // phantom.exit(1);
};

// 变量定义
var baiDuObj;
var steps = [
  function () {
    page.clearCookies();// 每次请求之前先清理cookie

    if (loadUrl === null || loadUrl.length == 0) {
      console.error('loadUrl不能为空串!');
      phantom.exit();
      return;
    }

    // 渲染页面
    console.log("加载页面中... loadUrl= " + loadUrl);

    // 目的是为了先刷新出必要的Cookie,再去访问shareUrl,不然会报403
    page.open(loadUrl, function (status) {
      console.log('status=' + status);
      setTimeout(function () {
        page.evaluate(function (loadUrl) {
          // console.log(document.cookie)
          window.location.href = loadUrl;
        }, loadUrl);
      }, 500)
    });
  },

  function () {
    // page.render('step1.png');

    var currentUrl = page.url;

    console.log("currentUrl=" + currentUrl);
    console.log("codeVal=" + codeVal);

    if (currentUrl === null || currentUrl === "") {
      console.log('当前url为空,脚本退出执行...');
      phantom.exit(1);
      return;
    }

    // 提取码为空时就不需要再输入了
    if (codeVal === null || codeVal.length == 0 || codeVal === "") {
      console.log('当前分享不需要提取码...');
      return;
    }

    // 自动输入提取码
    page.evaluate(function (codeVal) {
      // 当请求页面中不存在accessCode元素时,就不继续执行了
      var accessCodeEle = document.getElementsByTagName('input').item(0);
      console.log(accessCodeEle);
      if (accessCodeEle === null) {
        console.info("页面不存在accessCode元素..." + accessCodeEle);
      } else {
        accessCodeEle.value = codeVal;
        var element = document.getElementsByClassName('g-button').item(0);
        console.log(element);
        var event = document.createEvent("MouseEvents");
        event.initMouseEvent(
            "click", // 事件类型
            true,
            true,
            window,
            1,
            0, 0, 0, 0, // 事件的坐标
            false, // Ctrl键标识
            false, // Alt键标识
            false, // Shift键标识
            false, // Meta键标识
            0, // Mouse左键
            element); // 目标元素

        element.dispatchEvent(event);

        // 点击提交提取码,然后跳转到下载页面
        element.click();
      }
    }, codeVal);
  },
  function () {
    // page.render('step2.png');

    page.includeJs('https://cdn.bootcss.com/jquery/1.12.4/jquery.min.js',
        function () {
          baiDuObj = page.evaluate(function () {
            var yunData = window.yunData;
            var cookies = document.cookie;
            var panAPIUrl = location.protocol + "//" + location.host + "/api/";
            var shareListUrl = location.protocol + "//" + location.host
                + "/share/list";

            // 变量定义
            var sign, timestamp, logid, bdstoken, channel, shareType, clienttype, encrypt, primaryid, uk, product, web, app_id, extra, shareid, is_single_share;
            var fileList = [], fidList = [];
            var vcode;// 验证码

            // 初始化参数
            function initParams() {
              shareType = getShareType();
              sign = yunData.SIGN;
              timestamp = yunData.TIMESTAMP;
              bdstoken = yunData.MYBDSTOKEN;
              channel = 'chunlei';
              clienttype = 0;
              web = 1;
              app_id = 250528;
              logid = getLogID();
              encrypt = 0;
              product = 'share';
              primaryid = yunData.SHARE_ID;
              uk = yunData.SHARE_UK;
              shareid = yunData.SHARE_ID;
              is_single_share = isSingleShare();

              if (shareType == 'secret') {
                extra = getExtra();
              }

              if (is_single_share) {
                var obj = {};
                if (yunData.CATEGORY == 2) {
                  obj.filename = yunData.FILENAME;
                  obj.path = yunData.PATH;
                  obj.fs_id = yunData.FS_ID;
                  obj.isdir = 0;
                } else {
                  obj.filename = yunData.FILEINFO[0].server_filename;
                  obj.path = yunData.FILEINFO[0].path;
                  obj.fs_id = yunData.FILEINFO[0].fs_id;
                  obj.isdir = yunData.FILEINFO[0].isdir;
                }
                fidList.push(obj.fs_id);
                fileList.push(obj);
              } else {
                fileList = getFileList();
                $.each(fileList, function (index, element) {
                  fidList.push(element.fs_id);
                });
              }
            }

            //判断分享类型(public或者secret)
            function getShareType() {
              return yunData.SHARE_PUBLIC === 1 ? 'public' : 'secret';
            }

            //判断是单个文件分享还是文件夹或者多文件分享
            function isSingleShare() {
              return yunData.getContext === undefined;
            }

            // 获取cookie
            function getCookie(e) {
              var o, t;
              var n = document, c = decodeURI;
              return n.cookie.length > 0 && (o = n.cookie.indexOf(e + "="), -1
              != o) ? (o = o + e.length + 1, t = n.cookie.indexOf(";", o), -1
              == t && (t = n.cookie.length), c(n.cookie.substring(o, t))) : "";
            }

            // 私密分享时需要sekey
            function getExtra() {
              var seKey = decodeURIComponent(getCookie('BDCLND'));
              return '{' + '"sekey":"' + seKey + '"' + "}";
            }

            function base64Encode(t) {
              var a, r, e, n, i, s, o = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
              for (e = t.length, r = 0, a = ""; e > r;) {
                if (n = 255 & t.charCodeAt(r++), r == e) {
                  a += o.charAt(n >> 2);
                  a += o.charAt((3 & n) << 4);
                  a += "==";
                  break;
                }
                if (i = t.charCodeAt(r++), r == e) {
                  a += o.charAt(n >> 2);
                  a += o.charAt((3 & n) << 4 | (240 & i) >> 4);
                  a += o.charAt((15 & i) << 2);
                  a += "=";
                  break;
                }
                s = t.charCodeAt(r++);
                a += o.charAt(n >> 2);
                a += o.charAt((3 & n) << 4 | (240 & i) >> 4);
                a += o.charAt((15 & i) << 2 | (192 & s) >> 6);
                a += o.charAt(63 & s);
              }
              return a;
            }

            // 获取登录id
            function getLogID() {
              var name = "BAIDUID";
              var u = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/~!@#¥%……&";
              var d = /[\uD800-\uDBFF][\uDC00-\uDFFFF]|[^\x00-\x7F]/g;
              var f = String.fromCharCode;

              function l(e) {
                if (e.length < 2) {
                  var n = e.charCodeAt(0);
                  return 128 > n ? e : 2048 > n ? f(192 | n >>> 6) + f(
                      128 | 63 & n) : f(224 | n >>> 12 & 15) + f(
                      128 | n >>> 6 & 63) + f(128 | 63 & n);
                }
                var n = 65536 + 1024 * (e.charCodeAt(0) - 55296)
                    + (e.charCodeAt(1) - 56320);
                return f(240 | n >>> 18 & 7) + f(128 | n >>> 12 & 63) + f(
                        128 | n >>> 6 & 63) + f(128 | 63 & n);
              }

              function g(e) {
                return (e + "" + Math.random()).replace(d, l);
              }

              function m(e) {
                var n = [0, 2, 1][e.length % 3];
                var t = e.charCodeAt(0) << 16 | (e.length > 1 ? e.charCodeAt(1)
                        : 0) << 8 | (e.length > 2 ? e.charCodeAt(2) : 0);
                var o = [u.charAt(t >>> 18), u.charAt(t >>> 12 & 63),
                  n >= 2 ? "=" : u.charAt(t >>> 6 & 63),
                  n >= 1 ? "=" : u.charAt(63 & t)];
                return o.join("");
              }

              function h(e) {
                return e.replace(/[\s\S]{1,3}/g, m);
              }

              function p() {
                return h(g((new Date()).getTime()));
              }

              function w(e, n) {
                return n ? p(String(e)).replace(/[+\/]/g, function (e) {
                  return "+" == e ? "-" : "_";
                }).replace(/=/g, "") : p(String(e));
              }

              return w(getCookie(name));
            }

            //获取当前目录
            function getPath() {
              var hash = location.hash;
              var regx = /(^|&|\/)path=([^&]*)(&|$)/i;
              var result = hash.match(regx);
              return decodeURIComponent(result[2]);
            }

            //获取分类显示的类别,即地址栏中的type
            function getCategory() {
              var hash = location.hash;
              var regx = /(^|&|\/)type=([^&]*)(&|$)/i;
              var result = hash.match(regx);
              return decodeURIComponent(result[2]);
            }

            function getSearchKey() {
              var hash = location.hash;
              var regx = /(^|&|\/)key=([^&]*)(&|$)/i;
              var result = hash.match(regx);
              return decodeURIComponent(result[2]);
            }

            //获取当前页面(list或者category)
            function getCurrentPage() {
              var hash = location.hash;
              return decodeURIComponent(
                  hash.substring(hash.indexOf('#') + 1, hash.indexOf('/')));
            }

            //获取文件信息列表
            function getFileList() {
              var result = [];
              if (getPath() == '/') {
                result = yunData.FILEINFO;
              } else {
                logid = getLogID();
                var params = {
                  uk: uk,
                  shareid: shareid,
                  order: 'other',
                  desc: 1,
                  showempty: 0,
                  web: web,
                  dir: getPath(),
                  t: Math.random(),
                  bdstoken: bdstoken,
                  channel: channel,
                  clienttype: clienttype,
                  app_id: app_id,
                  logid: logid
                };
                $.ajax({
                  url: shareListUrl,
                  method: 'GET',
                  async: false,
                  data: params,
                  success: function (response) {
                    if (response.errno === 0) {
                      result = response.list;
                    }
                  }
                });
              }
              return result;
            }

            //生成下载时的fid_list参数
            function getFidList(list) {
              var retList = null;
              if (list.length === 0) {
                return null;
              }

              var fileidlist = [];
              $.each(list, function (index, element) {
                fileidlist.push(element.fs_id);
              });
              retList = '[' + fileidlist + ']';
              return retList;
            }

            // 初始化
            initParams();

            // console.log('fileList=---------' + fileList);
            // console.log('fidList=---------' + getFidList(fileList))

            var retObj = {
              'sign': sign,
              'timestamp': timestamp,
              'logid': logid,
              'bdstoken': bdstoken,
              'channel': channel,
              'shareType': shareType,
              'clienttype': clienttype,
              'encrypt': encrypt,
              'primaryid': primaryid,
              'uk': uk,
              'product': product,
              'web': web,
              'app_id': app_id,
              'extra': extra,
              'shareid': shareid,
              'fid_list': getFidList(fileList),// 要下载的文件id
              'file_list': fileList,
              'panAPIUrl': panAPIUrl,
              'single_share': is_single_share,
              'cookies': cookies
            };

            return retObj;
          });

          console.log("data=" + JSON.stringify(baiDuObj));
        });
  },
  function () {
  }
];

// main started
setInterval(function () {
  if (!loadInProgress && typeof steps[stepIndex] == "function") {

    console.log(
        '                                                                                               ');
    console.log(
        '===============================================================================================');
    console.log('                                    step ' + (stepIndex + 1)
        + '                               ');
    console.log(
        '===============================================================================================');
    console.log(
        '                                                                                               ');

    steps[stepIndex]();
    stepIndex++;
  }

  if (typeof steps[stepIndex] != "function") {
    console.log("Completed!");
    console.log('FinalOutPut: codeVal=' + codeVal + "; loadUrl=" + loadUrl
        + "; result=" + JSON.stringify(baiDuObj));
    phantom.exit();
  }
}, 5000);

Copy the code

The above is roughly the entire BBS to climb the rough process. Due to the limited space, some places may be ambiguous or unclear, welcome to discuss. My technical level is limited, there are mistakes or inadequacies welcome criticism.