Pholcus

Pholcus (Ghost Spider) is a pure Go language written to support distributed high concurrency, heavyweight crawler software, targeted at the Internet data collection, for people who have a certain Go or JS programming foundation to provide a powerful crawler tool that only needs to focus on rule customization.

It supports single machine, server, client three operating modes, with Web, GUI, command line three operating interfaces; Rules are simple and flexible, batch task concurrently, output way rich (mysql/mongo/kafka/CSV/excel, etc.), a large number of Demo sharing; In addition, it also supports horizontal and vertical grasping modes, and supports a series of advanced functions such as simulated login and task suspension and cancellation.

  • Official QQ group: Go Big data 42731170

The crawler principle

 

 

Characteristics of framework

  1. For users who have certain Go or JS programming basis, it provides heavyweight crawler tools with complete functions and only needs to focus on rule customization.

  2. Support single machine, server, client three operating modes;

  3. You can control the opening mode of GUI(Windows), Web, and Cmd through parameters.

  4. Support state control, such as pause, resume, stop, etc.

  5. Can control the amount of acquisition;

  6. Can control the number of concurrent coroutines;

  7. Supports concurrent execution of multiple collection tasks.

  8. Support proxy IP list, can control the replacement frequency;

  9. Support random pauses in the acquisition process to simulate manual behavior;

  10. User-defined configuration input interfaces are provided based on rules

  11. There are five output modes: mysql, mongodb, Kafka, CSV, Excel and download of original files.

  12. Support batch output, and the number of each batch is controllable;

  13. Support static Go and dynamic JS two acquisition rules, support horizontal and vertical capture modes, and a large number of Demo;

  14. Persistent record of success for automatic deduplication;

  15. Serialization failed request, support deserialization automatic overload processing;

  16. Adopt surfer high concurrent download, support GET/POST/HEAD method and HTTP/HTTPS protocol, at the same time support fixed UserAgent automatic saving cookie and random large number of UserAgent disable cookie two modes, highly simulate browser behavior, Can realize simulation login and other functions;

  17. In server/client mode, Teleport high-concurrency SocketAPI framework is adopted to communicate with full-duplex connection and internal data transmission format is JSON.

 

Go Version Requirements

Acuity Go1.6

 

Download and install

go get -u -v github.com/henrylee2cn/pholcus
Copy the code

Note: Pholcus publicly maintains spider rule library addresses github.com/henrylee2cn…

 

Create a project

Package the main import (" github.com/henrylee2cn/pholcus/exec "_" github.com/henrylee2cn/pholcus_lib "/ / this is a public maintenance spiders rule base // _ "pholCUS_lib_pte" // You can also add your own rule library) func main() {// Set the default operation interface of the runtime and start running // Before running the software, you can set the -a_ui parameter to "web", "GUI" or "CMD", Exec.DefaultRun("web")}Copy the code

 

Compile operation

Normal compilation method

CD {{the replace your gopath}} / src/github.com/henrylee2cn/pholcus go install or go buildCopy the code

Hides the compilation method of the CMD window in Windows

CD {{the replace your gopath}} / src/github.com/henrylee2cn/pholcus go install - ldflags = "- H windowsgui" or go build -ldflags="-H windowsgui"Copy the code

View optional parameters:

pholcus -h
Copy the code

 

The screenshot of the webui is as follows:

 

The following screenshot shows the mode selection screen of the GUI version

 

The following is an example of setting Cmd running parameters

$ pholcus -_ui=cmd -a_mode=0 -c_spider=3,8 -a_outtype=csv -a_thread=20 -a_dockercap=5000 -a_pause=300
-a_proxyminute=0 -a_keyins="<pholcus><golang>" -a_limit=10 -a_success=true -a_failure=true
Copy the code

 

* Note: * If you want to use proxy IP address function under Mac, please make sure to obtain the root user permission, otherwise you cannot ping can proxy!

 

Runtime directory files

├─ Pholcus software │ ├─ ├─ Config. ini Configuration files │ │ ├─ Proxy. Lib Proxy LIST │ │ │ ├─ Spiders Rules │ │ ├─ Pholcus software │ ├─ Pholcus_pkg │ ├─config.ini Configuration files │ │ ├─ Proxy └ ─ XXX. Pholcus. HTML dynamic rules file │ │ │ ├ ─ phantomjs program files │ │ │ ├ ─ text_out text data file output directory │ │ │ ├ ─ file_out file output directory │ │ │ ├ ─ logs log directory │ │ ├─ ├─ ├─ └ │ ├─ └Copy the code

 

Dynamic rule Example

Features: dynamic loading rules, no need to recompile software, easy to write, add freedom, suitable for lightweight collection projects. xxx.pholcus.html

<Spider> <Name>HTML dynamic rule example </Name> <Description>HTML dynamic rule example [Auto Page] [http://xxx.xxx.xxx]</Description> <Pausetime>300</Pausetime> <EnableLimit>false</EnableLimit> <EnableCookie>true</EnableCookie> <EnableKeyin>false</EnableKeyin> <NotDefaultField>false</NotDefaultField> <Namespace> <Script></Script> </Namespace> <SubNamespace> <Script></Script> </SubNamespace> <Root> <Script param="ctx"> console.log("Root"); CTX. JsAddQueue ({Rule: Url: "http://xxx.xxx.xxx", "landing page"}); </Script> </Root> <Rule name=" login page "> <AidFunc> <Script param=" CTX,aid"> </Script> </AidFunc> <ParseFunc> <Script param="ctx"> console.log(ctx.GetRuleName()); Ctx. JsAddQueue({Url: "http://xxx.xxx.xxx", Rule: "after login ", Method: "POST", PostData: "[email protected]&password=44444444&login_btn=login_btn&submit=login_btn" }); </Script> </ParseFunc> </Rule> <Rule name=" login "> <ParseFunc> <Script param=" CTX "> console.log(ctx.getrulename ()); Ctx.output ({" all ": ctx.gettext ()}); CTX. JsAddQueue ({Rule: Url: "http://accounts.xxx.xxx/member", "individual center", the Header: {" Referer ": [CTX. GetUrl ()]}}); </Script> </ParseFunc> </Rule> <Rule name=" personal center "> <ParseFunc> <Script param=" CTX "> console.log(" Personal center: " + ctx.GetRuleName()); Ctx.output ({" all ": ctx.gettext ()}); </Script> </ParseFunc> </Rule> </Spider>Copy the code

Static rule Example

Features: Compiled with software, more customized, higher efficiency, suitable for heavyweight collection projects. xxx.go

Func init() {Spider{Name: "static rule example ", Description:" static rule example [Auto Page] [http://xxx.xxx.xxx]", // Pausetime: 300, // Limit: LIMIT, // Keyin: KEYIN, EnableCookie: true, NotDefaultField: false, Namespace: nil, SubNamespace: nil, RuleTree: &RuleTree{ Root: func(ctx *Context) { ctx.AddQueue(&request.Request{Url: "http://xxx.xxx.xxx", Rule: }, Trunk: map[string]*Rule{" Trunk ": {ParseFunc: func(CTX *Context) {ctx.addQueue (&request.request {Url: "http://xxx.xxx.xxx", Rule: "after login ", Method: "POST", PostData: "Username [email protected]&password=123456&login_btn=login_btn&submit=login_btn",})},}," after login ": {ParseFunc: Func (CTX *Context) {ctx.output (map[string]interface{}{" all ": ctx.gettext (),}) ctx.addQueue (&request.request {Url: Rule: "http://accounts.xxx.xxx/member", "individual center", the Header:. HTTP Header {" Referer ": {ParseFunc: func(CTX *Context) {CTX.Output(map[string]interface{}{" all ": ctx.GetText(), }) }, }, }, }, }.Register() }Copy the code

 

The proxy IP

  • The proxy IP address is written to/pholcus_pkg/proxy.libThe format of the file is as follows, one IP line:
http://183.141.168.95:3128
https://60.13.146.92:8088
http://59.59.4.22:8090
https://180.119.78.78:8090
https://222.178.56.73:8118
http://115.228.57.254:3128
http://49.84.106.160:9000
Copy the code
  • Select Proxy IP address change Frequency or set the -a_proxyminute parameter on the CLI

  • * Note: * If you want to use proxy IP address function under Mac, please make sure to obtain the root user permission, otherwise you cannot ping can proxy!

 

FAQ

Are duplicate urls automatically de-duplicated in the request queue?

Urls are de-duplicated by default, but duplicates can be ignored by setting request. Reloadable=true.Copy the code

Does the framework have a mechanism to determine if the content of the page the URL points to is updated?

Url page content updates, the framework cannot directly support judgment, but users can customize support in the rules.Copy the code

Is the request successful based on the status code of the Web header?

Not to determine the status, but to determine whether the server has returned a response stream. That is, 404 pages are also successful.Copy the code

A rerequest mechanism after a failed request?

If each URL still fails after a specified number of attempts to download, the request is appended to a special queue similar in nature to defer. After the current task is complete, it is automatically added to the download queue for the next download. If there are still failed to download, save to the failure history record. The next time you execute this crawler rule, you can automatically add the failed requests to a special queue of the defer nature by choosing to inherit the historical failed records... (Repeat steps later)Copy the code

 

Third-party dependencies

"github.com/henrylee2cn/teleport" "golang.org/x/net/html/charset" "gopkg.in/mgo.v2" "github.com/robertkrimen/otto" "github.com/Shopify/sarama" "github.com/go-sql-driver/mysql" "github.com/lxn/walk" "Github.com/elazarl/go-bindata-assetfs" github.com/henrylee2cn/pholcus_lib "/ /" this a spiders for public maintenance rule baseCopy the code

(Thanks for the support of the above open source project!)

Open source licenses

Apache License v2