In actual projects, some non-important and non-urgent data are often captured locally by web crawler to save cost, but the site has frequency limit to the source of data access. Many programmers use free, multi-channel agents on the network to solve frequency constraints. Since they are free, agents are not very stable, which leads to a lot of time and logic for each project dealing with agent selection, failure retries, and ultimately application code complexity. This paper adopts the method of multi-level proxy, the first level of proxy to solve all problems, users only need to simply use the first level of proxy.

Basic idea: the development of an agent agent module, the application layer to shield the above problems. The following is a summary of experience and learning

1. Forward proxy forwarding principle

Understanding the difference between forward and reverse proxies is the key to fast coding

The forward proxy is aware of the real target server, while the reverse proxy is unaware and thinks that the proxy server is the real target server

2. TCP Socket programming in Go language

Go’s transport layer programming code is very simple, one line of code. The following is to create a port listener to accept requests from the network from the transport layer.

net.Listen("tcp".": 7856")
Copy the code

3. TCP protocol parsing

The proxyUrl points to the listener address created above in the format http://ip:port

proxy, _ := url.Parse(proxyUrl)
  tr := &http.Transport{
    Proxy:           http.ProxyURL(proxy),
    TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
    DialContext: (&net.Dialer{
      Timeout:   30 * time.Second,
      KeepAlive: 30 * time.Second,
      DualStack: true,
    }).DialContext,
  }

  client := &http.Client{
    Transport: tr,
    Timeout:   10 * time.Second,
  }
         ......
        client.Do(req)
Copy the code

Client.Do initiates an HTTP connection request. There are two steps in this process:

1. Establish a TCP connection with the proxy server. 2. Send an HTTP connection request (see HTTP protocol parsing)

The packets captured in Wireshark are as follows:

X TCP 78 50707 → 7856[SYN] Seq=0 Win=65535 Len=0 MSS=1460 WS=64 TSval=104140878 TSecr=0 SACK_PERM=1  
887 47.187710 X.X.X.X 192.168.0.105 TCP 74  7856 → 50707 [SYN, ACK] Seq=0 Ack=1 WinLen=0 MSS=1412 SACK_PERM=1 TSval=822200229 TSecr=104140878 WS=128 888 47.187803 192.168.0.105 X.X.X.X TCP 66 50707 → 7856 [ACK] Seq=1 ACK =1 Win=131584 Len=0 TSval=104140896 TSecr=822200229
Copy the code

At this point, the TCP channel has been opened (the roadbed has been made, waiting to build what kind of road)

3. HTTP protocol parsing

The TCP/IP family transport layer channel has been opened, the following is the HTTP application layer initialization package (the roadbed was made in the previous step, now we have paved the road for trucks).

889 47.188292 192.168.0.105 120.24.69.155 HTTP  159 CONNECT w.mmm920.com:443 HTTP/1.1  
893 53.023565 120.24.69.155 192.168.0.105 HTTP  105 HTTP/1.1 200 Connection established 
Copy the code

HTTP proxy server logic

There are only two steps for establishing a client connection to HTTP:

  • Client initiatingCONNECT to w.XXXX.com: 443 HTTP / 1.1request
  • The server determines a CONNECT connection and returnsHTTP / 1.1 200 Connection established \ r \ n \ r \ nThe packet

    At this point, the connection between the client and the HTTP proxy is established successfully
if method == "CONNECT" {
    fmt.Fprint(client, "HTTP / 1.1 200 Connection established \ r \ n \ r \ n")}else {
    log.Println("server write", method) // Other agreements
    server.Write(b[:n])
  }
Copy the code

Proxy server connection logic to agents on other networks:

    c, err := net.DialTimeout("tcp", remote proxy addr, time.Second*5)
    req, err := http.NewRequest(http.MethodConnect, reqURL.String(), nil)
    req.Write(c)
    resp, err := http.ReadResponse(bufio.NewReader(c), req)
    ifresp.StatusCode ! =200 {
      err = fmt.Errorf("Connect server using proxy error, StatusCode [%d]", resp.StatusCode)
      return nil, err
    }
Copy the code

Determine whether the proxy server is currently available through THE CONNECT protocol of HTTP, and obtain net.conn pipe C

5. Research on TCP connection pool

Advanced article

TCP connection pooling can be considered for transport layer channels between tier 1 and Tier 2 proxy servers. Because the second-level proxy servers are free proxies on the network, it costs a lot to establish a connection and is unstable. Therefore, once a connection is established, it should be reused immediately. There are also some risks to consider, such as connection pool maintenance, stress on remote proxy servers, etc

6. Using Wireshark

Wireshark is a great tool for troubleshooting problems and learning TCP/IP protocol analysis

7. Go code

The code is very simple, just 200 lines of code, to achieve the function of multi-level proxy, and for learning TCP/IP protocol and HTTP protocol connection process is very simple and clear.

The logic of refreshProxyAddr is omitted because confidential information is involved. RefreshProxyAddr is the logic for updating the proxy IP address pool. For testing, you can manually set several IP addresses in the format proxyUrls[“http://x.x.x.x:3128”] = ‘ ‘

package main

import (
	"bufio"
	"bytes"
	"fmt"
	"io"
	"log"
	"net"
	"net/http"
	"net/url"
	"os"
	"runtime/debug"
	"strings"
	"sync"
	"time"

	"github.com/robfig/cron"
)

var proxyUrls map[string]string = make(map[string]string)
var choiseURL string
var mu sync.Mutex
var connHold map[string]net.Conn = make(map[string]net.Conn) //map[proxy url] TCP connection

func init(a) {
	log.SetFlags(log.LstdFlags | log.Lshortfile)
	refreshProxyAddr()

	cronTask := cron.New()
	cronTask.AddFunc("@every 1h".func(a) {
		mu.Lock()
		defer mu.Unlock()
		refreshProxyAddr()
	})
	cronTask.Start()
}

func main(a) {
	l, err := net.Listen("tcp".": 7856")
	iferr ! =nil {
		log.Panic(err)
	}

	for {
		client, err := l.Accept()
		iferr ! =nil {
			log.Panic(err)
		}
		go handle(client)
	}
}

func handle(client net.Conn) {
	defer func(a) {
		if err := recover(a); err ! =nil {
			log.Println(err)
			debug.PrintStack()
		}
	}()
	if client == nil {
		return
	}
	log.Println("client tcp tunnel connection:", client.LocalAddr().String(), "- >", client.RemoteAddr().String())
	// client.SetDeadline(time.Now().Add(time.Duration(10) * time.Second))
	defer client.Close()

	var b [1024]byte
	n, err := client.Read(b[:]) // Read all data in the application layer
	iferr ! =nil || bytes.IndexByte(b[:], '\n') = =- 1 {
		log.Println(err) // Transport layer connections are without application layer content such as net.dial ()
		return
	}
	var method, host, address string
	fmt.Sscanf(string(b[:bytes.IndexByte(b[:], '\n')),"%s%s", &method, &host)
	log.Println(method, host)
	hostPortURL, err := url.Parse(host)
	iferr ! =nil {
		log.Println(err)
		return
	}

	if hostPortURL.Opaque == "443" { / / HTTPS access
		address = hostPortURL.Scheme + ": 443"
	} else { / / HTTP access
		if strings.Index(hostPortURL.Host, ":") = =- 1 { //host does not contain a port. The default value is 80
			address = hostPortURL.Host + ": 80"
		} else {
			address = hostPortURL.Host
		}
	}

	server, err := Dial("tcp", address)
	iferr ! =nil {
		log.Println(err)
		return
	}
	// After data is forwarded at the application layer, close the channel at the transport layer
	defer server.Close()
	log.Println("server tcp tunnel connection:", server.LocalAddr().String(), "- >", server.RemoteAddr().String())
	// server.SetDeadline(time.Now().Add(time.Duration(10) * time.Second))

	if method == "CONNECT" {
		fmt.Fprint(client, "HTTP / 1.1 200 Connection established \ r \ n \ r \ n")}else {
		log.Println("server write", method) // Other agreements
		server.Write(b[:n])
	}

	// Forward
	go func(a) {
		io.Copy(server, client)
	}()
	io.Copy(client, server) // Block forwarding
}

//refreshProxyAddr Refreshagent IP address
func refreshProxyAddr(a) {
	var proxyUrlsTmp map[string]string = make(map[string]string) \\ Get proxy IP address logical proxyUrls = proxyUrlsTmp// You can manually set the test proxy IP address
}

//DialSimple directly establishes a connection with the secondary proxy server by sending datagrams
func DialSimple(network, addr string) (net.Conn, error) {
	var proxyAddr string
	for proxyAddr = range proxyUrls { // Get a random proxy address
		break
	}
	c, err := func(a) (net.Conn, error) {
		u, _ := url.Parse(proxyAddr)
		log.Println("Proxy host", u.Host)
		// Dial and create client connection.
		c, err := net.DialTimeout("tcp", u.Host, time.Second*5)
		iferr ! =nil {
			log.Println(err)
			return nil, err
		}
		_, err = c.Write([]byte("CONNECT w.xxxx.com:443 HTTP/1.1\r\n Host: w.xxxx.com:443\r\n User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.3\r\n\r\n"))// w.xxxx.com:443 replaces the actual address
		iferr ! =nil {
			panic(err)
		}
		c.Write([]byte(` GET www.baidu.com HTTP / 1.1 \ r \ n \ r \ n `))
		io.Copy(os.Stdout, c)
		return c, err
	}()
	return c, err
}

//Dial Sets up a transmission channel
func Dial(network, addr string) (net.Conn, error) {
	var proxyAddr string
	for proxyAddr = range proxyUrls { // Get a random proxy address
		break
	}
	// Establish a transport layer channel to the proxy server
	c, err := func(a) (net.Conn, error) {
		u, _ := url.Parse(proxyAddr)
		log.Println("Proxy address", u.Host)
		// Dial and create client connection.
		c, err := net.DialTimeout("tcp", u.Host, time.Second*5)
		iferr ! =nil {
			return nil, err
		}

		reqURL, err := url.Parse("http://" + addr)
		iferr ! =nil {
			return nil, err
		}
		req, err := http.NewRequest(http.MethodConnect, reqURL.String(), nil)
		iferr ! =nil {
			return nil, err
		}
		req.Close = false
		req.Header.Set("User-Agent"."Mozilla / 5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.3")

		err = req.Write(c)
		iferr ! =nil {
			return nil, err
		}

		resp, err := http.ReadResponse(bufio.NewReader(c), req)
		iferr ! =nil {
			return nil, err
		}
		defer resp.Body.Close()

		log.Println(resp.StatusCode, resp.Status, resp.Proto, resp.Header)
		ifresp.StatusCode ! =200 {
			err = fmt.Errorf("Connect server using proxy error, StatusCode [%d]", resp.StatusCode)
			return nil, err
		}
		return c, err
	}()
	if c == nil|| err ! =nil { // The agent is abnormal
		log.Println(Proxy exception:, c, err)
		log.Println("Local Direct forward:", c, err)
		return net.Dial(network, addr)
	}
	log.Println("Agent normal,tunnel message", c.LocalAddr().String(), "- >", c.RemoteAddr().String())
	return c, err
}
Copy the code