百木园-与人分享,
就是让自己快乐。

爬虫代理 IP 池及隧道代理(2022.05.24)

爬虫代理 IP 池及隧道代理

目录

  • 爬虫代理 IP 池及隧道代理
    • 1. 代理 IP 池
      • 1.1 简介
      • 1.2 实现
      • 1.3 测试
    • 2. 隧道代理
      • 2.1 简介
      • 2.2 实现
        • 2.2.1 目录结构
        • 2.2.2 配置文件
        • 2.2.3 openresty
      • 2.3 测试

日常开发中,偶尔会遇到爬取网页数据的需求,为了隐藏本机真实 IP,常常会用到代理 IP 池,本文将基于 openresty 与代理 IP 池搭建更为易用的隧道代理。

1. 代理 IP 池

1.1 简介

代理 IP 池即在数据库中维护一个可用的 IP 代理队列,一般实现思路如下:

  1. 定时从免费或收费代理网站获取代理 IP 列表;
  2. 将代理 IP 列表以 Hash 结构存入 Redis;
  3. 定时检测代理 IP 的可用性,剔除不可用的代理 IP;
  4. 对外提供 API 接口用来管理代理 IP 池;

1.2 实现

此处笔者采用的开源项目jhao104/proxy_pool,具体实现方式参考其文档。

1.3 测试

import json

import requests
from retrying import retry


def get_proxy_ip() -> str:
    resp = requests.get(url=\"http://192.168.0.121:5010/get\")
    assert resp.status_code == 200
    return f\"http://{json.loads(resp.text)[\'proxy\']}\"


@retry(stop_max_attempt_number=5)
def proxy_test() -> None:
    resp = requests.get(url=\"http://httpbin.org/get\", proxies={\"http\": get_proxy_ip()}, timeout=5)
    assert resp.status_code == 200
    print(f\"origin: {json.loads(resp.text)[\'origin\']}\")


if __name__ == \"__main__\":
    try:
        proxy_test()
    except Exception as e:
        print(f\"Error: {e}.\")

2. 隧道代理

2.1 简介

通过代理 IP 池实现了隐藏本机真实 IP,但每次需要通过 API 接口获取新的代理 IP,不太方便,所以出现了隧道代理。隧道代理内部自动将请求通过不同的代理 IP 进行转发,对外提供统一的代理地址。

2.2 实现

此处笔者通过 openresty 配合上文搭建的代理 IP 池实现隧道代理。

2.2.1 目录结构

openresty
├── conf.d
│   └── tunnel-proxy.stream
├── docker.sh
└── nginx.conf

2.2.2 配置文件

  1. nginx.conf 文件为 openresty 的主配置文件,主要修改为引入了 stream 相关的配置文件,具体内容如下:

    # nginx.conf  --  docker-openresty
    #
    # This file is installed to:
    #   `/usr/local/openresty/nginx/conf/nginx.conf`
    # and is the file loaded by nginx at startup,
    # unless the user specifies otherwise.
    #
    # It tracks the upstream OpenResty\'s `nginx.conf`, but removes the `server`
    # section and adds this directive:
    #     `include /etc/nginx/conf.d/*.conf;`
    #
    # The `docker-openresty` file `nginx.vh.default.conf` is copied to
    # `/etc/nginx/conf.d/default.conf`.  It contains the `server section
    # of the upstream `nginx.conf`.
    #
    # See https://github.com/openresty/docker-openresty/blob/master/README.md#nginx-config-files
    #
    
    #user  nobody;
    #worker_processes 1;
    
    # Enables the use of JIT for regular expressions to speed-up their processing.
    pcre_jit on;
    
    
    
    #error_log  logs/error.log;
    #error_log  logs/error.log  notice;
    #error_log  logs/error.log  info;
    
    #pid        logs/nginx.pid;
    
    
    events {
        worker_connections  1024;
    }
    
    
    http {
        include       mime.types;
        default_type  application/octet-stream;
    
        # Enables or disables the use of underscores in client request header fields.
        # When the use of underscores is disabled, request header fields whose names contain underscores are marked as invalid and become subject to the ignore_invalid_headers directive.
        # underscores_in_headers off;
    
        #log_format  main  \'$remote_addr - $remote_user [$time_local] \"$request\" \'
        #                  \'$status $body_bytes_sent \"$http_referer\" \'
        #                  \'\"$http_user_agent\" \"$http_x_forwarded_for\"\';
    
        #access_log  logs/access.log  main;
    
            # Log in JSON Format
            # log_format nginxlog_json escape=json \'{ \"timestamp\": \"$time_iso8601\", \'
            # \'\"remote_addr\": \"$remote_addr\", \'
            #  \'\"body_bytes_sent\": $body_bytes_sent, \'
            #  \'\"request_time\": $request_time, \'
            #  \'\"response_status\": $status, \'
            #  \'\"request\": \"$request\", \'
            #  \'\"request_method\": \"$request_method\", \'
            #  \'\"host\": \"$host\",\'
            #  \'\"upstream_addr\": \"$upstream_addr\",\'
            #  \'\"http_x_forwarded_for\": \"$http_x_forwarded_for\",\'
            #  \'\"http_referrer\": \"$http_referer\", \'
            #  \'\"http_user_agent\": \"$http_user_agent\", \'
            #  \'\"http_version\": \"$server_protocol\", \'
            #  \'\"nginx_access\": true }\';
            # access_log /dev/stdout nginxlog_json;
    
        # See Move default writable paths to a dedicated directory (#119)
        # https://github.com/openresty/docker-openresty/issues/119
        client_body_temp_path /var/run/openresty/nginx-client-body;
        proxy_temp_path       /var/run/openresty/nginx-proxy;
        fastcgi_temp_path     /var/run/openresty/nginx-fastcgi;
        uwsgi_temp_path       /var/run/openresty/nginx-uwsgi;
        scgi_temp_path        /var/run/openresty/nginx-scgi;
    
        sendfile        on;
        #tcp_nopush     on;
    
        #keepalive_timeout  0;
        keepalive_timeout  65;
    
        #gzip  on;
    
        include /etc/nginx/conf.d/*.conf;
    
        # Don\'t reveal OpenResty version to clients.
        # server_tokens off;
    }
    
    stream {
        log_format proxy \'$remote_addr [$time_local] \'
                         \'$protocol $status $bytes_sent $bytes_received \'
                         \'$session_time \"$upstream_addr\" \'
                         \'\"$upstream_bytes_sent\" \"$upstream_bytes_received\" \"$upstream_connect_time\"\';
        access_log /usr/local/openresty/nginx/logs/access.log proxy;
        error_log /usr/local/openresty/nginx/logs/error.log notice;
        open_log_file_cache off;
    
        include /etc/nginx/conf.d/*.stream;
    }
    
  2. tunnel-proxy.stream 为配置隧道代理的文件,通过查询 Redis 获取代理 IP,并将请求通过代理 IP 转发到指定目标地址,具体内容如下:

    # tunnel-proxy.stream
    
    upstream backend {
        server 0.0.0.0:9870;
    
        balancer_by_lua_block {
            local balancer = require \"ngx.balancer\"
            local host = ngx.ctx.proxy_host
            local port = ngx.ctx.proxy_port
    
            local success, msg = balancer.set_current_peer(host, port)
            if not success then
                ngx.log(ngx.ERR, \"Failed to set the peer. Error: \", msg, \".\")
            end
        }
    }
    
    server {
        # 对外代理监听端口
        listen 9870;
        listen [::]:9870;
    
        proxy_connect_timeout 10s;
        proxy_timeout 10s;
        proxy_pass backend;
    
        preread_by_lua_block {
            local redis = require(\"resty.redis\")
            local redis_instance = redis:new()
            redis_instance:set_timeout(3000)
    
            # Redis 地址
            local rhost = \"192.168.0.121\"
            # Redis 端口
            local rport = 6379
            # Redis 数据库
            local database = 0
            # Redis Hash 键名
            local rkey = \"use_proxy\"
            local success, msg = redis_instance:connect(rhost, rport)
            if not success then
                ngx.log(ngx.ERR, \"Failed to connect to redis. Error: \", msg, \".\")
            end
    
            redis_instance:select(database)
            local proxys, msg = redis_instance:hkeys(rkey)
            if not proxys then
                ngx.log(ngx.ERR, \"Proxys num error. Error: \", msg, \".\")
                return redis_instance:close()
            end
    
            math.randomseed(tostring(ngx.now()):reverse():sub(1, 6))
            local proxy = proxys[math.random(#proxys)]
            local colon_index = string.find(proxy, \":\")
            local proxy_ip = string.sub(proxy, 1, colon_index - 1)
            local proxy_port = string.sub(proxy, colon_index + 1)
            ngx.log(ngx.NOTICE, \"Proxy: \", proxy, \", ip: \", proxy_ip, \", port: \", proxy_port, \".\");
            ngx.ctx.proxy_host = proxy_ip
            ngx.ctx.proxy_port = proxy_port
            redis_instance:close()
        }
    }
    

2.2.3 openresty

通过 docker 启动 openresty,此处笔者为了方便,将 docker 命令保存成了 shell 文件,具体内容如下:

docker run --name openresty -itd --restart always \\
-p 9870:9870 \\
-v $PWD/nginx.conf:/usr/local/openresty/nginx/conf/nginx.conf \\
-v $PWD/conf.d:/etc/nginx/conf.d \\
-e LANG=C.UTF-8 \\
-e TZ=Asia/Shanghai \\
--log-driver json-file \\
--log-opt max-size=1g \\
--log-opt max-file=3 \\
openresty/openresty:alpine

执行 bash docker.sh 命名启动 openresty,至此隧道代理搭建完成。

2.3 测试

import json

import requests
from retrying import retry

proxies = {
    \"http\": \"http://192.168.0.121:9870\"
}


@retry(stop_max_attempt_number=5)
def proxy_test() -> None:
    resp = requests.get(
        url=\"http://httpbin.org/get\", proxies=proxies,  timeout=5, )
    assert resp.status_code == 200
    print(f\"origin: {json.loads(resp.text)[\'origin\']}\")


if __name__ == \"__main__\":
    try:
        proxy_test()
    except Exception as e:
        print(f\"Error: {e}.\")

参考链接:

  • 只要5分钟,创建一个隧道代理 - 知乎 (zhihu.com)
  • openresty正向代理搭建 - 简书 (jianshu.com)

来源:https://www.cnblogs.com/xiaoQQya/p/16305232.html
本站部分图文来源于网络,如有侵权请联系删除。

未经允许不得转载:百木园 » 爬虫代理 IP 池及隧道代理(2022.05.24)

相关推荐

  • 暂无文章