课程 24:互联网编程

学习目标

1. 网络与HTTP基础

1.1 基本概念

IP/端口: 网络中主机的地址与进程的通信入口。
协议: TCP(面向连接、可靠)、UDP(无连接、尽力而为)、HTTP(基于TCP的应用层协议)。
客户端-服务器模型: 客户端发起请求,服务器处理并返回响应。

1.2 HTTP要点

2. 使用 requests 进行HTTP请求

2.1 GET:参数、头部、超时

import requests

url = "https://httpbin.org/get"
params = {"q": "python", "page": 1}
headers = {"User-Agent": "CS106A-Client/1.0"}

r = requests.get(url, params=params, headers=headers, timeout=5)
r.raise_for_status()
print(r.status_code)
print(r.url)
print(r.json())

2.2 POST:表单与JSON

import requests

# 表单提交
r1 = requests.post("https://httpbin.org/post", data={"name": "Alice"})
print(r1.json()["form"])  # {'name': 'Alice'}

# JSON提交
r2 = requests.post("https://httpbin.org/post", json={"name": "Bob"})
print(r2.json()["json"])  # {'name': 'Bob'}

2.3 会话、重试与错误处理

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retry = Retry(total=3, backoff_factor=0.5, status_forcelist=[429, 500, 502, 503, 504])
session.mount("https://", HTTPAdapter(max_retries=retry))

try:
    resp = session.get("https://httpbin.org/status/503", timeout=5)
    resp.raise_for_status()
except requests.exceptions.RequestException as e:
    print("request error:", e)
推荐使用 raise_for_status()、合理的 timeout 与重试策略;避免无限重试。

3. 网页抓取与解析

3.1 基本抓取流程

import requests
from bs4 import BeautifulSoup

resp = requests.get("https://example.com", headers={"User-Agent": "CS106A"}, timeout=5)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
# 提取标题与所有链接
title = soup.find("title").get_text(strip=True)
links = [a.get("href") for a in soup.find_all("a") if a.get("href")]
print(title)
print(links[:10])
抓取建议:

4. Socket编程基础

4.1 简单TCP客户端

import socket

with socket.create_connection(("example.com", 80), timeout=5) as s:
    req = (
        "GET / HTTP/1.1\r\n"
        "Host: example.com\r\n"
        "Connection: close\r\n\r\n"
    )
    s.sendall(req.encode("ascii"))
    data = b""
    while True:
        chunk = s.recv(4096)
        if not chunk:
            break
        data += chunk
print(data[:200])

4.2 简单TCP回声服务器

import socket

def run_echo_server(host="127.0.0.1", port=5000):
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as srv:
        srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        srv.bind((host, port))
        srv.listen()
        print(f"listening on {host}:{port}")
        while True:
            conn, addr = srv.accept()
            with conn:
                print("connected:", addr)
                while True:
                    buf = conn.recv(1024)
                    if not buf:
                        break
                    conn.sendall(buf)

# run_echo_server()

4.3 使用 socketserver(更简洁)

import socketserver

class EchoHandler(socketserver.BaseRequestHandler):
    def handle(self):
        while True:
            data = self.request.recv(1024)
            if not data:
                break
            self.request.sendall(data)

# with socketserver.TCPServer(("127.0.0.1", 5001), EchoHandler) as server:
#     server.serve_forever()
UDP常用于实时/广播等场景(不保证可靠);TCP适合可靠传输。

5. 异步I/O(可选)

5.1 asyncio + aiohttp 批量请求

import asyncio
import aiohttp

async def fetch(session, url):
    async with session.get(url, timeout=5) as resp:
        resp.raise_for_status()
        return await resp.text()

async def main():
    urls = ["https://example.com" for _ in range(5)]
    async with aiohttp.ClientSession() as session:
        texts = await asyncio.gather(*[fetch(session, u) for u in urls])
        print(len(texts))

# asyncio.run(main())
异步I/O适合大量并发I/O密集任务;注意超时、并发量限制和错误处理。

6. 实际应用案例

6.1 文件下载(流式写入、断点续传思路)

import requests

url = "https://speed.hetzner.de/100MB.bin"
with requests.get(url, stream=True, timeout=10) as r:
    r.raise_for_status()
    with open("file.bin", "wb") as f:
        for chunk in r.iter_content(chunk_size=8192):
            if chunk:
                f.write(chunk)

6.2 调用开放API(示例:天气/翻译等)

import requests

API = "https://httpbin.org/get"
params = {"city": "Beijing"}
resp = requests.get(API, params=params, timeout=5)
resp.raise_for_status()
print(resp.json())

6.3 简易聊天室(基于回声服务扩展)

# 思路:服务器维护连接集合,广播收到的消息到所有连接
# 可用select/asyncio/threading实现;此处给出框架思路

7. 常见错误与调试

import requests
try:
    r = requests.get("https://expired.badssl.com/", timeout=5)
    r.raise_for_status()
except requests.exceptions.SSLError as e:
    print("SSL error:", e)

8. 编程练习与挑战

练习1:基础

练习2:进阶

练习3:实战项目

9. 综合作业与项目

作业24:互联网编程综合应用

任务1:网页抓取与解析

抓取指定站点的若干页面,解析标题和链接,尊重robots、设置UA与限速,导出结果。

任务2:开放API客户端

封装一个API客户端,支持鉴权、重试、超时、分页与错误处理,输出统计报表。

任务3:网络服务

实现一个可靠的回声/聊天室服务器,支持并发连接、日志与异常恢复。

学习建议:优先使用官方API;把网络错误当常态处理;记录日志、加超时与重试;遵守网站协议与爬虫礼仪。
思考题:
拓展阅读: