课程 24:互联网编程
1. 网络与HTTP基础
1.1 基本概念
IP/端口: 网络中主机的地址与进程的通信入口。
协议: TCP(面向连接、可靠)、UDP(无连接、尽力而为)、HTTP(基于TCP的应用层协议)。
客户端-服务器模型: 客户端发起请求,服务器处理并返回响应。
1.2 HTTP要点
- 常见方法:GET(获取)、POST(提交)、PUT、DELETE
- URL与查询参数:
?q=python&page=1
- 请求头/响应头:
User-Agent、Content-Type、Authorization
- 状态码:200成功、301/302重定向、401未授权、403禁止、404未找到、429限流、5xx服务器错误
2. 使用 requests 进行HTTP请求
2.1 GET:参数、头部、超时
import requests
url = "https://httpbin.org/get"
params = {"q": "python", "page": 1}
headers = {"User-Agent": "CS106A-Client/1.0"}
r = requests.get(url, params=params, headers=headers, timeout=5)
r.raise_for_status()
print(r.status_code)
print(r.url)
print(r.json())
2.2 POST:表单与JSON
import requests
# 表单提交
r1 = requests.post("https://httpbin.org/post", data={"name": "Alice"})
print(r1.json()["form"]) # {'name': 'Alice'}
# JSON提交
r2 = requests.post("https://httpbin.org/post", json={"name": "Bob"})
print(r2.json()["json"]) # {'name': 'Bob'}
2.3 会话、重试与错误处理
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retry = Retry(total=3, backoff_factor=0.5, status_forcelist=[429, 500, 502, 503, 504])
session.mount("https://", HTTPAdapter(max_retries=retry))
try:
resp = session.get("https://httpbin.org/status/503", timeout=5)
resp.raise_for_status()
except requests.exceptions.RequestException as e:
print("request error:", e)
推荐使用 raise_for_status()、合理的 timeout 与重试策略;避免无限重试。
3. 网页抓取与解析
3.1 基本抓取流程
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://example.com", headers={"User-Agent": "CS106A"}, timeout=5)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
# 提取标题与所有链接
title = soup.find("title").get_text(strip=True)
links = [a.get("href") for a in soup.find_all("a") if a.get("href")]
print(title)
print(links[:10])
抓取建议:
- 遵守
robots.txt 与网站使用条款;不要高频抓取
- 设置
User-Agent、合理 sleep 与失败重试
- 尽量使用官方API;解析前先了解页面结构
4. Socket编程基础
4.1 简单TCP客户端
import socket
with socket.create_connection(("example.com", 80), timeout=5) as s:
req = (
"GET / HTTP/1.1\r\n"
"Host: example.com\r\n"
"Connection: close\r\n\r\n"
)
s.sendall(req.encode("ascii"))
data = b""
while True:
chunk = s.recv(4096)
if not chunk:
break
data += chunk
print(data[:200])
4.2 简单TCP回声服务器
import socket
def run_echo_server(host="127.0.0.1", port=5000):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as srv:
srv.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
srv.bind((host, port))
srv.listen()
print(f"listening on {host}:{port}")
while True:
conn, addr = srv.accept()
with conn:
print("connected:", addr)
while True:
buf = conn.recv(1024)
if not buf:
break
conn.sendall(buf)
# run_echo_server()
4.3 使用 socketserver(更简洁)
import socketserver
class EchoHandler(socketserver.BaseRequestHandler):
def handle(self):
while True:
data = self.request.recv(1024)
if not data:
break
self.request.sendall(data)
# with socketserver.TCPServer(("127.0.0.1", 5001), EchoHandler) as server:
# server.serve_forever()
UDP常用于实时/广播等场景(不保证可靠);TCP适合可靠传输。
5. 异步I/O(可选)
5.1 asyncio + aiohttp 批量请求
import asyncio
import aiohttp
async def fetch(session, url):
async with session.get(url, timeout=5) as resp:
resp.raise_for_status()
return await resp.text()
async def main():
urls = ["https://example.com" for _ in range(5)]
async with aiohttp.ClientSession() as session:
texts = await asyncio.gather(*[fetch(session, u) for u in urls])
print(len(texts))
# asyncio.run(main())
异步I/O适合大量并发I/O密集任务;注意超时、并发量限制和错误处理。
6. 实际应用案例
6.1 文件下载(流式写入、断点续传思路)
import requests
url = "https://speed.hetzner.de/100MB.bin"
with requests.get(url, stream=True, timeout=10) as r:
r.raise_for_status()
with open("file.bin", "wb") as f:
for chunk in r.iter_content(chunk_size=8192):
if chunk:
f.write(chunk)
6.2 调用开放API(示例:天气/翻译等)
import requests
API = "https://httpbin.org/get"
params = {"city": "Beijing"}
resp = requests.get(API, params=params, timeout=5)
resp.raise_for_status()
print(resp.json())