作业 9：Bajillion 搜索引擎

任务概览

本作业带你用 Python 从零实现一个简易搜索引擎。你将：

热身：编写一个基于列表的函数（后续可复用）。
Part A：为一组文本文件构建倒排索引。
Part B：基于索引实现搜索功能。
可选扩展：用类把搜索引擎做成本地 Web 服务。
伦理反思：回答两道关于搜索引擎伦理的问题。

热身：共同元素（Sandcastle）

在 common_elements.py 中实现：

def common(list1, list2):
    """返回出现在两个列表中的所有元素（去重），不修改原列表。"""
    result = []
    seen = set()
    set2 = set(list2)
    for x in list1:
        if x in set2 and x not in seen:
            result.append(x)
            seen.add(x)
    return result

示例：

>>> common(['a', 'b', 'c'], ['c', 'a', 'z'])
['a', 'c']
>>> common(['a', 'a', 'b'], ['x', 'a', 'a'])
['a']

Part A：构建倒排索引

在 searchengine.py 中实现：

def create_index(filenames, index, file_titles):
    """
    - filenames: 字符串列表，待索引的文件路径
    - index: dict[str, list[str]]，术语 → 文件名 列表（就地构建）
    - file_titles: dict[str, str]，文件名 → 标题（就地构建）
    术语规则：
      * 以空格/换行分割
      * 全部小写
      * 仅去除首尾标点（string.punctuation）；中间标点保留
      * 去除空字符串
    首行作为标题存入 file_titles，同时首行中的术语也要加入索引。
    """
    import os, string
    for filename in filenames:
        with open(filename, encoding='utf-8') as f:
            # 标题
            title = f.readline().rstrip('\n')
            file_titles[filename] = title

            def add_terms(line: str):
                for raw in line.split():
                    term = raw.lower().strip(string.punctuation)
                    if term:
                        index.setdefault(term, [])
                        if not index[term] or index[term][-1] != filename:
                            # 防止同一文件重复插入
                            index[term].append(filename)

            # 标题行也参与索引
            add_terms(title)
            # 余下行
            for line in f:
                add_terms(line)

运行（小数据集）：

python3 searchengine.py small

程序会打印索引与 file_titles 的内容以便检查。

Part B：搜索功能

在 searchengine.py 中实现：

def search(index, query):
    """返回包含 query 所有术语的文件列表（交集搜索）。"""
    terms = query.split()
    if not terms:
        return []
    # 起始为第一个术语的倒排列表
    posting = index.get(terms[0], [])
    for t in terms[1:]:
        posting = common(posting, index.get(t, []))  # 复用热身函数
        if not posting:
            break
    return posting

交互式搜索模式：

python3 searchengine.py small -s

程序先构建索引，再循环读取查询并输出：每条结果展示标题与文件名。

文件标题字典

每个文件首行作为标题，存入 file_titles[filename] = title。注意：首行同样需要参与索引（常见遗漏）。

BBC News 数据集

当小数据集工作良好后，运行：

python3 searchengine.py bbcnews
python3 searchengine.py bbcnews -s

输出较大，主要用于稳定性验证。

伦理问题（提交到 ethics.txt）

问题 1：广告驱动的排序偏见为何有伦理风险？你建议怎样从高层机制上缓解？

问题 2：在“偏好友商/多元视角/高质量/相关性/透明度”等目标中，你认为哪些最重要？如何在 Bajillion 中落实？

提交文件

必交：common_elements.py、searchengine.py、ethics.txt
若做扩展：extension_server.py、extension.py 及依赖文件

（可选）Web 扩展：Bajillion Online!

在 extension_server.py 中用类实现一个本地服务器：

运行：python3 extension_server.py，浏览器访问 http://localhost:8000
处理 /search?query=... 请求，返回形如 [{"title": "..."}, ...] 的 JSON 字符串。
可用 json.dumps(collection, indent=2) 将结果列表转成字符串。
合理调用：create_index、search、textfiles_in_dir。
可扩展字段：url、snippet。

实现要点速览

strip 标点：term = raw.lower().strip(string.punctuation)
去空术语：空字符串不入索引
去重复文件：同一术语的同一文件仅出现一次
交集搜索：逐术语求 posting list 的交集

背景与简介

本作业将综合运用你的 Python 技能来构建一个你几乎每天都会用到的应用——搜索引擎！最初的 Google 在 1998 年由两位斯坦福的博士生 Larry Page 与 Sergey Brin 创建，最早版本也有相当一部分使用 Python 编写。感兴趣的话可以阅读他们最初的论文（链接），以及曾经托管在斯坦福的早期地址 google.stanford.edu（现已跳转）。

你将亲手实现一个精简版的文本搜索引擎 “Bajillion”。名字来自一个表示“超大数量”的词，寓意“这个季度学到的东西多到不行”。核心分为两部分：首先构建“倒排索引”，其次基于索引实现检索。最后还有可选扩展（将搜索引擎做成本地 Web 服务），以及两道与搜索伦理有关的思考题。

术语、索引与具体示例

术语（term）可以理解为“词/名字/数字”等可被检索的单位；我们用一个 dict[str, list[str]] 表示倒排索引：键为术语，值为包含该术语的“文件名列表”（posting list）。术语规则：

分隔：以空格或换行分隔。
小写化：全部转换为小写。
去首尾标点：仅去掉开头和结尾的标点（string.punctuation）；中间的标点保留。
过滤空术语：去掉为空的字符串（纯标点被剥离后即为空）。

例如文件 doc1.txt：

*We* are 100,000
STRONG! $$

应解析为术语：we、are、100,000、strong（$$ 被忽略）。索引结果（初始为空）应为：

{
    'we': ['doc1.txt'],
    'are': ['doc1.txt'],
    '100,000': ['doc1.txt'],
    'strong': ['doc1.txt']
}

再处理 doc2.txt：

Strong, you are!
--Yoda--

合并后索引应为：

{
    'we' : ['doc1.txt'],
    'are' : ['doc1.txt', 'doc2.txt'],
    '100,000' : ['doc1.txt'],
    'strong' : ['doc1.txt', 'doc2.txt'],
    'you': ['doc2.txt'],
    'yoda': ['doc2.txt']
}

small 数据集的搜索示例

在 small 数据集上，假设索引构建正确，下列查询应返回对应结果（顺序可能不同，但集合应一致）：

示例 1：search(index, 'apple') → ['small/1.txt', 'small/2.txt', 'small/3.txt']
示例 2：search(index, 'ball') → ['small/1.txt', 'small/3.txt']
示例 3：search(index, 'lizard') → ['small/3.txt']
示例 4：search(index, 'apple ball') → ['small/1.txt', 'small/3.txt']
示例 5：search(index, 'dog ball') → ['small/1.txt']
示例 6：search(index, 'dog ball hamster') → []
示例 7：search(index, 'nope') → []