bricks
开发指南
2.4 解析篇
2.4.3 Rule 对象

2.4.3 Rule 对象

Rule 对象是 Bricks 框架中的核心组件,用于定义复杂的数据提取规则。它提供了强大的功能来处理各种数据提取场景,包括条件判断、数据转换、默认值设置等。

Rule 类概述

Rule 类位于 bricks.lib.extractors.Rule,它封装了数据提取的逻辑,使得复杂的提取规则变得简单易用。

实例化参数

参数名参数类型参数描述默认值
exprsOptional[Union[str, Callable]]匹配规则,可以是字符串或函数""
conditionOptional[Callable]条件函数,当条件不满足时不进行匹配None
pre_scriptOptional[Callable]前置处理脚本None
post_scriptOptional[Callable]后置处理脚本None
is_arrayOptional[bool]指定匹配结果是否为数组None
acquirebool是否为必须字段True
optionsOptional[dict]其他参数,配合引擎使用None
constOptional[Any]静态值,忽略规则匹配直接返回该值...
defaultOptional[Any]默认值,匹配不到时返回None
engineOptional[Union[Type["Extractor"], str]]提取引擎None

基本使用

简单规则

from bricks.lib.extractors import Rule, JsonExtractor
 
data = {"name": "张三", "age": 25, "city": "北京"}
 
# 创建简单规则
name_rule = Rule(exprs="name", engine=JsonExtractor)
result = name_rule.apply(data)
print(result)  # "张三"

使用静态值

from bricks.lib.extractors import Rule
 
# 使用常量值
static_rule = Rule(const="固定值")
result = static_rule.apply({"any": "data"})
print(result)  # "固定值"

设置默认值

from bricks.lib.extractors import Rule, JsonExtractor
 
data = {"name": "张三"}
 
# 设置默认值
age_rule = Rule(exprs="age", default=18, engine=JsonExtractor)
result = age_rule.apply(data)
print(result)  # 18 (因为 age 字段不存在)

高级功能

条件判断

使用 condition 参数可以设置条件,只有当条件满足时才执行提取:

from bricks.lib.extractors import Rule, JsonExtractor
 
data = {"user": {"name": "张三", "age": 25, "status": "active"}}
 
# 只有当用户状态为 active 时才提取姓名
name_rule = Rule(
    exprs="user.name",
    condition=lambda obj: JsonExtractor.extract(obj, "user.status") == "active",
    engine=JsonExtractor
)
 
result = name_rule.apply(data)
print(result)  # "张三"
 
# 如果状态不是 active
inactive_data = {"user": {"name": "李四", "age": 30, "status": "inactive"}}
result = name_rule.apply(inactive_data)
print(result)  # None (条件不满足)

前置和后置处理

使用 pre_scriptpost_script 可以在提取前后对数据进行处理:

from bricks.lib.extractors import Rule, JsonExtractor
 
data = {"user": {"full_name": "  张三  ", "age": 25}}
 
# 前置处理:清理数据,后置处理:格式化结果
name_rule = Rule(
    exprs="user.full_name",
    pre_script=lambda obj: obj,  # 可以在这里对输入数据进行预处理
    post_script=lambda result: result.strip().upper() if result else result,
    engine=JsonExtractor
)
 
result = name_rule.apply(data)
print(result)  # "张三"

数组处理控制

使用 is_array 参数可以控制返回结果的类型:

from bricks.lib.extractors import Rule, JsonExtractor
 
data = {"items": [{"name": "item1"}, {"name": "item2"}]}
 
# 强制返回单个值(第一个)
first_rule = Rule(
    exprs="items[*].name",
    is_array=False,
    engine=JsonExtractor
)
result = first_rule.apply(data)
print(result)  # "item1"
 
# 强制返回数组
array_rule = Rule(
    exprs="items[0].name",
    is_array=True,
    engine=JsonExtractor
)
result = array_rule.apply(data)
print(result)  # ["item1"]

Group 对象

Group 对象用于组合多个 Rule,实现 OR 逻辑(尝试多个规则,返回第一个成功的结果):

基本使用

from bricks.lib.extractors import Rule, Group, JsonExtractor
 
data = {"title": "标题内容"}
 
# 创建多个规则
rule1 = Rule(exprs="name", engine=JsonExtractor)  # 这个会失败
rule2 = Rule(exprs="title", engine=JsonExtractor)  # 这个会成功
 
# 组合规则
group = Group([rule1, rule2])
result = group.apply(data)
print(result)  # "标题内容"

使用 OR 操作符

from bricks.lib.extractors import Rule, JsonExtractor
 
data = {"title": "标题内容"}
 
# 使用 | 操作符组合规则
combined_rule = Rule(exprs="name", engine=JsonExtractor) | Rule(exprs="title", engine=JsonExtractor)
result = combined_rule.apply(data)
print(result)  # "标题内容"

实际应用示例

复杂数据提取

from bricks.lib.extractors import Rule, JsonExtractor
 
# 模拟电商产品数据
product_data = {
    "product": {
        "id": "12345",
        "name": "智能手机",
        "price": {"current": 2999, "original": 3999},
        "specs": {
            "brand": "华为",
            "model": "P50",
            "storage": "128GB"
        },
        "availability": True,
        "reviews": [
            {"rating": 5, "comment": "很好用"},
            {"rating": 4, "comment": "性价比高"}
        ]
    }
}
 
# 定义复杂的提取规则
rules = {
    "product_id": Rule(exprs="product.id", engine=JsonExtractor),
    "product_name": Rule(exprs="product.name", engine=JsonExtractor),
    "current_price": Rule(exprs="product.price.current", engine=JsonExtractor),
    "brand": Rule(exprs="product.specs.brand", engine=JsonExtractor),
    "is_available": Rule(exprs="product.availability", engine=JsonExtractor),
    "avg_rating": Rule(
        exprs="avg(product.reviews[*].rating)",
        engine=JsonExtractor,
        post_script=lambda x: round(x, 1) if x else 0
    ),
    "discount": Rule(
        exprs="product.price",
        engine=JsonExtractor,
        post_script=lambda price: round((price["original"] - price["current"]) / price["original"] * 100, 1) if price else 0
    ),
    "full_name": Rule(
        const=None,
        post_script=lambda _: f"{JsonExtractor.extract(product_data, 'product.specs.brand')} {JsonExtractor.extract(product_data, 'product.name')}"
    )
}
 
# 应用规则
results = {}
for key, rule in rules.items():
    results[key] = rule.apply(product_data)
 
print(results)

条件性数据提取

from bricks.lib.extractors import Rule, JsonExtractor
 
user_data = {
    "user": {
        "name": "张三",
        "age": 17,
        "email": "zhangsan@example.com",
        "is_verified": True
    }
}
 
# 只有成年用户才提取邮箱
email_rule = Rule(
    exprs="user.email",
    condition=lambda obj: JsonExtractor.extract(obj, "user.age") >= 18,
    default="未成年用户",
    engine=JsonExtractor
)
 
result = email_rule.apply(user_data)
print(result)  # "未成年用户" (因为年龄小于18)
 
# 修改年龄
user_data["user"]["age"] = 25
result = email_rule.apply(user_data)
print(result)  # "zhangsan@example.com"

最佳实践

  1. 合理使用默认值:为可能缺失的字段设置合理的默认值
  2. 善用条件判断:通过 condition 参数实现复杂的业务逻辑
  3. 数据预处理:使用 pre_script 清理和标准化输入数据
  4. 结果后处理:使用 post_script 格式化和转换提取结果
  5. 组合规则:使用 Group| 操作符实现容错机制
  6. 性能考虑:对于简单提取,直接使用解析器可能更高效

Rule 对象的强大之处在于它的灵活性和可组合性,能够处理各种复杂的数据提取场景,是构建健壮爬虫的重要工具。