Trading Fish The blog of Hector Castro

About / Archive / Talks / Feed

Validating Data in Python with Cerberus

This year was my first participating in Advent of Code—and I’m glad I did, because solving one of the challenges exposed me to an excellent data validation library for Python named Cerberus.

What’s in a valid passport

Below are some excerpts from the challenge, along with specific field level validation rules:

You arrive at the airport only to realize that you grabbed your North Pole Credentials instead of your passport. While these documents are extremely similar, North Pole Credentials aren’t issued by a country and therefore aren’t actually valid documentation for travel in most of the world.

It seems like you’re not the only one having problems, though; a very long line has formed for the automatic passport scanners, and the delay could upset your travel itinerary.

The line is moving more quickly now, but you overhear airport security talking about how passports with invalid data are getting through. Better add some data validation, quick!

You can continue to ignore the cid field, but each other field has strict rules about what values are valid for automatic validation:

  • byr (Birth Year) - four digits; at least 1920 and at most 2002.
  • iyr (Issue Year) - four digits; at least 2010 and at most 2020.
  • eyr (Expiration Year) - four digits; at least 2020 and at most 2030.
  • hgt (Height) - a number followed by either cm or in:
    • If cm, the number must be at least 150 and at most 193.
    • If in, the number must be at least 59 and at most 76.
  • hcl (Hair Color) - a # followed by exactly six characters 0-9 or a-f.
  • ecl (Eye Color) - exactly one of: amb blu brn gry grn hzl oth.
  • pid (Passport ID) - a nine-digit number, including leading zeroes.
  • cid (Country ID) - ignored, missing or not.

Your job is to count the passports where all required fields are both present and valid according to the above rules.

For completeness, here are some invalid passports (delimited by \n\n):

eyr:1972 cid:100
hcl:#18171d ecl:amb hgt:170 pid:186cm iyr:2018 byr:1926

iyr:2019
hcl:#602927 eyr:1967 hgt:170cm
ecl:grn pid:012533040 byr:1946

hcl:dab227 iyr:2012
ecl:brn hgt:182cm pid:021572410 eyr:2020 byr:1992 cid:277

And, some valid passports:

pid:087499704 hgt:74in ecl:grn iyr:2012 eyr:2030 byr:1980
hcl:#623a2f

eyr:2029 ecl:blu cid:129 byr:1989
iyr:2014 pid:896056539 hcl:#a97842 hgt:165cm

hcl:#888785
hgt:164cm byr:2001 iyr:2015 cid:88
pid:545766238 ecl:hzl
eyr:2022

Most of the validation rules look straightforward in isolation, but less so when you think about composing them all together.

Validating passports with Cerberus

Step one involved getting familiar with Cerberus validation rules. The library supports rules like the following:

  • contains - This rule validates that the a container object contains all of the defined items.
>>> document = {"states": ["peace", "love", "inity"]}

>>> schema = {"states": {"contains": "peace"}}
>>> v.validate(document, schema)
True
  • regex - The validation will fail if the field’s value does not match the provided regular expression.
>>> schema = {
...     "email": {
...        "type": "string",
...        "regex": "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
...     }
... }
>>> document = {"email": "john@example.com"}
>>> v.validate(document, schema)
True
  • required - If True the field is mandatory. Validation will fail when it is missing.
>>> v.schema = {"name": {"required": True, "type": "string"}, "age": {"type": "integer"}}
>>> document = {"age": 10}
>>> v.validate(document)
False

Step two involved converting the passports into Cerberus documents. This was mostly an exercise in parsing uniquely assembled text into Python dictionaries.

# Split the batch file records by double newline.
for record in batch_file.read().split("\n\n"):
    # Split the fields within a record by a space or newline.
    record_field_list = [
        tuple(field.split(":")) for field in re.compile(r"\s").split(record.strip())
    ]

That leaves record_field_list looking like:

>>> record_field_list
[('ecl', 'gry'),
 ('pid', '860033327'),
 ('eyr', '2020'),
 ('hcl', '#fffffd'),
 ('byr', '1937'),
 ('iyr', '2017'),
 ('cid', '147'),
 ('hgt', '183cm')]

From there, dict converts the list of tuples into a proper Cerberus document:

>>> document = dict(record_field_list)
>>> document
{'byr': '1937',
 'cid': '147',
 'ecl': 'gry',
 'eyr': '2020',
 'hcl': '#fffffd',
 'hgt': '183cm',
 'iyr': '2017',
 'pid': '860033327'}

Putting it all together

Equipped with a better understanding of what’s possible with Cerberus, and a list of Python dictionaries representing passports, below is the schema I put together to enforce the passport validation rules of the challenge. Only one of the rules (hgt) required a custom function (compare_hgt_with_units).

SCHEMA = {
    "byr": {"min": "1920", "max": "2002"},
    "iyr": {"min": "2010", "max": "2020"},
    "eyr": {"min": "2020", "max": "2030"},
    "hgt": {
        "anyof": [
            {"allof": [{"regex": "[0-9]+cm"}, {"check_with": compare_hgt_with_units}]},
            {"allof": [{"regex": "[0-9]+in"}, {"check_with": compare_hgt_with_units}]},
        ]
    },
    "hcl": {"regex": "#[0-9a-f]{6}"},
    "ecl": {"allowed": ["amb", "blu", "brn", "gry", "grn", "hzl", "oth"]},
    "pid": {"regex": "[0-9]{9}"},
    "cid": {"required": False},
}

# Provide a custom field validation function for a height with units.
def compare_hgt_with_units(field: str, value: str, error: Callable[..., str]) -> None:
    if value.endswith("cm"):
        if not (150 <= int(value.rstrip("cm")) <= 193):
            error(field, "out of range")
    elif value.endswith("in"):
        if not (59 <= int(value.rstrip("in")) <= 76):
            error(field, "out of range")
    else:
        error(field, "missing units")

With a schema in place, all that’s left to do is instantiate a Validator and validate each document:

>>> v = Validator(SCHEMA, require_all=True)
>>> v.validate(document)
True

Thanks, Cerberus!