Siffer — Soil Type Code Parser¶

What is the Siffer field?¶

Every polygon in the Estonian soil map is assigned one or more siffer (šiffer) codes — alphanumeric labels that identify the dominant soil type(s) within that mapping unit. A single polygon can carry up to four siffer codes, semicolon-separated, listed in order of decreasing dominance:

Ko;D;LPe;LP

This example encodes four soil types:

Code	Estonian name	Approximate international equivalent
`Ko`	Korestikmaa	Skeletic Leptosol / shallow rocky soil
`D`	Deluviaalmuld	Colluvic Regosol / slope deposit soil
`LPe`	Leetjas-paepealne erosioonimuld	Eroded Albeluvisol on limestone
`LP`	Leetjas-paepealne muld	Albeluvisol on limestone

The siffer vocabulary comprises several thousand codes defined in the national soil classification system. The valid codes for the current dataset are maintained in updated_uniq_jan25_2026.csv.

Why the raw data needs repair¶

The soil map was digitised from analogue sheets by many different operators over several decades. This produced a wide range of encoding artefacts:

Mixed delimiters — commas, spaces, colons, or dashes appear where semicolons should be used (e.g. Ko Ko LP or Ko,LP instead of Ko;LP).

Erosion-degree annotations — numeric erosion-intensity classes appended to soil codes where they do not belong (e.g. E1, E(1;2), C3 variants of the base codes E and C).

OCR and transcription errors — character swaps, merged tokens, stray brackets that arose when analogue text was scanned or retyped.

Legacy abbreviations — older mapping rounds used slightly different code spellings that are no longer part of the current standard.

The parser applies a multi-step repair workflow before the grammar validation:

Whole-string lookup in ~700 curated full-match replacements
Erosion-degree stripping (E-type, C-type, and parenthesised numerics)
Colon-separated numeric pair removal (last resort)
Delimiter normalisation (all separators → ;)
Per-token character-level lookup (~200 entries)

Only after these steps is the string validated against the formal Arpeggio grammar.

Output fields¶

The parser returns 7 columns per soil polygon row:

Field	Type	Description
`siffer_1`	str	First (dominant) soil-type code, standardised. Empty if absent.
`siffer_2`	str	Second soil-type code. Empty if absent.
`siffer_3`	str	Third soil-type code. Empty if absent.
`siffer_4`	str	Fourth soil-type code. Empty if absent.
`n_siffers`	int	Number of soil-type codes found in this polygon (0–4).
`parse_ok_s`	bool	`True` if all codes were recognised by the grammar. Used in the map viewer error-review style together with `parse_ok_l` and `parse_ok_h`.
`parse_error`	str	Description of what could not be parsed. Empty on success.

Empty vs absent

siffer_1 through siffer_4 are populated sequentially. If a polygon has only two soil types, siffer_1 and siffer_2 carry the codes and siffer_3, siffer_4 are empty strings.

Worked example¶

Raw field value: "Ko Ko LP"

Step	Result
Full-match lookup	no change
Erosion stripping	no change
Delimiter normalisation	`"Ko;Ko;LP"`
Grammar parse	siffer_1=`Ko`, siffer_2=`Ko`, siffer_3=`LP`
Deduplication (pipeline)	siffer_1=`Ko`, siffer_2=`LP`

Output: siffer_1="Ko", siffer_2="LP", n_siffers=2, parse_ok=True

Parse coverage¶

Across the full dataset (~800 k polygon rows), the siffer repair workflow resolves the large majority of non-standard entries. Rows where parse_ok=False represent codes absent from the current reference vocabulary or unresolvable artefacts; these are flagged for manual review.