Siffer — Soil Type Code Parser¶
What is the Siffer field?¶
Every polygon in the Estonian soil map is assigned one or more siffer (šiffer) codes — alphanumeric labels that identify the dominant soil type(s) within that mapping unit. A single polygon can carry up to four siffer codes, semicolon-separated, listed in order of decreasing dominance:
Ko;D;LPe;LP
This example encodes four soil types:
| Code | Estonian name | Approximate international equivalent |
|---|---|---|
Ko |
Korestikmaa | Skeletic Leptosol / shallow rocky soil |
D |
Deluviaalmuld | Colluvic Regosol / slope deposit soil |
LPe |
Leetjas-paepealne erosioonimuld | Eroded Albeluvisol on limestone |
LP |
Leetjas-paepealne muld | Albeluvisol on limestone |
The siffer vocabulary comprises several thousand codes defined in the national soil
classification system. The valid codes for the current dataset are maintained in
updated_uniq_jan25_2026.csv.
Why the raw data needs repair¶
The soil map was digitised from analogue sheets by many different operators over several decades. This produced a wide range of encoding artefacts:
Mixed delimiters — commas, spaces, colons, or dashes appear where semicolons
should be used (e.g. Ko Ko LP or Ko,LP instead of Ko;LP).
Erosion-degree annotations — numeric erosion-intensity classes appended to
soil codes where they do not belong (e.g. E1, E(1;2), C3 variants of the
base codes E and C).
OCR and transcription errors — character swaps, merged tokens, stray brackets that arose when analogue text was scanned or retyped.
Legacy abbreviations — older mapping rounds used slightly different code spellings that are no longer part of the current standard.
The parser applies a multi-step repair workflow before the grammar validation:
- Whole-string lookup in ~700 curated full-match replacements
- Erosion-degree stripping (E-type, C-type, and parenthesised numerics)
- Colon-separated numeric pair removal (last resort)
- Delimiter normalisation (all separators →
;) - Per-token character-level lookup (~200 entries)
Only after these steps is the string validated against the formal Arpeggio grammar.
Output fields¶
The parser returns 7 columns per soil polygon row:
| Field | Type | Description |
|---|---|---|
siffer_1 |
str | First (dominant) soil-type code, standardised. Empty if absent. |
siffer_2 |
str | Second soil-type code. Empty if absent. |
siffer_3 |
str | Third soil-type code. Empty if absent. |
siffer_4 |
str | Fourth soil-type code. Empty if absent. |
n_siffers |
int | Number of soil-type codes found in this polygon (0–4). |
parse_ok_s |
bool | True if all codes were recognised by the grammar. Used in the map viewer error-review style together with parse_ok_l and parse_ok_h. |
parse_error |
str | Description of what could not be parsed. Empty on success. |
Empty vs absent
siffer_1 through siffer_4 are populated sequentially. If a polygon has
only two soil types, siffer_1 and siffer_2 carry the codes and siffer_3,
siffer_4 are empty strings.
Worked example¶
Raw field value: "Ko Ko LP"
| Step | Result |
|---|---|
| Full-match lookup | no change |
| Erosion stripping | no change |
| Delimiter normalisation | "Ko;Ko;LP" |
| Grammar parse | siffer_1=Ko, siffer_2=Ko, siffer_3=LP |
| Deduplication (pipeline) | siffer_1=Ko, siffer_2=LP |
Output: siffer_1="Ko", siffer_2="LP", n_siffers=2, parse_ok=True
Parse coverage¶
Across the full dataset (~800 k polygon rows), the siffer repair workflow resolves
the large majority of non-standard entries. Rows where parse_ok=False represent
codes absent from the current reference vocabulary or unresolvable artefacts; these
are flagged for manual review.