Skip to content

Siffer — Soil Type Code Parser

What is the Siffer field?

Every polygon in the Estonian soil map is assigned one or more siffer (šiffer) codes — alphanumeric labels that identify the dominant soil type(s) within that mapping unit. A single polygon can carry up to four siffer codes, semicolon-separated, listed in order of decreasing dominance:

Ko;D;LPe;LP

This example encodes four soil types:

Code Estonian name Approximate international equivalent
Ko Korestikmaa Skeletic Leptosol / shallow rocky soil
D Deluviaalmuld Colluvic Regosol / slope deposit soil
LPe Leetjas-paepealne erosioonimuld Eroded Albeluvisol on limestone
LP Leetjas-paepealne muld Albeluvisol on limestone

The siffer vocabulary comprises several thousand codes defined in the national soil classification system. The valid codes for the current dataset are maintained in updated_uniq_jan25_2026.csv.


Why the raw data needs repair

The soil map was digitised from analogue sheets by many different operators over several decades. This produced a wide range of encoding artefacts:

Mixed delimiters — commas, spaces, colons, or dashes appear where semicolons should be used (e.g. Ko Ko LP or Ko,LP instead of Ko;LP).

Erosion-degree annotations — numeric erosion-intensity classes appended to soil codes where they do not belong (e.g. E1, E(1;2), C3 variants of the base codes E and C).

OCR and transcription errors — character swaps, merged tokens, stray brackets that arose when analogue text was scanned or retyped.

Legacy abbreviations — older mapping rounds used slightly different code spellings that are no longer part of the current standard.

The parser applies a multi-step repair workflow before the grammar validation:

  1. Whole-string lookup in ~700 curated full-match replacements
  2. Erosion-degree stripping (E-type, C-type, and parenthesised numerics)
  3. Colon-separated numeric pair removal (last resort)
  4. Delimiter normalisation (all separators → ;)
  5. Per-token character-level lookup (~200 entries)

Only after these steps is the string validated against the formal Arpeggio grammar.


Output fields

The parser returns 7 columns per soil polygon row:

Field Type Description
siffer_1 str First (dominant) soil-type code, standardised. Empty if absent.
siffer_2 str Second soil-type code. Empty if absent.
siffer_3 str Third soil-type code. Empty if absent.
siffer_4 str Fourth soil-type code. Empty if absent.
n_siffers int Number of soil-type codes found in this polygon (0–4).
parse_ok_s bool True if all codes were recognised by the grammar. Used in the map viewer error-review style together with parse_ok_l and parse_ok_h.
parse_error str Description of what could not be parsed. Empty on success.

Empty vs absent

siffer_1 through siffer_4 are populated sequentially. If a polygon has only two soil types, siffer_1 and siffer_2 carry the codes and siffer_3, siffer_4 are empty strings.


Worked example

Raw field value: "Ko Ko LP"

Step Result
Full-match lookup no change
Erosion stripping no change
Delimiter normalisation "Ko;Ko;LP"
Grammar parse siffer_1=Ko, siffer_2=Ko, siffer_3=LP
Deduplication (pipeline) siffer_1=Ko, siffer_2=LP

Output: siffer_1="Ko", siffer_2="LP", n_siffers=2, parse_ok=True


Parse coverage

Across the full dataset (~800 k polygon rows), the siffer repair workflow resolves the large majority of non-standard entries. Rows where parse_ok=False represent codes absent from the current reference vocabulary or unresolvable artefacts; these are flagged for manual review.