Keri sisuni

API Viide

Genereeritud automaatselt Pythoni lähtekoodi dokumentatsioonist. Pärast muudatusi lähtekoodis käivitage pixi run -e docs docs-build, et viidet uuendada.


Huumuse parser

soil_lib.huumus_parser

Regex-based parser for the Huumus (organic horizon) field of the Estonian soil map.

The Huumus field describes the organic surface horizon of a soil profile. Up to four space-separated sub-formulas (Sif1-Sif4) may appear; each corresponds to one dominant soil unit within the mapping polygon.

A sub-formula may carry two slash-separated variants::

left  = metsakõdu (forest litter) reading
right = põllumaa huumus (agricultural humus) reading

When only one part is present it applies to both land-use types. Within each part, individual organic layers are joined by + or ;.

Notation reference

th[depth] Toorhuumus (raw/mor humus). Always > 10 cm; strongly acidic, poorly decomposed; typical of conifer forests. E.g. th15 (15 cm thick).

t[degree][depth] Turvas (peat). Decomposition degree encoded as Unicode subscripts ₁₂₃ (U+2081-U+2083) or plain ASCII digits 1/2/3: 1 = weakly, 2 = moderately, 3 = strongly decomposed. E.g. t₂20.

h[depth] Huumus (mineral humus horizon, mull). Rare in the dataset (~4 rows).

[depth][degree] or [degree][depth] Metsakõdu (forest litter). Decomposition degree 1-3; thickness ≤ 10 cm. Subscript notation (₁₂₃) unambiguously means kõdu. E.g. 5₂ (5 cm, degree 2).

0 No organic matter present.

Depth notation: single value (e.g. th15) or range N-M cm (th15-25). Represents thickness of the horizon, not depth-to-top. Brackets (...) mark uncertain or locally variable layers — stripped before parsing.

Formal encoding structure::

siffer1 [siffer2 [siffer3 [siffer4]]]         (space-separated)
Each siffer : [forest_part /] agri_part        (slash split)
Each part   : unit [+ unit ...]
Each unit   : th[d] | t[deg][d] | h[d] | 0 | [d][deg]
Main entry point

:func:parse_huumus_formula — Dask-safe, returns a 24-column pd.Series. See :data:_EMPTY_PARSE for the full output schema with field descriptions.

Copyright (C) 2018-2026 Alexander Kmoch, University of Tartu. MIT License.

parse_huumus_formula(text)

Parse a single Huumus field value into a fixed-schema pd.Series.

Implements the Estonian soil-map Huumus encoding rules:

  • Up to four space-separated siffer sub-formulas (Sif1 … Sif4).
  • / separates upper and lower depth layers in the soil profile (left = upper layer, right = lower layer).
  • / (space-slash-space) is the legacy paper-map convention for agricultural-vs-forest-land separation within one polygon; detected by preprocessing and flagged as h_is_agri_forest.
  • Within each side, + or ; separates stacked sub-layers (litter layers or multiple humus types).
  • Individual units are classified as toorhuumus (th), turvas/peat (t), huumus (h), no-organic (0), or metsakõdu/litter (digit/subscript).

Output columns per unit (×4):

  • h_phu_type / _min / _max — primary humus from the right (lower-layer) side of / (or whole expression if no /).
  • h_lhu_type / _min / _max — secondary humus from the left (upper-layer) side, if present.
  • h_o1_deg / _min / _max through h_o3_* — litter (kõdu) layers from the left side, in bottom-to-top order: O1 = bottom (most decomposed, deepest), O3 = top (freshest, shallowest).
  • h_has_depth_split — True when / is present.
  • h_is_agri_forest — True when / pattern detected.
  • h_depth_total — sum of phu + lhu depth midpoints; values

    40 cm flagged for agri/forest review.

Designed as a drop-in for Series.apply() inside a Dask map_partitions call. The function is a plain module-level function with no closures over non-picklable objects, so it is safe for processes=True Dask workers.

Parameters

text : str | None Raw Huumus string, e.g. 'th15/h5 t₂20' (toorhuumus 15 cm / humus 5 cm upper/lower; peat degree-2 20 cm).

Returns

pd.Series ~96 fixed keys — see _EMPTY_PARSE and the module docstring for the full schema and field descriptions.


Siffri parandaja (Siffer Fixer)

soil_lib.siffer_fixer

Pre-processing and grammar-based parsing of Estonian soil-type codes (siffers).

A siffer is the alphanumeric soil-type code assigned to each mapping unit in the Estonian national soil map (mullastikukaart). Each polygon carries up to four siffer symbols, semicolon-separated — e.g. Ko;D;LPe;LP encodes four dominant soil types (Korestikmaa, Deluviaalmuld, Leetjas-paepealne erosioonimuld, Leetjas-paepealne muld).

Why repair is needed

The analogue soil map was digitised over decades by multiple operators, producing a wide range of data-entry artefacts:

  • Mixed delimiters — commas, spaces, colons, or dashes used instead of ;
  • Erosion-degree annotations — numeric suffixes like E1, E(1;2), C3 appended to soil codes where none should appear
  • OCR and transcription errors — letter swaps, merged tokens, stray brackets
  • Legacy abbreviations — obsolete code variants from earlier mapping rounds

This module implements a multi-step repair workflow that normalises raw siffer strings before handing them to the Arpeggio grammar validator.

Typical workflow::

from soil_lib.siffer_fixer import (
    siffer_repair_rules_lookup_full_match, test_erodee, test_erodee_E,
    test_erodee_C, test_erodee_koolon, single_errors_replace, special_keep,
    make_siffer_root_parser, quick_parse_dask, read_one_column_csv,
)
siffer_list = read_one_column_csv("updated_uniq_jan25_2026.csv", skip_header=True)[1]
root = make_siffer_root_parser(siffer_list)
parser = ParserPython(root, debug=False)

raw = "Ko Ko LP"
s = siffer_repair_rules_lookup_full_match.get(s, s)
s = test_erodee(s); s = test_erodee_E(s); s = test_erodee_C(s)
s = test_erodee_koolon(s)
s = single_errors_replace.get(s, s)
if s not in special_keep:
    s = s.replace(":", ";").replace(" ", ";").replace(",", ";")
result = quick_parse_dask(parser, s)  # pd.Series with 7 columns
Exports

Lookup tables (dicts): siffer_repair_rules_lookup_full_match — whole-string replacements for known multi-token error patterns (~700 entries, curated by domain scientists).

``single_errors_replace`` — per-token character-level replacements (~200 entries).

``special_keep`` — verbatim pass-through labels that are administrative map
categories (e.g. "Veeala, asustus või määramata"), not soil-type codes.

Functions:

Name Description

func:test_erodee, :func:test_erodee_E, :func:test_erodee_C,

func:test_erodee_koolon — strip numeric erosion/depth-class annotations.

func:make_siffer_root_parser — build an Arpeggio parser from a vocabulary

func:quick_parse_dask — parse one normalised string; returns a 7-column

Copyright (C) 2018-2026 Alexander Kmoch, University of Tartu. MIT License.

read_one_column_csv(file_path, string_quote='"', skip_header=True)

Read a single-column CSV file into a list of string values.

Pure-Python, no external dependencies — safe to call before the main environment is fully initialised.

Parameters:

Name Type Description Default
file_path

Path to the CSV file (UTF-8 encoded).

required
string_quote

Quote character to strip from each value (default ").

'"'
skip_header

If True skip the first line and return (None, values). If False read the first line as the header name and return (header_name, values).

True

Returns:

Type Description

tuple[str | None, list[str]]: (header, values) where header is the

column name (or None when skip_header is True) and values is

the list of stripped, dequoted non-empty row values.

Example

_, siffers = read_one_column_csv("updated_uniq_jan25_2026.csv") len(siffers) 2847

make_siffer_root_parser(known_siffers)

Build an Arpeggio PEG grammar factory for a given siffer vocabulary.

Constructs a closure (a root parser function) that Arpeggio's ParserPython can compile into a parser. The grammar recognises three structural forms:

  • pair_symbolKo(LPe) (dominant + minor in brackets)
  • semicolon_separatedKo;D;LPe;LP (up to 4 codes, semicolon-delimited)
  • single_symbolKo (single code)

The vocabulary is embedded as a regex alternation sorted longest-first to prevent partial matches (e.g. LPe must be tried before LP).

.. warning:: Arpeggio parsers carry mutable internal state and are not thread-safe. Call this factory and instantiate ParserPython inside each Dask worker function (i.e. inside map_partitions lambdas or the per_partition_* helper). Never share a parser across tasks.

Parameters:

Name Type Description Default
known_siffers

Iterable of valid siffer code strings (loaded from updated_uniq_jan25_2026.csv via :func:read_one_column_csv).

required

Returns:

Name Type Description
Callable

Root grammar function suitable for ParserPython(root, debug=False).

Example

root = make_siffer_root_parser(siffer_list) parser = ParserPython(root, debug=False) result = quick_parse_dask(parser, "Ko;LP")

quick_parse_dask(siffer_parser, text)

Parse a single siffer string and return a pd.Series suitable for dask apply.

Extracts up to 4 siffer symbols from the parse tree. Returns a fixed-schema Series so that Series.apply() expands into a well-typed DataFrame.

Columns returned

siffer_1 .. siffer_4 - individual soil-type codes (empty string if absent) n_siffers - number of symbols actually found (int) parse_ok - True on success, False on any error (bool) parse_error - error message string, empty on success

test_erodee(siffer)

Strip bare numeric erosion-degree annotations enclosed in parentheses.

Removes patterns like (1), (1;2), (1,2), (1:2), (1-2) that appear inside siffer strings as OCR/data-entry artefacts from depth-class or erosion-class annotations in the original analogue map. The parenthesised numbers are not part of the soil-type code and must be removed before parsing.

Parameters:

Name Type Description Default
siffer

Raw siffer string possibly containing parenthesised numeric tokens.

required

Returns:

Name Type Description
str

Siffer string with all matched parenthesised numeric patterns removed.

Example

test_erodee("Ko(2)I") 'KoI'

test_erodee_E(siffer)

Collapse numeric erosion-degree suffixes attached to the E soil-code prefix.

The Erosioonmuld (E) code is sometimes followed by a numeric erosion-intensity class (e.g. E1, E2, E(1-2), E1;2). These annotations are not part of the standardised siffer vocabulary and must be stripped to leave just E.

Parameters:

Name Type Description Default
siffer

Raw siffer string possibly containing E-prefixed numeric variants.

required

Returns:

Name Type Description
str

Siffer string with all E-numeric suffixes collapsed to bare E.

Example

test_erodee_E("EI;E2") 'EI;E'

test_erodee_C(siffer)

Collapse numeric qualifiers attached to the C soil-code prefix.

Analogous to :func:test_erodee_E but for C-type codes (e.g. C1, C(1), C1;2). Strips the numeric qualifier to leave bare C.

Parameters:

Name Type Description Default
siffer

Raw siffer string possibly containing C-prefixed numeric variants.

required

Returns:

Name Type Description
str

Siffer string with all C-numeric suffixes collapsed to bare C.

test_erodee_koolon(siffer)

Remove bare colon-separated numeric pairs (last-resort cleanup step).

Strips patterns like 1:2, 3:4 that remain after previous repair steps. These are legacy depth-class or erosion-class annotations using a colon delimiter. Applied last because it is the most aggressive and seldom triggers.

Parameters:

Name Type Description Default
siffer

Partially repaired siffer string.

required

Returns:

Name Type Description
str

Siffer string with any remaining N:M numeric pairs removed.


Loimise grammatika

soil_lib.LoimisGrammarV2

Arpeggio-based grammar and 7-step parsing pipeline for the Loimis (soil texture) field.

The Loimis field of the Estonian soil map encodes the particle-size composition of the soil profile across up to four depth layers in a compact string notation.

Notation reference

Fine earth (peenes): l (liiv/sand), pl (peenliiv/fine sand), sl (saviliiv/clayey sand), ls (liivsavi/sandy clay), s (savi/clay), tsl (tolmjas saviliiv/dusty clayey sand), tls (tolmjas liivsavi/dusty sandy clay), dk (liivakivirähk/sandstone grit).

Rock skeleton (kores): r (rähk/gravel), v (veeris/cobbles), k (kruus/coarse gravel), kb (killustik/rubble), p (paas/limestone bedrock), lu (lupjakivi/calcareous material), ck (slates/schist debris).

Organic (turfs): th (toorhuumus/raw humus), t₁₂₃ (turvas/peat, decomposition degree 1-3).

Amplitude subscripts ₁-₅: abundance/intensity modifier on kores or peenes (e.g. r₂ = moderate gravel content, ls₁ = weak sandy clay fraction).

Depth ranges: N-M cm (e.g. 30-70); + prefix means deeper than.

Carbonate (karbonaat): + immediately before a component code indicates calcareous/carbonate material (e.g. +ls).

Layer separator: / separates depth layers (e.g. l40-70/ls₂30/+ls₂).

Grammar variants

new_grammar() builds 10 Arpeggio parsers covering different constituent orderings (sp1-sp6) and amplitude-enabled kores variants (sp1a-sp4a).

Dask safety

Arpeggio parsers carry mutable state and must not be shared across tasks. Always call new_grammar() at the start of each Dask worker function.

7-step parsing pipeline
  1. :func:split_and_cut_dask_sharp — normalise and split by layer delimiters
  2. :func:test_brackets_dask — validate and fix bracket nesting
  3. :func:parse_test_dask_multiple — try all grammar variants; normalise
  4. :func:parse_reconstituate_dask — rebuild clean texture string
  5. loimis_grammar_product_dask_multiple (LoimisVisitor) — apply visitor
  6. test_layer_depths_dask (LoimisVisitor) — extract layer depths
  7. set_texture_values_dask (LoimisVisitor) — extract texture fractions

Copyright (C) 2018-2026 Alexander Kmoch, University of Tartu. MIT License.

new_grammar()

split_and_cut_dask_sharp(l_str, updated_texture_error_lookup)

Pipeline step 1: normalise and split a raw loimis string by layer delimiters.

Takes the raw Loimis1 field value, applies the error-correction lookup, discards any secondary siffer annotations after ;, and splits the remaining string by / (layer separator). The layers are rejoined with the internal || delimiter so that subsequent pipeline steps can work on a plain string rather than a list.

Parameters:

Name Type Description Default
l_str str

Raw loimis string, e.g. "l40-70/ls₂30/+ls₂" or None.

required
updated_texture_error_lookup

Dict mapping known malformed loimis tokens to their corrected canonical forms (from :mod:.LoimisLookups).

required

Returns:

Name Type Description
str str

||-delimited layer string (e.g. "l40-70||ls₂30||+ls₂"),

str

or "no_info" for null/non-string input.

test_brackets_dask(l_str)

Pipeline step 2: validate and fix bracket nesting within each layer token.

Iterates over ||-delimited layer tokens. For each token, consolidates repeated subscript characters and resolves parenthesised numeric depth annotations (e.g. r₂30(20)r₂30) using the bracket-matcher regex patterns defined at module level.

Parameters:

Name Type Description Default
l_str str

||-delimited string produced by :func:split_and_cut_dask_sharp.

required

Returns:

Name Type Description
str str

||-delimited string with bracket artefacts removed from each token.

parse_test_dask_multiple(t_str, parsers_d)

Pipeline step 3: normalise each layer token using the best-matching grammar.

Tries all grammar variants in parsers_d against each ||-delimited layer token via :func:consolidate_loimis_multiple_p. The first grammar that succeeds is used to produce a canonical normalised string. Tokens that match no grammar are replaced with "no_info".

Parameters:

Name Type Description Default
t_str str

||-delimited string from :func:test_brackets_dask.

required
parsers_d Dict

Dict of {name: ParserPython} from :func:new_grammar.

required

Returns:

Type Description
Series

pd.Series: Two-element series (normalised_str, error_count) where

Series

normalised_str is the ||-delimited result and error_count counts

Series

tokens that could not be parsed by any grammar variant.

parse_reconstituate_dask(t_str)

Pipeline step 4: reassemble normalised layer tokens into a slash-separated string.

Splits the ||-delimited result from step 3, counts how many layers contain "no_info" (unparseable), and rejoins with / to produce the final reconstituted loimis string that the visitor (step 5) will parse.

Parameters:

Name Type Description Default
t_str str

||-delimited normalised string from :func:parse_test_dask_multiple.

required

Returns:

Type Description
Series

pd.Series: Two-element series (loimis_reconst, has_no_info) where

Series

loimis_reconst is the /-separated string ready for the visitor and

Series

has_no_info is the count of unparseable layer tokens (0 = all parsed).


Loimise külastaja (Loimis Visitor)

soil_lib.LoimisVisitor

Arpeggio parse-tree visitors and texture extraction functions for the Loimis pipeline.

This module contains two PTNodeVisitor implementations that walk the Arpeggio parse tree produced by the grammars in :mod:.LoimisGrammarV2 and extract structured soil data, plus the module-level functions that apply them and derive the final tabular output columns.

Visitor classes

LpVisitorV2 (production) Walks the parse tree and assembles a nested dict describing all soil constituents within the loimis string. Output structure::

    {
        "type": "loimis",
        "count": <int>,          # number of soil layers
        "soilparts": [           # one entry per "/" layer
            {
                "constituents": [
                    {
                        "type":      "kores" | "peenes" | "turfs",
                        "code":      <str>,   # e.g. "r", "ls", "th"
                        "karbonaat": <bool>,  # True = calcareous material
                        "amp":       <int> | False,   # amplitude 1-5
                        "depth":     <dict> | False   # see visit_depth_range
                    },
                    ...
                ],
                "count": <int>
            },
            ...
        ]
    }

LpVisitor (legacy, kept for reference)

Depth dict format (from visit_depth_range)::

{
    "range":       <bool>,    # True = N-M range, False = single value
    "from":        <float>,   # upper depth boundary in mm (cm × 10)
    "to":          <float>,   # lower depth boundary in mm (cm × 10)
    "deeper_than": <bool>     # True when '+' prefix present
}
Module-level output functions

:func:loimis_grammar_product_dask_multiple — step 5 of the pipeline; applies LpVisitorV2 to produce the loimis_grammar dict and parse_info string.

:func:test_layer_depths_dask — step 6; extracts a 6-element positional pd.Series: [nlayers, ZMX, Z1, Z2, Z3, Z4].

:func:set_texture_values_dask — step 7; extracts a 33-element positional pd.Series (indices 0-32): EST_TXT1-4, EST_CRS1-4, LXTYPE1-4, CLAY1-4, SILT1-4, SAND1-4, ROCK1-4, loimis_search, KARB1-4.

Copyright (C) 2018-2026 Alexander Kmoch, University of Tartu. MIT License.

loimis_grammar_product_dask_multiple(input_expr, parsers_d)

Pipeline step 5: apply the LpVisitorV2 to build the structured loimis dict.

Iterates over /-separated layer expressions in input_expr, finds the matching grammar variant for each, and applies the visitor to produce a structured loimis_grammar dict describing all soil constituents.

Parameters:

Name Type Description Default
input_expr str

Reconstituted loimis string from step 4 (parse_reconstituate_dask), e.g. "l40-70/ls₂30", or None / "no_info" for empty entries.

required
parsers_d Dict

Dict of {name: ParserPython} from new_grammar() (must be freshly created per Dask task — not shared).

required

Returns:

Type Description
Series

pd.Series: Three-element series with keys loimis_grammar (nested dict

Series

describing all soil constituents), parse_info (status string:

Series

"successful", "empty_loimis", "parse_error", or

Series

"partial_no_info"), and parse_ok_l (bool, True when all layers

Series

parsed successfully).

test_layer_depths_dask(grammar_loimis)

Pipeline step 6: extract layer count and depth boundaries from the grammar dict.

Traverses the loimis_grammar structure produced by step 5 and reads the depth information encoded in constituent depth dicts to determine how many distinct soil layers are present and where each layer boundary lies.

Parameters:

Name Type Description Default
grammar_loimis Dict[str, Any]

The loimis_grammar dict from loimis_grammar_product_dask_multiple.

required

Returns:

Type Description
Series

pd.Series: Six-element positional series (access by integer index):

Series

==== =========== =================================================

Series

Idx Field name Description

Series

==== =========== =================================================

Series

0 nlayers Number of distinct texture layers (int)

Series

1 ZMX Maximum soil depth / total profile depth (cm)

Series

2 Z1 Lower boundary of layer 1 (cm)

Series

3 Z2 Lower boundary of layer 2 (cm), 0 if absent

Series

4 Z3 Lower boundary of layer 3 (cm), 0 if absent

Series

5 Z4 Lower boundary of layer 4 (cm), 0 if absent

Series

==== =========== =================================================

set_texture_values_dask(grammar_loimis)

Pipeline step 7: extract granulometric texture fractions for each soil layer.

Looks up each constituent code in texture_rules (from LoimisLookups) to derive the clay, silt, sand, and rock-fragment percentages, and maps the Estonian texture code to an international (WRB) texture class.

Parameters:

Name Type Description Default
grammar_loimis Dict[str, Any]

The loimis_grammar dict from step 5.

required

Returns:

Name Type Description
Series

pd.Series: 33-element positional series (access by integer index):

Series

====== =========== ===================================================

Series

Idx Field name Description

Series

====== =========== ===================================================

Series

0-3 EST_TXT1-4 Estonian texture code per layer (str)

Series

4-7 EST_CRS1-4 Coarse fraction type code per layer (str)

Series

8-11 LXTYPE1-4 International texture class per layer (str), e.g. "SAND", "LOAM", "CLAY", "PEAT", "GRAVELS"

Series

12-15 CLAY1-4 Clay fraction per layer (%, float or NaN)

Series

16-19 SILT1-4 Silt fraction per layer (%, float or NaN)

Series

20-23 SAND1-4 Sand fraction per layer (%, float or NaN)

Series

24-27 ROCK1-4 Rock/skeleton fraction per layer (%, float or NaN)

Series

28 loimis_search Raw search-parameter dict (for diagnostics)

Series

29-32 KARB1-4 Carbonate flag per layer: 1 if any constituent is calcareous (+ prefix), 0 otherwise (int)

Series

====== =========== ===================================================

Note Series

Clay/silt/sand fractions are NaN for gravel/rock-dominated

Series

layers where LXTYPE = "GRAVELS" and ROCK carries the content.