API Viide¶
Genereeritud automaatselt Pythoni lähtekoodi dokumentatsioonist. Pärast muudatusi lähtekoodis käivitage pixi run -e docs docs-build, et viidet uuendada.
Huumuse parser¶
soil_lib.huumus_parser
¶
Regex-based parser for the Huumus (organic horizon) field of the Estonian soil map.
The Huumus field describes the organic surface horizon of a soil profile. Up to four space-separated sub-formulas (Sif1-Sif4) may appear; each corresponds to one dominant soil unit within the mapping polygon.
A sub-formula may carry two slash-separated variants::
left = metsakõdu (forest litter) reading
right = põllumaa huumus (agricultural humus) reading
When only one part is present it applies to both land-use types.
Within each part, individual organic layers are joined by + or ;.
Notation reference¶
th[depth]
Toorhuumus (raw/mor humus). Always > 10 cm; strongly acidic, poorly
decomposed; typical of conifer forests. E.g. th15 (15 cm thick).
t[degree][depth]
Turvas (peat). Decomposition degree encoded as Unicode subscripts
₁₂₃ (U+2081-U+2083) or plain ASCII digits 1/2/3:
1 = weakly, 2 = moderately, 3 = strongly decomposed. E.g. t₂20.
h[depth]
Huumus (mineral humus horizon, mull). Rare in the dataset (~4 rows).
[depth][degree] or [degree][depth]
Metsakõdu (forest litter). Decomposition degree 1-3; thickness ≤ 10 cm.
Subscript notation (₁₂₃) unambiguously means kõdu. E.g. 5₂ (5 cm,
degree 2).
0
No organic matter present.
Depth notation: single value (e.g. th15) or range N-M cm (th15-25).
Represents thickness of the horizon, not depth-to-top.
Brackets (...) mark uncertain or locally variable layers — stripped before
parsing.
Formal encoding structure::
siffer1 [siffer2 [siffer3 [siffer4]]] (space-separated)
Each siffer : [forest_part /] agri_part (slash split)
Each part : unit [+ unit ...]
Each unit : th[d] | t[deg][d] | h[d] | 0 | [d][deg]
Main entry point¶
:func:parse_huumus_formula — Dask-safe, returns a 24-column pd.Series.
See :data:_EMPTY_PARSE for the full output schema with field descriptions.
Copyright (C) 2018-2026 Alexander Kmoch, University of Tartu. MIT License.
parse_huumus_formula(text)
¶
Parse a single Huumus field value into a fixed-schema pd.Series.
Implements the Estonian soil-map Huumus encoding rules:
- Up to four space-separated siffer sub-formulas (Sif1 … Sif4).
/separates upper and lower depth layers in the soil profile (left = upper layer, right = lower layer)./(space-slash-space) is the legacy paper-map convention for agricultural-vs-forest-land separation within one polygon; detected by preprocessing and flagged ash_is_agri_forest.- Within each side,
+or;separates stacked sub-layers (litter layers or multiple humus types). - Individual units are classified as toorhuumus (th), turvas/peat (t), huumus (h), no-organic (0), or metsakõdu/litter (digit/subscript).
Output columns per unit (×4):
h_phu_type/_min/_max— primary humus from the right (lower-layer) side of/(or whole expression if no/).h_lhu_type/_min/_max— secondary humus from the left (upper-layer) side, if present.h_o1_deg/_min/_maxthroughh_o3_*— litter (kõdu) layers from the left side, in bottom-to-top order: O1 = bottom (most decomposed, deepest), O3 = top (freshest, shallowest).h_has_depth_split— True when/is present.h_is_agri_forest— True when/pattern detected.h_depth_total— sum of phu + lhu depth midpoints; values40 cm flagged for agri/forest review.
Designed as a drop-in for Series.apply() inside a Dask
map_partitions call. The function is a plain module-level function
with no closures over non-picklable objects, so it is safe for
processes=True Dask workers.
Parameters¶
text : str | None
Raw Huumus string, e.g. 'th15/h5 t₂20'
(toorhuumus 15 cm / humus 5 cm upper/lower; peat degree-2 20 cm).
Returns¶
pd.Series
~96 fixed keys — see _EMPTY_PARSE and the module docstring for
the full schema and field descriptions.
Siffri parandaja (Siffer Fixer)¶
soil_lib.siffer_fixer
¶
Pre-processing and grammar-based parsing of Estonian soil-type codes (siffers).
A siffer is the alphanumeric soil-type code assigned to each mapping unit in
the Estonian national soil map (mullastikukaart). Each polygon carries up to
four siffer symbols, semicolon-separated — e.g. Ko;D;LPe;LP encodes four
dominant soil types (Korestikmaa, Deluviaalmuld, Leetjas-paepealne erosioonimuld,
Leetjas-paepealne muld).
Why repair is needed¶
The analogue soil map was digitised over decades by multiple operators, producing a wide range of data-entry artefacts:
- Mixed delimiters — commas, spaces, colons, or dashes used instead of
; - Erosion-degree annotations — numeric suffixes like
E1,E(1;2),C3appended to soil codes where none should appear - OCR and transcription errors — letter swaps, merged tokens, stray brackets
- Legacy abbreviations — obsolete code variants from earlier mapping rounds
This module implements a multi-step repair workflow that normalises raw siffer strings before handing them to the Arpeggio grammar validator.
Typical workflow::
from soil_lib.siffer_fixer import (
siffer_repair_rules_lookup_full_match, test_erodee, test_erodee_E,
test_erodee_C, test_erodee_koolon, single_errors_replace, special_keep,
make_siffer_root_parser, quick_parse_dask, read_one_column_csv,
)
siffer_list = read_one_column_csv("updated_uniq_jan25_2026.csv", skip_header=True)[1]
root = make_siffer_root_parser(siffer_list)
parser = ParserPython(root, debug=False)
raw = "Ko Ko LP"
s = siffer_repair_rules_lookup_full_match.get(s, s)
s = test_erodee(s); s = test_erodee_E(s); s = test_erodee_C(s)
s = test_erodee_koolon(s)
s = single_errors_replace.get(s, s)
if s not in special_keep:
s = s.replace(":", ";").replace(" ", ";").replace(",", ";")
result = quick_parse_dask(parser, s) # pd.Series with 7 columns
Exports¶
Lookup tables (dicts):
siffer_repair_rules_lookup_full_match — whole-string replacements for
known multi-token error patterns (~700 entries, curated by domain scientists).
``single_errors_replace`` — per-token character-level replacements (~200 entries).
``special_keep`` — verbatim pass-through labels that are administrative map
categories (e.g. "Veeala, asustus või määramata"), not soil-type codes.
Functions:
| Name | Description |
|---|---|
|
func: |
|
func: |
|
func: |
|
func: |
Copyright (C) 2018-2026 Alexander Kmoch, University of Tartu. MIT License.
read_one_column_csv(file_path, string_quote='"', skip_header=True)
¶
Read a single-column CSV file into a list of string values.
Pure-Python, no external dependencies — safe to call before the main environment is fully initialised.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path to the CSV file (UTF-8 encoded). |
required | |
string_quote
|
Quote character to strip from each value (default |
'"'
|
|
skip_header
|
If |
True
|
Returns:
| Type | Description |
|---|---|
|
tuple[str | None, list[str]]: |
|
|
column name (or |
|
|
the list of stripped, dequoted non-empty row values. |
Example
_, siffers = read_one_column_csv("updated_uniq_jan25_2026.csv") len(siffers) 2847
make_siffer_root_parser(known_siffers)
¶
Build an Arpeggio PEG grammar factory for a given siffer vocabulary.
Constructs a closure (a root parser function) that Arpeggio's
ParserPython can compile into a parser. The grammar recognises three
structural forms:
pair_symbol—Ko(LPe)(dominant + minor in brackets)semicolon_separated—Ko;D;LPe;LP(up to 4 codes, semicolon-delimited)single_symbol—Ko(single code)
The vocabulary is embedded as a regex alternation sorted longest-first to
prevent partial matches (e.g. LPe must be tried before LP).
.. warning::
Arpeggio parsers carry mutable internal state and are not thread-safe.
Call this factory and instantiate ParserPython inside each Dask
worker function (i.e. inside map_partitions lambdas or the
per_partition_* helper). Never share a parser across tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
known_siffers
|
Iterable of valid siffer code strings (loaded from
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
Callable |
Root grammar function suitable for |
Example
root = make_siffer_root_parser(siffer_list) parser = ParserPython(root, debug=False) result = quick_parse_dask(parser, "Ko;LP")
quick_parse_dask(siffer_parser, text)
¶
Parse a single siffer string and return a pd.Series suitable for dask apply.
Extracts up to 4 siffer symbols from the parse tree. Returns a fixed-schema Series so that Series.apply() expands into a well-typed DataFrame.
Columns returned
siffer_1 .. siffer_4 - individual soil-type codes (empty string if absent) n_siffers - number of symbols actually found (int) parse_ok - True on success, False on any error (bool) parse_error - error message string, empty on success
test_erodee(siffer)
¶
Strip bare numeric erosion-degree annotations enclosed in parentheses.
Removes patterns like (1), (1;2), (1,2), (1:2), (1-2)
that appear inside siffer strings as OCR/data-entry artefacts from depth-class
or erosion-class annotations in the original analogue map. The parenthesised
numbers are not part of the soil-type code and must be removed before parsing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
siffer
|
Raw siffer string possibly containing parenthesised numeric tokens. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
Siffer string with all matched parenthesised numeric patterns removed. |
Example
test_erodee("Ko(2)I") 'KoI'
test_erodee_E(siffer)
¶
Collapse numeric erosion-degree suffixes attached to the E soil-code prefix.
The Erosioonmuld (E) code is sometimes followed by a numeric erosion-intensity
class (e.g. E1, E2, E(1-2), E1;2). These annotations are
not part of the standardised siffer vocabulary and must be stripped to leave
just E.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
siffer
|
Raw siffer string possibly containing |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
Siffer string with all |
Example
test_erodee_E("EI;E2") 'EI;E'
test_erodee_C(siffer)
¶
Collapse numeric qualifiers attached to the C soil-code prefix.
Analogous to :func:test_erodee_E but for C-type codes (e.g. C1,
C(1), C1;2). Strips the numeric qualifier to leave bare C.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
siffer
|
Raw siffer string possibly containing |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
Siffer string with all |
test_erodee_koolon(siffer)
¶
Remove bare colon-separated numeric pairs (last-resort cleanup step).
Strips patterns like 1:2, 3:4 that remain after previous repair steps.
These are legacy depth-class or erosion-class annotations using a colon
delimiter. Applied last because it is the most aggressive and seldom triggers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
siffer
|
Partially repaired siffer string. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
Siffer string with any remaining |
Loimise grammatika¶
soil_lib.LoimisGrammarV2
¶
Arpeggio-based grammar and 7-step parsing pipeline for the Loimis (soil texture) field.
The Loimis field of the Estonian soil map encodes the particle-size composition of the soil profile across up to four depth layers in a compact string notation.
Notation reference¶
Fine earth (peenes): l (liiv/sand), pl (peenliiv/fine sand),
sl (saviliiv/clayey sand), ls (liivsavi/sandy clay), s (savi/clay),
tsl (tolmjas saviliiv/dusty clayey sand), tls (tolmjas liivsavi/dusty
sandy clay), dk (liivakivirähk/sandstone grit).
Rock skeleton (kores): r (rähk/gravel), v (veeris/cobbles),
k (kruus/coarse gravel), kb (killustik/rubble), p (paas/limestone
bedrock), lu (lupjakivi/calcareous material), ck (slates/schist debris).
Organic (turfs): th (toorhuumus/raw humus), t₁₂₃ (turvas/peat,
decomposition degree 1-3).
Amplitude subscripts ₁-₅: abundance/intensity modifier on kores or peenes
(e.g. r₂ = moderate gravel content, ls₁ = weak sandy clay fraction).
Depth ranges: N-M cm (e.g. 30-70); + prefix means deeper than.
Carbonate (karbonaat): + immediately before a component code indicates
calcareous/carbonate material (e.g. +ls).
Layer separator: / separates depth layers (e.g. l40-70/ls₂30/+ls₂).
Grammar variants¶
new_grammar() builds 10 Arpeggio parsers covering different constituent
orderings (sp1-sp6) and amplitude-enabled kores variants (sp1a-sp4a).
Dask safety¶
Arpeggio parsers carry mutable state and must not be shared across tasks.
Always call new_grammar() at the start of each Dask worker function.
7-step parsing pipeline¶
- :func:
split_and_cut_dask_sharp— normalise and split by layer delimiters - :func:
test_brackets_dask— validate and fix bracket nesting - :func:
parse_test_dask_multiple— try all grammar variants; normalise - :func:
parse_reconstituate_dask— rebuild clean texture string loimis_grammar_product_dask_multiple(LoimisVisitor) — apply visitortest_layer_depths_dask(LoimisVisitor) — extract layer depthsset_texture_values_dask(LoimisVisitor) — extract texture fractions
Copyright (C) 2018-2026 Alexander Kmoch, University of Tartu. MIT License.
new_grammar()
¶
split_and_cut_dask_sharp(l_str, updated_texture_error_lookup)
¶
Pipeline step 1: normalise and split a raw loimis string by layer delimiters.
Takes the raw Loimis1 field value, applies the error-correction lookup,
discards any secondary siffer annotations after ;, and splits the
remaining string by / (layer separator). The layers are rejoined with
the internal || delimiter so that subsequent pipeline steps can work on
a plain string rather than a list.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
l_str
|
str
|
Raw loimis string, e.g. |
required |
updated_texture_error_lookup
|
Dict mapping known malformed loimis tokens
to their corrected canonical forms (from :mod: |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
|
str
|
or |
test_brackets_dask(l_str)
¶
Pipeline step 2: validate and fix bracket nesting within each layer token.
Iterates over ||-delimited layer tokens. For each token, consolidates
repeated subscript characters and resolves parenthesised numeric depth
annotations (e.g. r₂30(20) → r₂30) using the bracket-matcher
regex patterns defined at module level.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
l_str
|
str
|
|
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
|
parse_test_dask_multiple(t_str, parsers_d)
¶
Pipeline step 3: normalise each layer token using the best-matching grammar.
Tries all grammar variants in parsers_d against each ||-delimited layer
token via :func:consolidate_loimis_multiple_p. The first grammar that
succeeds is used to produce a canonical normalised string. Tokens that match
no grammar are replaced with "no_info".
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
t_str
|
str
|
|
required |
parsers_d
|
Dict
|
Dict of |
required |
Returns:
| Type | Description |
|---|---|
Series
|
pd.Series: Two-element series |
Series
|
normalised_str is the |
Series
|
tokens that could not be parsed by any grammar variant. |
parse_reconstituate_dask(t_str)
¶
Pipeline step 4: reassemble normalised layer tokens into a slash-separated string.
Splits the ||-delimited result from step 3, counts how many layers
contain "no_info" (unparseable), and rejoins with / to produce the
final reconstituted loimis string that the visitor (step 5) will parse.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
t_str
|
str
|
|
required |
Returns:
| Type | Description |
|---|---|
Series
|
pd.Series: Two-element series |
Series
|
loimis_reconst is the |
Series
|
has_no_info is the count of unparseable layer tokens (0 = all parsed). |
Loimise külastaja (Loimis Visitor)¶
soil_lib.LoimisVisitor
¶
Arpeggio parse-tree visitors and texture extraction functions for the Loimis pipeline.
This module contains two PTNodeVisitor implementations that walk the
Arpeggio parse tree produced by the grammars in :mod:.LoimisGrammarV2 and
extract structured soil data, plus the module-level functions that apply them
and derive the final tabular output columns.
Visitor classes¶
LpVisitorV2 (production)
Walks the parse tree and assembles a nested dict describing all soil
constituents within the loimis string. Output structure::
{
"type": "loimis",
"count": <int>, # number of soil layers
"soilparts": [ # one entry per "/" layer
{
"constituents": [
{
"type": "kores" | "peenes" | "turfs",
"code": <str>, # e.g. "r", "ls", "th"
"karbonaat": <bool>, # True = calcareous material
"amp": <int> | False, # amplitude 1-5
"depth": <dict> | False # see visit_depth_range
},
...
],
"count": <int>
},
...
]
}
LpVisitor (legacy, kept for reference)
Depth dict format (from visit_depth_range)::
{
"range": <bool>, # True = N-M range, False = single value
"from": <float>, # upper depth boundary in mm (cm × 10)
"to": <float>, # lower depth boundary in mm (cm × 10)
"deeper_than": <bool> # True when '+' prefix present
}
Module-level output functions¶
:func:loimis_grammar_product_dask_multiple — step 5 of the pipeline; applies
LpVisitorV2 to produce the loimis_grammar dict and parse_info string.
:func:test_layer_depths_dask — step 6; extracts a 6-element positional
pd.Series: [nlayers, ZMX, Z1, Z2, Z3, Z4].
:func:set_texture_values_dask — step 7; extracts a 33-element positional
pd.Series (indices 0-32):
EST_TXT1-4, EST_CRS1-4, LXTYPE1-4, CLAY1-4, SILT1-4,
SAND1-4, ROCK1-4, loimis_search, KARB1-4.
Copyright (C) 2018-2026 Alexander Kmoch, University of Tartu. MIT License.
loimis_grammar_product_dask_multiple(input_expr, parsers_d)
¶
Pipeline step 5: apply the LpVisitorV2 to build the structured loimis dict.
Iterates over /-separated layer expressions in input_expr, finds the
matching grammar variant for each, and applies the visitor to produce a
structured loimis_grammar dict describing all soil constituents.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_expr
|
str
|
Reconstituted loimis string from step 4
( |
required |
parsers_d
|
Dict
|
Dict of |
required |
Returns:
| Type | Description |
|---|---|
Series
|
pd.Series: Three-element series with keys |
Series
|
describing all soil constituents), |
Series
|
|
Series
|
|
Series
|
parsed successfully). |
test_layer_depths_dask(grammar_loimis)
¶
Pipeline step 6: extract layer count and depth boundaries from the grammar dict.
Traverses the loimis_grammar structure produced by step 5 and reads the
depth information encoded in constituent depth dicts to determine how many
distinct soil layers are present and where each layer boundary lies.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
grammar_loimis
|
Dict[str, Any]
|
The |
required |
Returns:
| Type | Description |
|---|---|
Series
|
pd.Series: Six-element positional series (access by integer index): |
Series
|
==== =========== ================================================= |
Series
|
Idx Field name Description |
Series
|
==== =========== ================================================= |
Series
|
0 nlayers Number of distinct texture layers (int) |
Series
|
1 ZMX Maximum soil depth / total profile depth (cm) |
Series
|
2 Z1 Lower boundary of layer 1 (cm) |
Series
|
3 Z2 Lower boundary of layer 2 (cm), 0 if absent |
Series
|
4 Z3 Lower boundary of layer 3 (cm), 0 if absent |
Series
|
5 Z4 Lower boundary of layer 4 (cm), 0 if absent |
Series
|
==== =========== ================================================= |
set_texture_values_dask(grammar_loimis)
¶
Pipeline step 7: extract granulometric texture fractions for each soil layer.
Looks up each constituent code in texture_rules (from LoimisLookups) to
derive the clay, silt, sand, and rock-fragment percentages, and maps the
Estonian texture code to an international (WRB) texture class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
grammar_loimis
|
Dict[str, Any]
|
The |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Series
|
pd.Series: 33-element positional series (access by integer index): |
|
Series
|
====== =========== =================================================== |
|
Series
|
Idx Field name Description |
|
Series
|
====== =========== =================================================== |
|
Series
|
0-3 EST_TXT1-4 Estonian texture code per layer (str) |
|
Series
|
4-7 EST_CRS1-4 Coarse fraction type code per layer (str) |
|
Series
|
8-11 LXTYPE1-4 International texture class per layer (str), e.g.
|
|
Series
|
12-15 CLAY1-4 Clay fraction per layer (%, float or NaN) |
|
Series
|
16-19 SILT1-4 Silt fraction per layer (%, float or NaN) |
|
Series
|
20-23 SAND1-4 Sand fraction per layer (%, float or NaN) |
|
Series
|
24-27 ROCK1-4 Rock/skeleton fraction per layer (%, float or NaN) |
|
Series
|
28 loimis_search Raw search-parameter dict (for diagnostics) |
|
Series
|
29-32 KARB1-4 Carbonate flag per layer: 1 if any constituent
is calcareous ( |
|
Series
|
====== =========== =================================================== |
|
Note |
Series
|
Clay/silt/sand fractions are |
Series
|
layers where |