Curate datasets of any format

Our previous guide explained how to validate, standardize & annotate DataFrame and AnnData. In this guide, we’ll walk through the basic API that lets you work with any format of data.

How do I validate based on a public ontology?

LaminDB makes it easy to validate categorical variables based on registries that inherit from CanCurate.

CanCurate methods validate against the registries in your LaminDB instance. In Manage biological registries, you’ll see how to extend standard validation to validation against public references using a ReferenceTable ontology object: public = Record.public(). By default, from_values() considers a match in a public reference a validated value for any bionty entity.

# !pip install 'lamindb[bionty,zarr]'
!lamin init --storage ./test-curate-any --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-curate-any
import lamindb as ln
import bionty as bt
import zarr
import numpy as np

data = zarr.create(
    (10,),
    dtype=[("value", "f8"), ("gene", "U15"), ("disease", "U16")],
    store="data.zarr",
)
data["gene"] = [
    "ENSG00000139618",
    "ENSG00000141510",
    "ENSG00000133703",
    "ENSG00000157764",
    "ENSG00000171862",
    "ENSG00000091831",
    "ENSG00000141736",
    "ENSG00000133056",
    "ENSG00000146648",
    "ENSG00000118523",
]
data["disease"] = np.random.default_rng().choice(["MONDO:0004975", "MONDO:0004980"], 10)
 connected lamindb: testuser1/test-curate-any

Define validation criteria

Entities that don’t have a dedicated registry (“are not typed”) can be validated & registered using ULabel:

criteria = {
    "disease": bt.Disease.ontology_id,
    "project": ln.ULabel.name,
    "gene": bt.Gene.ensembl_gene_id,
}

Validate and standardize metadata

validate() validates passed values against reference values in a registry. It returns a boolean vector indicating whether a value has an exact match in the reference values.

bt.Disease.validate(data["disease"], field=bt.Disease.ontology_id)
! Your Disease registry is empty, consider populating it first!
   → use `.import_source()` to import records from a source, e.g. a public ontology
array([False, False, False, False, False, False, False, False, False,
       False])

When validation fails, you can call inspect() to figure out what to do.

inspect() applies the same definition of validation as validate(), but returns a rich return value InspectResult. Most importantly, it logs recommended curation steps that would render the data validated.

Note: you can use standardize() to standardize synonyms.

bt.Disease.inspect(data["disease"], field=bt.Disease.ontology_id);
! received 2 unique terms, 8 empty/duplicated terms are ignored
! 2 unique terms (100.00%) are not validated for ontology_id: 'MONDO:0004975', 'MONDO:0004980'
   detected 2 Disease terms in Bionty for ontology_id: 'MONDO:0004980', 'MONDO:0004975'
→  add records from Bionty to your Disease registry via .from_values()

Following the suggestions to register new labels:

Bulk creating records using from_values() only returns validated records:

Note: Terms validated with public reference are also created with .from_values, see Manage biological registries for details.

diseases = bt.Disease.from_values(data["disease"], field=bt.Disease.ontology_id)
ln.save(diseases)

Repeat the process for more labels:

projects = ln.ULabel.from_values(
    ["Project A", "Project B"],
    field=ln.ULabel.name,
    create=True,  # create non-existing labels rather than attempting to load them from the database
)
ln.save(projects)
genes = bt.Gene.from_values(data["gene"], field=bt.Gene.ensembl_gene_id)
ln.save(genes)

Annotate and save dataset with validated metadata

Register the dataset as an artifact:

artifact = ln.Artifact("data.zarr", description="a zarr object").save()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run

Link the artifact to validated labels. You could directly do this, e.g., via artifact.ulabels.add(projects) or artifact.diseases.add(diseases).

However, often, you want to track the features that measured labels. Hence, let’s try to associate our labels with features:

from lamindb.core.exceptions import ValidationError

try:
    artifact.features.add_values({"project": projects, "disease": diseases})
except ValidationError as e:
    print(e)
Hide code cell output
! cannot infer feature type of: [ULabel(uid='f2wY9NA4', name='Project A', is_concept=False, created_by_id=1, space_id=1, created_at=2025-01-12 14:04:15 UTC), ULabel(uid='1sZTQ1Z1', name='Project B', is_concept=False, created_by_id=1, space_id=1, created_at=2025-01-12 14:04:15 UTC)], returning '?
! cannot infer feature type of: [Disease(uid='4F2HPJ3w', name='Alzheimer disease', ontology_id='MONDO:0004975', synonyms='Alzheimer dementia|Alzheimer disease|Alzheimers disease|Alzheimers dementia|Alzheimer's disease|presenile and senile dementia|Alzheimer's dementia|AD', description='A Progressive, Neurodegenerative Disease Characterized By Loss Of Function And Death Of Nerve Cells In Several Areas Of The Brain Leading To Loss Of Cognitive Function Such As Memory And Language.', created_by_id=1, space_id=1, source_id=50, created_at=2025-01-12 14:04:15 UTC), Disease(uid='4JmTj6Sn', name='atopic eczema', ontology_id='MONDO:0004980', synonyms='allergic form of dermatitis|Atopic dermatitis|Besnier's prurigo|Atopic neurodermatitis|atopic eczema|eczema|eczematous dermatitis|allergic dermatitis', description='A Chronic Inflammatory Genetically Determined Disease Of The Skin Marked By Increased Ability To Form Reagin (Ige), With Increased Susceptibility To Allergic Rhinitis And Asthma, And Hereditary Disposition To A Lowered Threshold For Pruritus. It Is Manifested By Lichenification, Excoriation, And Crusting, Mainly On The Flexural Surfaces Of The Elbow And Knee. In Infants It Is Known As Infantile Eczema.', created_by_id=1, space_id=1, source_id=50, created_at=2025-01-12 14:04:15 UTC)], returning '?
These keys could not be validated: ['project', 'disease']
Here is how to create a feature:

  ln.Feature(name='project', dtype='?').save()
  ln.Feature(name='disease', dtype='?').save()

This errored because we hadn’t yet registered features. After copy and paste from the error message, things work out:

ln.Feature(name="project", dtype="cat[ULabel]").save()
ln.Feature(name="disease", dtype="cat[bionty.Disease]").save()
artifact.features.add_values({"project": projects, "disease": diseases})
artifact.features
Hide code cell output
Artifact .zarr
└── Linked features
    └── disease                     cat[bionty.Disease]        Alzheimer disease, atopic eczema         
        project                     cat[ULabel]                Project A, Project B                     

Since genes are the measurements, we register them as features:

feature_set = ln.FeatureSet(genes).save()
artifact.features.add_feature_set(feature_set, slot="genes")
artifact.describe()
Hide code cell output
Artifact .zarr
├── General
│   ├── .uid = '9YiM7lLyNJSRgKVq0000'
│   ├── .size = 972
│   ├── .hash = 'caT6VbBsVDRScv-n-iG_SQ'
│   ├── .n_files = 2
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate-any/.lamindb/9YiM7lLyNJSRgKVq.zarr
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-01-12 14:04:18
├── Dataset features/.feature_sets
│   └── genes10                  [bionty.Gene]                                                       
BRCA2                       num                                                                 
TP53                        num                                                                 
KRAS                        num                                                                 
BRAF                        num                                                                 
PTEN                        num                                                                 
ESR1                        num                                                                 
ERBB2                       num                                                                 
PIK3C2B                     num                                                                 
EGFR                        num                                                                 
CCN2                        num                                                                 
├── Linked features
│   └── disease                     cat[bionty.Disease]        Alzheimer disease, atopic eczema         
project                     cat[ULabel]                Project A, Project B                     
└── Labels
    └── .diseases                   bionty.Disease             Alzheimer disease, atopic eczema         
        .ulabels                    ULabel                     Project A, Project B                     
Hide code cell content
# clean up test instance
!lamin delete --force test-curate-any
!rm -r data.zarr
╭─ Error ──────────────────────────────────────────────────────────────────────╮
 Storage '/home/runner/work/lamindb/lamindb/docs/test-curate-any/.lamindb'    
 contains 2 objects - delete them prior to deleting the instance              
╰──────────────────────────────────────────────────────────────────────────────╯