Curate DataFrames and AnnDatas

Curating a dataset with LaminDB means three things:

  1. Validate: ensure the dataset meets predefined validation criteria

  2. Standardize: transform the dataset so that it meets validation criteria, e.g., by fixing typos or using standard instead of ad hoc identifiers

  3. Annotate: link the dataset against validated metadata so that it becomes queryable

If a dataset passes validation, curating it takes two lines of code:

curator = ln.Curator.from_df(df, ...)  # create a Curator and pass criteria in "..."
curator.save_artifact()                # validates the content of the dataset and saves it as annotated artifact

Beyond having valid content, the curated dataset is now queryable via metadata identifiers found in the dataset because they have been validated & linked against LaminDB registries.

Beyond validating metadata identifiers, LaminDB also validates data types and dataset schema.

How does validation in LaminDB compare to validation in pandera?

Like LaminDB, pandera validates the dataset schema (i.e., column names and dtypes).

pandera is only available for DataFrame-like datasets and cannot annotate datasets; i.e., can’t make datasets queryable.

However, it offers an API for range-checks, both for numerical and string-like data. If you need such checks, you can combine LaminDB and pandera-based validation.

import pandas as pd
import pandera as pa

# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})

# define modules
modules = pa.DataFrameSchema({
    "column1": pa.Column(int, checks=pa.Check.le(10)),
    "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
    "column3": pa.Column(str, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = modules(df)  # this corresponds to curator.validate() in LaminDB
print(validated_df)
# !pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-curate

Curate a DataFrame

Let’s start with a DataFrame that we’d like to validate.

import lamindb as ln
import bionty as bt
import pandas as pd


df = pd.DataFrame(
    {
        "temperature": [37.2, 36.3, 38.2],
        "cell_type": [
            "cerebral pyramidal neuron",
            "astrocytic glia",
            "oligodendrocyte",
        ],
        "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
        "donor": ["D0001", "D0002", "D0003"],
    },
    index=["obs1", "obs2", "obs3"],
)
df
Hide code cell output
 connected lamindb: testuser1/test-curate
temperature cell_type assay_ontology_id donor
obs1 37.2 cerebral pyramidal neuron EFO:0008913 D0001
obs2 36.3 astrocytic glia EFO:0008913 D0002
obs3 38.2 oligodendrocyte EFO:0008913 D0003

Define validation criteria and create a Curator object.

# in the dictionary, each key is a column name of the dataframe, and each value
# is a registry field onto which values are mapped
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}

# pass validation criteria
curate = ln.Curator.from_df(df, categoricals=categoricals)
Hide code cell output
 added 3 records with Feature.name for "columns": 'cell_type', 'assay_ontology_id', 'donor'

The validate() method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).

curate.validate()
Hide code cell output
 saving validated records of 'cell_type'
 added 2 records from public with CellType.name for "cell_type": 'oligodendrocyte', 'astrocyte'
 saving validated records of 'assay_ontology_id'
 added 1 record from public with ExperimentalFactor.ontology_id for "assay_ontology_id": 'EFO:0008913'
 mapping "cell_type" on CellType.name
!   2 terms are not validated: 'cerebral pyramidal neuron', 'astrocytic glia'
    1 synonym found: "astrocytic glia" → "astrocyte"
    → curate synonyms via .standardize("cell_type")    for remaining terms:
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
 mapping "donor" on ULabel.name
!   3 terms are not validated: 'D0001', 'D0002', 'D0003'
    → fix typos, remove non-existent values, or save terms via .add_new_from("donor")
False
# check the non-validated terms
curate.non_validated
{'cell_type': ['cerebral pyramidal neuron', 'astrocytic glia'],
 'donor': ['D0001', 'D0002', 'D0003']}

For cell_type, we saw that “cerebral pyramidal neuron”, “astrocytic glia” are not validated.

First, let’s standardize synonym “astrocytic glia” as suggested

curate.standardize("cell_type")
 standardized 1 synonym in "cell_type": "astrocytic glia" → "astrocyte"
# now we have only one non-validated term left
curate.non_validated
{'cell_type': ['cerebral pyramidal neuron'],
 'donor': ['D0001', 'D0002', 'D0003']}

For “cerebral pyramidal neuron”, let’s understand which cell type in the public ontology might be the actual match.

# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curate.lookup()` to get a lookup object of existing records in your instance
lookup = curate.lookup(public=True)
lookup
Hide code cell output
Lookup objects from the public:
 .cell_type
 .assay_ontology_id
 .donor
 .columns
 
Example:
    → categories = curator.lookup()["cell_type"]
    → categories.alveolar_type_1_fibroblast_cell

To look up public ontologies, use .lookup(public=True)
# here is an example for the "cell_type" column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron
Hide code cell output
CellType(ontology_id='CL:4023111', name='cerebral cortex pyramidal neuron', definition='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', synonyms=None, parents=array(['CL:0010012', 'CL:0000598'], dtype=object))
# fix the cell type
df.cell_type = df.cell_type.replace(
    {"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name}
)

For donor, we want to add the new donors: “D0001”, “D0002”, “D0003”

# this adds donors that were _not_ validated
curate.add_new_from("donor")
Hide code cell output
 added 3 records with ULabel.name for "donor": 'D0001', 'D0002', 'D0003'
# validate again
curate.validate()
Hide code cell output
 saving validated records of 'cell_type'
 added 1 record from public with CellType.name for "cell_type": 'cerebral cortex pyramidal neuron'
 "cell_type" is validated against CellType.name
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
 "donor" is validated against ULabel.name
True

Save a curated artifact.

artifact = curate.save_artifact(description="My curated dataframe")
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
! 1 unique term (25.00%) is not validated for name: 'temperature'
! did not create Feature record for 1 non-validated name: 'temperature'
artifact.describe(print_types=True)
Artifact .parquet/DataFrame
├── General
│   ├── .uid = 'UYDyM4TXw654yUmM0000'
│   ├── .size = 3786
│   ├── .hash = 'LZCfO2VdCCz0bzQ2cGpDEw'
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/UYDyM4TXw654yUmM0000.parquet
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-01-12 14:03:25
├── Dataset features/.feature_sets
│   └── columns3                 [Feature]                                                           
assay_ontology_id           cat[bionty.ExperimentalF…  single-cell RNA sequencing               
cell_type                   cat[bionty.CellType]       astrocyte, cerebral cortex pyramidal neu…
donor                       cat[ULabel]                D0001, D0002, D0003                      
└── Labels
    └── .cell_types                 bionty.CellType            oligodendrocyte, astrocyte, cerebral cor…
        .experimental_factors       bionty.ExperimentalFactor  single-cell RNA sequencing               
        .ulabels                    ULabel                     D0001, D0002, D0003                      

Curate an AnnData

Here we additionally specify which var_index to validate against.

import anndata as ad

X = pd.DataFrame(
    {
        "ENSG00000081059": [1, 2, 3],
        "ENSG00000276977": [4, 5, 6],
        "ENSG00000198851": [7, 8, 9],
        "ENSG00000010610": [10, 11, 12],
        "ENSG00000153563": [13, 14, 15],
        "ENSGcorrupted": [16, 17, 18],
    },
    index=df.index,  # because we already curated the dataframe above, it will validate
)
adata = ad.AnnData(X=X, obs=df)
adata
Hide code cell output
AnnData object with n_obs × n_vars = 3 × 6
    obs: 'temperature', 'cell_type', 'assay_ontology_id', 'donor'
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals,
    organism="human",
)
curate.validate()
Hide code cell output
 saving validated records of 'var_index'
 added 5 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000081059', 'ENSG00000276977', 'ENSG00000198851', 'ENSG00000010610', 'ENSG00000153563'
 mapping "var_index" on Gene.ensembl_gene_id
!   1 term is not validated: 'ENSGcorrupted'
    → fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
 "cell_type" is validated against CellType.name
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
 "donor" is validated against ULabel.name
False

Non-validated terms can be accessed via:

curate.non_validated
Hide code cell output
{'var_index': ['ENSGcorrupted']}

Subset the AnnData to validated genes only:

adata_validated = adata[
    :, ~adata.var.index.isin(curate.non_validated["var_index"])
].copy()

Now let’s validate the subsetted object:

curate = ln.Curator.from_anndata(
    adata_validated,
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals,
    organism="human",
)
curate.validate()
Hide code cell output
 "var_index" is validated against Gene.ensembl_gene_id
 "cell_type" is validated against CellType.name
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
 "donor" is validated against ULabel.name
True

The validated object can be subsequently saved as an Artifact:

artifact = curate.save_artifact(description="test AnnData")
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
!    1 unique term (25.00%) is not validated for name: 'temperature'
!    did not create Feature record for 1 non-validated name: 'temperature'

Saved artifact has been annotated with validated features and labels:

artifact.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'ys5M8yYpzDX9NYi90000'
│   ├── .size = 20336
│   ├── .hash = '8z6kAdTVBaDIDuA6aivzNg'
│   ├── .n_observations = 3
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/ys5M8yYpzDX9NYi90000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-01-12 14:03:31
├── Dataset features/.feature_sets
│   ├── obs3                     [Feature]                                                           
│   │   assay_ontology_id           cat[bionty.ExperimentalF…  single-cell RNA sequencing               
│   │   cell_type                   cat[bionty.CellType]       astrocyte, cerebral cortex pyramidal neu…
│   │   donor                       cat[ULabel]                D0001, D0002, D0003                      
│   └── var5                     [bionty.Gene]                                                       
TCF7                        int                                                                 
PDCD1                       int                                                                 
CD3E                        int                                                                 
CD4                         int                                                                 
CD8A                        int                                                                 
└── Labels
    └── .cell_types                 bionty.CellType            oligodendrocyte, astrocyte, cerebral cor…
        .experimental_factors       bionty.ExperimentalFactor  single-cell RNA sequencing               
        .ulabels                    ULabel                     D0001, D0002, D0003                      

We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:

  1. Defining validation criteria

  2. Validating data against existing registries

  3. Adding new validated entries to registries

  4. Annotating artifacts with validated metadata

By following these steps, you can ensure your data is standardized and well-curated.

If you have datasets that aren’t DataFrame-like or AnnData-like, read: Curate datasets of any format.