lamindb.Artifact¶
- class lamindb.Artifact(data: UPathStr, type: ArtifactKind | None = None, key: str | None = None, description: str | None = None, revises: Artifact | None = None, run: Run | None = None)¶
Bases:
Record
,IsVersioned
,TracksRun
,TracksUpdates
Datasets & models stored as files, folders, or arrays.
Artifacts manage data in local or remote storage.
Some artifacts are array-like, e.g., when stored as
.parquet
,.h5ad
,.zarr
, or.tiledb
.- Parameters:
data –
UPathStr
A path to a local or remote folder or file.type –
Literal["dataset", "model"] | None = None
The artifact type.key –
str | None = None
A path-like key to reference artifact in default storage, e.g.,"myfolder/myfile.fcs"
. Artifacts with the same key form a revision family.description –
str | None = None
A description.revises –
Artifact | None = None
Previous version of the artifact. Triggers a revision.run –
Run | None = None
The run that creates the artifact.
Typical storage formats & their API accessors
Arrays:
Table:
.csv
,.tsv
,.parquet
,.ipc
⟷DataFrame
,pyarrow.Table
Annotated matrix:
.h5ad
,.h5mu
,.zrad
⟷AnnData
,MuData
Generic array: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders
Non-arrays:
Image:
.jpg
,.png
⟷np.ndarray
, …Fastq:
.fastq
⟷ /VCF:
.vcf
⟷ /QC:
.html
⟷ /
You’ll find these values in the
suffix
&accessor
fields.LaminDB makes some default choices (e.g., serialize a
DataFrame
as a.parquet
file).See also
Storage
Storage locations for artifacts.
Collection
Collections of artifacts.
from_df()
Create an artifact from a
DataFrame
.from_anndata()
Create an artifact from an
AnnData
.
Examples
Create an artifact from a file path and pass
description
:>>> artifact = ln.Artifact("s3://my_bucket/my_folder/my_file.csv", description="My file") >>> artifact = ln.Artifact("./my_local_file.jpg", description="My image")
You can also pass
key
to create a virtual filepath hierarchy:>>> artifact = ln.Artifact("./my_local_file.jpg", key="example_datasets/dataset1.jpg")
What works for files also works for folders:
>>> artifact = ln.Artifact("s3://my_bucket/my_folder", description="My folder") >>> artifact = ln.Artifact("./my_local_folder", description="My local folder") >>> artifact = ln.Artifact("./my_local_folder", key="project1/my_target_folder")
Why does the API look this way?
It’s inspired by APIs building on AWS S3.
Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a
key
argument.In boto3:
# signature: S3.Bucket.upload_file(filepath, key) import boto3 s3 = boto3.resource('s3') bucket = s3.Bucket('mybucket') bucket.upload_file('/tmp/hello.txt', 'hello.txt')
In quilt3:
# signature: quilt3.Bucket.put_file(key, filepath) import quilt3 bucket = quilt3.Bucket('mybucket') bucket.put_file('hello.txt', '/tmp/hello.txt')
Make a new version of an artifact:
>>> artifact = ln.Artifact.from_df(df, key="example_datasets/dataset1.parquet").save() >>> artifact_v2 = ln.Artifact(df_updated, key="example_datasets/dataset1.parquet").save()
Alternatively, if you don’t want to provide a value for
key
, you can userevises
:>>> artifact = ln.Artifact.from_df(df, description="My dataframe").save() >>> artifact_v2 = ln.Artifact(df_updated, revises=artifact).save()
Attributes¶
-
features:
FeatureManager
¶ Feature manager.
Features denote dataset dimensions, i.e., the variables that measure labels & numbers.
Annotate with features & values:
artifact.features.add_values({ "species": organism, # here, organism is an Organism record "scientist": ['Barbara McClintock', 'Edgar Anderson'], "temperature": 27.6, "study": "Candidate marker study" })
Query for features & values:
ln.Artifact.features.filter(scientist="Barbara McClintock")
Features may or may not be part of the artifact content in storage. For instance, the
Curator
flow validates the columns of aDataFrame
-like artifact and annotates it with features corresponding to these columns.artifact.features.add_values
, by contrast, does not validate the content of the artifact.
- property labels: LabelManager¶
Label manager.
To annotate with labels, you typically use the registry-specific accessors, for instance
ulabels
:candidate_marker_study = ln.ULabel(name="Candidate marker study").save() artifact.ulabels.add(candidate_marker_study)
Similarly, you query based on these accessors:
ln.Artifact.filter(ulabels__name="Candidate marker study").all()
Unlike the registry-specific accessors, the
.labels
accessor provides a way of associating labels with features:study = ln.Feature(name="study", dtype="cat").save() artifact.labels.add(candidate_marker_study, feature=study)
Note that the above is equivalent to:
artifact.features.add_values({"study": candidate_marker_study})
- property n_objects: int¶
-
params:
ParamManager
¶ Param manager.
Example:
artifact.params.add_values({ "hidden_size": 32, "bottleneck_size": 16, "batch_size": 32, "preprocess_params": { "normalization_type": "cool", "subset_highlyvariable": True, }, })
- property path: Path | UPath¶
Path.
File in cloud storage, here AWS S3:
>>> artifact = ln.Artifact("s3://my-bucket/my-file.csv").save() >>> artifact.path S3Path('s3://my-bucket/my-file.csv')
File in local storage:
>>> ln.Artifact("./myfile.csv", description="myfile").save() >>> artifact = ln.Artifact.get(description="myfile") >>> artifact.path PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')
- property stem_uid: str¶
Universal id characterizing the version family.
The full uid of a record is obtained via concatenating the stem uid and version information:
stem_uid = random_base62(n_char) # a random base62 sequence of length 12 (transform) or 16 (artifact, collection) version_uid = "0000" # an auto-incrementing 4-digit base62 number uid = f"{stem_uid}{version_uid}" # concatenate the stem_uid & version_uid
Simple fields¶
-
uid:
str
¶ A universal random id.
-
key:
str
|None
¶ A (virtual) relative file path within the artifact’s storage location.
Setting a
key
is useful to automatically group artifacts into a version family.LaminDB defaults to a virtual file path to make renaming of data in object storage easy.
If you register existing files in a storage location, the
key
equals the actual filepath on the underyling filesytem or object store.
-
description:
str
|None
¶ A description.
LaminDB doesn’t require you to pass a key, you can
-
suffix:
str
¶ Path suffix or empty string if no canonical suffix exists.
This is either a file suffix (
".csv"
,".h5ad"
, etc.) or the empty string “”.
-
kind:
Literal
['dataset'
,'model'
] |None
¶ ArtifactKind
(defaultNone
).
-
otype:
str
|None
¶ Default Python object type, e.g., DataFrame, AnnData.
-
size:
int
|None
¶ Size in bytes.
Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.
-
hash:
str
|None
¶ Hash or pseudo-hash of artifact content.
Useful to ascertain integrity and avoid duplication.
-
n_files:
int
|None
¶ Number of files for folder-like artifacts,
None
for file-like artifacts.Note that some arrays are also stored as folders, e.g.,
.zarr
or.tiledbsoma
.Changed in version 1.0: Renamed from
n_objects
ton_files
.
-
n_observations:
int
|None
¶ Number of observations.
Typically, this denotes the first array dimension.
-
version:
str
|None
¶ Version (default
None
).Defines version of a family of records characterized by the same
stem_uid
.Consider using semantic versioning with Python versioning.
-
is_latest:
bool
¶ Boolean flag that indicates whether a record is the latest in its version family.
-
created_at:
datetime
¶ Time of creation of record.
-
updated_at:
datetime
¶ Time of last update to record.
-
aux:
dict
[str
,Any
] |None
¶ Auxiliary field for dictionary-like metadata.
Relational fields¶
-
space:
Space
¶ The space in which the record lives.
-
feature_sets:
FeatureSet
¶ The feature sets measured in the artifact.
-
collections:
Collection
¶ The collections that this artifact is part of.
Class methods¶
- classmethod df(include=None, features=False, limit=100)¶
Convert to
pd.DataFrame
.By default, shows all direct fields, except
updated_at
.Use arguments
include
orfeature
to include other data.- Parameters:
include (
str
|list
[str
] |None
, default:None
) – Related fields to include as columns. Takes strings of form"ulabels__name"
,"cell_types__name"
, etc. or a list of such strings.features (
bool
|list
[str
], default:False
) – IfTrue
, map all features of theFeature
registry onto the resultingDataFrame
. Only available forArtifact
.limit (
int
, default:100
) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.
- Return type:
DataFrame
Examples
Include the name of the creator in the
DataFrame
:>>> ln.ULabel.df(include="created_by__name"])
Include display of features for
Artifact
:>>> df = ln.Artifact.df(features=True) >>> ln.view(df) # visualize with type annotations
Only include select features:
>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Q
objects.expressions – Fields and values passed as Django query expressions.
- Return type:
QuerySet
- Returns:
A
QuerySet
.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.ULabel(name="my label").save() >>> ln.ULabel.filter(name__startswith="my").df()
- classmethod from_anndata(adata, key=None, description=None, run=None, revises=None, **kwargs)¶
Create from
AnnData
, validate & link features.- Parameters:
adata (AnnData | UPathStr) – An
AnnData
object or a path of AnnData-like.key (str | None, default:
None
) – A relative path within default storage, e.g.,"myfolder/myfile.h5ad"
.description (str | None, default:
None
) – A description.revises (Artifact | None, default:
None
) – An old version of the artifact.run (Run | None, default:
None
) – The run that creates the artifact.
- Return type:
Artifact
See also
Collection()
Track collections.
Feature
Track features.
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> adata = ln.core.datasets.anndata_with_obs() >>> artifact = ln.Artifact.from_anndata(adata, description="mini anndata with obs") >>> artifact.save()
- classmethod from_df(df, key=None, description=None, run=None, revises=None, **kwargs)¶
Create from
DataFrame
, validate & link features.- Parameters:
df (
DataFrame
) – ADataFrame
object.key (
str
|None
, default:None
) – A relative path within default storage, e.g.,"myfolder/myfile.parquet"
.description (
str
|None
, default:None
) – A description.revises (
Artifact
|None
, default:None
) – An old version of the artifact.run (
Run
|None
, default:None
) – The run that creates the artifact.
- Return type:
See also
Collection()
Track collections.
Feature
Track features.
Examples
>>> df = ln.core.datasets.df_iris_in_meter_batch1() >>> df.head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0 >>> artifact = ln.Artifact.from_df(df, description="Iris flower collection batch1") >>> artifact.save()
- classmethod from_dir(path, key=None, *, run=None)¶
Create a list of artifact objects from a directory.
Hint
If you have a high number of files (several 100k) and don’t want to track them individually, create a single
Artifact
viaArtifact(path)
for them. See, e.g., RxRx: cell imaging.- Parameters:
path (lamindb.core.types.UPathStr) – Source path of folder.
key (
str
|None
, default:None
) – Key for storage destination. IfNone
and directory is in a registered location, the inferredkey
will reflect the relative position. IfNone
and directory is outside of a registered storage location, the inferred key defaults topath.name
.run (
Run
|None
, default:None
) – ARun
object.
- Return type:
list
[Artifact
]
Examples
>>> dir_path = ln.core.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage) >>> artifacts = ln.Artifact.from_dir(dir_path) >>> ln.save(artifacts)
- classmethod from_mudata(mdata, key=None, description=None, run=None, revises=None, **kwargs)¶
Create from
MuData
, validate & link features.- Parameters:
mdata (
MuData
) – AnMuData
object.key (
str
|None
, default:None
) – A relative path within default storage, e.g.,"myfolder/myfile.h5mu"
.description (
str
|None
, default:None
) – A description.revises (
Artifact
|None
, default:None
) – An old version of the artifact.run (
Run
|None
, default:None
) – The run that creates the artifact.
- Return type:
See also
Collection()
Track collections.
Feature
Track features.
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> mdata = ln.core.datasets.mudata_papalexi21_subset() >>> artifact = ln.Artifact.from_mudata(mdata, description="a mudata object") >>> artifact.save()
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int
|str
|None
, default:None
) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A record.
- Raises:
lamindb.core.exceptions.DoesNotExist – In case no matching record is found.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ulabel = ln.ULabel.get("FvtpPJLJ") >>> ulabel = ln.ULabel.get(name="my-label")
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str
|DeferredAttribute
|None
, default:None
) – The field to look up the values for. Defaults to first string field.return_field (
str
|DeferredAttribute
|None
, default:None
) – The field to return. IfNone
, returns the whole record.
- Return type:
NamedTuple
- Returns:
A
NamedTuple
of lookup information of the field values with a dictionary converter.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> bt.Gene.from_source(symbol="ADGB-DT").save() >>> lookup = bt.Gene.lookup() >>> lookup.adgb_dt >>> lookup_dict = lookup.dict() >>> lookup_dict['ADGB-DT'] >>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") >>> genes.ensg00000002745 >>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str
) – The input string to match against the field ontology values.field (
str
|DeferredAttribute
|None
, default:None
) – The field or fields to search. Search all string fields by default.limit (
int
|None
, default:20
) – Maximum amount of top results to return.case_sensitive (
bool
, default:False
) – Whether the match is case sensitive.
- Return type:
QuerySet
- Returns:
A sorted
DataFrame
of search results with a score in columnscore
. Ifreturn_queryset
isTrue
.QuerySet
.
Examples
>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name") >>> ln.save(ulabels) >>> ln.ULabel.search("ULabel2")
- classmethod using(instance)¶
Use a non-default LaminDB instance.
- Parameters:
instance (
str
|None
) – An instance identifier of form “account_handle/instance_name”.- Return type:
QuerySet
Examples
>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name") uid score name ULabel7 g7Hk9b2v 100.0 ULabel5 t4Jm6s0q 75.0 ULabel6 r2Xw8p1z 75.0
Methods¶
- async adelete(using=None, keep_parents=False)¶
- async arefresh_from_db(using=None, fields=None, from_queryset=None)¶
- async asave(*args, force_insert=False, force_update=False, using=None, update_fields=None)¶
- cache(is_run_input=None)¶
Download cloud artifact to local cache.
Follows synching logic: only caches an artifact if it’s outdated in the local cache.
Returns a path to a locally cached on-disk object (say a
.jpg
file).- Return type:
Path
Examples
Sync file from cloud and return the local path of the cache:
>>> artifact.cache() PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
- clean()¶
Hook for doing any extra model-wide validation after clean() has been called on every field by self.clean_fields. Any ValidationError raised by this method will not be associated with a particular field; it will have a special-case association with the field defined by NON_FIELD_ERRORS.
- clean_fields(exclude=None)¶
Clean all fields and raise a ValidationError containing a dict of all validation errors if any occur.
- date_error_message(lookup_type, field_name, unique_for)¶
- delete(permanent=None, storage=None, using_key=None)¶
Trash or permanently delete.
A first call to
.delete()
puts an artifact into the trash (sets_branch_code
to-1
). A second call permanently deletes the artifact.FAQ: Storage FAQ
- Parameters:
permanent (
bool
|None
, default:None
) – Permanently delete the artifact (skip trash).storage (
bool
|None
, default:None
) – Indicate whether you want to delete the artifact in storage.
- Return type:
None
Examples
For an
Artifact
objectartifact
, call:>>> artifact.delete()
- describe(print_types=False)¶
Describe relations of record.
Examples
>>> artifact.describe()
- get_constraints()¶
- get_deferred_fields()¶
Return a set containing names of deferred fields for this instance.
- load(is_run_input=None, **kwargs)¶
Cache and load into memory.
See all
loaders
.- Return type:
Any
Examples
Load a
DataFrame
-like artifact:>>> artifact.load().head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0
Load an
AnnData
-like artifact:>>> artifact.load() AnnData object with n_obs × n_vars = 70 × 765
Fall back to
cache()
if no in-memory representation is configured:>>> artifact.load() PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
- open(mode='r', is_run_input=None)¶
Return a cloud-backed data object.
Works for
AnnData
(.h5ad
and.zarr
), generichdf5
andzarr
,tiledbsoma
objects (.tiledbsoma
),pyarrow
compatible formats.- Parameters:
mode (str, default:
'r'
) – can only be"w"
(write mode) fortiledbsoma
stores, otherwise should be always"r"
(read-only mode).- Return type:
AnnDataAccessor | BackedAccessor | SOMACollection | SOMAExperiment | PyArrowDataset
Notes
For more info, see tutorial: Slice arrays.
Examples
Read AnnData in backed mode from cloud:
>>> artifact = ln.Artifact.get(key="lndb-storage/pbmc68k.h5ad") >>> artifact.open() AnnDataAccessor object with n_obs × n_vars = 70 × 765 constructed for the AnnData object pbmc68k.h5ad ...
- prepare_database_save(field)¶
- refresh_from_db(using=None, fields=None, from_queryset=None)¶
Reload field values from the database.
By default, the reloading happens from the database this instance was loaded from, or by the read router if this instance wasn’t loaded from any database. The using parameter will override the default.
Fields can be used to specify which fields to reload. The fields should be an iterable of field attnames. If fields is None, then all non-deferred fields are reloaded.
When accessing deferred fields of an instance, the deferred loading of the field will call this method.
- replace(data, run=None, format=None)¶
Replace artifact content.
- Parameters:
data (lamindb.core.types.UPathStr) – A file path.
run (
Run
|None
, default:None
) – The run that created the artifact gets auto-linked ifln.track()
was called.
- Return type:
None
Examples
Say we made a change to the content of an artifact, e.g., edited the image
paradisi05_laminopathic_nuclei.jpg
.This is how we replace the old file in storage with the new file:
>>> artifact.replace("paradisi05_laminopathic_nuclei.jpg") >>> artifact.save()
Note that this neither changes the storage key nor the filename.
However, it will update the suffix if it changes.
- restore()¶
Restore from trash.
- Return type:
None
Examples
For any
Artifact
objectartifact
, call:>>> artifact.restore()
- save(upload=None, **kwargs)¶
Save to database & storage.
- Parameters:
upload (
bool
|None
, default:None
) – Trigger upload to cloud storage in instances with hybrid storage mode.- Return type:
Examples
>>> artifact = ln.Artifact("./myfile.csv", description="myfile") >>> artifact.save()
- save_base(raw=False, force_insert=False, force_update=False, using=None, update_fields=None)¶
Handle the parts of saving which should be done only once per save, yet need to be done in raw saves, too. This includes some sanity checks and signal sending.
The ‘raw’ argument is telling save_base not to save any parent models and not to do any changes to the values before save. This is used by fixture loading.
- serializable_value(field_name)¶
Return the value of the field name for this instance. If the field is a foreign key, return the id value instead of the object. If there’s no Field object with this name on the model, return the model attribute’s value.
Used to serialize a field’s value (in the serializer, or form output, for example). Normally, you would just access the attribute directly and not use this method.
- unique_error_message(model_class, unique_check)¶
- validate_constraints(exclude=None)¶
- validate_unique(exclude=None)¶
Check unique constraints on the model and raise ValidationError if any failed.
- view_lineage(with_children=True)¶
Graph of data flow.
- Return type:
None
Notes
For more info, see use cases: Data lineage.
Examples
>>> collection.view_lineage() >>> artifact.view_lineage()