What happens if I save the same artifacts & records twice?¶
LaminDB’s operations are idempotent in the sense defined in this document.
This allows you to re-run a notebook or script without erroring or duplicating data. Similar behavior holds for human data entry.
Summary¶
Metadata records¶
If you try to create any metadata record (Record
) and search_names
is True
(the default):
LaminDB will warn you if a record with similar
name
exists and display a table of similar existing records.You can then decide whether you’d like to save a record to the database or rather query an existing one from the table.
If a name already has an exact match in a registry, LaminDB will return it instead of creating a new record. For versioned entities, also the version must be passed.
If you set search_names
to False
, you’ll directly populate the DB.
Data: artifacts & collections¶
If you try to create a Artifact
object from the same content, depending on artifact_if_hash_exists
,
you’ll get an existing object, if
creation.artifact_if_hash_exists = "warn_return_existing"
(the default)you’ll get an error, if
creation.artifact_if_hash_exists = "error"
you’ll get a warning and a new object, if
creation.artifact_if_hash_exists = "warn_create_new"
Examples¶
# !pip install 'lamindb[jupyter]'
!lamin init --storage ./test-idempotency
→ initialized lamindb: testuser1/test-idempotency
import lamindb as ln
import pytest
ln.track("ANW20Fr4eZgM0000")
→ connected lamindb: testuser1/test-idempotency
→ created Transform('ANW20Fr4eZgM0000'), started new Run('vWCYgoNX...') at 2025-01-12 14:03:21 UTC
→ notebook imports: lamindb==1.0a2 pytest==8.3.4
Metadata records¶
assert ln.settings.creation.search_names
Let us add a first record to the ULabel
registry:
label = ln.ULabel(name="My project 1")
label.save()
ULabel(uid='fDrwrO9b', name='My project 1', is_concept=False, created_by_id=1, run_id=1, space_id=1, created_at=2025-01-12 14:03:23 UTC)
If we create a new record, we’ll automatically get search results that give clues on whether we are prone to duplicating an entry:
label = ln.ULabel(name="My project 1a")
! record with similar name exists! did you mean to load it?
uid | name | is_concept | description | reference | reference_type | space_id | run_id | created_at | created_by_id | aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
1 | fDrwrO9b | My project 1 | False | None | None | None | 1 | 1 | 2025-01-12 14:03:23.285828+00:00 | 1 | None | 1 |
label.save()
ULabel(uid='JVsVWdd6', name='My project 1a', is_concept=False, created_by_id=1, run_id=1, space_id=1, created_at=2025-01-12 14:03:23 UTC)
In case we match an existing name directly, we’ll get the existing object:
label = ln.ULabel(name="My project 1")
→ returning existing ULabel record with same name: 'My project 1'
If we save it again, it will not create a new entry in the registry:
label.save()
ULabel(uid='fDrwrO9b', name='My project 1', is_concept=False, created_by_id=1, run_id=1, space_id=1, created_at=2025-01-12 14:03:23 UTC)
Now, if we create a third record, we’ll get two alternatives:
label = ln.ULabel(name="My project 1b")
! records with similar names exist! did you mean to load one of them?
uid | name | is_concept | description | reference | reference_type | space_id | run_id | created_at | created_by_id | aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
1 | fDrwrO9b | My project 1 | False | None | None | None | 1 | 1 | 2025-01-12 14:03:23.285828+00:00 | 1 | None | 1 |
2 | JVsVWdd6 | My project 1a | False | None | None | None | 1 | 1 | 2025-01-12 14:03:23.350163+00:00 | 1 | None | 1 |
If we prefer to not perform a search, e.g. for performance reasons or too noisy logging, we can switch it off.
ln.settings.creation.search_names = False
label = ln.ULabel(name="My project 1c")
In this walkthrough, switch it back on:
ln.settings.creation.search_names = True
Data: artifacts and collections¶
Warn upon trying to re-ingest an existing artifact¶
assert ln.settings.creation.artifact_if_hash_exists == "warn_return_existing"
filepath = ln.core.datasets.file_fcs()
Create an Artifact
:
artifact = ln.Artifact(filepath, description="My fcs artifact").save()
Show code cell content
assert artifact.hash == "KCEXRahJ-Ui9Y6nksQ8z1A"
assert artifact.run == ln.context.run
assert len(artifact._previous_runs.all()) == 0
Create an Artifact
from the same path:
artifact2 = ln.Artifact(filepath, description="My fcs artifact")
→ returning existing artifact with same hash: Artifact(uid='MIaivNFW4jAb6IWS0000', is_latest=True, description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-12 14:03:24 UTC)
It gives us the existing object:
assert artifact.id == artifact2.id
assert artifact.run == artifact2.run
assert len(artifact._previous_runs.all()) == 0
If you save it again, nothing will happen (the operation is idempotent):
artifact2.save()
Artifact(uid='MIaivNFW4jAb6IWS0000', is_latest=True, description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-12 14:03:24 UTC)
In the hidden cell below, you’ll see how this interplays with data lineage.
Show code cell content
ln.context.track(new_run=True)
artifact3 = ln.Artifact(filepath, description="My fcs artifact")
assert artifact3.id == artifact2.id
assert artifact3.run != artifact2.run
assert artifact3._previous_runs.first() == artifact2.run
→ loaded Transform('ANW20Fr4eZgM0000'), started new Run('rCsQykrD...') at 2025-01-12 14:03:24 UTC
→ notebook imports: lamindb==1.0a2 pytest==8.3.4
→ returning existing artifact with same hash: Artifact(uid='MIaivNFW4jAb6IWS0000', is_latest=True, description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-12 14:03:24 UTC)
Error upon trying to re-ingest an existing artifact¶
ln.settings.creation.artifact_if_hash_exists = "error"
In this case, you’ll not be able to create an object from the same content:
with pytest.raises(FileExistsError):
artifact3 = ln.Artifact(filepath, description="My new fcs artifact")
Warn and create a new artifact¶
Lastly, let us discuss the following setting:
ln.settings.creation.artifact_if_hash_exists = "warn_create_new"
In this case, you’ll create a new object:
artifact4 = ln.Artifact(filepath, description="My new fcs artifact").save()
! creating new Artifact object despite existing artifact with same hash: Artifact(uid='MIaivNFW4jAb6IWS0000', is_latest=True, description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-01-12 14:03:24 UTC)
You can verify that it’s a new entry by comparing the ids:
assert artifact4.id != artifact.id
ln.Artifact.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").df()
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _curator | _overwrite_versions | space_id | storage_id | version | is_latest | run_id | created_at | created_by_id | aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1 | MIaivNFW4jAb6IWS0000 | None | My fcs artifact | .fcs | None | None | 6785467 | KCEXRahJ-Ui9Y6nksQ8z1A | None | None | md5 | True | None | False | 1 | 1 | None | True | 1 | 2025-01-12 14:03:24.297876+00:00 | 1 | None | 1 |
2 | hqXaOVetFrxkw8Fk0000 | None | My new fcs artifact | .fcs | None | None | 6785467 | KCEXRahJ-Ui9Y6nksQ8z1A | None | None | md5 | True | None | False | 1 | 1 | None | True | 2 | 2025-01-12 14:03:25.857644+00:00 | 1 | None | 1 |
Show code cell content
assert len(ln.Artifact.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").all()) == 2
!rm -rf ./test-idempotency
!lamin delete --force test-idempotency
Show code cell output
• deleting instance testuser1/test-idempotency