scrna1/6 Jupyter Notebook lamindata

scRNA-seq

You’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable & batch-iterable collection.

Along the way, you’ll see how to create reports, leverage data lineage, and query individual datasets.

If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE Census guide.

Here, you will:

  1. create an Artifact from an AnnData object and seed a growing Collection with it (scrna1/6, current page)

  2. append a new dataset and create a new version of this collection (scrna2/6)

  3. query & inspect artifacts by metadata individually (scrna3/6)

  4. load the joint collection and save analytical results (scrna4/6)

  5. iterate over the collection and train a model (scrna5/6)

  6. discuss converting a collection to a single TileDB SOMA store of the same data (scrna6/6)

# !pip install 'lamindb[jupyter,aws,bionty]' 
!lamin init --storage ./test-scrna --schema bionty
import lamindb as ln
import bionty as bt

ln.settings.transform.stem_uid = "Nv48yAceNSh8"
ln.settings.transform.version = "1"
ln.track()
Hide code cell output
💡 connected lamindb: testuser1/test-scrna
💡 notebook imports: bionty==0.47.1 lamindb==0.75.0
💡 saved: Transform(uid='Nv48yAceNSh85zKv', version='1', name='scRNA-seq', key='scrna', type='notebook', created_by_id=1, updated_at='2024-08-06 09:37:31 UTC')
💡 saved: Run(uid='NkcV4OdudGMpNQOU2qmA', transform_id=1, created_by_id=1)
Run(uid='NkcV4OdudGMpNQOU2qmA', started_at='2024-08-06 09:37:31 UTC', is_consecutive=True, transform_id=1, created_by_id=1)

Populate metadata registries based on an artifact

Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells() loads a subsampled version:

adata = ln.core.datasets.anndata_human_immune_cells()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 1648 × 36503
    obs: 'donor', 'tissue', 'cell_type', 'assay'
    var: 'feature_is_filtered', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding'
    obsm: 'X_umap'

Let’s curate this artifact:

curate = ln.Curate.from_anndata(
    adata, 
    var_index=bt.Gene.ensembl_gene_id, 
    categoricals={
        adata.obs.donor.name: ln.ULabel.name, 
        adata.obs.tissue.name: bt.Tissue.name, 
        adata.obs.cell_type.name: bt.CellType.name, 
        adata.obs.assay.name: bt.ExperimentalFactor.name
    }, 
    organism="human",
)
✅ added 4 records with Feature.name for columns: 'donor', 'tissue', 'cell_type', 'assay'
✅ added 36283 records from public with Gene.ensembl_gene_id for var_index: 'ENSG00000243485', 'ENSG00000237613', 'ENSG00000186092', 'ENSG00000238009', 'ENSG00000239945', 'ENSG00000239906', 'ENSG00000241860', 'ENSG00000241599', 'ENSG00000286448', 'ENSG00000236601', 'ENSG00000284733', 'ENSG00000235146', 'ENSG00000284662', 'ENSG00000229905', 'ENSG00000237491', 'ENSG00000177757', 'ENSG00000228794', 'ENSG00000225880', 'ENSG00000230368', 'ENSG00000272438', ...
220 non-validated categories are not saved in Gene.ensembl_gene_id: ['ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', 'ENSG00000272196', 'ENSG00000237975', 'ENSG00000235736', 'ENSG00000272880', 'ENSG00000227925', 'ENSG00000238042', 'ENSG00000237845', 'ENSG00000270188', 'ENSG00000287116', 'ENSG00000236856', 'ENSG00000226277', 'ENSG00000237133', 'ENSG00000224739', 'ENSG00000230525', 'ENSG00000227902', 'ENSG00000237327', 'ENSG00000285155', 'ENSG00000232411', 'ENSG00000239467', 'ENSG00000225205', 'ENSG00000272551', 'ENSG00000280374', 'ENSG00000226747', 'ENSG00000272519', 'ENSG00000236886', 'ENSG00000229352', 'ENSG00000286601', 'ENSG00000227021', 'ENSG00000259855', 'ENSG00000233143', 'ENSG00000228135', 'ENSG00000273301', 'ENSG00000237940', 'ENSG00000271870', 'ENSG00000237838', 'ENSG00000286996', 'ENSG00000223797', 'ENSG00000233509', 'ENSG00000269028', 'ENSG00000239462', 'ENSG00000286699', 'ENSG00000273370', 'ENSG00000261490', 'ENSG00000251679', 'ENSG00000249988', 'ENSG00000272567', 'ENSG00000270394', 'ENSG00000249381', 'ENSG00000272370', 'ENSG00000272354', 'ENSG00000251044', 'ENSG00000248371', 'ENSG00000251613', 'ENSG00000272040', 'ENSG00000182230', 'ENSG00000249684', 'ENSG00000233937', 'ENSG00000248103', 'ENSG00000204092', 'ENSG00000261068', 'ENSG00000236740', 'ENSG00000236996', 'ENSG00000232295', 'ENSG00000271734', 'ENSG00000236673', 'ENSG00000227220', 'ENSG00000236166', 'ENSG00000112096', 'ENSG00000285162', 'ENSG00000228434', 'ENSG00000229881', 'ENSG00000286228', 'ENSG00000237513', 'ENSG00000285106', 'ENSG00000226380', 'ENSG00000270672', 'ENSG00000225932', 'ENSG00000244693', 'ENSG00000283504', 'ENSG00000283648', 'ENSG00000268955', 'ENSG00000272267', 'ENSG00000255495', 'ENSG00000253381', 'ENSG00000254143', 'ENSG00000253878', 'ENSG00000259820', 'ENSG00000226403', 'ENSG00000229611', 'ENSG00000233776', 'ENSG00000269900', 'ENSG00000283886', 'ENSG00000261534', 'ENSG00000237548', 'ENSG00000239665', 'ENSG00000256892', 'ENSG00000249860', 'ENSG00000271409', 'ENSG00000224745', 'ENSG00000261438', 'ENSG00000231575', 'ENSG00000260461', 'ENSG00000234134', 'ENSG00000255823', 'ENSG00000248671', 'ENSG00000254740', 'ENSG00000254561', 'ENSG00000282080', 'ENSG00000256427', 'ENSG00000286911', 'ENSG00000287577', 'ENSG00000246331', 'ENSG00000287388', 'ENSG00000276814', 'ENSG00000271259', 'ENSG00000287622', 'ENSG00000255945', 'ENSG00000261650', 'ENSG00000256542', 'ENSG00000230641', 'ENSG00000275294', 'ENSG00000236094', 'ENSG00000237585', 'ENSG00000223458', 'ENSG00000261666', 'ENSG00000280710', 'ENSG00000203441', 'ENSG00000230156', 'ENSG00000275216', 'ENSG00000215271', 'ENSG00000286931', 'ENSG00000258414', 'ENSG00000258808', 'ENSG00000277050', 'ENSG00000273888', 'ENSG00000258777', 'ENSG00000258301', 'ENSG00000258861', 'ENSG00000259444', 'ENSG00000260780', 'ENSG00000244952', 'ENSG00000259730', 'ENSG00000258631', 'ENSG00000258831', 'ENSG00000273923', 'ENSG00000259664', 'ENSG00000259582', 'ENSG00000261720', 'ENSG00000277010', 'ENSG00000260182', 'ENSG00000262668', 'ENSG00000232196', 'ENSG00000260060', 'ENSG00000260141', 'ENSG00000261439', 'ENSG00000260923', 'ENSG00000215067', 'ENSG00000263316', 'ENSG00000262089', 'ENSG00000273388', 'ENSG00000264067', 'ENSG00000272736', 'ENSG00000214970', 'ENSG00000263388', 'ENSG00000262292', 'ENSG00000256618', 'ENSG00000221995', 'ENSG00000226377', 'ENSG00000273576', 'ENSG00000267637', 'ENSG00000283517', 'ENSG00000282965', 'ENSG00000286603', 'ENSG00000265717', 'ENSG00000278107', 'ENSG00000273733', 'ENSG00000273837', 'ENSG00000286949', 'ENSG00000256222', 'ENSG00000280095', 'ENSG00000278927', 'ENSG00000278955', 'ENSG00000224247', 'ENSG00000272948', 'ENSG00000233213', 'ENSG00000277352', 'ENSG00000239446', 'ENSG00000231566', 'ENSG00000256045', 'ENSG00000228906', 'ENSG00000228139', 'ENSG00000261773', 'ENSG00000237563', 'ENSG00000228890', 'ENSG00000226362', 'ENSG00000278198', 'ENSG00000273496', 'ENSG00000277666', 'ENSG00000278782', 'ENSG00000277761']!
      → to lookup categories, use lookup().var_index
      → to save, run add_new_from_var_index
curate.add_new_from_var_index()
✅ added 220 records with Gene.ensembl_gene_id for var_index: 'ENSG00000230699', 'ENSG00000241180', 'ENSG00000226849', 'ENSG00000272482', 'ENSG00000264443', 'ENSG00000242396', 'ENSG00000237352', 'ENSG00000269933', 'ENSG00000286863', 'ENSG00000285808', 'ENSG00000261737', 'ENSG00000230427', 'ENSG00000226822', 'ENSG00000273373', 'ENSG00000259834', 'ENSG00000224167', 'ENSG00000256374', 'ENSG00000234283', 'ENSG00000263464', 'ENSG00000203812', ...
curate.validate()
✅ var_index is validated against Gene.ensembl_gene_id
💡 mapping donor on ULabel.name
12 terms are not validated: 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', 'A31', '582C'
      → save terms via .add_new_from('donor')
💡 mapping tissue on Tissue.name
❗    found 17 terms validated terms: ['blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', 'ileum', 'caecum', 'thymus', 'skeletal muscle tissue', 'duodenum', 'sigmoid colon', 'transverse colon']
      → save terms via .add_validated_from('tissue')
✅ tissue is validated against Tissue.name
💡 mapping cell_type on CellType.name
❗    found 31 terms validated terms: ['classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', 'mucosal invariant T cell', 'group 3 innate lymphoid cell', 'naive B cell', 'CD16-negative, CD56-bright natural killer cell, human', 'plasma cell', 'CD8-positive, alpha-beta memory T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'gamma-delta T cell', 'conventional dendritic cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', 'effector memory CD4-positive, alpha-beta T cell', 'non-classical monocyte', 'mast cell', 'regulatory T cell', 'progenitor cell', 'dendritic cell, human', 'plasmablast', 'plasmacytoid dendritic cell', 'lymphocyte', 'germinal center B cell', 'megakaryocyte']
      → save terms via .add_validated_from('cell_type')
1 terms is not validated: 'animal cell'
      → save terms via .add_new_from('cell_type')
💡 mapping assay on ExperimentalFactor.name
❗    found 3 terms validated terms: ["10x 3' v3", "10x 5' v2", "10x 5' v1"]
      → save terms via .add_validated_from('assay')
✅ assay is validated against ExperimentalFactor.name
False
curate.add_new_from("donor")
curate.add_new_from("cell_type")
✅ added 12 records with ULabel.name for donor: 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', 'A31', '582C'
✅ added 31 records from public with CellType.name for cell_type: 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', 'mucosal invariant T cell', 'group 3 innate lymphoid cell', 'naive B cell', 'CD16-negative, CD56-bright natural killer cell, human', 'plasma cell', 'CD8-positive, alpha-beta memory T cell', 'CD16-positive, CD56-dim natural killer cell, human', 'gamma-delta T cell', 'conventional dendritic cell', 'CD8-positive, alpha-beta memory T cell, CD45RO-positive', ...
✅ added 1 record with CellType.name for cell_type: 'animal cell'
curate.add_validated_from("all")
💡 saving labels for 'donor'
💡 saving labels for 'tissue'
✅ added 17 records from public with Tissue.name for tissue: 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', 'ileum', 'caecum', 'thymus', 'skeletal muscle tissue', 'duodenum', 'sigmoid colon', 'transverse colon'
💡 saving labels for 'cell_type'
💡 saving labels for 'assay'
✅ added 3 records from public with ExperimentalFactor.name for assay: '10x 3' v3', '10x 5' v2', '10x 5' v1'
curate.validate()
✅ var_index is validated against Gene.ensembl_gene_id
✅ donor is validated against ULabel.name
✅ tissue is validated against Tissue.name
✅ cell_type is validated against CellType.name
✅ assay is validated against ExperimentalFactor.name
True

When we create a Artifact object from an AnnData, we automatically curate it with validated features and labels:

artifact = curate.save_artifact(description="Human immune cells from Conde22")
Hide code cell output
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/ehMgrxyXJneYzGOxT0AX.h5ad')
✅ storing artifact 'ehMgrxyXJneYzGOxT0AX' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/ehMgrxyXJneYzGOxT0AX.h5ad'
💡 parsing feature names of X stored in slot 'var'
36503 terms (100.00%) are validated for ensembl_gene_id
✅    linked: FeatureSet(uid='qo6FPQxG54TI8LHkCL9S', n=36503, dtype='float', registry='bionty.Gene', hash='xtVNbbhs3ty63qs-rwKZHA', created_by_id=1, run_id=1)
💡 parsing feature names of slot 'obs'
4 terms (100.00%) are validated for name
✅    linked: FeatureSet(uid='186MRWfRKTxX7RR0WX3k', n=4, registry='Feature', hash='waAn35V9qDIfQHnc-jdtPQ', created_by_id=1, run_id=1)
✅ saved 2 feature sets for slots: 'var','obs'
artifact.describe()
Artifact(uid='ehMgrxyXJneYzGOxT0AX', description='Human immune cells from Conde22', suffix='.h5ad', type='dataset', _accessor='AnnData', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', _hash_type='sha1-fl', n_observations=1648, visibility=1, _key_is_virtual=True, updated_at='2024-08-06 09:38:47 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna'
    .transform = 'scRNA-seq'
    .run = '2024-08-06 09:37:31 UTC'
  Labels
    .tissues = 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
    .cell_types = 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
    .experimental_factors = '10x 3' v3', '10x 5' v2', '10x 5' v1'
    .ulabels = 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
  Features
    'assay' = '10x 3' v3', '10x 5' v2', '10x 5' v1'
    'cell_type' = 'classical monocyte', 'T follicular helper cell', 'memory B cell', 'alveolar macrophage', 'naive thymus-derived CD4-positive, alpha-beta T cell', 'effector memory CD8-positive, alpha-beta T cell, terminally differentiated', 'alpha-beta T cell', 'CD4-positive helper T cell', 'naive thymus-derived CD8-positive, alpha-beta T cell', 'macrophage', ...
    'donor' = 'D496', '621B', 'A29', 'A36', 'A35', '637C', 'A52', 'A37', 'D503', '640C', ...
    'tissue' = 'blood', 'thoracic lymph node', 'spleen', 'lung', 'mesenteric lymph node', 'lamina propria', 'liver', 'jejunal epithelium', 'omentum', 'bone marrow', ...
  Feature sets
    'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'
    'obs' = 'donor', 'tissue', 'cell_type', 'assay'

Seed a collection

Let’s create a first version of a collection that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

collection = ln.Collection(
    artifact, name="My versioned scRNA-seq collection", version="1"
)
collection.save()
Hide code cell output
Collection(uid='8mjQ7FUidtGoWswBDFli', version='1', name='My versioned scRNA-seq collection', hash='exJtsBYH53iiebYH-Qx0sw', visibility=1, created_by_id=1, transform_id=1, run_id=1, updated_at='2024-08-06 09:38:48 UTC')

For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:

collection.describe()
Collection(uid='8mjQ7FUidtGoWswBDFli', version='1', name='My versioned scRNA-seq collection', hash='exJtsBYH53iiebYH-Qx0sw', visibility=1, updated_at='2024-08-06 09:38:48 UTC')
  Provenance
    .created_by = 'testuser1'
    .transform = 'scRNA-seq'
    .run = '2024-08-06 09:37:31 UTC'
  Feature sets
    'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'
    'obs' = 'donor', 'tissue', 'cell_type', 'assay'

Access the underlying artifacts like so:

collection.ordered_artifacts
<QuerySet [Artifact(uid='ehMgrxyXJneYzGOxT0AX', description='Human immune cells from Conde22', suffix='.h5ad', type='dataset', _accessor='AnnData', size=57612943, hash='9sXda5E7BYiVoDOQkTC0KB', _hash_type='sha1-fl', n_observations=1648, visibility=1, _key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-08-06 09:38:47 UTC')]>

See data lineage:

collection.view_lineage()
_images/2b835b0e22d8c97cd78ac9279784102d3f52cf39656d9953e0bf5c8e33972a32.svg