scrna2/6 Jupyter Notebook lamindata

Standardize and append a batch of data

Here, we’ll learn

  • how to standardize a less well curated collection

  • how to append it to the growing versioned collection

import lamindb as ln
import bionty as bt

ln.settings.transform.stem_uid = "ManDYgmftZ8C"
ln.settings.transform.version = "1"
ln.track()
💡 connected lamindb: testuser1/test-scrna
💡 notebook imports: bionty==0.47.1 lamindb==0.75.0
💡 saved: Transform(uid='ManDYgmftZ8C5zKv', version='1', name='Standardize and append a batch of data', key='scrna2', type='notebook', created_by_id=1, updated_at='2024-08-06 09:38:53 UTC')
💡 saved: Run(uid='D3bvFMVmAMN4ieqqzTqy', transform_id=2, created_by_id=1)
Run(uid='D3bvFMVmAMN4ieqqzTqy', started_at='2024-08-06 09:38:53 UTC', is_consecutive=True, transform_id=2, created_by_id=1)

Let’s now consider a less-well curated dataset:

adata = ln.core.datasets.anndata_pbmc68k_reduced()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 70 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

We are still working with human data, and can globally set an organism:

bt.settings.organism = "human"
curate = ln.Curate.from_anndata(adata, var_index=bt.Gene.symbol, categoricals={adata.obs.cell_type.name: bt.CellType.name})
3 non-validated categories are not saved in Feature.name: ['percent_mito', 'n_genes', 'louvain']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
✅ added 5 records from public with Gene.symbol for var_index: 'GPX1', 'SOD2', 'RN7SL1', 'SNORD3B-2', 'IGLL5'
11 non-validated categories are not saved in Gene.symbol: ['RP11-782C8.1', 'RP11-277L2.3', 'RP11-156E8.1', 'RP3-467N11.1', 'RP11-390E23.6', 'RP11-489E7.4', 'RP11-291B21.2', 'RP11-620J15.3', 'TMBIM4-1', 'AC084018.1', 'CTD-3138B18.5']!
      → to lookup categories, use lookup().var_index
      → to save, run add_new_from_var_index

Standardize & validate genes

Let’s convert Gene symbols to Ensembl ids via standardize(). Note that this is a non-unique mapping and the first match is kept because the keep parameter in .standardize() defaults to "first":

adata.var["ensembl_gene_id"] = bt.Gene.standardize(
    adata.var.index,
    field=bt.Gene.symbol,
    return_field=bt.Gene.ensembl_gene_id,
)
# use ensembl_gene_id as the index
adata.var.index.name = "symbol"
adata.var = adata.var.reset_index().set_index("ensembl_gene_id")

# we only want to save data with validated genes
validated = bt.Gene.validate(adata.var.index, bt.Gene.ensembl_gene_id, mute=True)
adata_validated = adata[:, validated].copy()
💡 standardized 754/765 terms

Here, we’ll use .raw:

adata_validated.raw = adata.raw[:, validated].to_adata()
adata_validated.raw.var.index = adata_validated.var.index
curate = ln.Curate.from_anndata(adata_validated, var_index=bt.Gene.ensembl_gene_id, categoricals={"cell_type": bt.CellType.name})
3 non-validated categories are not saved in Feature.name: ['percent_mito', 'n_genes', 'louvain']!
      → to lookup categories, use lookup().columns
      → to save, run add_new_from_columns
curate.validate()
✅ var_index is validated against Gene.ensembl_gene_id
💡 mapping cell_type on CellType.name
9 terms are not validated: 'Dendritic cells', 'CD19+ B', 'CD4+/CD45RO+ Memory', 'CD8+ Cytotoxic T', 'CD4+/CD25 T Reg', 'CD14+ Monocytes', 'CD56+ NK', 'CD8+/CD45RA+ Naive Cytotoxic', 'CD34+'
      → save terms via .add_new_from('cell_type')
False

Standardize & validate cell types

Since none of the cell types are validate, let us search the cell type names from the public ontology, and add the name found in the AnnData object as a synonym to the top match found in the public ontology.

bionty = bt.CellType.public()  # access the public ontology through bionty
name_mapper = {}
for name in adata_validated.obs.cell_type.unique():
    # search the public ontology and use the ontology id of the top match
    ontology_id = bionty.search(name).iloc[0].ontology_id
    # create a record by loading the top match from bionty
    record = bt.CellType.from_source(ontology_id=ontology_id)
    name_mapper[name] = record.name  # map the original name to standardized name
    record.save()
    record.add_synonym(name)
Hide code cell output
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0001087'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000910'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000911'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000919'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000795'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002397'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0001054'
✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0002101'

We can now standardize cell type names using the search-based mapper:

adata_validated.obs.cell_type = adata_validated.obs.cell_type.map(name_mapper)

Now, all cell types are validated:

curate.validate()
✅ var_index is validated against Gene.ensembl_gene_id
✅ cell_type is validated against CellType.name
True

Register

artifact = curate.save_artifact(description="10x reference adata")
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/PVQPdii2M86mw4vu5rzD.h5ad')
✅ storing artifact 'PVQPdii2M86mw4vu5rzD' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/PVQPdii2M86mw4vu5rzD.h5ad'
💡 parsing feature names of X stored in slot 'var'
754 terms (100.00%) are validated for ensembl_gene_id
✅    linked: FeatureSet(uid='vsV0boaVVPopVNknEwNW', n=754, dtype='float', registry='bionty.Gene', hash='j8QkIeLBgJwsscY4vVPx1A', created_by_id=1, run_id=2)
💡 parsing feature names of slot 'obs'
1 term (25.00%) is validated for name
3 terms (75.00%) are not validated for name: n_genes, percent_mito, louvain
✅    linked: FeatureSet(uid='63npO4FhKufA5LHNeroV', n=1, registry='Feature', hash='PILtg5IzUbU0cgyGU-yzhw', created_by_id=1, run_id=2)
✅ saved 2 feature sets for slots: 'var','obs'
artifact.view_lineage()
_images/6340d5520c792ecb1fb6aaea08a506b99f61c11c5943687eb8b9b0bd524d19fb.svg

Append the dataset to the collection

Query the previous collection:

collection_v1 = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="1"
).one()

Create a new version of the collection by sharding it across the new artifact and the artifact underlying version 1 of the collection:

collection_v2 = ln.Collection(
    [artifact, collection_v1.ordered_artifacts.first()],
    is_new_version_of=collection_v1,
)
collection_v2.save()
Hide code cell output
💡 adding collection ids [1] as inputs for run 2, adding parent transform 1
💡 adding artifact ids [1] as inputs for run 2, adding parent transform 1
✅ saved 1 feature set for slot: 'var'
Collection(uid='8mjQ7FUidtGoWswBgKvj', version='2', name='My versioned scRNA-seq collection', hash='Umjxg4HR1wkZqKROsyz1sw', visibility=1, created_by_id=1, transform_id=2, run_id=2, updated_at='2024-08-06 09:39:22 UTC')

Version 2 of the collection covers significantly more conditions.

collection_v2.describe()
Collection(uid='8mjQ7FUidtGoWswBgKvj', version='2', name='My versioned scRNA-seq collection', hash='Umjxg4HR1wkZqKROsyz1sw', visibility=1, updated_at='2024-08-06 09:39:22 UTC')
  Provenance
    .created_by = 'testuser1'
    .transform = 'Standardize and append a batch of data'
    .run = '2024-08-06 09:38:53 UTC'
  Feature sets
    'obs' = 'donor', 'tissue', 'cell_type', 'assay'
    'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'

View data lineage:

collection_v2.view_lineage()
_images/7d1758759dcb4baf73586934d6b06dbf0b7dc5cfb5edc88c4a5632bed59a73c2.svg