Breaking out of clusters: To cluster or not to cluster your scRNA-seq data

A new framework for cluster free differential expression analysis

Apr 15, 2023

Dear Community,

Tools for data analysis are often the most utilised and heavily relied on when it comes to single-cell omics. We as researchers tend to rely on a core set of tools to analyse and interpret complex biological data often. Once in a while we are presented with a new tool that would provide us with an opportunity for reinterpretation of our data, I am inclined always to try it out and see how different are my results and interpretation when I am utilizing a new tool from my “tool belt“.

In today’s summary, we will go through a new tool that has been proposed as an alternative or rather a complement to standard gene expression analysis. I hope you enjoy this read and as always sharing is caring for the community. So,

As always, take action to subscribe now. Hit that button and share it with your friends and colleagues. Leave a comment as it helps with the algorithms :)

Crowds of cells in a niche. Generated using Blue Willow

As a young kid getting sick was often a weird feeling but yet an exciting one. To be able to stay home and not attend my dreaded math lessons and yet miss the biology classes where I was gently introduced to the world of plants, animals and eventually diseases, was much of my mental routine during sick periods. When it comes to diseases, especially of infectious origins, a person originating from the subtropical region has encountered them more frequently than in several parts of the world. Diseases are such fascinating and complex processes, but during childhood, I was often shown a reductionist view of diseases. You seem to have caught a bug so take a few days off to rest and all will be good.

As the years progressed, I found myself reading about the minutiae of the disease process. During my courses in oncology and infection biology, I came to understand the complex dynamic processes of a disease, with various cell types among various tissues and organs playing one of the most beautiful orchestrated dance performances that you will ever see. Cells that are challenged by pathogens or the ones which are hyperactive, survival-endowed, scrappy, fecund, inventive copies of their predecessors (aka Cancer) undergo sets of molecular changes within themselves and their surroundings. Understanding these changes has been the subject of inquiry for generations of scientists all around the world. What started as a tedious and observational-based study of the differences in morphology and gross descriptive characteristics of the cells and tissues has now evolved into cell-specific analysis employing advanced techniques such as single-cell omics to employ an unbiased approach to cell phenotypes in diseases.

Performing comparative analysis is a scientific practice where often two groups such as disease and healthy states provide us with the necessary information to see what processes within the cells have been perturbed by the disease. For most studies this case-control design with several replicates allows us to have powerful insights that can be leveraged to design drugs to fight the disease condition. These types of studies hold true to single-cell omics where one measures the changes in one of the several layers of information within the cell. In a standard workflow commonly used to identify these changes, especially at an RNA level, comparative methods such as differential abundance or expression.

Differential abundance (DA) tests whether different cell types change their relative abundance between conditions.
Differential expression (DE) identifies which genes that are expressed at different levels between conditions.

Both of the above methods have been successfully used in bulk and single-cell level to give us all valuable insights across disease spectra. The current single-cell methods for such tests require that the cells are grouped into homogenous and similar cell types. These methods can help us understand DE between conditions and are not that sensitive when it comes to local transcriptional structure (ie what are the changes per gene within a cluster).

MiloDE: A sensitive cluster-free differential expression testing method.

In the current pre-print authors extend Milo which is a cluster-free differential abundance testing framework where cell abundances are estimated using kNN graph representation of scRNAseq data. We will try to go through the core principles of the framework and look into a few use cases proposed by the authors.

The method

The method uses a graph-based approach to leverage overlapping neighbourhoods in the scRNA-seq data. The Nearest neighbour graph (NNg) is a directed graph defined for a set of points in metric space. The NNg has a vertex for each point and a directed edge from point A to point B whenever B is closer or the nearest neighbour to A. A very simple way to put it is Imagine London, Newyork, Paris and Colombo being points in space. If you start mapping using kNN from London, the nearest neighbour would be Paris and you would connect these two points followed by connecting Paris to Newyork and consequently Newyork to Colombo. A clear explanation in the context of biology can be viewed in the video below.

Although a widely used approach for the detection of cell types kNNs have their limitations. One limitation is the sensitivity to the local density of the data. On the dependence of k there is a chance that rare cell types can get connected and therefore get reported as similar cell types. To improve overall sensitivity, the authors propose a 2nd order kNN graph in which a standard kNN graph is amended with additional edges between 2 cells that share at least one neighbouring cell. Going back to our cities analogy with a few more cities, the 2nd-order kNN graph can be approximated as below. As you can see there are some more local connections available than a 1st-order kNN graph (Keep in mind that this is an approximation of the concept).

The construction of a graph to recapitulate the distances between single cells the algorithm takes in SingleCellExperiment object containing count matrix for all the samples combined. The matrix should consist of entries that can identify sample IDs and conditions to be tested. It also utilises reducedDim slot that stores low-dimensional representations of single-cell datasets.

Consequently, Graph construction is performed using kNN graphs to represent transcriptional relationships between cells and assign neighbourhoods.

1st-order kNN graph: standard kNN graph
2nd-order kNN graph: 1st-order graph augmented with edges between cells if they share at least a cell within their neighbours. Neighbourhoods with centre cell c are defined as all cells that are connected with it by the edges.

The construction of the 1st order graph uses a user-defined k and the graph generated is used to select index cells. This is iterated post indexing and selection to construct the 2nd-order kNN. This eventually generates a cell x neighbourhood matrix with boolean identifiers of cell inclusion in the neighbourhood (ie TRUE/FALSE for a particular neighbourhood based on cell c).

An optional step recommended by the authors is Graph refinement wherein they identify neighbourhoods and discard any cells with unassigned neighbourhoods. They frame this as a Set cover problem and implement greedy algorithm. This is followed by DE testing, where four crucial steps occur.

Selection of neighbourhoods to be tested:
- In this step, they select neighbourhoods that show signs of shifts in expression (perturbed) and discard neighbourhoods that have no shifts of expression.
- To identify perturbed neighbourhoods they adapted Augur to build a Random forest classifier that predicts the neighbourhood and returns AUC (learn more about AUC here)
Selection of genes to be tested:
- By employing edgeR:filterByExpr they determine which genes have sufficient counts to be considered for DE testing.
- The procedure is performed for each neighbourhood and likely results in different sets of genes being tested within each neighbourhood.
DE testing within the neighbourhood.
- Utilizing the quasi-likelihood testing from edgeR they filter out all samples and aggregate counts to create pseudo-bulks that mimic bulk RNA-seq data followed by genes selected above to create library sizes in order to correct biases.
- Covariates from the experimental design are then integrated and Negative binomial dispersions is estimated as a function of gene abundance.
- They use a generalised linear model to perform DE testing and the output generated has tested genes and estimated logFC, p-value and FDR metrics that are commonly used to plot and visualize changes.
Multiple testing corrections across neighbourhoods.
- They use a spatial correction approach for each gene and correct calculated p-values across the tested neighbourhoods.
- By leveraging the graph representation used to assign cells to neighbourhoods and applying the Benjamini-Hochberg correction approach they are able to perform multiple testing

So you might be wondering, what this does and how it is useful for your own data. To this extent, the authors have given us some use cases. They created a chimeric mouse embryo dataset, in which tdTomato+ (a marker that can be used to visualize cells like GFP) mouse embryonic cells containing a Tal1 (T-cell acute lymphocytic leukaemia protein 1, plays various roles such as erythrocyte differentiation and haematopoiesis) knockout.

Using miloDE to investigate if it can reveal additional information on the extensive transcriptional changes upon perturbation. They calculated the number of DE genes as well as DE genes that are highly specific to their neighbourhoods and consequently regrouping neighbourhoods by cell type identity.

They show that:
- Endothelial and haematoendothelial progenitors are enriched to be perturbed.
- Other cell types show a considerably lower degree of perturbation.
- Absence of Tal1 results in transcriptional changes that were expected in these two cell types and to a much lesser degree in others.
- They also systematically assess how haematopoiesis is disrupted in cells contributing to the blood lineage by performing various statistical analyses.
- These analyses led to the identification of 6 co-regulated modules, containing genes that are differentially expressed in coherent sub-regions of the manifold.
- They picked two modules which were complementary to one another. The scored metrics where calculated the correlation between each gene and distance to blood progenitors and suggested that :
  - Module 1: Contains genes down-regulated in Tal1 negative cells
  - Module 2: Contains genes that are up-regulated in the Tal1 negative cells and are heavily enriched for genes associated with cardiomyogenesis. This is consistent with previous studies which specify that the cell fate moves towards cardiac cell type when there is lack of Tal1

Modules of genes expressed and few examples of genes within the dataset

Another supporting sub-experiment they perform to validate the utility of MiloDE is the reconstruction of a meaningful timeline of macrophage activation during Idiopathic pulmonary fibrosis (IPF).

IPF is a debilitating disease and the main topic of my PhD studies. Being a disease of less frequency it is often overlooked by research with few labs over the world looking into understanding the mechanisms. IPF also happens to be a complex disease that involves various cell types in both the lung and immune environments. At its core, the soft tissue of the lung turns into fibrotic tissue leading to complications in breathing. Identifying drugs that can be repurposed for treatment is my PhD project, and It is not short of a monumental task.

The authors make a good point by stressing that DE approaches applied to a specific cell type will return a single logFC value for a single gene and it is often not possible to reconstruct the changes in a continuous pattern, which is often the ground truth scenario. Spp1 also known as Osteopontin plays a diverse role in physiology and often in IPF macrophages tend to express Spp1 reducing macrophage plasticity. Also, Spp1-producing macrophages occur in healthy patients at a lower abundance making it a difficult metric to be used as a DE gene marker for patients.

The authors propose that a cluster-free approach is necessary to study subtle transcriptional changes that arise in these populations of cells during disease progression and not only as an endpoint marker. For the analysis itself, they use cells from healthy and IPF donors and extract the cells annotated as macrophage population and assess the Differential Abundances(DA) across neighbourhoods within the macrophage manifold and suggest that LogFC of the DA can be used as a proxy for disease progression.

Consequently, by applying miloDE and characterizing different DE patterns using the Louvain clustering to group genes, they identify 4 gene sets. the observations indicate neighbourhoods strongly associated with each gene set were enriched for specific stages of disease progression. By utilizing logFC-DA to systematically asses the relevance of each gene set along the phenotypic progression of fibrotic macrophages they calculate the average fraction of significant genes and average logFC.

This study resulted in identified four gene sets associated with disease stages. The first two contain DE at the early to mid stages of the disease and the last two at the late stages of the disease. this clear stratification can eventually help classify the disease progression and the genes could be theoretically used as markers for the disease.

IPF-1 gene set: Significantly enriched for inflammatory response genes.
IPF-2 gene set: Associated with initial defensive response against pathogens that are one of the drivers of IPF progression.
IPF-3 gene set: Associated with hallmarks of an advanced stage of fibrosis with markers pertaining to extracellular matrix deposition.
IPF-4 gene set” contains genes marking late-stage fibrosis and processes like wound healing and collagen fibril organization.

Macrophage-specific gene sets for different points of IPF progression.

Code availability:

https://github.com/MarioniLab/miloDE

Key takeaway points:

Allows for an in-depth characterisation of various DE patterns across the whole dataset and within individual cell types.
Leverages a large number of replicates in order to enable sensitive and specific DE detection.
Using 2nd-order kNN graph representation provides an improved estimation for local graph density and mitigates some pitfalls of approximation of gene regulation.
There are some pitfalls and the authors acknowledge this (please read the article).