Intro to Spatial Analysis

In this notebook, we will be going through some basic spatial analyses. The goal is to provide you with an intuition of the logic behind each of these functions. Try not to get bogged down in understanding every single line of the code, but focus more on the overall reasoning behind what is being done. The code in this notebook is not optimized for large datasets, but simplified versions of more complex functions, to make it more clear what is actually being done.

If you would like to run similar analyses on your own multiplexed imaging datasets, please see our lab’s poipeline here: https://github.com/angelolab/ark-analysis. There are also other toolkits for spatial analysis, including Squidpy and MCMICRO.

import os
import numpy as np
import pandas as pd
import skimage.io as io
from skimage.segmentation import find_boundaries
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib import colors
from mpl_toolkits.axes_grid1 import make_axes_locatable
from scipy.spatial.distance import cdist
from scipy import stats
import random
from sklearn.cluster import KMeans
import seaborn as sns

1. Inspect your data

Mantis Viewer or napari can be useful for visualizing your data, but it’s always a good idea to be able to open, view, and manipulate your data in Python, as it will give you more flexibility when analyzing your data.

%matplotlib notebook

# Directory where the data lives
data_dir = "example_data"

# Example image
ex_fov = "fov1"

# Look at a few markers
markers = ["CD45","Collagen1"]

# Set-up plots
plt.rcParams['figure.figsize'] = [10, 5]
fig, ax = plt.subplots(1,len(markers), sharex=True, sharey=True)
for i,mark in enumerate(markers):
    im_array = np.array(io.imread(os.path.join(data_dir, ex_fov, "image_data", mark+".tiff")))
    ax[i].imshow(im_array, origin="lower", cmap='gray', vmax=np.quantile(im_array,0.99))
    ax[i].set_title(mark)
    ax[i].axis('off')
plt.tight_layout()

We have already segmented these images to identify the location of single cells in the image using Mesmer. If you are interested in applying Mesmer to your own data, you can see the notebook here.

We can inspect the output of Mesmer here. In the segmentation output, each cell has a unique label. For example, all pixels with the value of 1 belong to the same cell, all pixels with the value of 2 belong to another cell, etc.

# Read in segmentation mask
seg_path = os.path.join(data_dir, ex_fov, "masks", "segmentation_whole_cell.tiff")
seg_array = np.array(io.imread(seg_path)).squeeze() #squeeze changes dimensions from (1,2048,2048) to (2048,2048)

# Set up plot
fig, ax = plt.subplots(figsize=[5,5])
plt.imshow(seg_array)
plt.axis('off')
plt.tight_layout()

Once we have segmented our data, we can generate cell tables where each row corresponds to one cell that was identified in the image. We have already identified the phenotype of each cell using Pixie. If you are interested in running Pixie on your own data, see the notebook here.

The ‘label’ column in the table corresponds to the cell IDs (pixel values in the segmentation output), ‘centroid-0’ and ‘centroid-1’ correspond to the center point of each cell, and the ‘cell_cluster’ column is the cell phenotype that we determined using Pixie.

# Read in cell table
cell_table_path = os.path.join(data_dir, "cell_table.csv")
cell_table = pd.read_csv(cell_table_path)

# Subset for only the example fov we're looking at here
fov_cell_table = cell_table.loc[cell_table['fov'] == ex_fov]
fov_cell_table

We can then create a cell phenotype map, where each cell is colored according to its cell phenotype.

# Define colors we want for each cell type
all_colors = {}
all_colors['APC'] = '#4E79A7'
all_colors['B'] = '#A0CBE8'
all_colors['Cancer'] = '#F28E2B'
all_colors['Cancer_EMT'] = '#FFBE7D'
all_colors['Cancer_Other'] = '#59A14F'
all_colors['CD4T'] = '#8CD17D'
all_colors['CD8T'] = '#B6992D'
all_colors['Endothelium'] = '#F1CE63'
all_colors['Fibroblast'] = '#499894'
all_colors['Immune_Other'] = '#86BCB6'
all_colors['M1_Mac'] = '#E15759'
all_colors['M2_Mac'] = '#FF9D9A'
all_colors['Mac_Other'] = '#79706E'
all_colors['Mast'] = '#D4A6C8'
all_colors['Monocyte'] = '#D37295'
all_colors['Neutrophil'] = '#FABFD2'
all_colors['NK'] = '#B07AA1'
all_colors['Other'] = '#BAB0AC'
all_colors['Stroma'] = '#9D7660'
all_colors['T_Other'] = '#D7B5A6'
all_colors['Treg'] = '#FFFF99'

# Create table matching each color to a unique ID
colors_list = [(key,value) for key,value in all_colors.items()]
all_colors_df = pd.DataFrame(colors_list, columns=['cell_cluster','color'])
all_colors_df['pheno_id'] = all_colors_df.index + 1

# Make color map for plotting
mycols = all_colors_df['color'].tolist()
mycols.insert(0,'#000000') # add black for empty slide, will have id 0
mycols.append('#FFFFFF') # add white for cell borders, will have id max_n+1
colmap = colors.ListedColormap(mycols)
max_n = np.max(all_colors_df['pheno_id'])
bounds = [i-0.5 for i in np.linspace(0,max_n+2, max_n+3)]
norm = colors.BoundaryNorm(bounds, colmap.N)

# Define function for making cell phenotype map (CPM)
def create_cpm(fov_name, cell_table, all_colors_df, seg_array, fig, ax):
    # Subset cell table for this FOV
    one_fov_cell_table = cell_table.loc[cell_table['fov'] == fov_name]
    # Combine with cell table
    one_fov_cell_table = one_fov_cell_table.merge(all_colors_df, how='left')
    # Make dictionary mapping each cell to its phenotype id
    fov_cell_dict = dict(zip(one_fov_cell_table['label'], one_fov_cell_table['pheno_id']))
    # Add 0 for empty slide
    fov_cell_dict[0] = 0
    
    # Make new image where each pxiel corresponds to its phenotype id
    # Use 'vectorize' in numpy package to speed up this operation
    cpm_array = np.vectorize(fov_cell_dict.get)(seg_array)

    # Find the borders of cells
    predicted_contour_mask = find_boundaries(seg_array, connectivity=1, mode='inner').astype(np.uint8)
    # Color this  border white
    cpm_array[predicted_contour_mask > 0] = max_n+1

    # Plot
    cpm_image = colmap(norm(cpm_array))
    ax.imshow(cpm_image)
    
    return

# Create CPM for example FOV
fig, ax = plt.subplots(figsize=[8,8])
cpm = create_cpm(ex_fov, cell_table, all_colors_df, seg_array, fig, ax)
plt.axis('off')

# Add colorbar to image
divider = make_axes_locatable(fig.gca())
cax = divider.append_axes(position="right", size="5%", pad="3%")
cbar = fig.colorbar(cm.ScalarMappable(norm=norm, cmap=colmap),
                    cax=cax, orientation="vertical", use_gridspec=True, pad=0.1,
                    shrink=0.9, drawedges=True)
cbar_labels = all_colors_df['cell_cluster'].to_list()
cbar_labels.insert(0,'Empty') # add black for empty slide, will have id 0
cbar_labels.append('Cell border') # add white for cell borders, will have id max_n+1

cbar.ax.set_yticks(
    ticks=np.arange(len(cbar_labels)),
    labels=cbar_labels
)
cbar.minorticks_off()

plt.tight_layout()

2. Quanitfying cell populations

There are many methods for cell enumeration, including counting the number of cells of each cell type, calculating cell frequency by dividing by the total number of cells, and calculating cell density by dividing by the total tissue area. While the first approach is the most straightforward, simply counting the number of cells depends on the amount of tissue present in the image. Cell frequencies normalize the number of each cell type to the total number of cells in the image such that all cell types sum to 1. As a result, cell frequencies are not confounded by differences in the size of the section. However, one drawback of cell frequencies is that they can obscure the reason that the amount of one cell type differs between samples. For example, consider a simple example of comparing the number of tissue resident macrophages in healthy and inflamed tissue. Even if the number of macrophages is the same in both images, it may seem like tissue resident macrophages are decreasing in the latter relative to the former. In reality, however, the absolute count of tissue resident macrophages is not changing; they are just outnumbered by infiltrating immune cells. Only considering cell frequencies in such a scenario would lead to the incorrect conclusion that macrophages are decreasing in the inflamed state. One solution is to also examine the density of each cell population by dividing by the total tissue area.

Here, we are showing an example of two FOVs that have different amounts of immune infiltrate, ECM content, and total area (as determined using the slide background mask). We have generated masks of the ECM (using the composite signal of ECM markers) and empty sllide (for MIBI, we can determine empty slide by measuring gold signal, since we use gold sputtered slides). Using the empty slide mask, we can estimate tissue area (by taking the inverse). In the ECM and empty slide masks shown here, white indicates ECM or empty slide, respectively.

# Example images
fov1 = "fov1"
fov2 = "fov2"

# Set-up plots
plt.rcParams['figure.figsize'] = [10,6]
fig, ax = plt.subplots(2,3)

# Look at cell phenotype maps, ECM mask, and gold mask
for i,fov in enumerate([fov1,fov2]):
    # CPMs
    seg_path = os.path.join(data_dir,fov,"masks","segmentation_whole_cell.tiff")
    seg_array = np.array(io.imread(seg_path)).squeeze()
    create_cpm(fov, cell_table, all_colors_df, seg_array, fig, ax[i,0])
    ax[i,0].set_title("Cell phenotype map")
    ax[i,0].set_ylabel(fov)
    ax[i,0].set_yticklabels([])
    ax[i,0].set_xticklabels([])
    ax[i,0].set_yticklabels([])
    ax[i,0].set_xticks([])
    ax[i,0].set_yticks([])
    
    # ECM
    ecm_path = os.path.join(data_dir,fov,"masks","total_ecm.tiff")
    ecm_array = np.array(io.imread(ecm_path))
    ax[i,1].imshow(ecm_array, cmap='gray')
    ax[i,1].axis('off')
    ax[i,1].set_title("ECM")
    
    # Empty slide
    empty_slide_path = os.path.join(data_dir,fov,"masks","empty_slide.tiff")
    empty_slide_array = np.array(io.imread(empty_slide_path))
    ax[i,2].imshow(empty_slide_array, cmap='gray')
    ax[i,2].axis('off')
    ax[i,2].set_title("Empty slide")

plt.tight_layout()

Simply count the number of each cell type

cell_table_keep = cell_table.loc[cell_table['fov'].isin([fov1,fov2])]
count_cells = cell_table_keep.groupby('fov')['cell_cluster'].value_counts().reset_index(name='count')
count_cells

Calculate frequency (divided by total number of cells)

# Get total number of cells per FOV
total_counts = cell_table_keep.groupby('fov').size().to_frame('total_cells')
total_counts = total_counts.reset_index()

# Calculate frequency
count_cells = count_cells.merge(total_counts, on='fov')
count_cells['frequency'] = count_cells['count'] / count_cells['total_cells']
count_cells

Calculate density (divided by tissue area)

# Calculate tissue area (inverse of empty slide)
def get_tissue_area(fov_name):
    empty_slide_path = os.path.join(data_dir,fov_name,"masks","empty_slide.tiff")
    empty_slide_array = np.array(io.imread(empty_slide_path))
    # Get total pixels that belong to empty slide
    empty_slide_px = np.sum(empty_slide_array) #pixel value is 1 if it is empty slide
    # Get number of pixels that belong to tissue (total image - empty slide)
    total_size = empty_slide_array.shape
    tissue_area_px = (total_size[0]*total_size[1]) - empty_slide_px
    # These images are 2048px x 2048px, imaged at 800um x 800um
    # Therefore, the size of 1 pixel is 800um/2048px = 0.39 um/px, area is 0.15 um^2
    return (fov_name, tissue_area_px*0.15)

# Calculate for all FOVs
all_tissue_px = [get_tissue_area(fov) for fov in [fov1,fov2]]
all_tissue_df = pd.DataFrame(all_tissue_px, columns=['fov','area'])

# Calculate density
count_cells = count_cells.merge(all_tissue_df, on='fov')
count_cells['density'] = count_cells['count'] / count_cells['area']
count_cells

Compare

cell_type = 'CD4T'
print(cell_type)
print('FOV:', fov1,
      ', total_counts:', count_cells.loc[(count_cells['fov']==fov1) & (count_cells['cell_cluster']==cell_type)]['count'].values[0],
      ', frequency:', count_cells.loc[(count_cells['fov']==fov1) & (count_cells['cell_cluster']==cell_type)]['frequency'].values[0],
      ', density:', count_cells.loc[(count_cells['fov']==fov1) & (count_cells['cell_cluster']==cell_type)]['density'].values[0])
print('FOV:', fov2,
      ', total_counts:', count_cells.loc[(count_cells['fov']==fov2) & (count_cells['cell_cluster']==cell_type)]['count'].values[0],
      ', frequency:', count_cells.loc[(count_cells['fov']==fov2) & (count_cells['cell_cluster']==cell_type)]['frequency'].values[0],
      ', density:', count_cells.loc[(count_cells['fov']==fov2) & (count_cells['cell_cluster']==cell_type)]['density'].values[0])

Additional exercises

Change the “cell_type” in the code block above and evaluate how the quantification method is different for different cell types.
Create a stacked bar plot showing the cell type composition of each fov (x axis should be fov1 and fov2, y axis is cell type composition). Try making this plot with counts, frequency, and density.

3. Cell-cell enrichment (global)

The goal of pairwise enrichment analysis is to assess whether any two given cell populations colocalize with each other. For example, this approach could be used to determine if T cells preferentially colocalize with tumor cells. In asking such questions, we recommend taking this one step further: Do two cell populations colocalize with each other more often than would be expected by chance? Depending on the frequency of the cell populations and native tissue structure, it may not be possible to determine whether some pairwise relationships are truly preferential or random. The goal of this approach is to minimize potential confounding effects specific to each image that might not be related to the biological question of interest. For example, in a tissue section equally composed of only two cell populations, pairwise enrichment is likely to occur simply by random chance.

With this in mind, we can assess the statistical significance of pairwise enrichment by comparing how often two cell populations are found in close proximity to each other compared with a null distribution. To construct a null distribution, we use a bootstrapping approach by selectively randomizing the location of the cell populations of interest, calculating pairwise enrichment, and repeating this process a large number of times (>100 times is typical). In the assessment of preferential enrichment for two cell populations defined by their expression of marker A or B, the simplest way to generate a null distribution is to randomize the location of one cell population across all cells in the image while keeping the location of the other cell population fixed. For both the original and randomized images, cells positive for A or B are colocalized if the distance between them is less than or equal to a user-defined value. The number of A–B interactions in the original image is then compared with the null distribution to determine statistical significance.

# Example image
ex_fov = "fov2"

# Determine cell types to look at
pheno1 = "Cancer"
pheno2 = "CD8T"

# Threshold to determine if two cells are "close"
dist_thresh = 50

# Number of bootstraps for generating null distribution
bootstrap_n = 100

Calculate the distance between all cells in the FOV

# Subset cell table for only cells in this FOV
fov_cell_table = cell_table.loc[cell_table['fov'] == ex_fov].reset_index(drop=True)
# Make list of all cell centroids
all_centroids = list(zip(fov_cell_table['centroid-0'],fov_cell_table['centroid-1']))
# Get distance between all cells
dist_mat = cdist(all_centroids, all_centroids, 'euclidean')
# Print dimensions of distance matrix
print("Dimensions of dist_mat: ", dist_mat.shape)

dist_mat

This distance matrix has dimensions of total number of cells x total number of cells. The values in the matrix indicate the distance between any 2 cells. The indices of the distance matrix correspond to the indices of the cell table. So the first row in the distance matrix corresponds to the distance between cell label 1 and all other cells in the image. After calculating the distance between all cells, we can subset it for our cell type of choice. We can then count the number of close contacts between two cell types.

Count number of close contacts between pheno1 and pheno2

# Get index of cells belonging to pheno1 and pheno2
pheno1_idx = fov_cell_table[fov_cell_table['cell_cluster'] == pheno1].index.to_list()
pheno2_idx = fov_cell_table[fov_cell_table['cell_cluster'] == pheno2].index.to_list()

# Only keep pheno1 cells in x-axis of distance matrix
pheno1_dist_mat = dist_mat[pheno1_idx,:]
# Binarize the distance matrix for distances that are within the defined threshold
bin_mask = (pheno1_dist_mat < dist_thresh) & (pheno1_dist_mat > 0)
# Change true/false to 1/0
pheno1_dist_mat_bin = bin_mask*1

# Subset this distance matrix for pheno2 cells in y-axis of distance matrix
true_dist_mat_bin = pheno1_dist_mat_bin[:,pheno2_idx]
# Inspect the shape of this matrix, should be number of cells of pheno1 x number of cells of pheno2
# Each element in the matrix is the distance between a pheno1 cell and a pheno2 cell
print("Shape of subsetted distance matrix: ", true_dist_mat_bin.shape)

# For each pheno1 cell, count number of "close" contacts with pheno2 cells
true_close_contacts = np.sum(true_dist_mat_bin, axis=1)
# Take the average across all pheno1 cells
true_close_contacts_mean = np.mean(true_close_contacts)
print("Average number of close contacts between pheno1 and pheno2 cells: ", true_close_contacts_mean)

Above, we have determined the average number of close contacts between pheno1 and pheno2 cells in this image. To determine whether this is a significant number, we can compare it against a null distribution, generated by randomly permuting cell labels. We keep the pheno1 cells constant, then randomize the location of pheno2 cells in the image. For each randomization, we calculate the number of close contacts. We repeat this many times to generate the null distribution.

Generate null distribution by bootstrapping

# Get all possible cell indices (total pool of available cells to randomize)
all_idx = fov_cell_table.index.to_list()
# Remove cells that are of pheno1 from this pool (since they are held constant in this randomization)
all_idx = [x for x in all_idx if x not in pheno1_idx]
# Get total number of cells that are pheno2
num_pheno2 = len(pheno2_idx)

# Randomly sample all cells to be labeled as pheno2 (bootstrapping)
all_bootstrap = []
for _ in range(bootstrap_n):
    # Select num_pheno2 random numbers, represents the indices of the randomly selected cells
    random_pheno2_idx = random.sample(all_idx, num_pheno2)
    # Subset the distance matrix to only keep these randomly selected cells
    keep_dist_mat_bin = pheno1_dist_mat_bin[:,random_pheno2_idx]
    # Find the total number of close contacts between pheno1 cells and randomly selected cells
    close_contacts = np.sum(keep_dist_mat_bin, axis=1)
    # Take the mean across all cells of pheno1
    close_contacts_mean = np.mean(close_contacts)
    # Add this value to the list of all bootstraps
    all_bootstrap.append(close_contacts_mean)

Compare null distrbution to actual number of close contacts

fig, ax = plt.subplots(figsize=(5,3))
# Blue histogram is null distribution
ax.hist(all_bootstrap, density=True,  bins=10, alpha=0.5)
# Red line is actual number of close contacts
plt.axvline(x=true_close_contacts_mean, color='red', linestyle='--', linewidth=2)
plt.show()

# Calculate statistics of null distribution
muhat, sigmahat = stats.norm.fit(all_bootstrap)
# Calculate z score based on distribution
z = (true_close_contacts_mean - muhat) / sigmahat
print("z-score: ", z)

Additional exercises

Flip “pheno1” and “pheno2” - is this calculation symmetric? Conceptually, should it be symmetric?
Play around with the distance threshold for determing “close” cells.
Play around with “pheno1” and “pheno2” and evaluate these metrics for various pairs of cells.
Pick a few different pheno1-pheno2 pairs. For one FOV, loop through this pairwise enrichment function for these pairs of cells. Store this output in a table (hint: column names could be “pheno1”, “pheno2”, “z-score”).
Repeat this calculation for the second FOV.
Compare z-scores between fov1 and fov2.

4. Cell-cell enrichment (context dependent)

This randomization strategy is robust for mitigating confounding effects attributable to the frequency of each cell population. However, this approach does not control for biases inherent to tissue structure. Consequently, interactions between two proteins may appear to be spatially enriched as a result of tissue structure or cell-specific expression, as markers expressed exclusively by certain cells will be heavily influenced by the location of those cells. For example, two immune cell populations that are known to preferentially localize in germinal centers are biased to be enriched with each other. While one possibility is that this relationship is indeed preferential, another possibility is that spatial enrichment is merely a result of both cell types being restricted to a smaller histological compartment.

To address this possibility, null distributions can also be generated in a context-dependent manner. In context-dependent spatial enrichment analysis, randomizations are restricted to occur only within a given cellular compartment or cell subset. In the germinal center example, randomizations can be restricted to occur only within the germinal center. As a result, spurious enrichments that might occur between noninteracting cell populations that happen to colocalize to germinal centers are accounted for in the null distribution.

For this dataset, we have generated masks for the cancer and stroma, core and border regions (using composite channels representative for each region).

Look at cancer and stroma masks

cancer_core = io.imread(os.path.join(data_dir, ex_fov, "masks", "cancer_core.tiff"))
cancer_border = io.imread(os.path.join(data_dir, ex_fov, "masks", "cancer_border.tiff"))
stroma_core = io.imread(os.path.join(data_dir, ex_fov, "masks", "stroma_core.tiff"))
stroma_border = io.imread(os.path.join(data_dir, ex_fov, "masks", "stroma_border.tiff"))

fig, ax = plt.subplots(1,4,figsize=[12,4])
ax[0].imshow(cancer_core, cmap='gray')
ax[0].axis('off')
ax[0].set_title("Cancer core")
ax[1].imshow(cancer_border, cmap='gray')
ax[1].axis('off')
ax[1].set_title("Cancer border")
ax[2].imshow(stroma_core, cmap='gray')
ax[2].axis('off')
ax[2].set_title("Stroma core")
ax[3].imshow(stroma_border, cmap='gray')
ax[3].axis('off')
ax[3].set_title("Stroma border")
plt.tight_layout()
plt.show()

Repeat calculation for number of close contacts but subset for cells within the cancer core only

# Get segmentation mask
seg_path = os.path.join(data_dir, ex_fov, "masks", "segmentation_whole_cell.tiff")
seg_array = np.array(io.imread(seg_path)).squeeze()

# Make new segmentation mask where everything outside of the cancer core is 0
masked_seg_array = np.copy(seg_array)
masked_seg_array[cancer_core == 0] = 0
# Only keep cell labels that are within the cancer core
masked_cell_labels = np.unique(masked_seg_array)
masked_cell_labels = [x for x in masked_cell_labels if x != 0] #remove 0 (which is empty slide)
# Subset cell table for only the cells within the cancer core
fov_cell_table_masked = fov_cell_table[fov_cell_table['label'].isin(masked_cell_labels)]

# Get index of cells belonging to pheno1 and pheno2, but this time, only keep labels that are in the cancer core mask
pheno1_idx = fov_cell_table_masked[fov_cell_table_masked['cell_cluster'] == pheno1].index.to_list()
pheno2_idx = fov_cell_table_masked[fov_cell_table_masked['cell_cluster'] == pheno2].index.to_list()

# Only keep pheno1 cells in x-axis of distance matrix
pheno1_dist_mat = dist_mat[pheno1_idx,:]
# Binarize the distance matrix for distances that are within the defined threshold
bin_mask = (pheno1_dist_mat < dist_thresh) & (pheno1_dist_mat > 0)
# Change true/false to 1/0
pheno1_dist_mat_bin = bin_mask*1

# Subset this distance matrix for pheno2 cells in y-axis of distance matrix
true_dist_mat_bin = pheno1_dist_mat_bin[:,pheno2_idx]

# For each pheno1 cell, count number of "close" contacts with pheno2 cells
true_close_contacts = np.sum(true_dist_mat_bin, axis=1)
true_close_contacts_mean = np.mean(true_close_contacts)
print("Average number of close contacts between pheno1 and pheno2 cells: ", true_close_contacts_mean)

Generate null distribution, only randomizing within cancer core

# Get all possible cell indices (total pool of available cells to randomize)
all_idx = fov_cell_table_masked.index.to_list()
# Remove cells that are of pheno1 from this pool (since they are held constant in this randomization)
all_idx = [x for x in all_idx if x not in pheno1_idx]
# Get total number of cells that are pheno2
num_pheno2 = len(pheno2_idx)

# Randomly sample all cells to be labeled as pheno2 (bootstrapping)
all_bootstrap = []
for _ in range(bootstrap_n):
    # Select num_pheno2 random numbers, represents the indices of the randomly selected cells
    random_pheno2_idx = random.sample(all_idx, num_pheno2)
    # Subset the distance matrix to only keep these randomly selected cells
    keep_dist_mat_bin = pheno1_dist_mat_bin[:,random_pheno2_idx]
    # Find the total number of close contacts between pheno1 cells and randomly selected cells
    close_contacts = np.sum(keep_dist_mat_bin, axis=1)
    # Take the mean across all cells of pheno1
    close_contacts_mean = np.mean(close_contacts)
    # Add this value to the list of all bootstraps
    all_bootstrap.append(close_contacts_mean)

Compare null distrbution to actual number of close contacts

fig, ax = plt.subplots(figsize=(5,3))
# Blue histogram is null distribution
ax.hist(all_bootstrap, density=True,  bins=10, alpha=0.5)
# Red line is actual number of close contacts
plt.axvline(x=true_close_contacts_mean, color='red', linestyle='--', linewidth=2)
plt.show()

# Calculate statistics of null distribution
muhat, sigmahat = stats.norm.fit(all_bootstrap)
# Calculate z score based on distribution
z = (true_close_contacts_mean - muhat) / sigmahat
print("z-score: ", z)

Additional exercises

Play with different regions and phenotype pairs.

5. Cellular microenvironments

Cell phenotypes are regulated in part by their local cellular niche, or microenvironment, which is defined as a collection of cell lineages found to be spatially colocalized and conserved across the data set. We can employ several methods to identify these microenvironments. The first step in this process is to quantify the local distribution of cell lineages around each cell in the data set. Here, we are enumerating the number of cells in each lineage within a set pixel radius from the index cell. Various machine learning methods can then be used to identify distinct cellular microenvironments that occur repeatedly across the data set. Here, we are using k-means clustering.

# FOVs to include in this analysis
all_fovs = ["fov1", "fov2"]

# Set distance for the neighborhood around the cell
nh_dist = 50

# Set the number of "expected" microenvironments
k = 3

# Set a seed for reproducibility
seed = 2024

# Get list of all cell types in our data
all_cell_types = cell_table['cell_cluster'].unique()

Find neighbors for each cell

# Define function to count the neighbors of each cell (where each cell is one row in the distance matrix)
def one_cell_nh(row):
    # Get index of neighbors (defined as "close" in the distance matrix)
    # In the distance matrix ('row' is just one row in the distance matrix), 1 indicates that the two cells were "close"
    neighbor_idx = np.where(row == 1)[0]
    # Initialize output as a dictionary, start with count of 0 for every cell type
    nh = {key:0 for key in all_cell_types}
    # Convert these indices to cell types
    if len(neighbor_idx) > 0:
        nh_cells = fov_cell_table.loc[neighbor_idx]['cell_cluster'].values
        # Count the number of each cell type
        unique_cell_type, counts = np.unique(nh_cells, return_counts=True)
        # Store the data in the dictionary
        for i,one_cell_type in enumerate(unique_cell_type):
            nh[one_cell_type] = counts[i]
    return nh

# Iterate through all fovs to get neighbors of all cells
all_nh_list = [] #store information for each fov in a list
for fov in all_fovs:
    # Subset cell table for only cells in this FOV
    fov_cell_table = cell_table.loc[cell_table['fov'] == fov].reset_index(drop=True)
    # Make list of all cell centroids
    all_centroids = list(zip(fov_cell_table['centroid-0'],fov_cell_table['centroid-1']))
    # Get distance between all cells
    dist_mat = cdist(all_centroids, all_centroids, 'euclidean')

    # Binarize distance matrix (1 if two cells are "close")
    dist_mat_bin = dist_mat < nh_dist
    dist_mat_bin = dist_mat_bin*1

    # Remove itself as its own neighbor
    dist_mat_bin[dist_mat == 0] = 0
    
    # Apply function to every cell in the data
    nh_dicts = np.apply_along_axis(one_cell_nh, axis=1, arr=dist_mat_bin)
    # Turn into dataframe
    nh_df = pd.DataFrame(nh_dicts.tolist())
    # Get total number of neighbors
    nh_df['total_neighbors'] = nh_df.sum(axis=1)
    # Combine with data from cell table
    nh_df = pd.merge(fov_cell_table, nh_df, left_index=True, right_index=True)
    all_nh_list.append(nh_df)

# Turn data into dataframe
all_nh_df = pd.concat(all_nh_list)
all_nh_df

k-means clustering

When finding the neighborhood, you can either use the number of cells in the neighborhood of each cell, or the frequency (divided by the total number of cells). The interpretation of each of these is slightly different. We encourage you to try both ways (you can comment/uncomment the line of code that calculates frequency below). By clustering on the neighborhoods, we can identify distinct microenvironments.

# If you want to try clustering using frequency (meaning dividing by the total number of cells, uncomment this line
# all_nh_df[all_cell_types] = all_nh_df[all_cell_types].divide(all_nh_df['total_neighbors'], axis=0)

# Remove cells that have no neighbors
all_nh_df_zeros_removed = all_nh_df.loc[all_nh_df['total_neighbors'] > 0].copy()

# Only keep columns we want for clustering
kmeans_input = all_nh_df_zeros_removed[all_cell_types]

# Cluster
cluster_fit = KMeans(n_clusters=k, random_state=seed, n_init=10).fit(kmeans_input)
# Add 1 to labels to avoid cluster number 0 (because output of kmeans is 0-indexed)
cluster_labels = ["ME"+str(x) for x in cluster_fit.labels_+1]

# Add cluster labels to cell table
all_nh_df_zeros_removed['me_cluster'] = cluster_labels

# Merge with the big cell table, assign unassigned cells (becaue no neighbors) to k+1
all_nh_df = all_nh_df.merge(all_nh_df_zeros_removed, how="left")
all_nh_df['me_cluster'] = all_nh_df['me_cluster'].fillna("Unassigned")
all_nh_df

Visualize microenvironments

We can visualize the number of each cell type that is assigned to each microenvironment.

# Count the number of each cell type assigned to each microenvironment
num_cell_types_dat_long = all_nh_df_zeros_removed.groupby(['me_cluster', 'cell_cluster']).size().reset_index(name='counts')
# Reformat this table
num_cell_types_dat = num_cell_types_dat_long.pivot(index='me_cluster', columns='cell_cluster', values='counts')
num_cell_types_dat.fillna(0, inplace=True)

# Make heatmap
heatmap = sns.clustermap(
    num_cell_types_dat.apply(stats.zscore, axis=1), # Apply a z-score for better visualization
    cmap = "vlag",
    center = 0,
    vmin = -3,
    vmax = 3
)

Visualize cell phenotypes maps, colored by microenvironment

# Define colors we want for each microenvironment (add colors here if you have larger k, you need k+1 colors)
all_colors = {}
all_colors['ME1'] = '#4E79A7'
all_colors['ME2'] = '#59A14F'
all_colors['ME3'] = '#D37295'
all_colors['Unassigned'] = '#79706E'

# Create table matching each color to a unique ID
colors_list = [(key,value) for key,value in all_colors.items()]
all_colors_df = pd.DataFrame(colors_list, columns=['me_cluster','color'])
all_colors_df['pheno_id'] = all_colors_df.index + 1

# Make color map for plotting
mycols = all_colors_df['color'].tolist()
mycols.insert(0,'#000000') # add black for empty slide, will have id 0
mycols.append('#FFFFFF') # add white for cell borders, will have id max_n+1
colmap = colors.ListedColormap(mycols)
max_n = np.max(all_colors_df['pheno_id'])
bounds = [i-0.5 for i in np.linspace(0,max_n+2, max_n+3)]
norm = colors.BoundaryNorm(bounds, colmap.N)

# Create microenvironment masks
for fov in all_fovs:
    # Get segentation array
    seg_path = os.path.join(data_dir, fov, "masks", "segmentation_whole_cell.tiff")
    seg_array = io.imread(seg_path).squeeze()
    
    # Make CPM using function we created above
    fig, ax = plt.subplots(figsize=[8,8])
    cpm = create_cpm(fov, all_nh_df, all_colors_df, seg_array, fig, ax)
    plt.axis('off')
    ax.set_title(fov)

    # Add colorbar
    divider = make_axes_locatable(fig.gca())
    cax = divider.append_axes(position="right", size="5%", pad="3%")
    cbar = fig.colorbar(cm.ScalarMappable(norm=norm, cmap=colmap),
                        cax=cax, orientation="vertical", use_gridspec=True, pad=0.1,
                        shrink=0.9, drawedges=True)
    cbar_labels = all_colors_df['me_cluster'].to_list()
    cbar_labels.insert(0,'Empty') # add black for empty slide, will have id 0
    cbar_labels.append('Cell border') # add white for cell borders, will have id max_n+1

    cbar.ax.set_yticks(
        ticks=np.arange(len(cbar_labels)),
        labels=cbar_labels
    )
    cbar.minorticks_off()

    plt.tight_layout()

Additional exercises

Compare generating neighborhoods using counts vs. frequency.
Change the k used for k-means clustering. How do the results change?
Inertia and silhouette score are two metrics that could be useful for helping us choose the best k. Try different k’s and extracting the inertia value (see https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, intertia_ is an attribute of the output of k-means). Plot k on the x axis and inertia on the y axis. What trend do you see?