Intro to Python

Importing packages

You can import packages using “import” (analogous to ‘library’ in R).

import os

You can also assign “nicknames” to packages when you’re importing them.

import numpy as np
import pandas as pd
import seaborn as sns

If you only need certain functions in a package, you can also only import the module within the package that you need.

import matplotlib.pyplot as plt
import skimage.io as io

Basic data structures

A python list is like an array:

xs = [3, 1, 2]   # Create a list
print(xs)

One difference between R and Python is that Python uses 0-indexing, meaning that the first element of the list is accessed using 0.

print(xs[0])
print(xs[-1])     # Negative indices count from the end of the list; prints "2"

In addition to accessing list elements one at a time, Python provides concise syntax to access sublists; this is known as slicing.

nums = list(range(5))    # range is a built-in function that creates a list of integers
print(nums)         # Prints "[0, 1, 2, 3, 4]"
print(nums[2:4])    # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]"
print(nums[2:])     # Get a slice from index 2 to the end; prints "[2, 3, 4]"
print(nums[:2])     # Get a slice from the start to index 2 (exclusive); prints "[0, 1]"
print(nums[:])      # Get a slice of the whole list; prints ["0, 1, 2, 3, 4]"
print(nums[:-1])    # Slice indices can be negative; prints ["0, 1, 2, 3]"
nums[2:4] = [8, 9] # Assign a new sublist to a slice
print(nums)         # Prints "[0, 1, 8, 9, 4]"

You can loop over the elements of a list like this:

animals = ['cat', 'dog', 'monkey']
for animal in animals:
    print(animal)

If you want access to the index of each element within the body of a loop, use the built-in enumerate function.

animals = ['cat', 'dog', 'monkey']
for idx, animal in enumerate(animals):
    print('#{}: {}'.format(idx + 1, animal)) #in 'print' syntax, the variables after 'format' are printed where the '{}' are

When programming, frequently we want to perform operations on every element of a list. As a simple example, consider the following code that computes square numbers:

nums = [0, 1, 2, 3, 4]
squares = []
for x in nums:
    squares.append(x ** 2)
print(squares)

You can make this code a lot simpler using a list comprehension:

nums = [0, 1, 2, 3, 4]
squares = [x ** 2 for x in nums]
print(squares)

List comprehensions can also contain conditions (“%” is the mod function, which performs division between two numbers and returns the remainder, ex. 6 % 3 = 0, 7 % 3 = 1):

nums = [0, 1, 2, 3, 4]
even_squares = [x ** 2 for x in nums if x % 2 == 0]
print(even_squares)

Another useful data structure is a dictionary. A dictionary stores (key, value) pairs. You can use it like this:

d = {'cat': 'cute', 'dog': 'furry'}  # Create a new dictionary with some data in 'key':'value' pairs
print(d['cat'])       # Get an entry from a dictionary; prints "cute"
print('cat' in d)     # Check if a dictionary has a given key; prints "True"
d['fish'] = 'wet'    # Create an entry in a dictionary
print(d['fish'])      # Access an entry in a dictionary, prints "wet"

You can iterate over the entries in a dictionary:

d = {'person': 2, 'cat': 4, 'spider': 8}
for animal, legs in d.items(): # Here, we are deconstructing each key-value pair into variables called 'animal' and 'legs'
    print('A {} has {} legs'.format(animal, legs))

Defining functions

Python functions are defined using the def keyword. For example:

def sign(x):
    if x > 0:
        return 'positive'
    elif x < 0:
        return 'negative'
    else:
        return 'zero'
for x in [-1, 0, 1]:
    print(sign(x))

Numpy

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

import numpy as np

We can initialize numpy arrays from nested Python lists, and access elements using square brackets:

a = np.array([[1,2,3],[4,5,6]])
print(a)
print(a.shape)
print(a[0, 0], a[0, 1], a[1, 0])

Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition. Here is an example:

a = np.array([[1,2], [3, 4], [5, 6]])

bool_idx = (a > 2)  # Find the elements of a that are bigger than 2;
                    # this returns a numpy array of Booleans of the same
                    # shape as a, where each slot of bool_idx tells
                    # whether that element of a is > 2.

print(bool_idx)
# We use boolean array indexing to construct a rank 1 array
# consisting of the elements of a corresponding to the True values
# of bool_idx
print(a[bool_idx])

# We can do all of the above in a single concise statement:
print(a[a > 2])

Numpy is a powerful library that underlies a lot of machine learning packages in Python. To learn more, check out the documentation: https://numpy.org/doc/stable/user/quickstart.html.

Scipy is a collection of mathematical algorithms built on top of Numpy. For more tutorials and information, see https://scipy.org/.

Pandas

Pandas is a powerful library for working with tabular data (similar to data frames in R).

import pandas as pd

Here is a simple example:

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

# Load data into a DataFrame object:
df = pd.DataFrame(data)
df

To locate a specific row:

print(df.loc[0])

Add column to data frame:

df['group'] = ['group1','group1','group2']
df

Subset a data frame:

# Only keep rows in group 1
df.loc[df['group'] == 'group1']

Count the number of rows in each group:

df.groupby('group').size()

You can also perform operations on groups of rows. For example, here, we are finding the mean calories in each group.

df.groupby('group')['calories'].mean()

Pandas dataframes are indexed (the numbers on the left hand side). Even when you subset a table, the indices do not change. For example, we can see that when we subset for group2 only, the index for that row stays the same.

df.loc[df['group'] == 'group2']

If we want to reset the index, we can use the reset_index function:

df.loc[df['group'] == 'group2'].reset_index()

We can merge 2 data frames using hte merge function:

new_data = {
  "group": ['group1', 'group2'],
  "new_col": [100, 200]
}

# Load data into a DataFrame object:
new_df = pd.DataFrame(new_data)
new_df
df.merge(new_df, on='group') # merge the two tables using the 'group' column

To read data directly from a file and load it as a pandas DataFrame:

pd.read_csv("example_data/cell_table.csv")

Pandas has a lot more functionality. For more tutorials and information, see https://pandas.pydata.org/docs/index.html.

Matplotlib and Seaborn

Matplotlib is a plotting library. In this section, we will give a brief introduction to the matplotlib.pyplot module.

import matplotlib.pyplot as plt

Here is a simple example:

# Compute the x and y coordinates for points on a sine curve
x = np.arange(0, 3 * np.pi, 0.1)
y = np.sin(x)

# Plot the points using matplotlib
plt.plot(x, y)
plt.show()

You can plot different things in the same figure using the subplot function. Here is an example:

# Compute the x and y coordinates for points on sine and cosine curves
x = np.arange(0, 3 * np.pi, 0.1)
y_sin = np.sin(x)
y_cos = np.cos(x)

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.subplot(2, 1, 1)

# Make the first plot
plt.plot(x, y_sin)
plt.title('Sine')

# Set the second subplot as active, and make the second plot.
plt.subplot(2, 1, 2)
plt.plot(x, y_cos)
plt.title('Cosine')

# Show the figure.
plt.show()

Seaborn is another popular plotting package that can generate “prettier” plots (maybe similar to ‘ggplot2’ in R).

import seaborn as sns
# Load an example dataset
tips = sns.load_dataset("tips")

# Create a visualization
sns.relplot(
    data=tips,
    x="total_bill", y="tip", col="time",
    hue="smoker", style="smoker", size="size",
)
plt.show()

For more documentation on these two packages, see https://matplotlib.org/3.5.3/api/_as_gen/matplotlib.pyplot.html and https://seaborn.pydata.org/tutorial/introduction.html.

Images

There are a few different packages that you can use for opening images. scikit-image (shorted as skimage) is a popular one.

import skimage.io as io

Here, we are loading in an image.

example_image = io.imread("example_data/fov1/image_data/CD45.tiff")

We can see that the image is just an array of numbers.

example_image

We can inspect the shape of the array (the shape of the image).

example_image.shape

We can also display the image.

fig = plt.figure(figsize=(8,8))
plt.imshow(example_image, origin="lower", cmap='gray', vmax=np.quantile(example_image,0.99))
plt.axis('off')
plt.tight_layout()

For more documentation on scikit-image, see https://scikit-image.org/.

Additional exercises

  1. Read in the cell tabe at “example_data/cell_table.csv”. How many unique types of cells are there? How many unique FOVs?
  2. Filter the cell table for cells in FOV2. How many CD4 T cells are there?
  3. Make a dictionary mapping each cell ID to its cell type.
  4. Make a bargraph showing the number of cells of each cell type in FOV1.
  5. Read in some example images in the “example_data” folder and play with the “vmax” parameter to see how the image changes.