HINT
# We can use the modulus operator "%%". It yields the remainder when the first operand is divided by the second.
15 %% 3
#> [1] 0
14 %% 3
#> [1] 2
RStudio is an integrated development environment (IDE) for R. It provides a user-friendly interface for coding, debugging, and data analysis. We use RStudio for its convenience and powerful features.
The working directory is the folder where R looks for files. It’s important to set it correctly for consistency across Windows and Mac.
# Get the current working directory
current_dir <- getwd()
print(paste("Current directory:", current_dir))
# List files in the current directory
files <- list.files()
print("Files in the current directory:")
print(files)
# Change the current directory
new_dir <- "path/to/new/directory"
setwd(new_dir)
print(paste("Changed directory to:", new_dir))
In R, objects are created to store data. Variables are used to name and reference these objects. The environment is the current workspace where objects and variables are stored.
Gene expression data from single-cell RNA sequencing (scRNA-seq) experiments is typically represented as a matrix, where rows correspond to genes and columns correspond to cells. Each cell contains the expression level of a gene, quantified as counts or normalized values. In R, there are several matrix data types commonly used for storing and manipulating gene expression data:
Matrix (matrix
): The basic matrix data type in R. It is a two-dimensional array with elements of the same data type.
Data Frame (data.frame
): A special type of matrix where columns can contain different data types (e.g., numeric, character, factor). Data frames are commonly used for storing tabular data, including gene expression matrices with additional metadata.
Sparse Matrix (Matrix
package): A matrix format optimized for storing large, sparse datasets with many zero values. It conserves memory and speeds up computation for large-scale analyses.
Matrix operations
element <- mat[1, 2]
print(element)
# Calculate row sums
row_sums <- rowSums(mat)
print(row_sums)
# Calculate column sums
col_sums <- colSums(mat)
print(col_sums)
# Create another matrix
mat2 <- matrix(rnorm(20), nrow = 5, ncol = 4)
print(mat2)
# Perform matrix multiplication
mat_mult <- mat %*% mat2
print(mat_mult)
# Transpose the matrix
mat_transpose <- t(mat)
print(mat_transpose)
# Select the first two rows
first_two_rows <- mat[1:2, ]
print(first_two_rows)
# Select the first three columns
first_three_cols <- mat[, 1:3]
print(first_three_cols)
# Logical variable
logical_var <- TRUE
print(logical_var) # Output: TRUE
# Logical operations
result <- logical_var & FALSE
print(result) # Output: FALSE
# Logical operations
a <- TRUE
b <- FALSE
# AND operation
result_and <- a & b
print(result_and) # Output: FALSE
# OR operation
result_or <- a | b
print(result_or) # Output: TRUE
# NOT operation
result_not <- !a
print(result_not) # Output: FALSE
# Comparison operators
x <- 5
y <- 10
# Greater than
result_gt <- x > y
print(result_gt) # Output: FALSE
# Less than
result_lt <- x < y
print(result_lt) # Output: TRUE
# Equal to
result_eq <- x == y
print(result_eq) # Output: FALSE
# Not equal to
result_neq <- x != y
print(result_neq) # Output: TRUE
while
A while loop is a control flow statement that allows code to be executed repeatedly based on a given Boolean condition. The loop executes the loop body as long as the condition remains true. When the condition becomes false, the loop terminates.
for
A for loop is a control flow statement used in many programming languages to repeat a block of code multiple times. It is particularly useful for iterating over sequences (like lists, arrays, or strings) and executing a piece of code for each element in the sequence.
apply
The apply() function in R is a powerful tool for applying a function to the rows or columns of a matrix or data frame. It is particularly useful for performing operations across a dataset without needing to write explicit loops. The syntax for apply() is:
apply(X, margin, function, ...)
# X: This is the array or matrix on which you want to apply the function.
# margin: A value that specifies whether to apply the function over rows (1), columns (2), or both (c(1, 2)).
# function: The function you want to apply to each row or column.
To calculate the sum of each row in a matrix:
# Create a matrix
my_matrix <- matrix(1:9, nrow=3)
# Apply sum function across rows
row_sums <- apply(my_matrix, 1, sum)
print(row_sums)
To find the mean of each column in a data frame:
# Create a data frame
df <- data.frame(a = c(1, 2, 3), b = c(4, 5, 6))
# Apply mean function across columns
column_means <- apply(df, 2, mean)
print(column_means)
sapply
and lappy
lapply()
returns a list, regardless of the output of each application of the function.sapply()
attempts to simplify the result into a vector or matrix if possible. If simplification is not possible, it returns a list similar to lapply()
.Suppose you have a list of numerical vectors and you want to compute the sum of each vector. Here’s how you could use lapply():
# Define a list of vectors
num_list <- list(c(1, 2, 3), c(4, 5), c(6, 7, 8, 9))
# Use lapply to apply the sum function
list_sums <- lapply(num_list, sum)
print(list_sums)
Using the same list of numerical vectors, if you use sapply() to compute the sum, the function will try to simplify the output into a vector:
When to Use Each
lapply()
: When you need the robustness of a list output, especially when dealing with heterogeneous data or when the function can return variable lengths or types.sapply()
: When you are working with homogeneous data and prefer a simplified output such as a vector or matrix, assuming the lengths and types are consistent across elements.The stringr
package in R simplifies these tasks with easy-to-use functions that can handle typical string operations.
Finding specific sequences or motifs within biological sequences is a common task.
Modifying sequences by replacing specific nucleotides or amino acids.
Extracting parts of sequences, such as cutting out genes or regions of interest.
Determining the length of sequences.
Converting uppercase to lowercase, or vice versa.
Splitting sequences into arrays, useful for reading fasta files or analyzing codons.
codons <- str_sub(sequence, seq(1, str_length(sequence), by = 3), seq(3, str_length(sequence), by = 3))
print(codons)
Bonus Challenge: What if our sequence length wasn’t a multiple of three? For an example of how to approach this situation please check out the Bonus challenge.
Counting occurrences of specific nucleotides or amino acids.
A package is a collection of R functions, data, and compiled code. Packages provide a way to organize and share code, making it easier for users to access specific tools and functions.
Tidyverse is a collection of R packages designed for data science. It includes packages like ggplot2
for data visualization and dplyr
for data manipulation.
Tidyverse is a collection of R packages designed for data science, developed with a focus on simplicity, consistency, and ease of use. One of the key components of Tidyverse is the concept of tidy data frames.
A tidyverse data frame is a rectangular table-like structure where:
Each variable forms a column: In a tidy data frame, each variable is placed in its own column. This makes it easy to work with the dataset as each variable is explicitly represented.
Each observation forms a row: Each row corresponds to a distinct observation, entity, or case. This makes it straightforward to perform operations on individual observations.
Each type of observational unit forms a table: Different types of data should be stored in separate tables, with relationships between tables represented using unique identifiers.
Ease of Analysis: Tidy data frames make it easier to perform data manipulation, visualization, and analysis using Tidyverse packages such as dplyr
, ggplot2
, and tidyr
.
Readability and Interpretability: Tidy data frames have a consistent structure, making it easier for others to understand and interpret your data.
Efficiency: Tidy data frames facilitate efficient and concise code, reducing the complexity of data manipulation tasks.
dplyr: Provides a grammar of data manipulation for data frames, enabling easy filtering, selecting, mutating, summarizing, and arranging of data.
tidyr: Offers tools for reshaping and tidying messy datasets, such as gather()
and spread()
functions for converting between wide and long formats.
ggplot2: Allows for the creation of sophisticated data visualizations using a layered grammar of graphics.
library(ggplot2)
library(tidyr)
library(dplyr)
# Example dataset
data <- data.frame(
ID = 1:3,
Name = c("Alice", "Bob", "Charlie"),
Math = c(90, 85, 95),
Science = c(88, 92, 89)
)
# Original dataset
print("Original dataset:")
print(data)
# Tidy the dataset using gather() function from tidyr
tidy_data <- gather(data, Subject, Score, -ID, -Name)
# Tidied dataset
print("Tidied dataset:")
print(tidy_data)
Selecting columns allows you to choose specific columns from your dataset. It helps you focus on the variables of interest and ignore the rest.
Filtering rows allows you to subset your dataset based on specific conditions. It helps you extract only the rows that meet certain criteria.
Summarizing data involves calculating summary statistics or aggregating data to get a concise overview of your dataset. It helps you understand the overall characteristics of your data.
Arranging rows involves sorting your dataset based on the values of one or more columns. It helps you reorder your data for better visualization or analysis.
The mutate() function in the dplyr package is essential for transforming data frames in R. It allows you to add new columns to a data frame or modify existing ones, using existing data. mutate() is part of the tidyverse, a collection of R packages designed for data science that makes data manipulation, exploration, and visualization easy and intuitive.
Calculating the GC content of DNA sequences.
Transcription of DNA sequences into RNA sequences involves replacing thymine (T) with uracil (U).
Identifying potential neoantigens by finding motifs associated with high mutation frequencies or specific mutation patterns.
# Sample DNA sequences
sequences <- tibble(
id = c("seq1", "seq2", "seq3"),
dna = c("ATGCGCATC", "GCGTACGTAGT", "ATATATATAT")
)
# Assume a simple motif that might indicate a neoantigen
motif = "ATG"
# Annotate sequences with potential neoantigen presence
sequences <- sequences %>%
mutate(
start_position = str_locate(dna, motif)[,1],
is_neoantigen_candidate = ifelse(start_position > 0 & str_count(dna, "[GC]") / str_length(dna) > 0.5, "Yes", "No")
)
print(sequences)
Functions (function
) in R perform specific tasks. They take input (arguments), perform operations, and return output.
function_name <- function(argument1, argument2, ...) {
# Function body
# Perform operations
# Return a value (optional)
}
Here, we define a function and call it!
Let’s bring together concepts by writing the function analyze_tcr_data
.
The function will:
library(dplyr)
library(stringr)
# Define the function
analyze_tcr_data <- function(tcr_tibble, length_threshold) {
# Validate input
if (!inherits(tcr_tibble, "tbl_df")) {
stop("Input must be a tibble.")
}
# Filter TCR sequences longer than the threshold and sort by clonality
filtered_and_sorted <- tcr_tibble %>%
filter(str_length(tcr_sequence) > length_threshold) %>%
arrange(desc(clonality))
# Add a column to indicate the presence of a specific motif (e.g., 'CASS')
enhanced_tcr_tibble <- filtered_and_sorted %>%
mutate(has_motif = if_else(str_detect(tcr_sequence, "CASS"), "Yes", "No"))
# Return the transformed tibble
return(enhanced_tcr_tibble)
}
# Example usage
# Assuming a tibble with TCR sequences and clonality metrics
tcr_data <- tibble(
tcr_sequence = c("CASSLGGTDTQYF", "CASSLGDETQYF", "CASSLG", "CASSEGTDTQYF"),
clonality = c(0.25, 0.15, 0.05, 0.55)
)
# Apply the function with a length threshold of 10
result_data <- analyze_tcr_data(tcr_data, 10)
print(result_data)
Explanation
filter()
to retain only TCR sequences longer than the specified length_threshold.arrange()
to sort the data by clonality in descending order.mutate()
combined with str_detect()
from the stringr package to add a column that indicates whether each TCR sequence contains the motif “CASS”.Reading files is a crucial part of data analysis. R provides functions for reading various file formats.
# Load the iris dataset (saved as iris in base R)
prac_data <- iris
# What variables are measured in the iris dataset?
# Hint: What do colnames(), str(), and summary() tell you about the data frame
# CODE YOUR ANSWER HERE
# How many sets of measurements were taken in the iris dataset?
# Hint: What is the difference between nrow(), ncol(), dim(), and length()
# CODE YOUR ANSWER HERE
# What are the three unique types of species analyzed in the iris dataset?
# Hint: How can unique() and table() be used to understand iris$Species
# CODE YOUR ANSWER HERE
# Filter the iris dataset to only those that are of the Species setosa
# Save this dataframe as the variable "setosa"
# CODE YOUR ANSWER HERE
# What is the average sepal width of irises that are of the Species setosa?
# Hint: Summary values of numeric vectors can be evaluated using min(), median(), max(), and summary()
# CODE YOUR ANSWER HERE
# Add a new column to setosa called "short_petal" that indicates TRUE if the
# Petal.Length is less than 5 and FALSE if the Petal.Length is at least 5
# CODE YOUR ANSWER HERE
# Create a function called "score" that checks whether the value of a numeric vector
# (input) is greater than the mean of all of the values, and returns a character vector
# with "high" for the values greater than the mean and "low" for the values
# less than the mean.
# Create a new column for setosa called "sepal_length_score" that uses "high" and
# "low" to indicate which values are greater than or less than the mean.
# Hint: How can you use the function you create
# Advanced: What does the following block of code do
iris %>%
group_by(Species) %>%
count()
# Advanced: Do you have a *.csv, *.tsv, or *.txt file of your own that you
# can add to your project?
# Hint: There is an "Upload" option in the Files pane where you can add a file
# that is present on your own computer. You can then read the file into your
# environment.
If our sequence isn’t length three, this code will throw an error because the function expects a sequence that is a multiple of 3:
library(stringr)
sequence <- "ATGCGTACGTTGAC" # Length 14
# sequence <- "ATGCGTACGTTGACA" # Length 15
codons <- str_sub(sequence, seq(1, str_length(sequence), by = 3), seq(3, str_length(sequence), by = 3))
#> Error in `str_sub()`:
#> ! Can't recycle `string` (size 5) to match `end` (size 4).
#> Run `rlang::last_trace()` to see where the error occurred.
Can you try to code up a solution to this problem?
Here are a few options for how to solve this (only peek after trying yourself).
sequence <- "ATGCGTACGTTGAC" # Length 14
# sequence <- "ATGCGTACGTTGACA" # Length 15
# Option 1: Check for length 3 and make sure before running the code, otherwise print a warning.
sequence_length <- nchar(sequence) # Get sequence length
if (sequence_length %% 3 != 0) {
print(paste("Error! The sequence is length:", sequence_length))
codons <- "ERROR"
} else {
codons <- str_sub(string = sequence,
start = seq(1, str_length(sequence), by = 3),
end = seq(3, str_length(sequence), by = 3))
}
print(codons)
# Option 2: Add N to the sequence to make in a multiple of 3, and then any seqeuence will run properly.
if (sequence_length %% 3 != 0) {
leftover <- 3 - nchar(sequence) %% 3
leftover_n <- paste(rep("N", leftover), collapse = "")
sequence <- paste0(sequence, leftover_n)
}
codons <- str_sub(sequence, seq(1, str_length(sequence), by = 3), seq(3, str_length(sequence), by = 3))
print(codons)
The slides for this lecture are available here