Loading... # GEO ## 数据结构 GEOquery 数据结构大致分为两类。第一种是GDS, GPL和GSM,他们的操作和数据类型差不多;第二种是GSE,GSE数据是由GSM和GPL整合而成。 GPL,GEOPlatform (GPL) 芯片平台 GSM,GEO Sample (GSM) 样本ID号 GSE,GEO Series (GSE) study的ID号 GDS,GEO Dataset (GDS) 数据集的ID号 一篇文章可以有一个或者多个GSE数据集,一个GSE里面可以有一个或者多个GSM样本。多个研究的GSM样本可以根据研究目的整合为一个GDS,而每个数据集都有着自己对应的芯片平台,就是GPL。 如果是芯片数据,那么就需要看GPL平台里面关于每个探针对应的注释信息, 如果是高通量测序数据,一般要同步进入该GSE对应的SRA里面去下载sra数据,然后转为fastq格式数据再进行处理。 ### GDS, GSM 和 GPL 这些数据类组成 - 一个元数据头(与the SOFT 格式的头差不多) - 一个GEODataTable(数据)。 可以使用show()查看这些数据类。 ```R library("GEOquery") gsm <- getGEO(filename=system.file("GSMxxxxxx.txt.gz",package="GEOquery")) # 查看gsm metadata head(Meta(gsm)) # 查看 gsm data Table(gsm)[1:5,] # 查看GSM数据列的含义 Columns(gsm) ``` ### GSE类 GSE类组成: - 一个元数据头 - GPLList,列表,包含GPL信息 - GSMList,列表,包含GSM信息 ```R gse <- getGEO("GSE226791",GSEMatrix=FALSE) head(Meta(gse)) # names of all the GSM objects contained in the GSE names(GSMList(gse)) # and get the first GSM object on the list GSMList(gse)[[1]] # and the names of the GPLs represented names(GPLList(gse)) ``` ## mRNA测序数据组成 [单细胞3' mRNA测序数据与CPM、RPKM及FPKM之间的联系 - 简书 (jianshu.com)](https://www.jianshu.com/p/caf078f4cef1) ### transcript 转录本 ### LncRNA 长链非编码RNA ### CPM/RPM **R**eads/**C**ounts of exon model **P**er **M**illion mapped reads,当然也有文章里面简单称之为:**c**ounts **p**er **m**illion。(Shulman and Elkon, 2019)其具体计算方法为: $$ \dfrac{read/counts\ mapped\ to\ a\ gene A*10^6}{total\ reads\ mapped\ to\ all\ genes} $$ 每100万个reads中有多少个是来源于基因A的。那么CPM到底实现了什么样的目的呢?设想一下,如果我们在单细胞测序时,由于我们技术上的偏差,使得两个细胞测序深度差异比较大,但实际上,这两个细胞本身转录出来的mRNA的量可能本身差异并不大,这个时候如果我们将两个细胞的reads数统一限制到100万,这样就能够很好的解决由技术导致的偏差,CPM就是干了这件事 ### RPKM RPKM全称是**R**eads **P**er **K**ilobase per **M**illion mapped reads,其表示的含义是**每千个碱基的转录在每百万个转录中读取的reads**,好拗口,翻译一下,RPKM的含义就是假设我们现在拿到了100万个reads,其中一个基因每1000个外显子碱基转录出来的reads数。(Mortazavi et al., 2008) 计算方法为: $$ \dfrac{numReads}{total\,\,numReads(M)*exon\,\,lengh(KB)} $$ 我们在进行表达量的比较时,如果我们一个read记为一次表达,注意,这里我们进行的是全长测序,比如smart2-seq,全长测序最终我们是要将cDNA打断成小片段测序的,那这个时候,如果一个基因本身就很长,那它测得的reads数相对来说就会更多,但是不是一定它的表达量就多呢?实际上并不一定,比如一个外显子共长10000个碱基的基因转录出一个mRNA最终可能会检测到400个reads,而一个外显子共长200个碱基的基因转录处一个mRNA可能最终只会检测到1个read,但实际上它们都只转录出一个mRNA,所以我们单纯用reads总数就会出错,RPKM就很好的避免了这样的错误,“每1000个外显子碱基”很好的排除了基因长度的影响,“每100万个reads”很好的排除了测序深度的影响 ### FPKM FPKM全称**F**ragments **P**er **K**ilobase of exon model per **M**illion mapped fragments,其表示的含义是每千个碱基的转录在每百万个片段中读取的片段数,首先就说,FPKM和RPKM的目的是一样的,都是为了消除基因长度和测序深度的影响。那么FPKM和RPKM的区别究竟在哪儿呢?答案就在reads和fragment的区别上,**RPKM适用于单端测序,FPKM适用于双端测序**,reads就是指的测序得到的一个序列,而fragment指的是被双端测序的那个核苷酸片段,一般来说双端测序,两个reads能够确定一个fragment,当然也存在一个read质量不高被剔除的情况,总之,理论上是reads总数在fragment总数的1~2倍。FPKM的计算方式与RPKM是一样的,只不过用fragment替换掉了reads。 ### single cell RNA 3'-end sequencing 我们在单细胞3' RNA测序中,很重要的工作就是衡量不同基因的表达情况,但是我们是用CPM还是RPKM还是FPKM呢?**实际上,你只能并且只会用到CPM**,试想一下,3'端测序难道不是测到一个read就是相当于一个mRNA吗?这个数量和基因长度有关系吗?显然没有关系,基因再长,3'端测序终究只能对每个mRNA测到一个read,所以在这个里面根本不存在基因长度的影响,只存在测序深度的影响 ## Converting to BioConductor ExpressionSets and limma MALists GEO datasets与limma 数据结构MAList 和Biobase数据结构 ExpressionSet比较相似。可以相互转换: - GDS2MA 和 GDS2eSet ### Getting GSE Series Matrix files as an ExpressionSet GEO Series是一套实验数据的集合,有SOFT,MINiML格式文件,以及一个 Series Matrix File(s)文本。Series Matrix File是tab-delimited text,`getGEO`函数可以解析,解析结果就是ExpressionSets。 ## 分析流程 ```R # 数据读取与清洗 -------------------------------------------------------------------- library(tidyverse) library(GEOquery) # 读取GSE42872的数据,其中32周动脉粥样硬化小鼠的数据使用GPL1261创建 data_set_name <- "GSE10000_GPL1261_" data_path <- "F:/vmshare/ubuntu/LuoJY/MR/batchscript/smr/GSE10000/" data <- read_delim("smr/GSE10000/GSE10000-GPL1261_series_matrix.txt", delim = "\t", escape_double = FALSE, comment = "!", trim_ws = TRUE) #eSet <- getGEO("GSE10000", # destdir = './smr', #下载在当前目录 # getGPL = F) #平台信息不要 ## Found 2 file(s),选择第一个 #gpl <- eSet[[1]]@annotation # 为GPL1261 # 搜索对应平台http://www.bio-info-trainee.com/1399.html # 即GPL1261 -> mouse430a2 #BiocManager::install("mouse430a2.db") library("mouse430a2.db") ls("package:mouse430a2.db") ids <- toTable(mouse430a2SYMBOL) # 查看多少数据存在对应的symbol table(data$ID_REF %in% ids$probe_id) # 去除没有对应symbol的data data <- data[data$ID_REF %in% ids$probe_id,] # 使用match函数把ids里的探针顺序改一下,使ids里探针顺序和表达矩阵的顺序完全一样。 #ids <- ids[match(rownames(data),ids$probe_id),] data <- left_join(data,ids,by = c('ID_REF' = 'probe_id')) # 根据平均值去除重复映射到同一个symbol的probe_id data <- data %>% mutate(mean_value = rowMeans(dplyr::select(.,GSM44658:GSM250943), na.rm = TRUE)) %>% group_by(symbol) %>% arrange(desc(mean_value)) %>% distinct(symbol, .keep_all = TRUE) %>% arrange(symbol) %>% dplyr::select(c(symbol,GSM44658:GSM250943)) # data #rownames(data) <- data$symbol # GSM44658 wt-Aorta-32-weeks-1 # GSM44659 wt-Aorta-32-weeks-3 # GSM44660 apoE-Aorta-32-weeks-1 # GSM44661 apoE-Aorta-32-weeks-2 # GSM44662 apoE-Aorta-32-weeks-3 # GSM44663 wt-Aorta-32-weeks-2 data <- data %>% ungroup() %>% dplyr::select(symbol,GSM44658,GSM44663,GSM44659,GSM44660,GSM44661,GSM44662) group_list<- c("wt","wt","wt","apoe","apoe","apoe") data%>% dplyr::filter(symbol == 'Gapdh') data%>% dplyr::filter(symbol == 'Actb') data_long <- data %>% gather(key = "sample", value = "value", -symbol) %>% mutate(type = ifelse(sample %in% c("GSM44658", "GSM44663", "GSM44659"), "wt", "apoe"), type_sample = factor(paste(type, sample))) # 绘制小提琴图 plot_qc_mean <- ggplot(data_long, aes(x = type_sample, y = value, fill = type)) + geom_violin() + geom_text(data = data_long %>% filter(symbol %in% c("Actb", "Gapdh")), aes(label = symbol), position = position_nudge(y = -1.5), size = 3,color = "red") + theme_minimal() + labs(x = "Sample", y = "Value", fill = "Type") plot_qc_mean <- plot_qc_mean + stat_summary(fun="mean",geom="crossbar", width = 0.6,linewidth = 0.2 ,color = "black") #options(repr.plot.width =15, repr.plot.height =9) #pdf(paste0(data_path,data_set_name,"qc_mean.pdf")) #p #dev.off() ggsave(paste0(data_path,data_set_name,"qc_mean.pdf"),plot=plot_qc_mean,width = 9,height = 9,dpi = 600) # 聚类 -------------------------------------------------------------------- library(ggdendro) data_hclust <- dplyr::select(data,-symbol) colnames(data_hclust) <- c("wt GSM44658", "wt GSM44663", "wt GSM44659", "apoe GSM44660", "apoe GSM44661", "apoe GSM44662") data_dist <- dist(t(data_hclust)) #nodePar <- list(lab.cex = 0.6, pch = c(NA, 19) ,cex = 0.7, col = "blue") hc <- hclust(data_dist) # 转换为 ggplot2 图形 ggdendro_data <- dendro_data(hc) plot_qc_hclust <- ggplot() + geom_segment(data = ggdendro_data$segments, aes(x = x, y = y, xend = xend, yend = yend), color = "blue", # 线条颜色 linewidth = 0.5) + # 线条宽度 geom_text(data = ggdendro_data$labels, aes(x = x, y = y, label = label), hjust = 1, vjust = 1, size = 3, # 标签字体大小 color = "red") + # 标签颜色 coord_flip() + theme_minimal() + theme(axis.text.y = element_text(size = 8), # y轴标签字体大小 axis.title.y = element_blank()) # 移除y轴标题 ggsave(paste0(data_path,data_set_name,"qc_hclust.pdf"),plot=plot_qc_hclust,scale = 0.6,width = 20,height = 9,dpi = 600) # PCA -------------------------------------------------------------------- library(factoextra) data_pca <- data %>% dplyr::select(-symbol) %>% prcomp(center = TRUE, scale = TRUE) data_pcaplot <- get_pca(data_pca) pdf(paste0(data_path,data_set_name,"qc_pca.pdf")) fviz_pca_biplot(data_pca, label = "var") fviz_pca_var(data_pca) fviz_screeplot(data_pca,addlabels = TRUE, choice = "eigenvalue") fviz_screeplot(data_pca,addlabels = TRUE, choice = "variance") # #p dev.off() library(ggfortify) pdf(paste0(data_path,data_set_name,"qc_pca1.pdf")) data_pca2<-data.frame(t(dplyr::select(data,-symbol))) data_pca2$group <- c("wt", "wt", "wt", "apoe", "apoe", "apoe") autoplot(prcomp(data_pca2[,1:(ncol(data_pca2)-1)] ), data=data_pca2,colour = 'group') dev.off() # 差异统计 -------------------------------------------------------------------- library(limma) #表达矩阵(data) #分组矩阵(design) #差异比较矩阵(contrast.matrix) data_limma <- data.frame(dplyr::select(data,-symbol)) rownames(data_limma) <- data$symbol # 做分组矩阵 design <- model.matrix(~0+factor(group_list)) colnames(design) <- levels(factor(group_list)) rownames(design) <- colnames(dplyr::select(data,-symbol)) design # 差异比较矩阵 contrast.matrix<-makeContrasts(paste0(c("wt","apoe"),collapse = "-"),levels = design) contrast.matrix #得到的分组矩阵 ##step1 fit <- lmFit(data_limma,design) ##step2 fit2 <- contrasts.fit(fit, contrast.matrix) ##这一步很重要,大家可以自行看看效果 fit2 <- eBayes(fit2) ## default no trend !!! ##eBayes() with trend=TRUE ##step3 tempOutput <- topTable(fit2, coef=1, n=Inf) nrDEG <- na.omit(tempOutput) #write.csv(nrDEG2,"limma_notrend.results.csv",quote = F) head(nrDEG) ##筛选差异排名前25绘制热图,验证差异是否明显 library(pheatmap) colnames(data_limma) <- unlist(lapply(colnames(data_limma),function(x){paste(ifelse (x %in% c("GSM44658", "GSM44663", "GSM44659"), "wt", "apoe"),x)})) choose_gene <- head(rownames(nrDEG),25) choose_matrix <- data_limma[choose_gene,] choose_matrix <- t(scale(t(choose_matrix))) pdf(paste0(data_path,data_set_name,"qc_heatmap.pdf")) pheatmap(choose_matrix) dev.off() # 火山图 -------------------------------------------------------------------- #colnames(nrDEG) #plot(nrDEG$logFC,-log10(nrDEG$P.Value)) DEG <- nrDEG logFC_cutoff <- with(DEG,mean(abs( logFC)) + 2*sd(abs( logFC)) ) DEG$change <- as.factor(ifelse(DEG$P.Value < 0.05 & abs(DEG$logFC) > logFC_cutoff, ifelse(DEG$logFC > logFC_cutoff ,'UP','DOWN'),'NOT')) this_tile <- paste0('Cutoff for logFC is ',round(logFC_cutoff,3), '\nThe number of up gene is ',nrow(DEG[DEG$change =='UP',]) , '\nThe number of down gene is ',nrow(DEG[DEG$change =='DOWN',]) ) this_tile head(DEG) g <- ggplot(data=DEG, aes(x=logFC, y=-log10(P.Value), color=change)) + geom_point(alpha=0.4, size=1.75) + theme_set(theme_set(theme_bw(base_size=20)))+ xlab("log2 fold change") + ylab("-log10 p-value") + ggtitle( this_tile ) + theme(plot.title = element_text(size=15,hjust = 0.5))+ scale_colour_manual(values = c('blue','black','red')) ## corresponding to the levels(res$change) ggsave(paste0(data_path,data_set_name,"qc_volcano.pdf"),plot=g,scale = 0.6,width = 12,height = 12,dpi = 600) data_up_filter <- filter(DEG,change == "UP") data_down_filter <- filter(DEG,change == "DOWN") write.csv(data_up_filter, paste0(data_path,data_set_name,"qc_volcano_up_gene.csv"), row.names = TRUE) write.csv(data_down_filter, paste0(data_path,data_set_name,"qc_volcano_down_gene.csv"), row.names = TRUE) # KEGG pathway analysis -------------------------------------------------------------------- #BiocManager::install("clusterProfiler") # 查询Mus musculus的基因数据库 #library(AnnotationHub) #hub <- AnnotationHub() #query(hub, 'Mus musculus') library(clusterProfiler) library(org.Mm.eg.db) data_pathway <- mutate(DEG,symbol = rownames(DEG)) s2e <- bitr(unique(data_pathway$symbol), fromType = "SYMBOL", #ID转换核心函数bitr toType = c( "ENTREZID"), OrgDb = org.Mm.eg.db) #head(s2e) #head(deg) data_pathway <- inner_join(data_pathway,s2e,by=c("symbol"="SYMBOL")) gene_up <- data_pathway[data_pathway$change == 'UP','ENTREZID'] gene_down <- data_pathway[data_pathway$change == 'DOWN','ENTREZID'] gene_diff <- c(gene_up,gene_down) gene_all <- data_pathway[,'ENTREZID'] kk.up <- enrichKEGG(gene = gene_up, organism = 'mmu', universe = gene_all, pvalueCutoff = 0.9, qvalueCutoff =0.9) head(kk.up)[,1:6] dim(kk.up) kk.down <- enrichKEGG(gene = gene_down, organism = 'mmu', universe = gene_all, pvalueCutoff = 0.9, qvalueCutoff =0.9) head(kk.down)[,1:6] dim(kk.down) kk.diff <- enrichKEGG(gene = gene_diff, organism = 'mmu', pvalueCutoff = 0.05) head(kk.diff)[,1:6] class(kk.diff) #提取出数据框 kegg_diff_dt <- kk.diff@result #根据pvalue来选,用于可视化 down_kegg <- kk.down@result %>% filter(pvalue<0.05) %>% mutate(group=-1) up_kegg <- kk.up@result %>% filter(pvalue<0.05) %>% mutate(group=1) kegg_plot <- function(up_kegg,down_kegg){ dat=rbind(up_kegg,down_kegg) colnames(dat) dat$pvalue = -log10(dat$pvalue) dat$pvalue=dat$pvalue*dat$group dat=dat[order(dat$pvalue,decreasing = F),] g_kegg<- ggplot(dat, aes(x=reorder(Description,order(pvalue, decreasing = F)), y=pvalue, fill=group)) + geom_bar(stat="identity") + scale_fill_gradient(low="blue",high="red",guide = FALSE) + scale_x_discrete(name ="Pathway names") + scale_y_continuous(name ="log10P-value") + coord_flip() + theme_bw()+theme(plot.title = element_text(hjust = 0.5))+ ggtitle("Pathway Enrichment") } g_kegg <- kegg_plot(up_kegg,down_kegg) ggsave(paste0(data_path,data_set_name,"kegg.pdf"),plot=g_kegg,scale = 1,width = 12,height = 12,dpi = 600) # 导出基因列表和对应富集功能 gene_up_list <- data_pathway[data_pathway$change %in% c('UP'), 'ENTREZID'] gene_down_list <- data_pathway[data_pathway$change %in% c( 'DOWN'), 'ENTREZID'] gene_up_enrichment_results <- kk.up@result[, c("ID", "Description")] gene_down_enrichment_results <- kk.down@result[, c("ID", "Description")] write.csv(gene_up_list, paste0(data_path,data_set_name,"kegg_gene_up_list.csv"), row.names = TRUE) write.csv(gene_up_enrichment_results, paste0(data_path,data_set_name,"kegg_gene_up_enrichment_results.csv"), row.names = FALSE) write.csv(gene_down_list, paste0(data_path,data_set_name,"kegg_gene_down_list.csv"), row.names = FALSE) write.csv(gene_down_enrichment_results, paste0(data_path,data_set_name,"kegggene_down_enrichment_results.csv"), row.names = FALSE) # GSEA -------------------------------------------------------------------- data_gsea <- data_pathway data(geneList, package="DOSE") head(geneList) length(geneList) names(geneList) boxplot(geneList) boxplot(data_gsea$logFC) geneList <- data_gsea$logFC names(geneList) <- data_gsea$ENTREZID geneList <- sort(geneList,decreasing = T) kk_gse <- gseKEGG(geneList = geneList, organism = 'mmu', #nPerm = 1000, minGSSize = 120, pvalueCutoff = 0.9, verbose = FALSE) head(kk_gse)[,1:6] gseaplot(kk_gse, geneSetID = rownames(kk_gse[1,])) down_kegg<-kk_gse[kk_gse$pvalue<0.05 & kk_gse$enrichmentScore < 0,];down_kegg$group=-1 up_kegg<-kk_gse[kk_gse$pvalue<0.05 & kk_gse$enrichmentScore > 0,];up_kegg$group=1 gse_kegg <- kegg_plot(up_kegg,down_kegg) ggsave(paste0(data_path,data_set_name,"gsea.pdf"),plot=gse_kegg,scale = 1,width = 12,height = 12,dpi = 600) #print(gse_kegg) #ggsave(gse_kegg,filename ='kegg_up_down_gsea.png') # GO database analysis -------------------------------------------------------------------- #go富集分析--耗费时间灰常长,很正常 library(clusterProfiler) #输入数据 data_go <- data_pathway gene_go_up <- data_go[data_go$change == 'UP','ENTREZID'] gene_go_down <- data_go[data_go$change == 'DOWN','ENTREZID'] gene_go_diff <- c(gene_go_up,gene_go_down) head(data_go) #**GO分析三大块** #细胞组分 ego_CC <- enrichGO(gene = gene_go_diff, OrgDb= org.Mm.eg.db, ont = "CC", pAdjustMethod = "BH", minGSSize = 1, pvalueCutoff = 0.01, qvalueCutoff = 0.01, readable = TRUE) #生物过程 ego_BP <- enrichGO(gene = gene_go_diff, OrgDb= org.Mm.eg.db, ont = "BP", pAdjustMethod = "BH", minGSSize = 1, pvalueCutoff = 0.01, qvalueCutoff = 0.01, readable = TRUE) #分子功能: ego_MF <- enrichGO(gene = gene_go_diff, OrgDb= org.Mm.eg.db, ont = "MF", pAdjustMethod = "BH", minGSSize = 1, pvalueCutoff = 0.01, qvalueCutoff = 0.01, readable = TRUE) #作图 pdf(paste0(data_path,data_set_name,"go.pdf"), width = 14, height = 8) #第一种,条带图,按p从小到大排的 barplot(ego_CC, showCategory=20,title="EnrichmentGO_CC",label_format = 50) barplot(ego_BP, showCategory=20,title="EnrichmentGO_BP", label_format = 50) barplot(ego_MF, showCategory=20,title="EnrichmentGO_MF",label_format = 50) #如果运行了没出图,就dev.new() #第二种,点图,按富集数从大到小的 dotplot(ego_CC,title="EnrichmentGO_CC_dot",label_format = 50) dotplot(ego_BP,title="EnrichmentGO_BP_dot",label_format = 50) dotplot(ego_MF,title="EnrichmentGO_MF_dot",label_format = 50) dev.off() ``` 最后修改:2024 年 02 月 02 日 © 允许规范转载 赞 如果觉得我的文章对你有用,请随意赞赏