【NGS 次世代基因體資料科學】使用bioinfokit繪製火山圖Volcano Plot

這裡介紹如何用現成的套件快速繪製火山圖(Volcano plot)。在做基因體量化分析的下游,常見需要製作差異分析,找出哪些基因在不同條件下有顯著差異。火山圖是這類型問題的標準化圖表,是散點圖的一種特殊形式。其橫座標通常是兩條件差異倍率的對數(log2 Fold change),而縱座標則是顯著性值或校正後的顯顯著性值的負對數(-log10 p-val)。每個點就是檢驗標的,若是像RNA-Seq等常見的分析就會是一個個的基因。
這裡介紹使用Python的套件bioinfokit來繪製火山圖,首先我們需要下載範例資料以及安裝bioinfokit :
pip install bioinfokit
wget https://zenodo.org/record/2529117/files/limma-voom_luminalpregnant-luminallactate
wget https://zenodo.org/record/2529117/files/volcano_genes
稍微看一下資料的情況
head limma-voom_luminalpregnant-luminallactate
輸出:
ENTREZID SYMBOL GENENAME logFC AveExpr t P.Value adj.P.Val
12992 Csn1s2b casein alpha s2-like B -8.603611114762 3.56295004142591 -43.7964980711435 3.83064977005569e-15 6.05395889659601e-11
13358 Slc25a1 solute carrier family 25 (mitochondrial carrier, citrate transporter), member 1 -4.12417532129173 5.77969894403042 -29.9078492752674 1.75859473379618e-13 1.38964155864574e-09
11941 Atp2b2 ATPase, Ca++ transporting, plasma membrane 2 -7.38698638678659 1.28214314739647 -27.8194991746381 4.83636254056037e-13 2.43279979019347e-09
20531 Slc34a2 solute carrier family 34 (sodium phosphate), member 2 -4.17781242057656 4.27862903538987 -27.0727230566646 6.15742796809282e-13 2.43279979019347e-09
100705 Acacb acetyl-Coenzyme A carboxylase beta -4.3143199499725 4.4409137064501 -25.2235657685746 1.4999774593805e-12 4.74112875360987e-09
13645 Egf epidermal growth factor -5.36266382024579 0.73590465313728 -24.5993025952199 2.11624444834827e-12 5.57418787694935e-09
230810 Slc30a2 solute carrier family 30 (zinc transporter), member 2 -3.20311813582619 2.69581147606106 -23.80427759327 3.02466813499713e-12 6.82883645792781e-09
68801 Elovl5 ELOVL family member 5, elongation of long chain fatty acids (yeast) -2.86330403668687 6.45520453127796 -22.3535751591657 6.5987442593808e-12 1.30358192844068e-08
19659 Rbp1 retinol binding protein 1, cellular 5.44304402150002 6.10703318183881 21.0523576693295 1.47914323855668e-11 2.36474559807046e-08
limma-voom_luminalpregnant-luminallactate是一個tsv格式的下游分析結果。縱向是基因,橫向則是每個基因的變化量(這個案例是logFC),以及P值和校正後P值(adj.P.Val)
再看一下volcano_genes
head volcano_genes
GeneID
Mcl1
Hbegf
Tgfb2
Cxcl16
Csf1
Pdgfb
Edn1
Lif
Kitl
它是等一下要標註上名稱的候選基因名稱的列表
接著我們使用pandas把tsv讀成dataframe,並傳入visuz.GeneExpression.volcano這個函式就可以了
from bioinfokit import analys, visuz
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("limma-voom_luminalpregnant-luminallactat", sep='\t')
with open('volcano_genes') as f:
genes = f.read().splitlines()
df.dropna(inplace=True)
visuz.GeneExpression.volcano(df=df, lfc='logFC', pv='adj.P.Val', lfc_thr=(1, 1), sign_line=True, plotlegend=True, legendpos='upper right', legendanchor=(
1.46, 1), dotsize=0.1, gfont=7, xlm=(-10.0, 10.0, 1), pv_thr=(0.05, 0.05), genenames=tuple(genes), geneid="SYMBOL")
執行之後會在程式得得目錄下輸出圖檔

以上就是繪製火山圖的介紹。