【NGS 次世代基因體資料科學】基礎教學 05 Burrows-Wheeler Aligner BWA 的使用介紹

Ming Han YangApr 2, 2017

BWA (Burrows-Wheeler Aligner) 是常見的基因體序列比對程式，關於DNA序列的研究會使用它來把定序回來的原始資料(FASTQ)比對回參考基因組上，其中類似的比對軟體還有Bowtie，Bowtie2，SOAPaligner等，商業方案則更多會考慮用illumina 的 DRAGEN Bio-IT Platform。這裡就介紹怎麼用BWA來做比對，它是做研究比較常用的比對軟體。

Table of Contents

BWA的編譯和安裝

這邊可以直接參考官方的安裝指令（需有make）：

git clone https://github.com/lh3/bwa.git
cd bwa; make

不過由於採用的0.7.17版本跟我的GCC編譯器版本會有衝突，我這邊按照這裡的指示去修改rle.h的程式碼，並重新整理成shell script 如下：

ShellScript

#!/bin/bash
# download & compile bwa
wget https://github.com/lh3/bwa/archive/refs/tags/v0.7.17.tar.gz
tar -xzf v0.7.17.tar.gz
cd bwa-0.7.17
sed -i 's/const\suint8_t\srle_auxtab/extern const uint8_t rle_auxtab/g' rle.h
# bug fix see also
#https://github.com/lh3/bwa/commit/2a1ae7b6f34a96ea25be007ac9d91e57e9d32284

make 
mv bwa ../
cd ../
rm -r bwa-0.7.17
rm v0.7.17.tar.gz
./bwa

執行之後如果成功編譯完成會顯示：

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
Contact: Heng Li <lh3@sanger.ac.uk>


Usage:   bwa <command> [options]


Command: index         index sequences in the FASTA format
        mem           BWA-MEM algorithm
        fastmap       identify super-maximal exact matches
        pemerge       merge overlapping paired ends (EXPERIMENTAL)
        aln           gapped/ungapped alignment
        samse         generate alignment (single ended)
        sampe         generate alignment (paired ended)
        bwasw         BWA-SW for long queries


        shm           manage indices in shared memory
        fa2pac        convert FASTA to PAC format
        pac2bwt       generate BWT from PAC
        pac2bwtgen    alternative algorithm for generating BWT這裡就介紹怎麼用BWA來做比對，
        bwtupdate     update .bwt to the new format
        bwt2sa        generate SA from BWT and Occ


Note: To use BWA, you need to first index the genome with `bwa index'.
     There are three alignment algorithms in BWA: `mem', `bwasw', and
     `aln/samse/sampe'. If you are not sure which to use, try `bwa mem'
     first. Please `man ./bwa.1' for the manual.

BWA建立參考基因組的索引

常見已組裝好的參考基因組，例如hg19，mm10可以從UCSC 基因組網站https://hgdownload.soe.ucsc.edu/downloads.html 下載，解壓縮合併之後就可以用bwa index 這個指令去產生BWA專用的索引檔，後續除非使用其他版本的參考基因組，否則原則上這個步驟只需要做完一次即可。這裡示範建立ucsc hg19索引的shel script：

ShellScript

#!/bin/bash
# download ucsc hg19
mkdir bwa_hg19
cd bwa_hg19
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
tar zvfx chromFa.tar.gz
cat *.fa > hg19.fa
rm chr*.fa
../bwa index -p hg19_bwa_idx -a bwtsw hg19.fa

除了UCSU的版本之外，可以下載參考基因組的地方還有NCBI，ENSEMBL以及Gencode。如果不是已經有參考基因組的物種，就需要進行從頭組裝(De novo assembly)

利用BWA 進行比對NGS資料

一般而言，下機後拿到的就是FASTQ格式的資料，或是published dataset 的raw data 下載回來，經過FastQC做完QC之後，就可進行Mapping的流程。這裡考慮處理資料所需的時間，使用以下較小的測資，用git下載：

ShellScript

git clone https://github.com/hartwigmedical/testdata.git

BWA mapping之後的輸出是SAM（Sequence Alignment Map）格式，BWA預設是輸出到標準輸出裝置，可以直接用Linux的管道(pipline)把資料轉送到samtools 透過指令view和sort得到未經處理壓縮過的BAM（Binary Alignment Map）格式。由於需要samtools來做壓縮的工作，因此還需要安裝好samtools：

sudo apt-get update
sudo apt-get install samtools

安裝好之後，就可以按照上述流程撰寫指令腳本。以下是整理好的fastq to bam的shell script 腳本範例：

ShellScript

#!/bin/bash
# bwa with samtools fastq to bam

bwa_path="."
ref_path="./bwa_hg19/hg19_bwa_idx"
raw_fq_R1='./testdata/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R1_001.fastq.gz'
raw_fq_R2='./testdata/100k_reads_hiseq/TESTX/TESTX_H7YRLADXX_S1_L001_R2_001.fastq.gz'
thread='16'
mkdir ./tmp
$bwa_path/bwa mem -t $thread $ref_path $raw_fq_R1 $raw_fq_R2 2>mapping.log |
   samtools view -@ $thread -q 30 -bS - |
   samtools sort -@ $thread -T ./tmp >out.bam
samtools index out.bam
samtools flagstat out.bam >mapping.stat
rm -r ./tmp

這一步一般而言需要一段時間來跑，根據FASTQ的大小越大需要越久，到這一步驟走完從FASTQ到Raw BAM的流程，到這邊就會得到原始的比對結果就可以進行下一步定量或是定性(變異)的分析。

專案網址

https://github.com/whuang022ai/NGS_Analysis
0 forks.
0 stars.
0 open issues.

Recent commits:

Update README.md, GitHub
Update README.md, GitHub
Update README.md, GitHub
add zip data, whuang022ai
add readme, whuang022ai

Ming Han Yang

A passionate bioinformatician focuses on the next generation of medical science and biotechnology.

喜歡這樣的教學創作的話，歡迎小額贊助給予支持🙏

More about

【NGS 次世代基因體資料科學】基礎教學 05 Burrows-Wheeler Aligner BWA 的使用介紹

BWA的編譯和安裝

BWA建立參考基因組的索引

利用BWA 進行比對NGS資料

專案網址

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Recent Posts

Recent Comments

BWA的編譯和安裝

BWA建立參考基因組的索引

利用BWA 進行比對NGS資料

專案網址

Related Posts

【NGS with Data Science】Gene2vec distributed representation of genes pipeline reproduction

【NGS with Data Science】Use bioinfokit to make a volcano plot

【NGS 次世代基因體資料科學】t-SNE簡介

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Recent Posts

Recent Comments