Information

Is it possible to re-align of SAM/BAM files?


I am carrying out some metagenomic comparison using whole genome sequencing approach. I would like to filter out reads from my files in a sequential manner, thus I would like to align to a first reference genome, then use the generated SAM/BAM file to extract the mapped reads with samtols view -F4, re-generate fastq files and align those to another reference genome. This decreases the size of the input and speeds the procedure. Using BWA, though, I get an error of file synchronization when doing the second alignment:

[mem_sam_pe] paired reads have different names: "HISEQ:90:C3UNJACXX:1:1107:16388:61141", "HISEQ:90:C3UNJACXX:3:1303:2008:58095" [mem_sam_pe] [mem_sam_pe] paired reads have different names: "HWI-ST1437:64:C3UM1ACXX:4:1214:3833:62731", "HWI-ST1437:64:C3UM1ACXX:4:1202:18519:45906"

This is weird since the fastq files were generated by the filtered BAM file, samtools should have known how to keep the synchronization of the files. My question is: is it possible to realign a BAM file? maybe not with BWA, at least with another tool?

Thank you.

The procedure I carried out is:

# align bwa mem -R     -o  # convert samtools view -Sb  >  # sort samtools sort  -o  # select mapped samtools view -h -F 4  -o  # convert to fastq samtools fastq -1  -2   # re-align bwa mem -R     | samtools sort -o 

Looks like I solved it by using PICARD: extracting the aligned reads but without sorting, I used

java -jar picard.jar SamToFastq I= FASTQ= SECOND_END_FASTQ= bwa mem -R    

Fromsamtools flagstatthe output looks fine.


Is it possible to re-align of SAM/BAM files? - Biology

This is site is used for testing only. Visit: https://www.biostars.org to ask a question.

I extracted the specific region from a BAM file generated by aligning fastq (Illumina/whole genome sequencing) on hg38 and converted to corresponding fastq reads of that region by Picard (SamTofastq) and try to realign to another (my own) reference sequence by bwa mem. But, it seems that bwa mem cannot understand the sequencing is as paired end since it frequently says “skip orientation FF as there are not enough pairs” during the mapping, it also happened for other orientations (FR, RF, and RR) however, there is 1259352 paired read. Subsequently, the properly paired percentage is very low. Could you please kindly help me out how I can solve the problem?

Edit and update: I also found that during the conversion of bam to fastq by SamTofastq (picard), it returned me: SAM validation error: ERROR: Found 13548 unpaired mates. I think, it may not be a real problem actually, it seems that Picard just consider paired read to creat Fastq file. So, the real problem is with bwa mem, yes? Please kindly share your idea with me?

Here is the used commands:

Show us the headers of files you extracted from original alignment. They likely have lost read origination information.

While trivial this error has come up in the past when people try to use the same file (e.g. R1/R1 instead of R1/R2) when doing the second alignment. Something to check on.

Sorry, file is on the cluster and can't transfer it. Could you please tell me where is the read orientation information in the BAM header, what I should look for? I did alignment with two fastq file R1/R2.

What does zcat filename_R1.fq.gz | head -4 and zcat filename_R2.fq.gz | head -4 show? If your files are uncompressed then replace zcat with cat in the command.

The read name in two fastq files generated by SamTofastq is the same except for /1 and /2. Also, the read count of two fastq file is the same. So, the problem is another place. In addition, someone says "bwa expects the first read of file one to be the mate of the first read of file 2. That assumption obviously does not hold throughout the whole file, because sorting by coordinates and throwing away reads ruins that" in this post. What do you think?

You should post the exact commands you used to extract the reads from the bam file and subsequently to map the fastq files again

You should also do your best to provide the information requested when someone asks for additional information.

I added the used commands to the original post.

Either the original alignment or your conversion from samtofastq eliminated the identifiers that show which of the reads is from R1 and which is from R2. Since you assure us that reads from R1/R2 are in sync, you can fix that issue by using following tool from BBMap suite:

add one of the two options below. Either should work.

is the output of head -4 of two fastq files (sorry if the resolution is not enough high) as you can see read name in both fastq files is the same and are along with /1 and /2, respectively so the read name and their identifier (/1 and /2) is correct. Is it possible the read name or the related identifiers become wrong in the whole fastq files? do you still suggest using repair.sh?

My apologies. I meant to say reformat.sh instead of repair.sh (fixed now).

BTW: We don't see your fastq examples. You can copy and paste the lines (don't post as an image) and then format using the code button.


Your reads do have /1 and /2 . So it is possible that there just are not enough reads for bwa to do that estimate.

Yes, the reads have /1 and /2 based on the head of two fastq files. There is 1259352 paired read, it isn't enough for bwa mem in your opinion? I also tested bowtie2, but the overall alignment rate was very bad (about 5.4%). With bwa mem, the mapping percentage is about 95%, but the properly paired is very low (about 10%). Please kindly share me any suggestions and ideas?

Are you getting that error/warning with a) original data AND b) extracted reads (

13K) for a specific region. I thought it was only the latter and thus there may not be enough data.

Why are you cherry picking data like this BTW? Is your own reference just for this smaller region? If it is for the whole genome why can't you use the full data to align?

/1 and /2 have not been used for a while. Is this old data?

Thank you for following the issue. Regarding your questions, I get the error/warning with extracted reads there is more than 1M PE reads (not 13K) in the extracted bam file as I mentioned in my previous comment. Assuming Bwa mem may predict the wrong insert size distribution, I used CollectInsertSizeMetrics (from picard) to predict the insert size distribution in the extracted bam file however, I’m not sure how these numbers should be feed to bwa mem via I flag. Could you please kindly show me an example?

2) yes, my own reference is related to just this small region. Actually, this region is HLA and I would like to visualize the reported alleles from HLA typing (of whole genome sequencing data) in IGV. So, I extracted this region, converted to fastq files and try to re-align to hla genomic sequences as reference. It will be great if you please share me any idea/suggestion on this issue.

When you re-align these, are you not supposed to re-align to the HLA FASTA sequences from, e.g., IMGT?

Hi Kevin, not re-align to HLA fasta sequence! please kindly tell me what is the reference sequence to realign? I'm getting more confused

I think that I showed you (in some other thread) a published work by my colleagues at UCL (?) - in that, there is a few steps to follow.

Yes, your mean is LOHHLA I think. However, at the moment I would like to focus just HLA alleles visualization (HLA typing from whole genome sequencing was done by another person). After solving the current problem, I certainly test LOHHLA but it requires Novalign that is not free!. So, now for just visualization of the typed HLA alleles in IGV, I realigned the extracted fastq reads to HLA fasta sequence as reference by bwa mem and faced with very low properly paired mapping. Considering that the reads in two generated fastq files is paired based on head output, which I showed in the previous comments in this post, I think that bwa mem may wrongly estimate the insert size distribution. That's why I calculated it by picard, but don't sure how to feed it to bwa mem via I option. Kevin, could you please kindly help me to solve the current issue?

I did try to replicate that pipeline myself (LOHHLA), but ended up becoming too busy. I actually had a meeting with the developer (Dr. McGranahan) but cannot recall the finer details of it, at this stage. Yes, NovoAlign was a strange choice, and remains so. However, I believe that older versions of it are free?

HLA typing has been an obscure and strange area in bioinformatics these past few years. My first attempt at it was in 2015 on Roche 454 data and I did everything manually - it was excruciating.

Can you take a look at the suggestion by Wouter, perhaps? - see below.

Thanks. As I found there are several programs for HLA typing, so no need work manually these days, but I prefer to work manually on one sample for better learning please kindly share me any well-documented guide/tutorial if you know. The tool suggested by Wouter is another HLA typing program as I told you, at the moment I have to focus just on the HLA alleles visualization.

Login before adding your answer.

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.


Estimating paired-end read insert length from SAM/BAM files

I wrote a single Python script to estimate the paired-end read insert length (or fragment length) from read mapping information (i.e., SAM/BAM files). The algorithm is simple: check the TLEN field in the SAM format, throw out pair-end reads whose pairs are too far away, and use them to estimate the mean and variance of the insert length.

This script is also able to provide a detailed distribution of read length and read span for your convenience. Please refer to the detailed usage below.

This script is distributed in GitHub now.

Note : According to the SAM definition, the read span "equals the number of bases from the leftmost mapped base to the rightmost mapped base". This span is the distance between two reads in a paired-end read PLUS 2 times read length. Read span is different from the "mate-inner-distance" in Tophat (-r option), which measures only the distance between two reads in a paired-end read.

47 comments:

Hi wei I ma using your script to calculate the insert size but it is saying this [email protected]:

/Desktop$ samtools view bowtie_out.nameSorted.PropMapPairsForRSEM.bam | head -n 1000000 | python getinsertsize.py
1M.
Read length: mean 100.0, STD=0.0
Possible read length and their counts:
<100: 1000000>
No qualified paired-end reads found. Are they single-end reads?

Can you please tell me how to check whether my reads are single end or paired end.

Probably these reads are single-end reads. You can check a few lines of your BAM file, and look at the 7th and 8th field of each line. If they are single-end reads, you will see marks like "0 *", which means there is no corresponding paired information.

I checked my read are paired end. Actually I built a denovo transcriptome and mapped those reads back to the transcriptome and made the BAM file. So will this script will work in this case also or it is specifically written for a reference based assembly.

I also get a message saying:
No qualified paired-end reads found. Are they single-end reads?

but my reads are paired-ends. I can see the pairing with igv. Do you have any suggestion?

it's possible that your bam format is older version. can you paste a few lines of your bam file? use "samtools view test.bam | head -n 10 " print the first 10 lines of your bam record (test.bam).

Hello. Thanks for sharing the script! btw, can I ask why you used PNEXT field(8th column in a SAM file) rather than TLEN field(9th column in a SAM file). From my understanding, TLEN indicates an inferred insert size.

In most cases you can use either way to calculate the insert size.

Thanks for the script Wei.

I had to modify it a little to get it to work, and I think this is the same problem the commenters above me were probably having. The NH:i: field is an optional field, and many short read mapper don't actually include it, including some popular mappers.

Since your script required the optional field to be present and equal to 1 to count that read, all the reads were getting skipped (giving that message: no qualified paired-end reads). I commented out those lines pertaining to this NH:i: field, and then the script ran successfully and counted the reads.

Can you please pot the code here. I am still getting the error.

Just comment line 63 to line 66 and it should be fine with the "NH:i" issues. I've also updated the script so you can try on the newest one.

I get the following error:

$samtools view 10081.marked.realigned.recal.bam | head -n 1000000 | python getinsertsize.py
File "getinsertsize.py", line 45
print(str(nline/1000000)+'M. ',file=sys.stderr)
^
SyntaxError: invalid syntax

Does any one know what the problem is?

what's your python version? It works for 2.7, but I'm not sure if it works under 2.6 or lower.

Okay I updated my version and I got the following output:

1.0M.
Read length: mean 101.0, STD=0.0
Possible read length and their counts:
<101: 951678>
Read span: mean 0.0, STD=0.0

I am certain these are paired end reads, why do I get 0 span?

Here is some of my bam file too:

HWI-1KL153:29:D1N72ACXX:8:1209:12276:12090 1187 chrM 9 60 1 01M = 200 292 GTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTA TTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAG [email protected][email protected][email protected][email protected][email protected]@BC? [email protected][email protected]@[email protected]:[email protected][email protected]@G<=CA?BB9>>B X 0:i:1 X1:i:0 BD:Z:KKOLLKNLJLILLJJIJKLMJLLLJMMIMMONLKLMNNNONNKBMNMMKKBBLNMOLON JJJJMKJKOOKNNNONKNPOOLNPNPONOOPQQQQQTTMNNN MD:Z:101 RG:Z:Exome_10081 XG:i:0 BI:Z:NNRPQNPPMOKPPNMNLNORMPLPMQOLQPRQLOLORQRQQQNFPPQQONGGPRQRPRQMMMMRNOO RRORQQOQPRSRSOQSRPRRSTSUUTUUWUQPRQ AM:i:37 NM:i:0 SM:i:37 XM:i:0 XO:i:0 M Q:i:60 XT:A:U

Do you mind sending me the first 10000 lines of your bam file (use head -n 10000)? I can take a look at what happens. You can zip the text file and send to li.david.wei AT gmail.com. Thanks!

Hi David,I've sent it to you, please let me know if you got it.

I've looked at your data. The problem is there are too many reads with insert size 0, which confuses the program and filter all the reads. This is a bug in my program actually, and I've updated the source code and it should work well now. Thanks & give it a try!

PS: you need to run the program like "samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py -" now instead of "samtools view [ BAM file ] | head -n 1000000 | getinsertsize.py".

it seems to work well thank you for your very quick responses!

For my one sample I have the following

Read span: mean 364.3316085392387, STD=123.087158684682

The STD seems to be fairly high, but I'm not exactly sure. Does anyone have experience with what the expected STD should be?

Probably using more reads will have a better estimation of the span. Also, it seems that many of your first 10M reads are from chrM, which may not be a good choice. You can try to use reads from other chromosomes?

I tried using the whole bam and I got pretty much the same number (

100 million reads). So this seems to be a real result.

I was also wondering can the script be modified so that the span-distribution-file be outputted so that read counts are for 10bp bins? I am new at python and help would be greatly appreciated.

You can probably load the file into R and check the histogram. That should give you an intuitive overview of the span distributions..

Sorry the arrow should be pointing at the equal sign from "file= sys.stderr"

Hi! and thank you very much for your script.

I would suggest to improve its performances to get rid of '.keys()' at lines 61 and 76.
if readlen in plrdlen.keys() would become if readlen in plrdlen.

By the way, could you define exactly what you mean by read span?

Thanks for your reply. read span is the distance between two reads in a paired-end read plus 2* read length. It is equal to the insert size of the fragment.

Great script!! Many thanks from Malaysia.

Wondering if you would know how to integrate the read span (assuming this read span is referring to inset length) with tophat2 alignment?

More specifically, one of the tophat2 parameters is "-r" (also known as mate-inner-distance) http://tophat.cbcb.umd.edu/manual.html

So if for example the read span value from your script is 200, would you know what to use for the tophat2 -r parameter??

Would you need to consider the read length of the sample (example 100bp) to determine a correct tophat2 -r value.

the tophat2 -r value is the read span minus 2*read length, so it's read length dependent. Since it is the distance between two reads in a paired-end read, it's called "mate-inner-distance".

Hmmmmmmm. following up from the question,

To include the -r parameter seems straightforward thanks to your clarification. Now that I included the (read span - 2*readlength) value in Tophat. To add a layer of complication, if I wish also to include the standard deviation value in the mapping parameters. I can not just use the value from your script am I correct?

For example the output from your script
read span : 242 SD :90

Therefore -r value would be : 42 (242-2*100). However the same cant be done with SD, we cant just minus a value..(if my understanding is correct)

The question I am asking. would it be possible to get an adjusted SD value based modified read-span length?? I am thinking of simply subtracting 200 from each read length to get a new SD estimate..

I been staring at the script and see were would be a place to insert the readlength*2 value and get a new correct SD. but I have cant pin point where.

Hi, the SD is the same (90 in your sample). This is because for a random variable x and a constant a, SD(x)=SD(x-a). The read span comes directly from the definition of the field[8], which already includes the read length.

Thanks for the insight Wei!

I am afraid I still don't fully understand. I think I need to dive deeper into statistics as well as sam field :)

I sill fail to see why if we adjusted the read span (by subtracting 200) the SD will remain the same??

Would a alternate workflow be
1)Run your script first, get an estimate for read span
2)Subtract filed[8] with the value of readspan-2*read-length
3)Rerun an adjusted script and get a more accurate estimate for SD

Or is it just I need to read more on basic statistics and SD :)

Thanks, just used this, very useful!

thank you for this post.
I have a question : why with different subsample of the same bam I can have so manydifferences in read span ?
For example :

1) samtools view accepted_hits_chr22.bam | head -n 100000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 236.150649351, STD=88.6332394609
2) samtools view accepted_hits_chr22.bam | head -n 400000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 221.92447156, STD=79.0142612548
3) samtools view accepted_hits_chr22.bam | head -n 800000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 3383.2321526, STD=3460.01998406
4) samtools view accepted_hits_chr22.bam | head -n 1000000 | python getinsertsize.py -
Read length: mean 101.0, STD=0.0
Read span: mean 3111.18687862, STD=3442.69889313

So this means there're some really long paired-end reads that disrupt the calculation of read span. You can just use the results of the top 100k reads.

This comment has been removed by the author.

I'm trying your script on sam format. > igave follwoing command

python getinsertsize.py xyz.sam | -

it showed error at line 16 # no module argparse

above command need any modification or whats command to run your code on sam format . I tried with view option in samtools but it's for bam format not for sam. Please can you guide me How can I run for sam format to calculate Mean, read length and SD from sam alignment file as input

Hi, this means there's no argparse module in your Python system. What python version did you use? Python 2.7 should have no problem running this script.

k thanks, I'll update python. My python version 2.6.6 Thanks for reply.

I tried on updated python 2.7 now. It run completely but at last rather to print output.It showed following error.

close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr

Could you suggest me why this error appearing

Thanks a lot Wei Li, Now Its working perfectly

Hi Wei,
Thanks a lot for this. would you know how to adapt the script so one could select or exclude specific reads based on the span length? Eg: say you wanted only read pairs with span <500.
thanks!

I'm mapping some illumina reads to a closely related species using BWA and getting some odd results. My mean read span is non-zero but is less than the mean read length. Any idea what could cause that? I'm fairly new to this and don't know python.

Read length: mean 94.7053421448, STD=15.8237819615
Read span: mean 74.962962963, STD=48.8084808847

Probably your bam file includes a mixture of reads with different lengths. This may make it complicated for calculating read span.

I have exactly the same issue.

Nice..blog you can also visit our website that is a estimating software.http://www.accurateestimator.com/

Any improvement for un-even read length reads?
I have trouble with different (MP/PE) libraries, but the insert size are all

150bp! Not sure if that's due to un-even read length.

Thank you for good working tools
i got this output: Read length of paired-end reads : mean 296.22, STD=22.51
Read span: mean 484.54, STD=108.36
Does it mean my insert size is 484-296=184 ?


How to retrieve one or more biological sequences?

One of the most frequent operations I do during the day is the search and recovery of biological sequences of interest. So I think it is necessary to talk to you about how this can be done, especially for novices, and therefore be able to “play” with some recovered genes or proteins. I remember that at the beginning of my bioinformatics studies I enjoyed recovering gene sequences and I was delighted to observe that multitude of letters (relating to the nitrogenous bases) that followed one another within the sequence. In short, obtaining a biological sequence of interest is a basic skill for a bioinformatician and with this article I will explain how to do it.

Let’s start with a simple assumption. The biological sequences of interest can be retrieved by searching in generic databases or databases dedicated to the species in question or by using specific commands launched from the terminal to obtain the desired sequence from previously downloaded files. But let’s do some practical examples right away to better understand how to do it.

RETRIEVE SEQUENCES FROM DATABASE

Regarding the recovery of one or more sequences from a database, two important premises must be made:

  1. There are different types of databases, more or less specialized in the recovery of certain sequences, and it is possible to find on the web specific databases for some species or families such as the sequence database relating to the Solanaceae family, i.e. Sol Genomics Network.
  2. To speed up and optimize the database search it is necessary to use some Boolean operators such as:
    • AND, allows you to connect two or more search strings. Its logical meaning can be understood as the intersection between two query that we are interested in searching in a combined way.
    • OR, allows you to search for an element by discriminating others when more elements are sought at a time.
    • NOT, allows you to search for all the elements that do not correspond to a certain query. It therefore has an opposite function to the AND operator.

But let’s do some examples:

When you want to search for something in a database, you need to know in which database is it easier to find a particular sequence. Let me explain better. If you want to search for a specific protein it is useful to query and then search for this in the UniProt database, which is managed by a private consortium, or on the Protein database managed by NCBI, certainly it would not make any sense to search for the aforementioned protein in the Gene database. Quite right? Or again, if your intention is to download the tomato genome it is easier to find it in the Genome database rather than in the Assembly database where we mainly find information and statistics relating to the assembly of that genome. In short, the take-home message regarding the search for sequences or data in the databases is:

Search for what you need in the right database to be more successful in your research.

In fact, no one would ever go out and buy a steak from the greengrocer and an apple from the butcher … Do you understand what I mean?

RETRIEVE SEQUENCES FROM FILE USING COMMANDS EXECUTED ON THE TERMINAL

In some cases we find in our hands a file containing different information and sequences but we want to derive only some information or sequences from this. To do this we are helped by the commands that if executed on the terminal allow us to easily reach our goal.

Let’s pose some ideal situations to better understand how we can do the above.

  1. We download Arabidopsis thaliana genome from the Phytozome database and recover from this the sequence “TGTAGGGATGAAGTCTTTCTTCGTTGT”. How can we do it? grep command can be used but you have to take in mind that grep allows you to extrapolate only the sequence given as a query and the row in which it is found.

In the video below I have shown how to use these three tools:

Regarding grep there would be much more to say but I prefer not to do it now because I would move away from the true goal of this article so I will just promise to talk about it in depth in a future article.

  1. Now let’s consider another possible situation. Let’s say we always have the Arabidopsis thaliana genome file and want to extrapolate the entire chromosome 3 from it. For do this it is useful to use the samtools faidx tool. Watch the video below to understand how:
  1. The situation presented in point 2 is very frequent in the work of a bioinformatician but how can we derive more than one sequence from a given genome? Let’s say we have the tomato gene set and want to extrapolate from it all the genes that are involved in some way in the synthesis and activity of the rubisco. In the video below I show you how:
  1. Let’s take another example of a situation that can be quite common. Given the Solanum tuberosum genome, find the scaffold headers 1 to 3 and then extrapolate the sequences. To do this it is convenient to use regular expressions, see how in the video below:

Also in this case the use of regular expressions should be deepened so, as in the case of grep, I will try to talk about it better in a dedicated article in the future.

  1. Another very common situation for a bioinformatician. We have for our “hands” a vcf file (variant calling format) that is a file that shows in a matrix, consisting of rows and columns, the various information (especially positional) relating to the polymorphisms traced within a given genome. It can sometimes be useful to extrapolate from this file only specific information and this can be done with awk command (I will also talk about this in a separate article for not bore you too much) or through the use of our dear regular expressions. In the video below I have shown an example for awk.
  1. Finally it is useful to say that with samtools faidx it is also possible to extrapolate a sequence from a fasta file thanks to the coordinates of the latter and therefore to its position in the considered genome given the fact that this tool is able to annotate the genome before searching for the sequence .

Here we are at the end of this article. As always, I hope to have managed to explain complex concepts in the simplest way possible. Maybe let me know with a comment what you think of the article or if there is something you feel should be added or corrected.


Contents

A FASTQ file normally uses four lines per sequence.

  • Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
  • Line 2 is the raw sequence letters.
  • Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
  • Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

A FASTQ file containing a single sequence might look like this:

The byte representing quality runs from 0x21 (lowest quality '!' in ASCII) to 0x7e (highest quality '

' in ASCII). Here are the quality value characters in left-to-right increasing order of quality (ASCII):

The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged [ citation needed ] as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).

Illumina sequence identifiers Edit

Sequences from the Illumina software use a systematic identifier:

HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell lane
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 index number for a multiplexed sample (0 for no indexing)
/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.

With Casava 1.8 the format of the '@' line has changed:

EAS139 the unique instrument name
136 the run id
FC706VJ the flowcell id
2 flowcell lane
2104 tile number within the flowcell lane
15343 'x'-coordinate of the cluster within the tile
197393 'y'-coordinate of the cluster within the tile
1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y Y if the read is filtered (did not pass), N otherwise
18 0 when none of the control bits are on, otherwise it is an even number
ATCACG index sequence

Note that more recent versions of Illumina software output a sample number (as taken from the sample sheet) in place of an index sequence. For example, the following header might appear in the first sample of a batch:

NCBI Sequence Read Archive Edit

FASTQ files from the INSDC Sequence Read Archive often include a description, e.g.

In this example there is an NCBI-assigned identifier, and the description holds the original identifier from Solexa/Illumina (as described above) plus the read length. Sequencing was performed in paired-end mode (

500bp insert size), see SRR001666. The default output format of fastq-dump produces entire spots, containing any technical reads and typically single or paired-end biological reads.

Modern usage of FASTQ almost always involves splitting the spot into its biological reads, as described in submitter-provided metadata:

When present in the archive, fastq-dump can attempt to restore read names to original format. NCBI does not store original read names by default:

In the example above, the original read names were used rather than the accessioned read name. NCBI accessions runs and the reads they contain. Original read names, assigned by sequencers, are able to function as locally unique identifiers of a read, and convey exactly as much information as a serial number. The ids above were algorithmically assigned based upon run information and geometric coordinates. Early SRA loaders parsed these ids and stored their decomposed components internally. NCBI stopped recording read names because they are frequently modified from the vendors' original format in order to associate some additional information meaningful to a particular processing pipeline, and this caused name format violations that resulted in a high number of rejected submissions. Without a clear schema for read names, their function remains that of a unique read id, conveying the same amount of information as a read serial number. See various SRA Toolkit issues for details and discussions.

Also note that fastq-dump converts this FASTQ data from the original Solexa/Illumina encoding to the Sanger standard (see encodings below). This is because the SRA serves as a repository for NGS information, rather than format. The various *-dump tools are capable of producing data in several formats from the same source. The requirements for doing so have been dictated by users over several years, with the majority of early demand coming from the 1000 Genomes Project.

Quality Edit

A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred quality score:

The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used a different mapping, encoding the odds p/(1-p) instead of the probability p:

Although both mappings are asymptotically identical at higher quality values, they differ at lower quality levels (i.e., approximately p > 0.05, or equivalently, Q < 13).

At times there has been disagreement about which mapping Illumina actually uses. The user guide (Appendix B, page 122) for version 1.4 of the Illumina pipeline states that: "The scores are defined as Q=10*log10(p/(1-p)) [sic], where p is the probability of a base call corresponding to the base in question". [2] In retrospect, this entry in the manual appears to have been an error. The user guide (What's New, page 5) for version 1.5 of the Illumina pipeline lists this description instead: "Important Changes in Pipeline v1.3 [sic]. The quality scoring scheme has changed to the Phred [i.e., Sanger] scoring scheme, encoded as an ASCII character by adding 64 to the Phred value. A Phred score of a base is: Q phred = − 10 log 10 ⁡ e >=-10log _< ext<10>>e> , where e is the estimated probability of a base being wrong. [3]

Encoding Edit

  • Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126 (although in raw read data the Phred quality score rarely exceeds 60, higher scores are possible in assemblies or read maps). Also used in SAM format. [4] Coming to the end of February 2011, Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format, according to the announcement on seqanswers.com forum. [5]
  • PacBio HiFi reads, which are typically stored in SAM/BAM format, use the Sanger convention: Phred quality scores from 0 to 93 are encoded using ASCII 33 to 126. Raw PacBio subreads use the same convention but typically assign a placeholder base quality (Q0) to all bases in the read. [6]
  • Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII 59 to 126 (although in raw read data Solexa scores from -5 to 40 only are expected)
  • Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 (although in raw read data Phred scores from 0 to 40 only are expected).
  • Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0 to 2 have a slightly different meaning. The values 0 and 1 are no longer used and the value 2, encoded by ASCII 66 "B", is used also at the end of reads as a Read Segment Quality Control Indicator. [7] The Illumina manual [8] (page 30) states the following: If a read ends with a segment of mostly low quality (Q15 or below), then all of the quality values in the segment are replaced with a value of 2 (encoded as the letter B in Illumina's text-based encoding of quality scores). This Q2 indicator does not predict a specific error rate, but rather indicates that a specific final portion of the read should not be used in further analyses. Also, the quality score encoded as "B" letter may occur internally within reads at least as late as pipeline version 1.6, as shown in the following example:

An alternative interpretation of this ASCII encoding has been proposed. [9] Also, in Illumina runs using PhiX controls, the character 'B' was observed to represent an "unknown quality score". The error rate of 'B' reads was roughly 3 phred scores lower the mean observed score of a given run.

  • Starting in Illumina 1.8, the quality scores have basically returned to the use of the Sanger format (Phred+33).

For raw reads, the range of scores will depend on the technology and the base caller used, but will typically be up to 41 for recent Illumina chemistry. Since the maximum observed quality score was previously only 40, various scripts and tools break when they encounter data with quality values larger than 40. For processed reads, scores may be even higher. For example, quality values of 45 are observed in reads from Illumina's Long Read Sequencing Service (previously Moleculo).

Color space Edit

For SOLiD data, the sequence is in color space, except the first position. The quality values are those of the Sanger format. Alignment tools differ in their preferred version of the quality values: some include a quality score (set to 0, i.e. '!') for the leading nucleotide, others do not. The sequence read archive includes this quality score.

Simulation Edit

FASTQ read simulation has been approached by several tools. [10] [11] A comparison of those tools can be seen here. [12]

Compression Edit

General compressors Edit

General-purpose tools such as Gzip and bzip2 regard FASTQ as a plain text file and result in suboptimal compression ratios. NCBI's Sequence Read Archive encodes metadata using the LZ-77 scheme. General FASTQ compressors typically compress distinct fields (read names, sequences, comments, and quality scores) in a FASTQ file separately these include Genozip, [13] DSRC and DSRC2, FQC, LFQC, Fqzcomp, and Slimfastq.

Reads Edit

Having a reference genome around is convenient because then instead of storing the nucleotide sequences themselves, one can just align the reads to the reference genome and store the positions (pointers) and mismatches the pointers can then be sorted according to their order in the reference sequence and encoded, e.g., with run-length encoding. When the coverage or the repeat content of the sequenced genome is high, this leads to a high compression ratio. Unlike the SAM/BAM formats, FASTQ files do not specify a reference genome. Alignment-based FASTQ compressors supports the use of either user-provided or de novo assembled reference: LW-FQZip uses a provided reference genome and Quip, Leon, k-Path and KIC perform de novo assembly using a de Bruijn graph-based approach. Genozip [13] can optionally use a reference if the user provides one, which may be a single- or multi-species reference file.

Explicit read mapping and de novo assembly are typically slow. Reordering-based FASTQ compressors first cluster reads that share long substrings and then independently compress reads in each cluster after reordering them or assembling them into longer contigs, achieving perhaps the best trade-off between the running time and compression rate. SCALCE is the first such tool, followed by Orcom and Mince. BEETL uses a generalized Burrows–Wheeler transform for reordering reads, and HARC achieves better performance with hash-based reordering. AssemblTrie instead assembles reads into reference trees with as few total number of symbols as possible in the reference. [14] [15]

Benchmarks for these tools are available in. [16]

Quality values Edit

Quality values account for about half of the required disk space in the FASTQ format (before compression), and therefore the compression of the quality values can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. Both lossless and lossy compression are recently being considered in the literature. For example, the algorithm QualComp [17] performs lossy compression with a rate (number of bits per quality value) specified by the user. Based on rate-distortion theory results, it allocates the number of bits so as to minimize the MSE (mean squared error) between the original (uncompressed) and the reconstructed (after compression) quality values. Other algorithms for compression of quality values include SCALCE [18] and Fastqz. [19] Both are lossless compression algorithms that provide an optional controlled lossy transformation approach. For example, SCALCE reduces the alphabet size based on the observation that “neighboring” quality values are similar in general. For a benchmark, see. [20]

As of the HiSeq 2500 Illumina gives the option to output qualities that have been coarse grained into quality bins. The binned scores are computed directly from the empirical quality score table, which is itself tied to the hardware, software and chemistry that were used during the sequencing experiment. [21]

Genozip [13] uses its DomQual algorithm to compress binned quality scores, such as those generated by Illumina or by Genozip's own --optimize option which generates bins similar to Illumina.

Encryption Edit

Genozip [13] encrypts FASTQ files (as well as other genomic formats), by applying the standard AES encryption at its most secure level of 256 bits (--password option).

Cryfa [22] uses AES encryption and enables to compact data besides encryption. It can also address FASTA files.

There is no standard file extension for a FASTQ file, but .fq and .fastq are commonly used.


What is Numerical Taxonomy? How is it useful?

Classification of biological species is one of the important concern while studying taxonomic and or evolutionary relationships among various species. Classification is either based on only one / a few characters known as “Monothetic”, or based on multiple characters known as “Polythetic”.

It is obviously much more difficult to classify organisms on the basis of multiple characters rather than a few characters. The traditional approaches of taxonomists are tedious. The arrival of computer techniques in the field of biology has made the task easier for the taxonomists.

Numerical taxonomy is a system of grouping of species by numerical methods based on their character states. It was first initiated by Peter H.A.Sneath et al.

Before going further I would like to clear the difference between two common terms, namely, “Classification” & “Identification”. When the organisms are classified on the basis of like properties, then it is called Classification, and after the classification, when the additional unidentified objects are allocated, then it is known as Identification. The purpose of taxonomy is to group the objects to be classified into natural taxa. Conventional taxonomists equate the taxonomic relationships with evolutionary relationships, but the numerical taxonomists defined them as three kinds:

  • Phenetic: based on overall similarity.
  • Cladistic: based on a common line of descents.
  • Chronistic: temporal relation among various evolutionary branches.

How is the classification do by Numerical Taxonomy?

The objects to be classified are known as Operational Taxonomic Units (OTUs). They may be species, genera, family, higher ranking taxonomic groups,etc., The characters are numerically recorded either in the form of appropriate numbers or may be programmed in such a way that the differences among them are proportional to their dissimilarity. Lets say, a character called ‘hairness of leaf’, it may be recorded as:

  • hairless = 0
  • sparsely haired = 1
  • regularly haired = 2
  • densely haired = 3

Fig.1 OTUs (black dots) represented in a multidimensional space.

Such a numerical system imply that the dissimilarity between densely haired and hairless is 3 times than that of sparsely haired and hairless.

The other method of implementing numerical taxonomy is that the characters are always represented by only two states,i.e., 0 for the absence and 1 for the presence of a particular character. This method is usually implemented in the field of microbiology. After that, all the characters and the taxonomic units are arranged in the form of the data matrix and the similarity among all possible pairs of OTUs is computed based on the considered characters. The similarity ( more specifically, dissimilarity) is the distance between OTUs is represented in a multidimensional space, where the characters can be visualized as the coordinates. The objects that are very similar are plotted close to each other and those which are dissimilar are plotted farther apart. Then these straight lines are computed. The similarity among the OTUs is calculated by ‘simialrity matrix’ having few color schemes, where the dark-shaded areas are highly similar. This matrix is then rearranged to get the clusters of similar OTUs. The results of numerical taxonomy are generally represented in the form of phenograms.

An exhaustive list of references for this article is available with the author and is available on personal request, for more details write to [email protected]


Our 2020/2021 Highlights document

Over the past year we have provided new insights into human, parasite and microbe evolution, cellular development, and the mutational processes that lead to cancer. We supported the UK's national COVID-19 genomic surveillance efforts and delivered reference genomes for a wide range of UK species as part of the Darwin Tree of Life project . more

Download our 2020/21 Highlights document

Watch Again: Dr Roser Vento-Tormo's online seminar and Q&A

Mapping tissues in vivo and in vitro one cell at a time:

Tissue microenvironment shapes cellular identity and function, understanding how this happens in vivo can help us engineer novel in vitro models.

Dr Roser Vento-Tormo’s research is focused on the adaptation of immune cells in tissues and their function in steady state and inflammation. Her team uses genomics, spatial transcriptomics and bioinformatics tools to reconstruct the microenvironment that will shape immune cellular identity and function. Watch her seminar and Q&A at the link below:


Is it possible to re-align of SAM/BAM files? - Biology

Run stringtie from the command line like this:
The main input of the program is a BAM file with RNA-Seq read mappings which must be sorted by their genomic location (for example the accepted_hits.bam file produced by TopHat or the output of HISAT2 after sorting and converting it using samtools as explained below).

Input files

The file resulted from the above command (alns.sorted.bam) can be used as input for StringTie.

Every spliced read alignment (i.e. an alignment across at least one junction) in the input SAM file must contain the tag XS to indicate the genomic strand that produced the RNA from which the read was sequenced. Alignments produced by TopHat and HISAT2 (when run with --dta option) already include this tag, but if you use a different read mapper you should check that this XS tag is included for spliced alignments.

NOTE: be sure to run HISAT2 with the --dta option for alignment, or your results will suffer.

As an option, a reference annotation file in GTF/GFF3 format can be provided to StringTie. In this case, StringTie will prefer to use these "known" genes from the annotation file, and for the ones that are expressed it will compute coverage, TPM and FPKM values. It will also produce additional transcripts to account for RNA-seq data that aren't covered by (or explained by) the annotation. Note that if option -e is not used the reference transcripts need to be fully covered by reads in order to be included in StringTie's output. In that case, other transcripts assembled from the data by StringTie and not present in the reference file will be printed as well.

NOTE: we highly recommend that you provide annotation if you are analyzing a genome that is well-annotated, such as human, mouse, or other model organisms.

Output files

  1. Stringtie's main output is a GTF file containing the assembled transcripts
  2. Gene abundances in tab-delimited format
  3. Fully covered transcripts that match the reference annotation, in GTF format
  4. Files (tables) required as input to Ballgown, which uses them to estimate differential expression
  5. In merge mode, a merged GTF file from a set of GTF files

Description of each column's values:

  • seqname: Denotes the chromosome, contig, or scaffold for this transcript. Here the assembled transcript is on chromosome X.
  • source: The source of the GTF file. Since this example was produced by StringTie, this column simply shows 'StringTie'.
  • feature: Feature type e.g., exon, transcript, mRNA, 5'UTR).
  • start: Start position of the feature (exon, transcript, etc), using a 1-based index.
  • end: End position of the feature, using a 1-based index.
  • score: A confidence score for the assembled transcript. Currently this field is not used, and StringTie reports a constant value of 1000 if the transcript has a connection to a read alignment bundle.
  • strand: If the transcript resides on the forward strand, '+'. If the transcript resides on the reverse strand, '-'.
  • frame: Frame or phase of CDS features. StringTie does not use this field and simply records a ".".
  • attributes: A semicolon-separated list of tag-value pairs, providing additional information about each feature. Depending on whether an instance is a transcript or an exon and on whether the transcript matches the reference annotation file provided by the user, the content of the attributes field will differ. The following list describes the possible attributes shown in this column:
    • gene_id: A unique identifier for a single gene and its child transcript and exons based on the alignments' file name.
    • transcript_id: A unique identifier for a single transcript and its child exons based on the alignments' file name.
    • exon_number: A unique identifier for a single exon, starting from 1, within a given transcript.
    • reference_id: The transcript_id in the reference annotation (optional) that the instance matched.
    • ref_gene_id: The gene_id in the reference annotation (optional) that the instance matched.
    • ref_gene_name: The gene_name in the reference annotation (optional) that the instance matched.
    • cov: The average per-base coverage for the transcript or exon.
    • FPKM: Fragments per kilobase of transcript per million read pairs. This is the number of pairs of reads aligning to this feature, normalized by the total number of fragments sequenced (in millions) and the length of the transcript (in kilobases).
    • TPM: Transcripts per million. This is the number of transcripts from this particular gene normalized first by gene length, and then by sequencing depth (in millions) in the sample. A detailed explanation and a comparison of TPM and FPKM can be found here, and TPM was defined by B. Li and C. Dewey here.
    • Column 1 / Gene ID: The gene identifier comes from the reference annotation provided with the -G option. If no reference is provided this field is replaced with the name prefix for output transcripts (-l).
    • Column 2 / Gene Name: This field contains the gene name in the reference annotation provided with the -G option. If no reference is provided this field is populated with '-'.
    • Column 3 / Reference: Name of the reference sequence that was used in the alignment of the reads. Equivalent to the 3rd column in the .SAM alignment.
    • Column 4 / Strand: '+' denotes that the gene is on the forward strand, '-' for the reverse strand.
    • Column 5 / Start: Start position of the gene (1-based index).
    • Column 6 / End: End position of the gene (1-based index).
    • Column 7 / Coverage: Per-base coverage of the gene.
    • Column 8 / FPKM: normalized expression level in FPKM units (see previous section).
    • Column 9 / TPM: normalized expression level in RPM units (see previous section).

    3. Fully covered transcripts matching the reference annotation transcripts (in GTF format)

    If StringTie is run with the -C <cov_refs.gtf> option (requires -G <reference_annotation> ), it returns a file with all the transcripts in the reference annotation that are fully covered, end to end, by reads. The output format is a GTF file as described above. Each line of the GTF is corresponds to a gene or transcript in the reference annotation.

    4. Ballgown Input Table Files

    If StringTie is run with the -B option, it returns a Ballgown input table file, which contains coverage data for all transcripts. The output table files are placed in the same directory as the main GTF output. These tables have these specific names: (1) e2t.ctab, (2) e_data.ctab, (3) i2t.ctab, (4) i_data.ctab, and (5) t_data.ctab. A detailed description of each of these five required inputs to Ballgown can be found on the Ballgown site, at this link.

    5. Merge mode: Merged GTF

    If StringTie is run with the --merge option, it takes as input a list of GTF/GFF files and merges/assembles these transcripts into a non-redundant set of transcripts. This step creates a uniform set of transcripts for all samples to facilitate the downstream calculation of differentially expressed levels for all transcripts among the different experimental conditions. Output is a merged GTF file with all merged gene models, but without any numeric results on coverage, FPKM, and TPM. Then, with this merged GTF, StringTie can re-estimate abundances by running it again with the -e option on the original set of alignment files, as illustrated in the figure below.

    Evaluating transcript assemblies

    A simple way of getting more information about the transcripts assembled by StringTie (summary of gene and transcript counts, novel vs. known etc.), or even performing basic tracking of assembled isoforms across multiple RNA-Seq experiments, is to use the gffcompare program. Basic usage information and download options for this program can be found on the GFF utilities page.

    Differential expression analysis

    1. for each RNA-Seq sample, map the reads to the genome with HISAT2 using the --dta option. It is highly recommended to use the reference annotation information when mapping the reads, which can be either embedded in the genome index (built with the --ss and --exon options, see HISAT2 manual), or provided separately at run time (using the --known-splicesite-infile option of HISAT2). The SAM output of each HISAT2 run must be sorted and converted to BAM using samtools as explained above.
    2. for each RNA-Seq sample, run StringTie to assemble the read alignments obtained in the previous step it is recommended to run StringTie with the -G option if the reference annotation is available.
    3. run StringTie with --merge in order to generate a non-redundant set of transcripts observed in any of the RNA-Seq samples assembled previously. The stringtie --merge mode takes as input a list of all the assembled transcripts files (in GTF format) previously obtained for each sample, as well as a reference annotation file (-G option) if available.
    4. for each RNA-Seq sample, run StringTie using the -B/-b and -e options in order to estimate transcript abundances and generate read coverage tables for Ballgown. The -e option is not required but recommended for this run in order to produce more accurate abundance estimations of the input transcripts. Each StringTie run in this step will take as input the sorted read alignments (BAM file) obtained in step 1 for the corresponding sample and the -G option with the merged transcripts (GTF file) generated by stringtie --merge in step 3. Please note that this is the only case where the -G option is not used with a reference annotation, but with the global, merged set of transcripts as observed across all samples. (This step is the equivalent of the Tablemaker step described in the original Ballgown pipeline.)
    5. Ballgown can now be used to load the coverage tables generated in the previous step and perform various statistical analyses for differential expression, generate plots etc.

    An alternate, faster differential expression analysis workflow can be pursued if there is no interest in novel isoforms (i.e. assembled transcripts present in the samples but missing from the reference annotation), or if only a well known set of transcripts of interest are targeted by the analysis. This simplified protocol has only 3 steps (depicted below) as it bypasses the individual assembly of each RNA-Seq sample and the "transcript merge" step. This simplified workflow attempts to directly estimate and analyze the expression of a known set of transcripts as given in the reference annotation file.

    The R package IsoformSwitchAnalyzeR can be used to assign gene names to transcripts assembled by StringTie, which can be particularly helpful in cases where StringTie could not perform this assignment unambigiously.
    The importIsoformExpression() + importRdata() function of the package can be used to import the expression and annotation data into R. During this import the package will attempt to clean up and recover isoform annotations where possible. From the resulting switchAnalyzeRlist object, IsoformSwitchAnalyzeR can detect isoform switches along with predicted functional consequences. The extractGeneExpression() function can be used to get a gene expression (read count) matrix for analysis with other tools.
    More information and code examples can be found here.

    Using StringTie with DESeq2 and edgeR

    DESeq2 and edgeR are two popular Bioconductor packages for analyzing differential expression, which take as input a matrix of read counts mapped to particular genomic features (e.g., genes). We provide a Python script (prepDE.py, or the Python 3 version: prepDE.py3 ) that can be used to extract this read count information directly from the files generated by StringTie (run with the -e parameter).

    prepDE.py derives hypothetical read counts for each transcript from the coverage values estimated by StringTie for each transcript, by using this simple formula: reads_per_transcript = coverage * transcript_len / read_len

    • one option is to provide a path to a directory containing all sample sub-directories, with the same structure as the ballgown directory in the StringTie protocol paper in preparation for Ballgown. By default (no -i option), the script is going to look in the current directory for all sub-directories having .gtf files in them, as in this example:
    • Alternatively, one can provide a text file listing sample IDs and their respective paths (sample_lst.txt).

    Usage: prepDE.py [options]
    generates two CSV files containing the count matrices for genes and transcripts, using the coverage values found in the output of stringtie -e

    Options:
    -h, --help show this help message and exit
    -i INPUT, --input=INPUT, --in=INPUTa folder containing all sample sub-directories, or a text file with sample ID and path to its GTF file on each line [default: . ]
    -g Gwhere to output the gene count matrix [default: gene_count_matrix.csv]
    -t Twhere to output the transcript count matrix [default: transcript_count_matrix.csv]
    -l LENGTH, --length=LENGTHthe average read length [default: 75]
    -p PATTERN, --pattern=PATTERNa regular expression that selects the sample subdirectories
    -c, --clusterwhether to cluster genes that overlap with different gene IDs, ignoring ones with geneID pattern (see below)
    -s STRING, --string=STRINGif a different prefix is used for geneIDs assigned by StringTie [default: MSTRG]
    -k KEY, --key=KEYif clustering, what prefix to use for geneIDs assigned by this script [default: prepG]
    --legend=LEGENDif clustering, where to output the legend file mapping transcripts to assigned geneIDs [default: legend.csv]

    These count matrices (CSV files) can then be imported into R for use by DESeq2 and edgeR (using the DESeqDataSetFromMatrix and DGEList functions, respectively).

    Protocol: Using StringTie with DESeq2

    Given a list of GTFs, which were re-estimated upon merging, users can follow the below protocol to use DESeq2 for differential expression analysis. To generate GTFs from raw reads follow the StringTie protocol paper (up to the Ballgown step).


    Variant calling with GATK

    Different variant callers disagree a great deal, for single nucleotide polymorphisms (SNPs) and particularly for insertions and deletions (indels). Of the various methods available (samtools, varscan, freebayes, ReadXplorer etc) GATK, by the Broad Institute is the best. The HaplotypeCaller module which performs local de novo assemblies around indels has recently been updated to include non-diploid organisms, so it’s used here but there is little difference for bacterial genomes at least, between HaplotypeCaller and UnifiedGenotyper.

    Following conversations at a few conferences and meetings now, most recently the brilliant 2nd ISCB Bioinformatics Symposium at TGAC, I’ve realised that others have had the same issues I had with implementing GATK. This mostly arises in preparation of your data beforehand, so here is a brief run-through of exactly how to call variants from microbial genome short read data using GATK which I hope is useful!:

    You’ll need to install samtools, picardtools, GATK, bwa and optionally, vcffilter for this workflow. Picardtools and GATK are simply .jar files so that’s no problem while you probably already have bwa installed, otherwise installation is well documented!

    This workflow begins with short read (fastq) files and a fasta reference. First the reference is prepared, a sequence dictionary is created, short reads are aligned to the reference and read group information provided, resulting sequence alignment map (sam) file sorted and converted to binary alignment map (bam) format, duplicates marked, bam file sorted, indel targets identified, indels realigned and variants called. Simple!

    For simplicity an example set of commands are provided here, where the reference is reference.fasta and the short reads are R1.fastq.gz and R2.fastq.gz. You will need to enter the paths and versions of the software being used at each step and your actual file names. Ploidy is set to 1.

    See Updated GATK pipeline to HaplotypeCaller + gVCF for implementation of the more recent gVCF workflow. Otherwise HaplotypeCaller alone (below) will work just fine


    #Index reference
    bwa index reference.fasta

    #Sort reference
    samtools faidx reference.fasta

    #Create sequence dictionary
    java -jar

    /bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict

    #Align reads and assign read group
    bwa mem -R &#[email protected] ID:FLOWCELL1.LANE1 PL:ILLUMINA LB:test SM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

    #Sort sam file
    java -jar

    /bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate

    #Mark duplicates
    java -jar

    /bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt

    #Sort bam file
    java -jar

    #Create realignment targets
    java -jar

    /bin/GATK3.3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list

    #Indel realignment
    java -jar

    /bin/GATK3.3/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam

    #Call variants (HaplotypeCaller)
    java -jar

    /bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

    The resulting vcf file will contain your variant calls!

    Then you can optionally filter the variants:

    #Filter variants

    /bin/vcflib/bin/vcffilter -f ‘DP > 9’ -f ‘QUAL > 10’ raw.vcf > filtered.vcf

    Or first split the raw.vcf file into SNPs and indels:

    #Extract SNPs
    java -jar

    /bin/GATK3.3/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType SNP -o snps.vcf

    #Extract Indels
    java -jar

    /bin/GATK/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType INDEL -o indels.vcf

    I also have a neat perl wrapper to automate this workflow over many short read files and would be happy to make this available if people are interested. Please do comment with any questions or issues and I’ll do my best to resolve them!


    Watch the video: ГЛАЗ - ГАМАЗ и ПИПКА - СТЕКЛОРЕЗ #5 Прохождение Gears of war 5 (January 2022).