Naive code to calculate N50 in Perl

22 April 2012 | Research, SLab, Today`s Script

A critical stage in genome sequencing is the assembly of shotgun reads, or piecing together fragments randomly extracted from the sample, to form a set of contiguous sequences (contigs) representing the DNA in the sample [1]. N50 is one of the quality parametres of genome assembly. N50 indicates the scaffold/contigs length such that 50% of the de novo assembled sequences lies in scaffolds of this size or larger [2].
–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Given a set of sequences of varying lengths, the N50 length is defined as the length N for which half of all bases in the sequences are in a sequence of length L < N. In other words, N50 is the contig length such that using equal or longer contigs produces half the bases of the genome.
The algorithm to calculate N50

  1. Sum the total length of the sequence of contigs
  2. Sort descending the sequence of contigs based on its length
  3. Calculate the sum of the sorted contig so that the sum is more than or same with half of the total length. It is called N50 length, while the N50 value is the length of contig until get the N50 length.

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
This is a naive code in perl to calculate the N50 value. The input is in fasta format, also completed with the assumption that the contigs consist of multiple lines.

#/usr/bin/perl -w
use strict;

my @arr;
my $length;
my $totalLength;
my $seq;
my $file;

$file = $ARGV[0];
open (IN, $file) or die “myfile can’t open -$!”;
$length=0;

while (my $line = ) {

if ($line =~/^>(\w+)/ ) {

if($length!=0) {

$totalLength+=$length;
push(@arr,$length);

}

$length=0;

}

else {

chomp $line;## chomp removes newlines
$seq = $line;
$length+=length($seq);

}

}

$totalLength+=$length;
push(@arr,$length);

close IN;
my @sort = sort {$b <=> $a} @arr;
my $n50;
foreach my $val(@sort){

$n50+=$val;

if($n50 >= $totalLength/2){

print “====================\n”;
print “$file\n”;
print “N50 length is $n50 and N50 value is: $val\n”;
print “Total Length is: $totalLength\n”;
last;

}

}

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
References
[1] Zerbino, D.R., Birney, E., 2008, Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Genome Research, vol.28, pp.821-829
[2]Namiki, T., Hachiya, T., Tanaka, H., Sakakibara, Y., 2011, MetaVelvet : An extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2011 ACM Conference on Bioinformatics, Computational Biology and Biomedicine, Chicago, IL, USA


Leave a Reply