James Bonfield of the Sanger Institute was the overall winner. Read full details on the Pistoia Blog.
The volume of next-generation sequencing data is a big problem. Data volumes are growing rapidly as sequencing technology improves. Individual runs are providing many more reads than before, and decreasing run times mean that more data can today be generated by a single machine in one day than a single machine could have produced in the whole of 2005.
Storing millions of reads and their quality scores uncompressed is impractical, yet current compression technologies are becoming inadequate. There is a need for a new and novel method of compressing sequence reads and their quality scores in a way that preserves 100% of the information whilst achieving much-improved linear (or, even better, non-linear) compression ratios.
The Pistoia Alliance, in the interests of promoting pre-competitive collaboration, is putting forward a prize fund of US$15,000 to the best novel open-source NGS compression algorithm submitted before the closing date of 15 March 2012.
Follow us on Twitter: @SeqSqueeze
Times are in seconds, memory is average total use in Kbytes. Measured using the time command with %e and %K options. ALL valid entries shown (valid = ran successfully and produced at least one correct fully matching result).
Name, institute & entry ID (link to source) |
Compress. ratio |
Compress. time |
Compress. memory |
Decompress. time |
Decompress. memory |
Header mismatches |
Sequence mismatches |
Quality mismatches |
| Matt Mahoney, Dell Inc.(96) | 0.0287 | 905.79 | 5293280 | 394.63 | 5294704 | 0 | 0 | 24033620 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(78) | 0.0536 | 3035.17 | 5200 | 3010.02 | 5200 | 16629547 | 16629548 | 16629548 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(79) | 0.0536 | 3018.11 | 5200 | 3026.36 | 5200 | 16628724 | 16628724 | 16628724 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(69) | 0.0546 | 3404.17 | 68986880 | 3389.73 | 68984480 | 16512460 | 16512461 | 16512461 |
| James Bonfield, Sanger Institute(101) | 0.1141 | 3280.24 | 13235808 | 325.25 | 4381552 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(104) | 0.1142 | 10299.99 | 13272736 | 342.79 | 4381552 | 0 | 0 | 0 |
| James Bonfield, Wellcome Trust Sanger Institute(17) | 0.1154 | 208.42 | 932976 | 293.05 | 932880 | 0 | 0 | 24030818 |
| James Bonfield, Sanger Institute(97) | 0.1162 | 3276.83 | 13240864 | 318.46 | 4186912 | 0 | 0 | 0 |
| Matt Mahoney, Dell Inc.(99) | 0.1166 | 1218.88 | 5398224 | 769.26 | 5399808 | 0 | 0 | 0 |
| Matt Mahoney, Dell Inc.(90) | 0.1183 | 1062.8 | 5463456 | 610.63 | 5464128 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(86) | 0.1245 | 11585.89 | 13239776 | 334.89 | 1635408 | 0 | 0 | 0 |
| Matt Mahoney, Dell Inc.(85) | 0.1254 | 930.87 | 5716624 | 653.64 | 5756576 | 0 | 0 | 0 |
| Matt Mahoney, Dell Inc.(82) | 0.1255 | 924.38 | 5738432 | 660.85 | 5769200 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(62) | 0.1273 | 10476.97 | 13231488 | 335.6 | 2063344 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(60) | 0.1273 | 22063.57 | 13516544 | 123.86 | 4198048 | 23834062 | 23834062 | 23834062 |
| Davide Cittaro, Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute(34) | 0.1323 | 3194.87 | 228512 | 5398.99 | 193584 | 0 | 0 | 24032852 |
| Davide Cittaro, Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute(32) | 0.1328 | 5020.04 | 123504 | 6250.11 | 124320 | 1319193 | 0 | 24032852 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(108) | 0.1693 | 11820.43 | 21999248 | 11245.79 | 21998992 | 0 | 0 | 0 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(102) | 0.1695 | 8107.2 | 17505408 | 8272.33 | 17505184 | 0 | 0 | 0 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(98) | 0.1695 | 8123.18 | 33889264 | 8091.45 | 33889056 | 0 | 0 | 0 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(100) | 0.1696 | 7814.66 | 17112192 | 7903.5 | 17112000 | 0 | 0 | 0 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(103) | 0.1698 | 4819.01 | 16980272 | 4887.3 | 16980448 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(72) | 0.1709 | 525.41 | 22240640 | 600.29 | 22231504 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(105) | 0.1718 | 306.67 | 2747120 | 333.76 | 1631600 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(92) | 0.172 | 244.89 | 2472832 | 284.79 | 1638800 | 0 | 0 | 0 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(80) | 0.1726 | 8184.44 | 8725168 | 8282.31 | 8724928 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(52) | 0.1727 | 429.94 | 21285824 | 483.35 | 21284640 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(64) | 0.1729 | 392.08 | 2097280 | 484.24 | 1543568 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(66) | 0.1729 | 362.7 | 1937056 | 434.2 | 1935472 | 0 | 0 | 0 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(71) | 0.173 | 8509.41 | 8725168 | 8573.91 | 8724944 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(47) | 0.1743 | 250 | 1699328 | 495.06 | 1699104 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(91) | 0.1743 | 154.58 | 1581120 | 291.66 | 1580960 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(36) | 0.1744 | 347.64 | 1877504 | 472.03 | 1877456 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(44) | 0.1748 | 270.42 | 2080960 | 1.28 | 2206160 | 24040369 | 24040699 | 24040699 |
| Daniel Jones, University of Washington(45) | 0.1748 | 272.66 | 2015696 | 491.61 | 2121696 | 0 | 0 | 0 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(87) | 0.175 | 4949.94 | 4325680 | 5101.93 | 4325440 | 0 | 0 | 0 |
| Seth Hillbrand, Columbia University(30) | 0.1754 | 741.79 | 3685616 | 726.04 | 3621744 | 0 | 0 | 0 |
| Matt Mahoney, Dell Inc.(70) | 0.1756 | 1149.7 | 5790848 | 1275.28 | 5937152 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(38) | 0.1758 | 234.57 | 1866304 | 448.68 | 1958336 | 0 | 0 | 0 |
| James Bonfield, Wellcome Trust Sanger Institute(16) | 0.1762 | 228.55 | 933248 | 328.48 | 933200 | 0 | 0 | 0 |
| Seth Hillbrand, Columbia University(25) | 0.1767 | 792.24 | 1545808 | 739.96 | 1514176 | 0 | 0 | 0 |
| Matt Mahoney, Dell Inc.(67) | 0.177 | 1065.51 | 5863568 | 1203.09 | 5933024 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(83) | 0.177 | 112.69 | 244736 | 142.05 | 236032 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(37) | 0.1774 | 207.7 | 838592 | 396.21 | 889680 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(51) | 0.1775 | 128.25 | 334800 | 150.26 | 333616 | 0 | 0 | 0 |
| Seth Hillbrand, Columbia University(24) | 0.1779 | 596.33 | 1423024 | 544.76 | 1391296 | 0 | 0 | 0 |
| Seth Hillbrand, Columbia University(18) | 0.1781 | 796.83 | 3750080 | 851.56 | 3718256 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(29) | 0.1793 | 203.58 | 1063712 | 571.36 | 1276576 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(61) | 0.1797 | 109.9 | 75664 | 156.98 | 74480 | 0 | 0 | 0 |
| Seth Hillbrand, Columbia University(14) | 0.1797 | 859.58 | 1686176 | 827.45 | 1659056 | 0 | 0 | 0 |
| Yongwook Choi, J. Craig Venter Institute(77) | 0.1803 | 1445.22 | 2348160 | 1591.64 | 2348160 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(35) | 0.1803 | 164.52 | 98768 | 219.13 | 98704 | 0 | 0 | 0 |
| Seth Hillbrand, Columbia University(13) | 0.1813 | 794.66 | 2744944 | 765.04 | 2690816 | 0 | 0 | 0 |
| James Bonfield, Wellcome Trust Sanger Institute(15) | 0.1818 | 137.56 | 149888 | 187.31 | 149840 | 0 | 0 | 0 |
| Seth Hillbrand, Columbia University(12) | 0.1824 | 518.82 | 10897392 | 474.45 | 10639488 | 0 | 0 | 0 |
| Daniel Jones, University of Washington(23) | 0.1841 | 274.99 | 1010736 | 395.09 | 1147136 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute(33) | 0.1847 | 135.95 | 35760 | 189.27 | 35680 | 0 | 0 | 0 |
| James Bonfield, Sanger Institute (+denizens of encode.ru)(8) | 0.1878 | 141.2 | 423360 | 164.19 | 411936 | 0 | 0 | 0 |
| Armando J. Pinho, IEETA / Universidade de Aveiro(93) | 0.1884 | 2322.17 | 68736 | 2394.99 | 68480 | 0 | 0 | 0 |
| Ibrahim Numanagic, Simon Fraser Uni(107) | 0.1993 | 691.24 | 25973392 | 414.91 | 6067872 | 153675 | 0 | 0 |
| James Bonfield, Wellcome Trust Sanger Institute(7) | 0.2206 | 188.38 | 36512 | 100.91 | 16016 | 0 | 0 | 0 |
| Markus Krisch, i-novation.de(109) | 0.2241 | 449.62 | 24651072 | 454.91 | 5526784 | 0 | 0 | 0 |
| Inbal Landsberg, Or Peled, Barak Yacov, Yonatan Amir, Dan Benjamin and Ron Reiter, Interdisciplinary Center (IDC) Herzliya(95) | 0.2289 | 5565.23 | 6594288 | 3850.37 | 5973312 | 198799 | 2403 | 2403 |
| Ron Reiter, Interdisciplinary Center (IDC) Herzliya(9) | 0.24 | 738.27 | 27664 | 351.21 | 16032 | 0 | 0 | 0 |
| Ryan Braganza, Intersect Australia(39) | 0.2563 | 6753.88 | 14272 | 296.78 | 54900656 | 23834062 | 23834062 | 23834062 |
| Competition Baseline, SequenceSqueeze(6) | 0.3007 | 1020.97 | 5200 | 104.5 | 5008 | 0 | 0 | 0 |
| Ryan Braganza, Intersect Australia(28) | 0.301 | 1299.66 | 15040 | 331.02 | 13472 | 0 | 0 | 0 |
| Ryan Braganza, Intersect Australia(43) | 0.3072 | 1595.34 | 16144 | 493.9 | 53856 | 0 | 0 | 0 |
| Ryan Braganza, Intersect Australia(40) | 0.3072 | 1555.38 | 18960 | 1120.29 | 59433232 | 23834062 | 23834062 | 23834062 |
| Ryan Braganza, Intersect Australia(41) | 0.3072 | 1540.58 | 17152 | 1118 | 59433792 | 23834062 | 23834062 | 23834062 |
| Ryan Braganza, Intersect Australia(42) | 0.3072 | 1550.46 | 18976 | 997.72 | 17120 | 23834062 | 23834062 | 23834062 |
Yingrui Li is the Duty Operation Officer of Science & Technology Department of the BGI-Shenzhen. His research focuses on genomics and related bioinformatics analysis, the bioinformatics analytical platform based on the next-generation sequencing of BGI was constructed with his great help. He organized or took part in a number of national projects, including ""Yan Huang Project", "1000 Genomes Project", "Yan Huang Whole Genome Methylation", "Cancer Genome Project", etc. He has authored 28 high-impact papers, first/co-first author for 11 of them. Among those 28, 16 are published in Nature (including Nature series) and Science.
One of the founders of the Pistoia Alliance and president of the Alliance since 2009, Dr. Nick Lynch works with the board on strategy and leads the operational team that focuses on project delivery and daily maintenance of the Alliance.
Guy Coates leads the Informatics Systems Group at the Wellcome Trust Sanger Institute. His team is responsible for designing and maintaining the high performance compute (HPC) environment which support the Institute's high-throughput DNA sequencing platforms. As well as traditional HPC, his group also has an interest in promoting cloud and grid technologies to enable investigators to mange and share their data effectively.
Tim Fennell is the Assistant Director for Sequencing Pipeline Informatics at the Broad Institute. His team developed and maintains the Java language bindings for the popular SAM and BAM file formats as well as the Picard suite of tools for working with Next-Generation sequencing data. In his role at the Broad Tim and his team are responsible for all aspects of production pipeline development and operations for the Broad's fleet of next-generation sequencing instruments which are capable of producing several terabases of sequence per day.