Russian version English version
Volume 12   Issue 1   Year 2017
Nazipova N.N., Isaev E.A., Kornilov V.V., Pervukhin D.V., Morozova A.A., Gorbunov A.A., Ustinin M.N.

Big Data in Bioinformatics

Mathematical Biology & Bioinformatics. 2017;12(1):102-119.

doi: 10.17537/2017.12.102.



  1. Manyika J., Chui M., Brown B., Bughin J., Dobbs R., Roxburgh C., Byers A.H. The Next Frontier for Innovation, Competition, and Productivity. San Francisco: McKinsey Global Institute; 2011. (accessed 17 February 2017).
  2. Jacobs A. The Pathologies of Big Data. Communications of the ACM. 2009;52(8). doi: 10.1145/1536616.1536632
  3. What’s New in Gartner’s Hype Cycle for Emerging Technologies, 2015. Gartner. (accessed 17 February 2017).
  4. Chui M., Löffler M., Roberts R. The Internet of Things. McKinsey Quarterly. 2010. (accessed 17 February 2017).
  5. Hogeweg P. The Roots of Bioinformatics in Theoretical Biology. PLOS Computational Biology. 2011;7(3). Article No. e1002021. doi: 10.1371/journal.pcbi.1002021
  6. Winkler H. Verbreitung und Ursache der Parthenogenesis im Pflanzen - und Tierreiche. Jena: Verlag Fischer; 1920. doi: 10.5962/bhl.title.1460
  7. Baker M. The ’Oms Puzzle. Nature. 2013;494:416-419. doi: 10.1038/494416a
  8. Ohashi H., Hesegawa M., Wakimoto K., Miyamoto-Sato E. Next-generation technologies for multiomics approaches including interactome sequencing. BioMed Research International. 2015;2015. Article No. 104209.
  9. International Human Genome Sequencing Consortium. Human genome. Nature. 2001;409:860-921.
  10. Venter J.C., Adams M.D., Myers E.W., Li P.W., Mural R.J., Sutton G.G., Smith H.O., Yandell M., Evans C.A., Holt R.A., et al. The sequence of the human genome. Science. 2001;291(5507):1304-1351. doi: 10.1126/science.1058040
  11. Buermans H.P.J., den Dunnen J.T. Next generation sequencing technology. Advances and applications. BBA – Molecular Basis of Disease. 2014;1842(10):1932-1941. doi: 10.1016/j.bbadis.2014.06.015
  12. Bioinforx Inc. Next Generation Sequencing Software. (accessed 17 February 2017).
  13. BaseSpace Sequence Hub. (accessed 17 February 2017).
  14. CLCBio. (accessed 17 February 2017).
  15. DNASTAR Lasergene. (accessed 17 February 2017).
  16. Kearse M., Moir R., Wilson A., Stones-Havas S., Cheung M., Sturrock S., Buxton S., Cooper A., Markowitz S., Duran C., et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics. 2012;28(12):1647-1649. doi: 10.1093/bioinformatics/bts199
  17. Giardine B., Riemer C., Hardison R.C., Burhans R., Elnitski L., Shah P., Zhang Y., Blankenberg D., Albert I., Taylor J., et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005;15(10):1451-1455. doi: 10.1101/gr.4086505
  18. Goecks J., Nekrutenko A., Taylor J., Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010;11(8). Article No. R86. doi: 10.1186/gb-2010-11-8-r86
  19. Madduri R.K., Sulakhe D., Lacinski L., Liu B., Rodriguez A., Chard K., Dave U.J., Foster I.T. Experiences Building Globus Genomics: A Next-Generation Sequencing Analysis Service using Galaxy, Globus, and Amazon Web Services. Concurr. Comput. 2014;26(13):2266-2279. doi: 10.1002/cpe.3274
  20. Wattam A.R., Abraham D., Dalay O., Disz T.L., Driscoll T., Gabbard J.L., Gillespie J.J., Gough R., Hix D., Kenyon R., et al. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res. 2014;42:D581-D591. doi: 10.1093/nar/gkt1099
  21. Golosova O., Henderson R., Vaskin Y., Gabrielian A., Grekhov G., Nagarajan V., Oler A.J., Quinones M., Hurt D., Fursov M., Huyen Y. Unipro UGENE NGS pipelines and components for variant calling, RNA-seq and ChIP-seq data analyses. PeerJ. 2014;2. Article No. e644. doi: 10.7717/peerj.644
  22. Okonechnikov K., Golosova O., Fursov M., UGENE Team. Unipro UGENE: a unified bioinformatics toolkit. Bioinformatics. 2012;28(8):1166-1167. doi: 10.1093/bioinformatics/bts091
  23. Jagla B., Wiswedel B., Coppee J.-Y. Extending KNIME for next-generation sequencing data analysis. Bioinformatics. 2011;27(20):2907-2909. doi: 10.1093/bioinformatics/btr478
  24. Warr W.A. Scientific workflow systems: Pipeline Pilot and KNIME. Journal of Computer-Aided Molecular Design. 2012;26(7):801-804. doi: 10.1007/s10822-012-9577-7
  25. Oinn T., Addis M., Ferris J., Marvin D., Senger M., Greenwood M., Carver T., Glover K., Pocock M.R., Wipat A., Li P. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics. 2004;20(17):3045-3054. doi: 10.1093/bioinformatics/bth361
  26. Barnett D.W., Garrison E.K., Quinlan A.R., Stromberg M.P., Marth G.T. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics. 2011;27(12):1691-1692. doi: 10.1093/bioinformatics/btr174
  27. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25(16):2078-2079. doi: 10.1093/bioinformatics/btp352
  28. Nordell Markovits A., Joly Beauparlant C., Toupin D., Wang S., Droit A., Gevry N. NGS++: a library for rapid prototyping of epigenomics software tools. Bioinformatics. 2013;29(15):1893-1894. doi: 10.1093/bioinformatics/btt312
  29. Plieskatt J., Rinaldi G., Brindley P.J., Jia X., Potriquet J., Bethony J., Mulvenna J. Bioclojure: a functional library for the manipulation of biological sequences. Bioinformatics. 2014;30(17):2537-2539. doi: 10.1093/bioinformatics/btu311
  30. libStatGen. (accessed 17 February 2017).
  31. Pitt W.R., Williams M.A., Steven M., Sweeney B., Bleasby A.J., Moss D.S. The Bioinformatics Template Library – generic components for biocomputing. Bioinformatics. 2001;17(8):729-737. doi: 10.1093/bioinformatics/17.8.729
  32. Stajich J.E., Block D., Boulez K., Brenner S.E., Chervitz S.A., Dagdigian C., Fuellen G., Gilbert J.G., Korf I., Lapp H., et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12(10):1611-1618. doi: 10.1101/gr.361602
  33. Goto N., Prins P., Nakao M., Bonnal R., Aerts J., Katayama T. BioRuby: bioinformatics software for the Ruby programming language. Bioinformatics. 2010;26(20):2617-269. doi: 10.1093/bioinformatics/btq475
  34. Holland R.C., Down T.A., Pocock M., Prlic A., Huen D., James K., Foisy S., Drager A., Yates A., Heuer M., et al. BioJava: an open-source framework for bioinformatics. Bioinformatics. 2008;24(18):2096-2097. doi: 10.1093/bioinformatics/btn397
  35. Cock P.J., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B., et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422-1423. doi: 10.1093/bioinformatics/btp163
  36. Open Bioinformatics Foundation. (accessed 17 February 2017).
  37. Huber W., Carey V.J., Gentleman R., Anders S., Carlson M., Carvalho B.S., Bravo H.C., Davis S., Gatto L., Girke T., et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat. Methods. 2015;12(2):115-121. doi: 10.1038/nmeth.3252
  38. Gentleman R.C., Carey V.J., Bates D.M., Bolstad B., Dettling M., Dudoit S., Ellis B., Gautier L., Ge Y., Gentry J., et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10). Article No. R80. doi: 10.1186/gb-2004-5-10-r80
  39. Milicchio F., Rose R., Bian J., Min J., Prosperi M. Visual programming for next-generation data analytics. BioData Mining. 2016;9. Article No. 16. doi: 10.1186/s13040-016-0095-3
  40. Bernstein F.C., Koetzle T.F., Williams G.J., Meyer E.F.Jr., Brice M.D., Rodgers J.R., Kennard O., Shimanouchi T., Tasumi M. The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 1977;112(3):535-542. doi: 10.1016/S0022-2836(77)80200-3
  41. Bourne P. E., Berman H.M., McMahon B., Watenpaugh K.D., Westbrook J.D., Fitzgerald P.M.D. Macromolecular crystallographic information file. Methods in Enzymology. 1997;277:571-590. doi: 10.1016/S0076-6879(97)77032-0
  42. Galperin M.Y., Fernández-Suárez X.M., Rigden D.J. The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes. Nucleic Acids Res. 2017;45:D1-D11. doi: 10.1093/nar/gkw1188
  43. Benson D., Lipman D.J., Ostell J. GenBank. Nucleic Acids Res. 1994;22:3441-3444. doi: 10.1093/nar/22.17.3441
  44. Rice C.M., Fuchs R., Higgins D.G., Stoehr P.J., Cameron G.N. The EMBL Data Library. Nucleic Acids Res. 1993;21:2967-2971. doi: 10.1093/nar/21.13.2967
  45. Tateno Y., Gojobori T. DNA Data Bank of Japan in the age of information biology. Nucleic Acids Res. 1997;25(1):14-17. doi: 10.1093/nar/25.1.14
  46. de Brevern A.G., Meyniel J.-P., Fairhead C., Neuvéglise C., Malpertuy A. Trends in IT Innovation to Build a Next Generation Bioinformatics Solution to Manage and Analyse Biological Big Data Produced by NGS Technologies. BioMed Research International. 2015. Article No. 904541. doi: 10.1155/2015/904541
  47. Lith A., Mattsson J. Investigating Storage Solutions for Large Data. A comparison of well performing and scalable data storage solutions for real time extraction and batch insertion of data: Master of Science Thesis. 2010. (accessed 17 February 2017).
  48. Svensson J. Relational vs. graph databases: Which to use and when? SD Times. 2016. (accessed 17 February 2017).
  49. Have C.T., Jensen L.J. Are graph databases ready for bioinformatics? Bioinformatics. 2013;29(24):3107-3108. doi: 10.1093/bioinformatics/btt549
  50. Taylor R.C. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics. 2010;11. Article No. S1. doi: 10.1186/1471-2105-11-S12-S1
  51. Chang F., Dean J., Ghemawat S., Hsieh W.C., Wallach D.A., Burrows M., Chandra T., Fikes A., Gruber R.E. Bigtable: A Distributed Storage System For Structured Data. In: The 7th Symposium on Operating System Design and Implementation Seattle, WA: Usenix Association, 2006. 14 p. (accessed 17 February 2017).
  52. Shen L., Shao N., Liu X., Nestler E. Ngs.plot: quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC Genomics. 2014;15(1). Article No. 284. doi: 10.1186/1471-2164-15-284
  53. Robinson J.T., Thorvaldsdóttir H., Winckler W., Guttman M., Lander E.S., Getz G., Mesirov J.P. Integrative genomics viewer. Nature Biotechnology. 2011;29(1):24-26. doi: 10.1038/nbt.1754
  54. Toedling J., Ciaudo C., Voinnet O., Heard E., Barillot E. Girafe – an R/Bioconductor package for functional exploration of aligned next-generation sequencing reads. Bioinformatics. 2010;26(22):2902-2903. doi: 10.1093/bioinformatics/btq531
  55. Nolan D., Lang D.T. Interactive and animated scalable vector graphics and R data displays. Journal of Statistical Software. 2012;46(1):1-88. doi: 10.18637/jss.v046.i01
  56. TIBCO Spotfire Homepage. (accessed 17 February 2017).
  57. Wexler J., Thompson W., Aponte K. Time Is Precious, So Are Your Models. SAS provides solutions to streamline deployment. In: SAS Global Forum 2013. Paper No. 086-2013. (accessed 17 February 2017).
  58. Tanenbaum A.S., van Steen M. Raspredelennye sistemy. Printsipy i paradigmy. Saint Petersburg; 2003. 877 p. (Translation of: Tanenbaum A.S., van Steen M. Distributed Systems: Principles and Paradigms. Prentice Hall; 2002).
  59. Dean J., Ghemawat S. MapReduce: simplified data processing on large clusters. Commun. ACM. 2008;51(1):107-113. doi: 10.1145/1327452.1327492
  60. White T. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2015. 756 p.
  61. The Apache Software Foundation Home page. (accessed 17 February 2017).
  62. IBM z Systems – z13s. (accessed 17 February 2017).
  63. Rustici G., Kolesnikov N., Brandizi M., Burdett T., Dylag M., Emam I., Farne A., Hastings E., Ison J., Keays M., et al. ArrayExpress update – trends in database growth and links to data analysis tools. Nucleic Acids Res. 2013;41:D987-D990. doi: 10.1093/nar/gks1174
  64. Greene A.C., Giffin K.A., Greene C.S., Moore J.H. Adapting bioinformatics curricula for big data. Briefings in Bioinformatics. 2016;17(1):43-50. doi: 10.1093/bib/bbv018
  65. Margolis R., Derr L., Dunn M., Huerta M., Larkin J., Sheehan J., Guyer M., Green E.D. The National Institutes of Health’s Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data. J. Am. Med. Inform. Assoc. 2014;21:957-958. doi: 10.1136/amiajnl-2014-002974
  66. Luo J., Wu M., Gopukumar D., Zhao Y. Big Data Application in Biomedical Research and Health Care: A Literature Review. Biomed. Inform. Insights. 2016;8:1-10 doi: 10.4137/BII.S31559


Table of Contents Original Article
Math. Biol. Bioinf.
doi: 10.17537/2017.12.102
published in Russian

Abstract (rus.)
Abstract (eng.)
Full text (rus., pdf)
References Translation into English
Math. Biol. Bioinf.
doi: 10.17537/2018.13.t1

Full text (eng., pdf)


  Copyright IMPB RAS © 2005-2024