1 Computational biology
Biologists are confronted by a tidal wave of information. Unfortunately, few of them know how to swim.
Economist, June 24, 1999.
1.1 Biological applications of computers Biology is becoming an information intensive discipline and a systems oriented discipline.
The best known example of information intensive biology is the human genome project. The human genome, printed out, would fill thousands of volumes. There is no way to make sense of it without computational tools. Here are a few of the ways computer programs help.
Computers store and retrieve sequence information. You can pick some gene, say BRCA2, one of the breast cancer susceptibility genes, and quickly get its DNA sequence from any internet connected computer. This alone represents a large investment in technology and an important tool for scientists.
Genes, the parts of the DNA that encode proteins, actually make up a small fraction of the DNA. Programs help distinguish genes from the much larger volume of noncoding DNA. There are systematic statistical differences between the genes and the noncoding DNA, and computer programs can learn those differences and attempt to classify unknown segments of DNA.
Biologists reconstruct evolutionary trees by measuring the similarity of genes across species. They use computer programs to calculate similarity.
Biologists are very interested in what parts of the genome are preserved across many species, and which parts are changing, and computer programs help search for this information. A highly preserved section may indicate a critical biological function. The human gene KCNJ3, which codes for a protein used in a potassium channel, critical for neural signalling and other functions, has a 96% match with the comparable gene in the Coelacanth, a primitive fish. Rapidly changing genes are also interesting. Human genes relating to digestion and brain structure have been evolving rapidly in recent history.
Computational tools are very important in other areas of biology also. For example, the three dimensional structure of a protein or RNA molecule is critical to its function. Predicting the three dimensional structure from the sequence information is an important unsolved problem in biology. Many groups are working on computer programs to help make these predictions.
Scientists use computer programs to simulate neural systems. A simulation tests our understanding of how a single cell works. If we write a computer program modeling a neuron, and the program predicts the same behavior under a specific set of conditions that the bench scientist observes in the lab, then in some sense we can claim to understand it. This line of work goes back to Hodgkin and Huxley, who used pencils and paper and desk calculators to work out a computational model of the giant axon of the squid. Scientists are working on programs to simulate large systems of neurons to try to understand how the properties of brains are built out of the properties of neurons.
Systems biology is another area where computers help biologists. Cells have very complex regulatory networks for their genes and proteins. An external signal may activate a protein that binds to a gene regulator so that another protein is synthesized, and so on and on. Systems biology studies these interlocking systems, in effect studying how individual molecular interactions lead to the behavior of the cell as a whole.
Ecology and population biology often use computer simulations to test understanding of natural systems, and to predict their future course. Many of these systems are too big for lab experiments. Computer models also have great practical value, for example in predicting how different actions may affect wildlife populations.
Many lab instruments include dedicated computers which help run them, interpret the results and communicate them to the scientist or another computer. These are called embedded systems.
This is a sampling of the most important areas where computers are helping biologists. These are the kinds of problems this book prepares you to tackle.
1.2 How to use this book Programming is something you learn by doing. This book includes many exercises and projects, and to master the subject you should spend more time doing those than you do reading the book.
Most chapters in this book have a scientific topic and a programming topic or topics. The chapter title describes the scientific subject and the subtitle the programming subjects.
This book has sample code at https://github.com/pgarst/JavaBio. It includes all the code used in this book in downloadable form, which you may want to use as a starting point for your own projects. It also includes solutions for many of the exercises and projects, but you should always attempt your own solution before peeking.
You should read this book at the computer for the most part. On almost every page you will want to check the online documentation or try a small example of a new feature of Java.
Programming involves many picky details, but avoid getting lost in them. The essence of programming isn’t making sure you get all the semicolons in the right places, but in feeling out the structure of problems, and finding robust, usable ways to solve them.
You can learn Java from this book without any particular background, but a solid high school level science background will help. Each chapter includes brief notes on the science used in the chapter, with pointers to more detailed information. There are occasional advanced notes which put some topic in context for those with a somewhat deeper background.
This book should be excellent background for more advanced work in computer science or computational biology. Compared to other Java books this one puts more emphasis on topics useful in scientific and technical computing, such as numerical methods, simulations, random number facilities and available repositories of scientific software.
1.3 Java and other languages A program is a list of instructions telling the computer precisely what to do, written in a computer language. Java is one among many languages you can use to write that list of instructions. Each language has very specific rules and structures.
Here is one statement in Java:
System.out.println("Darwin rules!");
Here is the same instruction in Python:
print "Darwin rules!"
In Perl:
print "Darwin rules!\n";
In C++ (pronounced “c plus plus”):
cout << "Darwin rules!" << endl;
Why bother with all these languages? Why doesn't everybody just use one language? Most people would agree that there are more computer languages than are strictly necessary, but like animals they evolve and multiply. Dennis Ritchie designed a programming language called C at Bell Labs in 1972, which was a successor to a programming language called B. It was well designed and became widely used. Starting in 1979 Bjarne Stroustrup at Bell Labs thought about C, what worked well and what could be improved, and C++ was born. In 1995 James Gosling at Sun Microsystems did this again. He thought about C++ and how it could be improved, and developed Java.
The process still continues. Scala is a more powerful successor language to Java which is generating a lot of buzz. In ten years people in your position might be reading Scala for Bioinformatics, or might be reading about some other language.
Java programs are platform independent: they run on Windows, Macs and Linux machines, and others as well. After chapter 2, where we walk you through installing the tools you will need to program in Java, the rest of the book works equally well for different platforms with only very minor customizations. The other major computer languages used for biology have followed this example, but for earlier languages making programs available on different machines could be a huge pain in the neck.
Different languages have different strengths. Some produce faster programs; some are easier to use to quickly throw a small program together; some include resources for building big software systems. There are special purpose languages designed for writing programs that might run on thousands of computers at once.
There are three languages in wide use in biology: Java, Python and Perl. Perl was adopted early on, and so there is a lot of biological material in that language, but both Java and Python are better languages, and I’d recommend using one of them if you have a choice. Your choice might depend on what people in your lab use, or the language of some application you want to work with.
Biologists also use many other languages, albeit less commonly.
Let's list some of the good and bad points of Java relative to Python. The relative merits of computer languages can be a religious issue, and some people will disagree with these claims.
- Java is more widely used in software engineering than Python, so it would be a good choice if you are thinking of studying more computer science.
- In many cases Java programs run significantly faster than...