The Power of Computers in Biology

Worksheet

Task A: Identify the mysterious “Nucleotide Sequence R”

We provide a DNA sequence without any information on its function – Sequence R.

1. Open a new tab in your Web browser.

2. Go to Sequence R.

Copy the whole of it.

Keep this browser tab or window open.

To discover the role of this mysterious sequence, we will search for proteins in a database that show high similarity to a translation of Sequence R.

To do this we will use a sequence search engine, BLAST, to search sequences in the database at the National Center for Biotechnology Information (NCBI).

1. Open BLAST.

2. Click “blastx”.

3. In the box labelled “Enter Query sequence”, paste Sequence R.

4. Click “BLAST”.

BLASTX uses the genetic code to translate Sequence R, then compares it with every protein in the sequence database.

The BLASTX search may take a few minutes, during which time “Status” on the Web page is “Searching”.

Results will be shown in a long Web page. Scroll down to see a table of results. The best-matching sequence from the database is listed first.

Question 1

Sequence R has an excellent match to a known protein, indicated by its low E-value and high percentage identity. We assume Sequence R codes for this protein.

What is the name of this protein?

Hint: To find out more about this protein, click the link under “Accession”. Then you will see the protein name near the top of the page, in bold.

Question 2

In which organism is this protein found?

Hint: Look for the line beginning “SOURCE”.

Question 3

What is the biological role of the protein?

Hint: Do a Web search for the name of the protein (from your answer to Question 1).

Task B: Search for a match to Sequence R in the human genome

We will now perform another BLAST search to see if Sequence R has a match in the human genome.

This time, we will use BLASTN, which compares a DNA query with a DNA sequence database.

1. Copy Sequence R again from our Web page.

2. Go back to the main NCBI BLAST Web page. (You can just search for “NCBI BLAST”.)

3. Under “BLAST Genomes”, click “Human”.

BLAST

4. Click “blastn” at the top of the page, and paste in Sequence R.

BLASTN

5. Under “Program selection” > “Optimize for”, select “Somewhat similar sequences (blastn)”.

6. Click “BLAST”.

Question 4

On which human chromosome has the best match been found?

Click on the first result in the “Description” column of the table.

This will display fragments of Sequence R (the query sequence) aligned against human genomic DNA found by the BLASTN search (the subject sequences).

Look at the BLASTN results carefully.

In your results, you will see a good match between Sequence R and the human genomic DNA. However, the mouse (Sequence R) and human sequences differ, due to mutations.

Figure 1 shows evidence of substitution and frameshift mutations in BLASTN results.

Question 5

In the alignment between Sequence R and the human genomic DNA, can you see evidence of a substitution mutation?

If so: sketch the region that includes the substitution.

Question 6

Can you see evidence of an insertion or deletion mutation?

If so: sketch the region that includes the insertion or deletion.

A frameshift mutation is an insertion or deletion which disrupts the reading frame of a protein-coding sequence. From this point on, any protein sequence would be scrambled.

A frameshift mutation is strong evidence that the DNA no longer codes for a functional protein.

Question 7

Do you think the human genome includes a functional version of Sequence R?

Explain your answer.

Question 8

Sequence R comes from the house mouse and codes for L-gulonolactone oxidase, an enzyme that synthesizes vitamin C.

Vitamin C is vital for both humans and mice.

Does your answer to Question 7 tell us anything about how the diet of humans might differ from the diet of mice?

Task C: Using BLAST at the command line

We typically interact with a computer using a graphical user interface (GUI). This involves using a mouse or a touch-screen, perhaps with some typing.

An alternative is the command line. This involves commands typed into a terminal.

The command line may be more difficult to use at first. However, it has some advantages. We can easily keep a copy of the commands we ran, making it easier to document our actions.

In bioinformatics, both GUI and command line are used every day. For scientific computing, we often use the command line on a Linux operating system.

Introduction to the command line: creating, saving, and reading files

Follow our on-screen instructions to connect to the Linux server.

To access the command line, click the black rectangular icon near the top left.

First, we will use the nano text editor to create a file.

To start editing a new file called test.txt, type the following text, then ENTER

Copied

nano test.txt

You now see something like a basic word processor. Type some text, for example “hello world”.

Copied

hello world

To exit nano and save your file:

Press CTRL-X

nano will then ask:

"Save modified buffer (ANSWERING "No" WILL DESTROY CHANGES)?"

Press y to save your file.

nano will then ask for the file name, suggesting the name you gave earlier (test.txt)

Press Enter

nano should now close which brings you back to the command prompt. A $ sign is shown, meaning the computer is waiting for your next command.

To find the file you created, we now will list files on the Linux server. Type the following and press ENTER

Copied

ls

You should be able to see the name of the file you created, test.txt, and others. You can display the contents of a text file by using the cat command. Type the text below and press ENTER

Copied

cat test.txt

Sequence data in text files

DNA sequences are stored in text files. The DNA sequence we used in Task A and B is stored in the file sequence_r.fa
To display the contents of sequence_r.fa type the following text and then press ENTER

Copied

cat sequence_r.fa

Working in the command line allows us to work with very large data files. The file chromosome8.fa contains the DNA sequence of human Chromosome 8.

To display the sequence of Chromosome 8, type the following text and press ENTER

Copied

cat chromosome8.fa

This DNA sequence is very long. If it takes too long to display the whole sequence, you can stop the process by pressing CTRL-C

Running BLAST at the command line

An advantage of working at the command line is we can combine commands into a short program (script), and run them all at once.

We provide a script mutation_checker.sh, which allows you to search for GULO in animal genomes.

Type the following text (all in one line), then press ENTER

Copied

bash mutation_checker.sh

Then, select the animal(s) you would like to analyse from the numbered list, for example if you would like to analyse human and dog type:

Copied

2 4

and press ENTER

mutation_checker.sh will run BLAST to compare Sequence R to the animal(s) you choose.

Then, mutation_checker.sh checks if the BLAST alignment contains any insertions or deletions (indels).

If there are no indels, we can be confident that the GULO gene is coding for a functional enzyme, and that this species can produce it's own Vitamin C.

If there are indels, it is likely that there has been a frameshift mutuation, and that this species does not have a function GULO gene, and that this species can't produce it's own Vitamin C.

Remember

Compared to running several BLAST searches online, using the script (mutation_checker.sh) has saved time.

However, we must always check the results produced by a computer.

To inspect our results we will use the cat function.

For example, the dog has two copies of the GULO gene. mutation_checker.sh informs us that gene 1 likely has an indel and that gene 2 is intact.

To inspect the BLAST alignment for dog gene 1 use the following code (remember to change this depending on the animal(s) you are interested in.

Copied

 cat dog_1_blast_alignment.txt

Question 9

Which animals did you analyse?

Do you think these animals can produce their own Vitamin C?

Did any of the results surprise you?

Daniel Barker, Heleen Plaisier, Laura CE Campbell, Stevie A Bain, Richard Fitzpatrick and Chenxi Zhang

4273pi Bioinformatics Education Project

Copyright and related rights waived via CC0 1.0 Public Domain Dedication.

Version 3.1 (extended)