Applied Python 2

[sc:fullwidth_signup]

Retrieving and sending information over the web

Geting a web page

A lot of biological information lives on web sites. Most of it can, with the right tools, be persuaded to jump neatly into your Python programms 🙂

A quick reminder from the introductory course: we can get the contents of a web page quite easily using Python’s built-in urllib module:

import urllib.request
f = urllib.request.urlopen('http://nematodes.org')
result = f.read()
print(result)

In some cases, this is all we need to do; the website builder has made it very easy for us to get the information we need – E.g. the NCBI Eutils service:

import urllib.request

# get a FASTA sequence for accession number 12345 
f = urllib.request.urlopen('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=12345&rettype=fasta')

# read the result and turn it back into a text string
result = f.read().decode('utf-8')
print(result)
>gi|12345|emb|CAA44029.1| psaI [Triticum aestivum]
MTDLNLPSIFVPLVGLVFPAIAMTSLFLYVQKNKIV

In most cases, the website will not be quite so co-operative. Here is the web page that gives the status of a bunch of genome projects from the 959 Nematode Genomes initiative:

http://www.nematodes.org/nematodegenomes/index.php/Strains_with_Data

This is clearly structured data, which would be amenable to manipulation in a Python script. The problem is extracting the useful bits of data from the result we get when we
try to read the page:

import urllib.request
f = urllib.request.urlopen('http://www.nematodes.org/nematodegenomes/index.php/Strains_with_Data')
result = f.read()
print(result)

If we look inside the result, we can pick out the useful bits of information – so they are definitely in there! The trouble is that (unlike the eutils example)
the useful information is mixed up with a load of other stuff (mostly HTML structure which tells a web browser how the page should be displayed).

We need a tool that can parse this web page (just like we parse a FASTA file or a BLAST output file) and let us extract specific bits of information. We might be tempted to use
regular expressions, but they will not reliably work for HTML – we need a ‘proper’ parser that reads the elements one by one. Parsing HTML should be quite a simple task; after all it is
extremely well-documented and described – take a look here for the specification. The trouble is that very few web pages follow the specification exactly!
Web browsers have been designed to work around this fact, so you don’t notice HTML errors when you are browsing, but it’s easy to test a web page against the specification (this
process is called ‘validation’). Here is a link to the validation results for the page we are looking at.
You can see that even this short page breaks the rules for HTML in 21 places, so we can’t rely too much on the specification.

The upshot of all this is that we will be using an existing library for HTML parsing which has been designed to cope with deviations from the spec (just like web browsers do). The library
is called Beautiful Soup and it’s not part of the standard Python distribution so you will have to install it. If you are on linux then you can
just run apt-get install python-beautifulsoup4. If you are on a Mac / Windows then download/unzip the package and run python3 setup.py install from inside the folder.

Beautiful Soup has wonderful documentation which should be a model for all open-source projects 🙂

Here’s how we get started:

# import the module
from bs4 import BeautifulSoup
import urllib.request

# get the page data as normal
page = urllib.request.urlopen('http://www.nematodes.org/nematodegenomes/index.php/Strains_with_Data').read()

# use the BeautifulSoup function. Its input is the page data, and the output is an object that we can use
soup = BeautifulSoup(page)

# there are many ways of extracting the bits of information we need
# we can find all elements of a particular type
# how many tables are there?
print(len(soup.find_all('table')))
# how many table rows are there?
print(len(soup.find_all('tr')))
# how many links (a elements) are there
print(len(soup.find_all('a')))

# once we have an element, or list of elements, we can get various bits of information

for link in soup.find_all('a'):
    print("n")
    # we can get the text of an element
    print(link.string)
    
    # or we can ask for a particular attribute
    print(link.get('href'))
 
# check the documentation for more

Here’s a more useful example; we will try to print the data URLs for all species who’s genome status is ‘complete’:

# here is a concrete example
# for each species who's status is 'complete'
# get a list of all the URLs with information and print them

from bs4 import BeautifulSoup
import urllib.request

page = urllib.request.urlopen('http://www.nematodes.org/nematodegenomes/index.php/Strains_with_Data').read()
soup = BeautifulSoup(page)

# iterate through a list of all the table rows in the page
for row in soup.find_all('tr'):
    # get all table cells for a single row
    columns = row.find_all('td')
    td_count = len(columns) 
    # rows that contain useful information always have 6 cells
    if td_count == 6:
        # unpack the row into individual cells
        strain, species, status, contact, where, url = columns

        if status.string == 'complete':
            print(species.string)
            # the url cell can contain multiple a elements, so we will print out each one
            for anchor in url.find_all('a'):
                print('t' + anchor.get('href'))

Notice how we drill down progressivesly deeper into the html structure to find the information we are looking for:
– First we look at all the table rows
– Then we look inside each row at the individual columns.
– Then we look inside the column which contains the status.
– Then we look at all the links inside the column that contains the url.
– Finally we look inside each link to get the href (the address that the link points to).

This is often a good strategy for getting information out of web pages.

Exercise

1. The ‘taxonomy’ section of the NCBI website lets us look at an overview of the data for a give taxonomic group. Each node in the NCBI taxonomic tree has a unique number – a taxid. For example,
arthropods are number 6656. Elephants are number 9779. The url for a given taxonomic group looks like this:

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=XXXXX&lvl=3&lin=f&keep=1&srchmode=1&unlock

where XXXXX is the number of the group you want to look at. Here’s a link to the overview page for arthropods:

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=6656&lvl=3&lin=f&keep=1&srchmode=1&unlock

and here’s the page for elephants:

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=9779&lvl=3&lin=f&keep=1&srchmode=1&unlock

The table at the top-left of the page shows the number of various types of records belonging to that group. Write a function with one parameter – a taxid – which returns
a three-element dict showing the number of nucleotide, protein and genome sequences available for that taxid. I.e if you run

print(get_counts(6656))

you should get the result:

{'protein': '1,631,010', 'nucleotide': '5,181,232', 'genome': '485'}

These numbers will probably be different by the time you try this exercise, as new sequence data are being added to GenBank all the time.

Your function will have to :

  • construct the url
  • fetch the page
  • look for all the table rows that have 3 columns
  • look for the table row where the content of the first column is ‘Nucleotide’ (be careful with spelling and case!)
  • get the value of the second column and store it in the dict
  • do the same for ‘Protein’ and ‘Genome’
  • return the dict

Programs that extract information this way (‘screen scrapers’) are fragile! If there is something wrong with your internet connection, or the server is down, or the
url of the page changes, or the server address changes, etc, etc, then they will stop working.
They will also stop working if the layout of the web page changes over time; even if the new page layout looks exactly the same in the browser! If you need to use a screen-scraping program
that was written some time ago, then don’t be surprised if you have to tweak it to get it working again.


Email

Sending email in a Python is a little bit tricky, for reasons to do with authentication. There is a built-in module that handles most of the work for us called smtplib.

Sending email is an example of an action that can fail for many reasons, few of which have anything to do with the program itself. For example, the email address could be non-existant, or the SMTP server could be down. To handle
this, Python uses a try .. except .. finally structure. If an error is caused inside the try block, then the code in the except block is run (for example, to print an error message). The code in the finally block is always run, regardless of whether there is an error or not. This is a good place to carry out important actions that must always happen – here, closing the connection to the mail server.

To make it easier to construct the message string, use triple quotes to allow you to split strings over multiple lines.

# this module handles the job of connecting to the mail server
from smtplib import SMTP_SSL as SMTP
# this module helps us to construct email objects and set their properties
from email.mime.text import MIMEText

# this is the staffmail server address
# if you are on sms rather than staffmail, then use the next line

#SMTPserver = 'smtp.sms.ed.ac.uk'
SMTPserver = 'smtp.staffmail.ed.ac.uk'

# your email address goes here
sender =     "martin.jones@ed.ac.uk"

# this is the address to send to
destination = "somebody@example.com"

# your username and password (the ones you use for webmail/ease etc) go here
USERNAME = "XXXXXXXXX"
PASSWORD = "XXXXXXXXX"

# this is the message text - use three quotes so that we can use multiple lines
content="""
Hello world
"""

# this is the subject
subject="Advanced Python!"

# this is something new; all the try/except/finally stuff takes care of opening/closing connections
try:
    # create a new message object and store it in the variable msg
    msg = MIMEText(content, 'plain')
    # set the subject and the from address
    msg['Subject']= subject
    msg['From'] = sender
    # connect to the server, set the debug level so that we can see some messages, and login
    conn = SMTP(SMTPserver)
    conn.set_debuglevel(1)
    conn.login(USERNAME, PASSWORD)
    # now do the actual send
    try:
        conn.sendmail(sender, destination, msg.as_string())
    finally:
        conn.close()
# if there's an error then it will show up here
except exc:
    sys.exit( "mail failed; %s" % str(exc) ) # give a error message

Warning!

Do not try to send a lot of emails in a short space of time. What we are actually doing here is logging on to your email server and sending emails on your behalf.
If you try to do something odd, like send a few thousand emails in the space of a few minutes:

for i in range(0,10000):
    # email your best friend a link to a picture of a cat

The system administrator will not get the joke 🙂


Exercises

2. Copy and paste this script into a .py file on your computer and fill in the correct mail server address, username and password. Caution: because the script now contains
your username and password in plain text, be careful about accidentally sharing it! Run the script and check that the email is sent correctly. A useful website to use for testing is
mailinator.com. You can send email to any address @mailinator.com and it will accept them and let you view the inbox for that address.

3. Modify your get_counts script so that instead of printing the counts to the screen, it emails them to you. The easiest way to structure the code is by putting all the
email-related stuff in a separate function:

name input output job
get_counts taxid dict of counts to fetch the web page, extract the counts and add them to the dict
send_email() dict of counts, email address nothing to send an email to the specified address with a list of counts

It would probably be useful to have this script run once per week, so that you get an email every Monday giving you the latest counts. On Linux/Mac, you can do this
using the cron tool. On Windows, take a look at Task Scheduler.

4. Modify your script so that instead of emailing the counts for a single taxid, it takes a list of taxids (after all, you might be interested in more than one taxonomic group!)
and sends a single email with a summary of the counts for all taxids. To make the email easier to read, refer to each group by its scientific name rather than its taxid
(Hint: you can get the scientific name from the page itself – there should be a single ‘h2’ element that contains the name). You can add the name to the dict of counts, so it looks like this:

  {'protein': '1,631,010', 'nucleotide': '5,181,232', 'genome': '485', 'name', 'Arthropoda'}

5. Now, imagine what will happen when you get back to your lab and show this script to your colleagues. They will be insanely jealous and will want to get emails themselves.
Prepare for this by modifying your script so that it reads a list of email addresses and taxids from an external file. The file will look like this:

martin.jones@ed.ac.uk 9779,6843,57294,42241
somebody.else@ed.ac.uk 123,456,78910
anotherone@gmail.com 6597,2155,12456,21544,32115,15774,21665,98784,21554

each row has the email address of a person, followed by a space, followed by a comma-separated list of taxids. The data above is made up, so it won’t work – instead,
create a test file to use as the input for your script. Once you have it working, try to put together a file for your whole bench. You will need to ask your neighbours for their
email addresses and their favourite taxonomic groups.

Remember to sleep() for a few seconds between (a) fetching the pages and (b) sending emails

6. An even more useful version of the script would only email you when a significant number of new sequences have been added for the taxonomic groups of interest. To do this you’ll have to invent a way
of storing the numbers of sequences between runs of the script – probably a new text file. Your script will now have to get the sequence counts for each group, compare them to the counts from the
last email sent, and only send a new email if there are more than 100 new sequences. Think carefully about the logic behind whether or not to send an email. You only want to send an email if there are
more than 100 new sequences since the last email (which may or may not be the same as the last time the script was run). You’ll have to remember to update the sequence count file whenever an email is sent.

7. Modify your script so that each user can set a per-taxonomic-group threshold for the minimum number of new sequences required to trigger an email. For example, a given user might want to get an email whenever 100 new sequences have been added for elephants, or when 10,000 new sequences have been added for arthropods.

You will need to alter the format of the user-info fileto store this information. Also, let each user set a maximum number of days between emails – i.e. for a given taxonomic group, we want to send at least one email per month, even if no new sequences have been added. You’ll need to store the date at which the last email was sent! (warning: dealing with times and dates in programs is far, far trickier than you think! Take a look at this article if you don’t believe me. I think that the easiest was to deal with this is to use time.time() from the time module, which returns the number of seconds since the start of the current UNIX epoch, and convert days to seconds inside your program).

Powered by WordPress. Designed by Woo Themes