Very Applied Methods Workshop, April 25th, 2014
Department of Political Science, Stanford University
Author: Rebecca Weiss
But I will give you links to resources that will cover the above. And I'll make statements on these when appropriate.
numpy
, scipy
, scikits
)pandas
, statsmodels
)django
, flask
)nltk
, gensim
)PIL
, scikit-image
)scikit-learn
)lxml
, BeautifulSoup
)The Python standard library is also very comprehensive for most general computing needs.
from IPython.display import HTML
HTML("<iframe src=https://docs.python.org/2.7/library/ width=100% height=400></iframe>")
You'll see these Python features mentioned in other tutorials:
However, for most of you, this is more than you need to know.
strings
! ints
!) and collections (sets
! dicts
! lists
!), logic (for
loops! list
comprehensions!)We're going to try and cover a lot of ground by teaching through application: simple demo on how to use Python to extract structured data from web pages
from IPython.display import Image
Image('http://robotix.in/blog/wp-content/uploads/2011/10/python-vs-java-726367-copy.jpg')
First, you need to install Python. There are lots of tutorials on this:
Follow one of these guides.
My opinion: writing Python on OSX and Windows is not ideal (OSX is a little easier if you have brew
or MacPorts
installed, but it's still not great). If you want to get serious about your development, consider running a Virtual Machine (VMWare or VirtualBox).
When you are "installing Python," you are giving your computer access to the Python interpreter.
This is what allows you to write source code (human-readable language expressed in certain syntax) and convert it into executable software through a process called (shockingly) interpreting.
R
is also an interpreted language:
"The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions."
The previous guides will install the standard Python environment (the standard library, docs, and a Python interpreter).
There are other interpreters (Cython, Jython, IronPython)...we can't really go into those now (see here for more discussion).
If you intend to use Python for analysis, just get Enthought Python.
Enthought is a Python distribution. For our purposes, that means that it installs Python with the most common 3rd party analysis modules used in scientific computing.
If you install the Enthought Python Distribution and choose it as your default Python environment, all your calls to python
will go to the installation of Python that comes with EPD.
rweiss$ less ~/.profile
# Added by Canopy installer on 2013-05-09
source /Users/rweiss/Library/Enthought/Canopy_64bit/User/bin/activate
FYI: For most Unix-based shells, a .profile
file (or something similar) is automatically executed (source
) every time you open a shell. In other systems, this can also be a ~/.bashrc
, a ~/.bash_profile
, ~/.zshrc
, and others; it depends on what kind of shell you're running. I have some slides on the shell and useful shell utilities, and Stanford offers a good practical online short course on the shell.
If you want to check what python
you're running, type which python
. This will tell you where your computer is sending calls to python
. (You can do this for any executable added to your path, such as java
if it is installed).
rweiss$ which python
/Users/rweiss/Library/Enthought/Canopy_64bit/User/bin/python
This is an example in OSX. You can see that my computer is calling python
from the Canopy directory.
What if I don't want to use EPD and I want to go back to regular Python?
rweiss$ deactivate
rweiss$ which python
/usr/bin/python
rweiss$ source /Users/rweiss/Library/Enthought/Canopy_64bit/User/bin/activate # or just source ~/.profile
(Canopy 64bit) rweiss$ which python
/Users/rweiss/Library/Enthought/Canopy_64bit/User/bin/python
deactivate
is actually a virtualenv
command. We'll get back to that in a few slides.
Installing the EPD means you won't have to manually handle all the dependencies for many popular 3rd party modules, like scipy
and lxml
. You don't have to use it, but if you aren't comfortable installing software from the command line, handling paths, or compiling dependencies from source code, just use EPD.
Installing Enthought means you get Canopy for free.
But installing Enthought doesn't mean you must use Canopy.
Canopy is just an IDE. There are lots of IDEs. You can still use the Enthought distribution and use whatever text editor you want (I use Sublime and VIM).
If you are not comfortable with the command line and you want a single piece of software where you write all your code and the interpreter in the same environment (think RStudio), you should consider using Canopy for now.
Using the interpreter prompt
After you have followed a tutorial that teaches you how to install Python, typically the next step is to start Python from the command line (OSX Terminal):
rweiss$ which python
/usr/bin/python
rweiss$ python
Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
The prompt with the >>>
is interactive. It's like the console in R
. You can type in your commands and the code is immediately interpreted and printed.
Don't use python
. Use IPython
. It comes installed with EPD.
rweiss$ ipython
Python 2.7.3 | 64-bit | (default, Jun 14 2013, 18:17:36)
Type "copyright", "credits" or "license" for more information.
IPython 2.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]:
Just like with other languages, you can write your source code as a standalone script (ending with .py
) and pass it to the interpreter as a command-line argument:
rweiss$ ipython
Python 2.7.6 | 64-bit | (default, Jan 29 2014, 17:09:48)
Type "copyright", "credits" or "license" for more information.
IPython 2.0.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: print 'hello world!'
hello world!
%%bash
echo print \"'hello world!'\" > test.py
python test.py
IPython
is a third-party library on top of vanilla Python. Most third-party libraries are also called modules or packages.
If you have enstalled EPD, you already have IPython
plus a bunch of popular modules. If not, you have to install the third-party libraries by hand on the command line.
The most common solution is to use a package manager or install from source.
If you are using vanilla python, pip
or easy_install
. Choosing between these two can be controversial. I prefer pip
.
Installation instructions:
Both install the library from the central, universally accessible repository PyPI
(Python Package Index). This is roughly how this process works:
.egg
) and uploads to PyPI.pip
or easy_install
, e.g.:rweiss$ pip install requests
python
can call it. import requests
(or whatever you downloaded).If you are using Enthought, it comes with its own package manager Enstaller. You can install packages using either:
enpkg
).Enthought maintains their own repository but they can also draw from PyPI
as well.
If you are feeling adventurous, you can download the source from a third-party website directly (i.e. python setup.py install
)
Unless you understand your OS and feel comfortable with the command line, you will probably run into path and dependency problems. You will need a better understanding of file permissions and the command line.
Last bit of working advice for working with Python. Consider learning how to use virtualenv
.
Installation instructions are here.
rweiss$ pwd
/Users/rweiss/Documents/VAM-Python
rweiss$ virtualenv env
New python executable in env/bin/python
Installing setuptools, pip...done.
rweiss$ ls env/
.Python bin/ include/ lib/
rweiss$ ls env/bin/
activate activate.fish easy_install pip pip2.7 python2
activate.csh activate_this.py easy_install-2.7 pip2 python python2.7
rweiss$ ls env/lib/python2.7/site-packages/
_markerlib easy_install.pyc pip-1.5.4.dist-info pkg_resources.pyc setuptools-2.2.dist-info
easy_install.py pip pkg_resources.py setuptools
rweiss$ source env/bin/activate
rweiss$ which python
/Users/rweiss/Documents/VAM-Python/env/bin/python
(env)rweiss$ env/bin/pip install requests
Downloading/unpacking requests
Downloading requests-2.2.1-py2.py3-none-any.whl (625kB): 625kB downloaded
Installing collected packages: requests
Successfully installed requests
Cleaning up...
(env)rweiss$ ls env/lib/python2.7/site-packages/
_markerlib pip pkg_resources.pyc setuptools
easy_install.py pip-1.5.4.dist-info requests setuptools-2.2.dist-info
easy_install.pyc pkg_resources.py requests-2.2.1.dist-info
deactivate?
(env)rweiss$ deactivate
rweiss$ which python
/usr/bin/python
EPD is not just a distribution of Python. It also creates an isolated Python environment.
In Python, whitespace matters.
for i in xrange(5):
print i
for i in xrange(5):
print i
In Python, some names are reserved and you MUST NOT use them as variable names:
>>> import keyword
>>> keyword.iskeyword('str')
True
>>> keyword.kwlist
import __builtin__
>>> dir(__builtin__)
You can use these names as variable names, Python won't stop you...but you'll be sorry!
print unicode('will this work')
unicode = 'will this work?'
print unicode
unicode('will this work?')
Python errors are pretty clear. This raises a TypeError
. I tried to use a str
object as a function.
That's because I overwrote the namespace of the built-in unicode()
function with a string.
del unicode
print unicode('did that fix it?')
Here's a list of built-in errors and exceptions.
R
is 1-indexed. Most programming languages employ zero-based indexes.
That means the first element starts with 0.
test_string = 'zero'
print 'Zeroth element: ' + test_string[0]
print 'First element: ' + test_string[1]
Now we'll review a couple of basic data types and functions you can expect to use in Python.
string
s and string
member methods (and how to learn more)set
, list
, and dict
objectslist comprehensions
alphabet = 'abcdefghijklmnopqrstuvwxyz'
print alphabet # old-fashioned way of using print()
print(alphabet) # the new convention
The string is denoted by delimiters, '' characters. Python conventionally uses ''. You can also use "".
Variable assignment occurs through the use of the =
operator.
numbers = '0123456789
EOL
is actually a special character in most environments (you have probably seen it as \n
). It means "end of line."
Here, Python encountered the EOL character without first encountering a matching delimiter. Python expected a delimiter before it encountered the end of the line and didn't find one, so this raised a SyntaxError
.
FYI, the string literal is the value of a string. Here it is 0123456789
String types in Python are similar to an array of characters. You can refer to a character by its array index using slice notation: [start:end:step]
print alphabet[0]
print alphabet[0:15]
print alphabet[0:15:2]
print alphabet[:]
Python's slice notation also accepts negative integers:
print alphabet[-1:]
print alphabet[:-1]
print alphabet[:-1:2]
What if you pass a float
?
print alphabet[0.2:]
There are lots of built-in functions in Python that are helpful.
We'll review a few very common ones.
How do you determine object type?
Call the type()
function on the object.
print 'What type of object is "alphabet"?'
print type(alphabet)
How do you find out an object's attributes?
dir()
on the objectprint 'What are the attributes (AKA member methods and data attributes) for this type of object?'
print dir(alphabet)
An overview of some string
methods.
print alphabet.split
print alphabet.split()
print alphabet.lower()
You can also call some functions (not all!) using function notation:
print len(alphabet)
Don't worry about functions that start or end with "__" for now. It's complicated.
You can change the type of object through type conversion.
print list(alphabet)
print len(alphabet)
alphabet
and list(alphabet)
aren't precisely the exact same object. Can you explain?
The simplest way to understand a list comprehension is a one-liner for loop.
If you want to perform an element-wise operation on a list and get a list back (very common in Python), use a list comprehension.
print [i for i in list(alphabet)[:5]] # AKA List comprehension (note that strings are iterables!)
print [i.upper() for i in alphabet[::3]]
The string
library
You don't have to actually reinvent the wheel. The string
module is useful if you work with strings a lot.
import string
print string.ascii_lowercase
print string.digits
This is a very Pythonic way of joining string objects that are contained in a list.
The general idea: all strings have a join()
method, and you are joining a list of strings by another string.
numbers = list(string.digits)[0:5:2]
print '\t'.join([i for i in numbers])
print '\n'.join([i for i in numbers])
print 'SPACE'.join([i for i in numbers])
First, let's convert each of those string elements to an int.
We'll do this using in-place modification because lists are mutable types in Python.
We'll also introduce the enumerate()
built-in function.
print [type(x) for x in numbers]
for i, x in enumerate(numbers):
print 'The value at index {index} is {value}'.format(index=i, value=x)
numbers[i] = int(numbers[i])
print [type(x) for x in numbers]
# Addition
for i in numbers:
print 'Add 1 = {val}'.format(val=int(i) + 1)
# Multiplication
for i in numbers:
print 'Times 2 = {val}'.format(val=int(i) * 2)
# Exponentiation
for i in numbers:
print 'Squared = {val}'.format(val=int(i) ** 2)
You can use enumerate()
on any iterable object.
Here are some common code patterns to determine if an object is iterable.
# http://stackoverflow.com/a/1952655
# Duck typing (AKA 'Beg forgiveness')
try:
iterator = iter(alphabet)
except TypeError:
print 'Duck typing: Not iterable.'
else:
print 'Duck typing: Iterable.'
# Type checking (AKA 'Ask permission')
import collections #Also in the Python standard library
if isinstance(alphabet, collections.Iterable):
print 'Type-checking: Iterable.'
else:
print 'Type-checking: Not iterable.'
Sets are a data type that consists of a collection of unique items. Sets are highly optimized for lookups.
Unlike lists, they are not sorted in order of insertion.
If you've taken CS 106B, a set is a hashset.
import random # Comes with the standard library
foo = set() # can also say {[]}
bar = list() # can also say []
for i in xrange(4):
val = random.randint(0, 1000)
foo.add(val)
bar.append(val)
print 'Set values are: \t' + str(foo)
print 'List values are: \t' + str(bar)
# Note that xrange is (start, stop, step); if you don't specify, it is exclusive up to stop
foo.clear() # empty the set
bar[:] = list() # empty the list
for i in xrange(10):
val = random.randint(0, 1000)
foo.add(val)
bar.append(val)
print 'Set values are: \t' + str(foo)
print 'List values are: \t' + str(bar)
print '10 iterations: {setlength} elements in the set'.format(setlength=len(foo))
print '10 iterations: {listlength} elements in the list'.format(listlength=len(bar))
foo.clear()
bar[:] = list()
for i in xrange(100):
val = random.randint(0, 1000)
foo.add(val)
bar.append(val)
print '100 iterations: {setlength} elements in the set'.format(setlength=len(foo))
print '100 iterations: {listlength} elements in the list'.format(listlength=len(bar))
foo.clear()
bar[:] = []
for i in xrange(1000):
val = random.randint(0, 1000)
foo.add(val)
bar.append(val)
print '1000 iterations: {setlength} elements in the set'.format(setlength=len(foo))
print '1000 iterations: {listlength} elements in the list'.format(listlength=len(bar))
foo.clear()
bar[:] = []
for i in xrange(10000):
val = random.randint(0, 1000)
foo.add(val)
bar.append(val)
print '10000 iterations: {setlength} elements in the set'.format(setlength=len(foo))
print '10000 iterations: {listlength} elements in the list'.format(listlength=len(bar))
print "Sets have the following methods: " + str([thing for thing in dir(foo) if not thing.startswith("__")])
print "Lists have the following methods: " + str([thing for thing in dir(bar) if not thing.startswith("__")])
Dictionaries
A dictionary is a key-value mapping. If you have taken CS 106B, dicst are like hashmap or hashtable object.
baz = {}
baz['a'] = 1
baz['b'] = 2
baz
a
and b
are keys.
1 is the value for a
and 2 is the value for b
.
Note that I had to declare the dictionary baz
first.
What kind of key-value mappings can a dictionary have?
baz['c'] = alphabet
baz['d'] = numbers
baz['e'] = list(numbers)
baz['f'] = {'a': 1, 'b': 2}
baz[alphabet] = '1'
baz
What kind of key-value mappings don't work?
baz[{'a': 1, 'b':2}] = 0
baz
baz[list(numbers)] = numbers
You can always learn more about Python and code patterns, but you have to start small.
We're going to use two third-party libraries (requests
and BeautifulSoup
) and the techniques and code patterns you just learned to programatically scrape a list of pages.
from IPython.display import HTML
import requests # 3rd party library
presidency_platforms_url = 'http://www.presidency.ucsb.edu/platforms.php'
HTML("<iframe src=" + presidency_platforms_url + " width=100% height=400px></iframe>")
Request/Response model
Request/response is a messaging protocol.
It is the underlying architectural model for the Hypertext Transfer Protocol, which is the agreed-upon standard for the way the Web works.
The very general, grossly oversimplified idea:
Servers sit around waiting to respond to requests. If a server doesn't respond, something is wrong .
How do I know that my request was issued successfully?
import requests
r = requests.get(presidency_platforms_url)
print 'Server response status code = ' + r.status_code
print type(r.status_code)
print 'Server response status code = ' + str(r.status_code)
print 'Server response status code = %i' % r.status_code
print 'Server response status code = {statuscode}'.format(statuscode=r.status_code)
What's a status code?
The Web only works because everybody agreed to honor HTTP.
All HTTP clients (e.g. a web browser) must recognize status codes.
Generally:
If you write a script to automate scraping, check for status code = 200. Otherwise, you might get junk data!
What else comes in a server response?
You can check the response headers to find out more information about the server you've hit.
import pprint # A standard library module that helps to pretty print output.
print r.encoding
headers = r.headers
pprint.pprint(headers.items()) # Prints a list of tuples
FYI, the server sees a lot of information from you as well !
r = requests.get('http://httpbin.org/user-agent') # Website that allows you to test for HTTP behaviors
r.text
Note that a string prefaced with u' means it's a Unicode string. Unicode relates to character encoding. Encoding is tricky.
I also created a small history lesson on character encoding.
Your Response
object has a lot of helpful attributes that come in handy with web scraping:
print 'Requests.get() returns a {} object.'.format(type(r))
for attr in dir(r):
if attr.startswith('__') or attr.startswith('_'):
pass
else:
print attr
print r.text[:1000] # Truncated for example
print len(r.text) # in characters
You don't have to work solely with the raw response r.text
. You can also get the response back as JSON with r.json()
.
JSON can be preferrable because then you can work with Python dictionaries.
r = requests.get('http://httpbin.org/ip')
print r.json()
import json
print r.json()['origin']
This is probably more appropriately called screen scraping. I'll get more into that at the end.
BeautifulSoup
and lxml
BeautifulSoup
lxml
XPath
, CSS Selectors
Both of these can be easily installed if you use the Enthought Python Distribution. We're going to do an example with BeautifulSoup.
from bs4 import BeautifulSoup
r = requests.get(presidency_platforms_url)
soup = BeautifulSoup(r.text)
print type(soup)
print soup.prettify()[0:1000]
Scraping content
The general idea: if an HTML file renders in your browser and you can see it on your screen, it probably has some structure in it.
You don't want the raw HTML. You want the content that is rendered on your browser screen.
The goal is to use the HTML to retrieve the relevant content.
print soup.title
print soup.meta
print soup.a
print soup.p
Beautiful Soup has written some functions that are helpful for working with HTML. They are essentially wrappers to retrieve very common HTML elements.
Image('http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png')
Most modern browsers have a parser that reads in the HTML document, parses it into a DOM structure, and renders the DOM structure.
Much like HTTP, the DOM is an agreed-upon standard.
The DOM is much more than what I've described, but we don't have the time to go into it.
Image('http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png')
example_html = '<html>\n<body>\n<h1>Title</h1>\n<p>A <em>word</em> </p>\n</body>\n</html>'
example_soup = BeautifulSoup(example_html)
print example_soup.p
print example_soup.p.get_text()
print example_soup.em.get_text()
There are many ways we could do this. First thing to do is examine the page source.
print r.text[0:1000]
Now let's extract every link URL on the page.
all_links = []
for link in soup.findAll('a'):
all_links.append(link.get('href'))
print 'All link hrefs in a list from a for loop: %s' % len(all_links)
all_links_comprehension = [link.get('href') for link in soup.findAll('a')]
print 'All link hrefs in a list from a list comprehension: %s' % len(all_links_comprehension)
BeautifulSoup has a .get_text()
method that extracts the text
attribute from every tag.
print soup.get_text()[:1000]
Remember, that's every tag. If you're lucky, you'll be able select only the relevant tags that you care about and extract their text with this method.
all_links[40:60]
for link in all_links[40:60]:
print link.split('/')
for link in all_links[40:60]:
print 'href #' + str(all_links.index(link)) + ' = ' + link.split('/')[-1]
valid_links = []
for link in all_links:
final_url_element = link.split('/')[-1]
if final_url_element.startswith('index.php?'):
valid_links.append(link)
print 'There are {} valid links.'.format(len(valid_links))
valid_links[:10]
from datetime import datetime # Another standard library module.
for link in valid_links[:10]: # Limited for demonstration. Also check out import time; time.sleep()
r = requests.get(link)
print '{time}\t{link}\t{status}'.format(time=datetime.isoformat(datetime.now()), link=link, status=r.status_code)
However, that previous example will print to stdout
(that's what print()
does).
If you execute this as a script at the command line, it would be better to have this write to a file, so let's open up a text file and write the output to that:
import os
request_log_file = open('presidency_platforms_scraping.log', 'w')
print type(request_log_file) # What kind of object does open() create?
request_log_file.write('Timestamp\tURL\tStatus Code\n')
for link in valid_links[:10]:
r = requests.get(link)
request_event_string = '{time}\t{link}\t{status}\n'.format(time=datetime.isoformat(datetime.now()), link=link, status=r.status_code)
request_log_file.write(request_event_string) # Note I had to add the line ending "\n" above
request_log_file.close() # Make sure you close the file
print os.listdir(os.getcwd())
If you don't want to bother handling files, you could keep print
ing to stdout
and redirect to a text file, assuming you are comfortable with the shell!
%%bash
# This is IPython cell magic
head presidency_platforms_scraping.log
We're lucky; these are fairly simple pages. All of the relevant text we want looks like the text
attributes of <p>
elements per platform's page.
It also doesn't look like it's dynamically generated by Javascript or forms, so we can just rip it right out of the page.
Let's extract all the text from each page and save each one to a simple .txt
file.
r = requests.get('http://www.presidency.ucsb.edu/ws/index.php?pid=101962')
soup = BeautifulSoup(r.text)
print soup.a
print soup.p
print soup.p.get_text()
all_p_tags = soup.findAll('p')
print type(all_p_tags[0])
bs4.element.Tag
objects have?[attr for attr in dir(all_p_tags[0]) if not attr.startswith('__') and not attr.startswith('_')]
Now let's glue together all of these pieces.
We'll introduce two new functions: os.makedirs()
and os.path.join()
Also one last code pattern: opening and closing files using with()
if not os.path.exists('output'):
os.makedirs('output')
request_log_file = open('output/presidency_platforms_scraping.log', 'w')
request_log_file.write('Timestamp\tURL\tStatus Code\n')
print 'Starting scraping.'
for link in valid_links:
r = requests.get(link)
request_event_string = '{time}\t{link}\t{status}\n'.format(time=datetime.isoformat(datetime.now()), link=link, status=r.status_code)
request_log_file.write(request_event_string) # Note I had to add the line ending "\n" above
soup = BeautifulSoup(r.text)
all_p_tags = soup.findAll('p')
filename = link.split('/')[-1].split('=')[-1] + '.txt'
filename_path = os.path.join('output', filename)
with open(filename_path, 'w') as scraped_text_outfile:
text_as_list_of_strings = [p.get_text() for p in all_p_tags]
scraped_text = ''.join(text_as_list_of_strings)
scraped_text_outfile.write(scraped_text.encode('utf8'))
# Encoding is hard! Why did I do this? Hint: check "import sys; sys.stdout.encoding"
request_log_file.close()
print 'Finished scraping.'
Let's examine the results
print os.listdir('output')
%%bash
echo "Number of files in output/"
ls output/ | wc -l
cd output
echo "Example text file:"
head -c 1000 101962.txt
We barely scratched the surface. Here is a list of useful Python tutorials to help develop a deeper understanding:
from IPython.display import Image
# import antigravity
Image('http://imgs.xkcd.com/comics/python.png')
Here are all the module versions that I used in this tutorial:
%%bash
pip freeze
%%bash
pip freeze > requirements.txt
import this
from __future__ import braces
# http://legacy.python.org/dev/peps/pep-0401/
from __future__ import barry_as_FLUFL