Introduction to Python

Very Applied Methods Workshop, April 25th, 2014

Department of Political Science, Stanford University

Author: Rebecca Weiss

What are we covering in this presentation?

  • What is Python? Source code, interpreter, etc.
  • How do I get going with Python? What are available tools?
  • Why should I learn Python? Where is it useful?

What are we not covering in this presentation?

  • Syntax tedium
  • Extensive coverage of data types and collections
  • Software engineering (e.g. code patterns, object-oriented programming)

This is a 90 minute presentation. We don't have the time.

But I will give you links to resources that will cover the above. And I'll make statements on these when appropriate.

What is Python?

  • Python is a general purpose programming language with enormous 3rd-party support for a wide array of uses.
    • Numerical and scientific computing (e.g. numpy, scipy, scikits)
    • Data analysis (e.g. pandas, statsmodels)
    • Web application programming (e.g. django, flask)
    • Natural language processing and topic modeling (e.g. nltk, gensim)
    • Image processing (e.g. PIL, scikit-image)
    • Machine learning (scikit-learn)
    • Scraping (e.g. lxml, BeautifulSoup)

The Python standard library is also very comprehensive for most general computing needs.

In [1]:
from IPython.display import HTML
HTML("<iframe src=https://docs.python.org/2.7/library/ width=100% height=400></iframe>")
Out[1]:

You'll see these Python features mentioned in other tutorials:

However, for most of you, this is more than you need to know.

  • Because Python is a general purpose language, you can get easily lost in tutorials!
    • Assume knowledge of computing
    • Focus only on basics
      • data types (strings! ints!) and collections (sets! dicts! lists!), logic (for loops! list comprehensions!)
    • Useful tools and conventional practices?
    • How to solve analysis problems?

We're going to try and cover a lot of ground by teaching through application: simple demo on how to use Python to extract structured data from web pages

  • Review basic Python and programming concepts
  • Introduce web technology concepts (server-client architecture, response-request model, the DOM)
  • Demonstrate useful tools and modules to use Python for analysis purposes.

CS 106A?

In [2]:
from IPython.display import Image
Image('http://robotix.in/blog/wp-content/uploads/2011/10/python-vs-java-726367-copy.jpg')
Out[2]:

Background information and vocabulary

Installing Python

First, you need to install Python. There are lots of tutorials on this:

Follow one of these guides.

My opinion: writing Python on OSX and Windows is not ideal (OSX is a little easier if you have brew or MacPorts installed, but it's still not great). If you want to get serious about your development, consider running a Virtual Machine (VMWare or VirtualBox).

What exactly did "installing Python" do?

When you are "installing Python," you are giving your computer access to the Python interpreter.

This is what allows you to write source code (human-readable language expressed in certain syntax) and convert it into executable software through a process called (shockingly) interpreting.

R is also an interpreted language: "The core of R is an interpreted computer language which allows branching and looping as well as modular programming using functions."

Enthought Python Distribution: a "batteries-included" Python

The previous guides will install the standard Python environment (the standard library, docs, and a Python interpreter).

There are other interpreters (Cython, Jython, IronPython)...we can't really go into those now (see here for more discussion).

If you intend to use Python for analysis, just get Enthought Python.

Enthought is a Python distribution. For our purposes, that means that it installs Python with the most common 3rd party analysis modules used in scientific computing.

If you install the Enthought Python Distribution and choose it as your default Python environment, all your calls to python will go to the installation of Python that comes with EPD.

rweiss$ less ~/.profile
# Added by Canopy installer on 2013-05-09
source /Users/rweiss/Library/Enthought/Canopy_64bit/User/bin/activate

FYI: For most Unix-based shells, a .profile file (or something similar) is automatically executed (source) every time you open a shell. In other systems, this can also be a ~/.bashrc, a ~/.bash_profile, ~/.zshrc, and others; it depends on what kind of shell you're running. I have some slides on the shell and useful shell utilities, and Stanford offers a good practical online short course on the shell.

If you want to check what python you're running, type which python. This will tell you where your computer is sending calls to python. (You can do this for any executable added to your path, such as java if it is installed).

rweiss$ which python
/Users/rweiss/Library/Enthought/Canopy_64bit/User/bin/python

This is an example in OSX. You can see that my computer is calling python from the Canopy directory.

What if I don't want to use EPD and I want to go back to regular Python?

rweiss$ deactivate
rweiss$ which python
/usr/bin/python
rweiss$ source /Users/rweiss/Library/Enthought/Canopy_64bit/User/bin/activate # or just source ~/.profile
(Canopy 64bit) rweiss$ which python
/Users/rweiss/Library/Enthought/Canopy_64bit/User/bin/python

deactivate is actually a virtualenv command. We'll get back to that in a few slides.

Installing the EPD means you won't have to manually handle all the dependencies for many popular 3rd party modules, like scipy and lxml. You don't have to use it, but if you aren't comfortable installing software from the command line, handling paths, or compiling dependencies from source code, just use EPD.

Integrated Development Environments

Installing Enthought means you get Canopy for free.

But installing Enthought doesn't mean you must use Canopy.

Canopy is just an IDE. There are lots of IDEs. You can still use the Enthought distribution and use whatever text editor you want (I use Sublime and VIM).

If you are not comfortable with the command line and you want a single piece of software where you write all your code and the interpreter in the same environment (think RStudio), you should consider using Canopy for now.

Using the interpreter prompt

After you have followed a tutorial that teaches you how to install Python, typically the next step is to start Python from the command line (OSX Terminal):

rweiss$ which python
/usr/bin/python
rweiss$ python
Python 2.7.2 (default, Oct 11 2012, 20:14:37) 
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

The prompt with the >>> is interactive. It's like the console in R. You can type in your commands and the code is immediately interpreted and printed.

Don't use python. Use IPython. It comes installed with EPD.

rweiss$ ipython
Python 2.7.3 | 64-bit | (default, Jun 14 2013, 18:17:36) 
Type "copyright", "credits" or "license" for more information.

IPython 2.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]:

Using the interpreter prompt versus running a script

Just like with other languages, you can write your source code as a standalone script (ending with .py) and pass it to the interpreter as a command-line argument:

rweiss$ ipython
Python 2.7.6 | 64-bit | (default, Jan 29 2014, 17:09:48) 
Type "copyright", "credits" or "license" for more information.

IPython 2.0.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: print 'hello world!'
hello world!
In [3]:
%%bash
echo print \"'hello world!'\" > test.py
python test.py
hello world!

Package management

Package managers

IPython is a third-party library on top of vanilla Python. Most third-party libraries are also called modules or packages.

If you have enstalled EPD, you already have IPython plus a bunch of popular modules. If not, you have to install the third-party libraries by hand on the command line.

The most common solution is to use a package manager or install from source.

If you are using vanilla python, pip or easy_install. Choosing between these two can be controversial. I prefer pip.

Installation instructions:

  1. pip
  2. easy_install

Both install the library from the central, universally accessible repository PyPI (Python Package Index). This is roughly how this process works:

  1. A developer writes a library, creates a distributable package (an .egg) and uploads to PyPI.
  2. You ask for the library via pip or easy_install, e.g.:
    rweiss$ pip install requests
    
  3. The package manager handles installing the library; downloads the bundled file and extracts code to the right paths so that python can call it.
  4. In your script or in the shell, import requests (or whatever you downloaded).

If you are using Enthought, it comes with its own package manager Enstaller. You can install packages using either:

  1. the Canopy interface (though you will probably not need to, as most libraries for data are already installed).
  2. Enstaller from the command line (enpkg).

Enthought maintains their own repository but they can also draw from PyPI as well.

If you are feeling adventurous, you can download the source from a third-party website directly (i.e. python setup.py install)

Unless you understand your OS and feel comfortable with the command line, you will probably run into path and dependency problems. You will need a better understanding of file permissions and the command line.

Virtual environments

Last bit of working advice for working with Python. Consider learning how to use virtualenv.

Installation instructions are here.

rweiss$ pwd
/Users/rweiss/Documents/VAM-Python

rweiss$ virtualenv env
New python executable in env/bin/python
Installing setuptools, pip...done.     

rweiss$ ls env/
.Python  bin/     include/ lib/     

rweiss$ ls env/bin/
activate          activate.fish     easy_install      pip               pip2.7            python2           
activate.csh      activate_this.py  easy_install-2.7  pip2              python            python2.7         

rweiss$ ls env/lib/python2.7/site-packages/
_markerlib            easy_install.pyc        pip-1.5.4.dist-info        pkg_resources.pyc        setuptools-2.2.dist-info
easy_install.py            pip                pkg_resources.py        setuptools
rweiss$ source env/bin/activate

rweiss$ which python
/Users/rweiss/Documents/VAM-Python/env/bin/python

(env)rweiss$ env/bin/pip install requests
Downloading/unpacking requests
  Downloading requests-2.2.1-py2.py3-none-any.whl (625kB): 625kB downloaded
Installing collected packages: requests
Successfully installed requests
Cleaning up...

(env)rweiss$ ls env/lib/python2.7/site-packages/
_markerlib            pip                pkg_resources.pyc        setuptools
easy_install.py            pip-1.5.4.dist-info        requests            setuptools-2.2.dist-info
easy_install.pyc        pkg_resources.py        requests-2.2.1.dist-info

Remember deactivate?

(env)rweiss$ deactivate
rweiss$ which python
/usr/bin/python

EPD is not just a distribution of Python. It also creates an isolated Python environment.

Some Syntax tedium

In Python, whitespace matters.

In [4]:
for i in xrange(5):
print i
  File "<ipython-input-4-83d360549157>", line 2
    print i
        ^
IndentationError: expected an indented block
In [5]:
for i in xrange(5):
    print i
0
1
2
3
4

In Python, some names are reserved and you MUST NOT use them as variable names:

In [6]:
>>> import keyword
>>> keyword.iskeyword('str')
True
>>> keyword.kwlist
Out[6]:
['and',
 'as',
 'assert',
 'break',
 'class',
 'continue',
 'def',
 'del',
 'elif',
 'else',
 'except',
 'exec',
 'finally',
 'for',
 'from',
 'global',
 'if',
 'import',
 'in',
 'is',
 'lambda',
 'not',
 'or',
 'pass',
 'print',
 'raise',
 'return',
 'try',
 'while',
 'with',
 'yield']
In [7]:
import __builtin__
>>> dir(__builtin__)
Out[7]:
['ArithmeticError',
 'AssertionError',
 'AttributeError',
 'BaseException',
 'BufferError',
 'BytesWarning',
 'DeprecationWarning',
 'EOFError',
 'Ellipsis',
 'EnvironmentError',
 'Exception',
 'False',
 'FloatingPointError',
 'FutureWarning',
 'GeneratorExit',
 'IOError',
 'ImportError',
 'ImportWarning',
 'IndentationError',
 'IndexError',
 'KeyError',
 'KeyboardInterrupt',
 'LookupError',
 'MemoryError',
 'NameError',
 'None',
 'NotImplemented',
 'NotImplementedError',
 'OSError',
 'OverflowError',
 'PendingDeprecationWarning',
 'ReferenceError',
 'RuntimeError',
 'RuntimeWarning',
 'StandardError',
 'StopIteration',
 'SyntaxError',
 'SyntaxWarning',
 'SystemError',
 'SystemExit',
 'TabError',
 'True',
 'TypeError',
 'UnboundLocalError',
 'UnicodeDecodeError',
 'UnicodeEncodeError',
 'UnicodeError',
 'UnicodeTranslateError',
 'UnicodeWarning',
 'UserWarning',
 'ValueError',
 'Warning',
 'ZeroDivisionError',
 '__IPYTHON__',
 '__IPYTHON__active',
 '__debug__',
 '__doc__',
 '__import__',
 '__name__',
 '__package__',
 'abs',
 'all',
 'any',
 'apply',
 'basestring',
 'bin',
 'bool',
 'buffer',
 'bytearray',
 'bytes',
 'callable',
 'chr',
 'classmethod',
 'cmp',
 'coerce',
 'compile',
 'complex',
 'copyright',
 'credits',
 'delattr',
 'dict',
 'dir',
 'divmod',
 'dreload',
 'enumerate',
 'eval',
 'execfile',
 'file',
 'filter',
 'float',
 'format',
 'frozenset',
 'get_ipython',
 'getattr',
 'globals',
 'hasattr',
 'hash',
 'help',
 'hex',
 'id',
 'input',
 'int',
 'intern',
 'isinstance',
 'issubclass',
 'iter',
 'len',
 'license',
 'list',
 'locals',
 'long',
 'map',
 'max',
 'memoryview',
 'min',
 'next',
 'object',
 'oct',
 'open',
 'ord',
 'pow',
 'print',
 'property',
 'range',
 'raw_input',
 'reduce',
 'reload',
 'repr',
 'reversed',
 'round',
 'set',
 'setattr',
 'slice',
 'sorted',
 'staticmethod',
 'str',
 'sum',
 'super',
 'tuple',
 'type',
 'unichr',
 'unicode',
 'vars',
 'xrange',
 'zip']

You can use these names as variable names, Python won't stop you...but you'll be sorry!

In [8]:
print unicode('will this work')
will this work

In [9]:
unicode = 'will this work?'
print unicode
unicode('will this work?')
will this work?

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-df26d9d9e02d> in <module>()
      1 unicode = 'will this work?'
      2 print unicode
----> 3 unicode('will this work?')

TypeError: 'str' object is not callable

Python errors are pretty clear. This raises a TypeError. I tried to use a str object as a function.

That's because I overwrote the namespace of the built-in unicode() function with a string.

In [10]:
del unicode
print unicode('did that fix it?')
did that fix it?

Here's a list of built-in errors and exceptions.

Python is 0-indexed

R is 1-indexed. Most programming languages employ zero-based indexes.

That means the first element starts with 0.

This is not an arbitrary decision.

In [11]:
test_string = 'zero'
print 'Zeroth element: ' + test_string[0]
print 'First element: ' + test_string[1]
Zeroth element: z
First element: e

Data types and collections in Python

Now we'll review a couple of basic data types and functions you can expect to use in Python.

  1. Useful built-in functions
  2. strings and string member methods (and how to learn more)
  3. set, list, and dict objects
  4. list comprehensions
In [12]:
alphabet = 'abcdefghijklmnopqrstuvwxyz'
print alphabet # old-fashioned way of using print()
print(alphabet) # the new convention
abcdefghijklmnopqrstuvwxyz
abcdefghijklmnopqrstuvwxyz

The string is denoted by delimiters, '' characters. Python conventionally uses ''. You can also use "".

Variable assignment occurs through the use of the = operator.

In [13]:
numbers = '0123456789
  File "<ipython-input-13-72568ec75aba>", line 1
    numbers = '0123456789
                        ^
SyntaxError: EOL while scanning string literal

EOL is actually a special character in most environments (you have probably seen it as \n). It means "end of line." Here, Python encountered the EOL character without first encountering a matching delimiter. Python expected a delimiter before it encountered the end of the line and didn't find one, so this raised a SyntaxError.

FYI, the string literal is the value of a string. Here it is 0123456789

String types in Python are similar to an array of characters. You can refer to a character by its array index using slice notation: [start:end:step]

In [14]:
print alphabet[0]
print alphabet[0:15]
print alphabet[0:15:2]
print alphabet[:]
a
abcdefghijklmno
acegikmo
abcdefghijklmnopqrstuvwxyz

Python's slice notation also accepts negative integers:

In [15]:
print alphabet[-1:]
print alphabet[:-1]
print alphabet[:-1:2]
z
abcdefghijklmnopqrstuvwxy
acegikmoqsuwy

What if you pass a float?

In [16]:
print alphabet[0.2:]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-ae6a5b5f59da> in <module>()
----> 1 print alphabet[0.2:]

TypeError: slice indices must be integers or None or have an __index__ method

Useful commands in Python

There are lots of built-in functions in Python that are helpful.

We'll review a few very common ones.

How do you determine object type?

Call the type() function on the object.

In [17]:
print 'What type of object is "alphabet"?'
print type(alphabet)
What type of object is "alphabet"?
<type 'str'>

How do you find out an object's attributes?

  1. Read the documentation!
  2. If you don't know, call dir() on the object
In [18]:
print 'What are the attributes (AKA member methods and data attributes) for this type of object?'
print dir(alphabet)
What are the attributes (AKA member methods and data attributes) for this type of object?
['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_field_name_split', '_formatter_parser', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']

An overview of some string methods.

In [19]:
print alphabet.split
print alphabet.split()
print alphabet.lower()
<built-in method split of str object at 0x10055dfb0>
['abcdefghijklmnopqrstuvwxyz']
abcdefghijklmnopqrstuvwxyz

You can also call some functions (not all!) using function notation:

In [20]:
print len(alphabet)
26

Don't worry about functions that start or end with "__" for now. It's complicated.

You can change the type of object through type conversion.

In [21]:
print list(alphabet)
print len(alphabet)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
26

alphabet and list(alphabet) aren't precisely the exact same object. Can you explain?

List comprehensions

The simplest way to understand a list comprehension is a one-liner for loop.

If you want to perform an element-wise operation on a list and get a list back (very common in Python), use a list comprehension.

In [22]:
print [i for i in list(alphabet)[:5]] # AKA List comprehension (note that strings are iterables!)
print [i.upper() for i in alphabet[::3]]
['a', 'b', 'c', 'd', 'e']
['A', 'D', 'G', 'J', 'M', 'P', 'S', 'V', 'Y']

The string library

You don't have to actually reinvent the wheel. The string module is useful if you work with strings a lot.

In [23]:
import string
print string.ascii_lowercase
print string.digits
abcdefghijklmnopqrstuvwxyz
0123456789

Combining list comprehensions with string concatenation

This is a very Pythonic way of joining string objects that are contained in a list.

The general idea: all strings have a join() method, and you are joining a list of strings by another string.

In [24]:
numbers = list(string.digits)[0:5:2]

print '\t'.join([i for i in numbers])
print '\n'.join([i for i in numbers])
print 'SPACE'.join([i for i in numbers])
0	2	4
0
2
4
0SPACE2SPACE4

Basic mathematical operations

First, let's convert each of those string elements to an int.

We'll do this using in-place modification because lists are mutable types in Python.

We'll also introduce the enumerate() built-in function.

In [25]:
print [type(x) for x in numbers]
for i, x in enumerate(numbers):
    print 'The value at index {index} is {value}'.format(index=i, value=x)
    numbers[i] = int(numbers[i])
print [type(x) for x in numbers]
[<type 'str'>, <type 'str'>, <type 'str'>]
The value at index 0 is 0
The value at index 1 is 2
The value at index 2 is 4
[<type 'int'>, <type 'int'>, <type 'int'>]

In [26]:
# Addition
for i in numbers:
    print 'Add 1 = {val}'.format(val=int(i) + 1)
    
# Multiplication    
for i in numbers:
    print 'Times 2 = {val}'.format(val=int(i) * 2)
    
# Exponentiation
for i in numbers:
    print 'Squared = {val}'.format(val=int(i) ** 2)
Add 1 = 1
Add 1 = 3
Add 1 = 5
Times 2 = 0
Times 2 = 4
Times 2 = 8
Squared = 0
Squared = 4
Squared = 16

You can use enumerate() on any iterable object.

Here are some common code patterns to determine if an object is iterable.

In [27]:
# http://stackoverflow.com/a/1952655
# Duck typing (AKA 'Beg forgiveness')
try:
    iterator = iter(alphabet)
except TypeError:
    print 'Duck typing: Not iterable.'   
else:
    print 'Duck typing: Iterable.'

# Type checking (AKA 'Ask permission') 
import collections #Also in the Python standard library

if isinstance(alphabet, collections.Iterable):
    print 'Type-checking: Iterable.'
else:
    print 'Type-checking: Not iterable.'    
Duck typing: Iterable.
Type-checking: Iterable.

Sets versus lists

Sets are a data type that consists of a collection of unique items. Sets are highly optimized for lookups.

Unlike lists, they are not sorted in order of insertion.

If you've taken CS 106B, a set is a hashset.

In [28]:
import random # Comes with the standard library
foo = set() # can also say {[]}
bar = list() # can also say []

for i in xrange(4): 
    val = random.randint(0, 1000)
    foo.add(val)
    bar.append(val)

print 'Set values are: \t' + str(foo)
print 'List values are: \t' + str(bar)
# Note that xrange is (start, stop, step); if you don't specify, it is exclusive up to stop
Set values are: 	set([32, 968, 211, 841])
List values are: 	[211, 32, 968, 841]

In [29]:
foo.clear() # empty the set
bar[:] = list() # empty the list

for i in xrange(10):
    val = random.randint(0, 1000)
    foo.add(val)
    bar.append(val)

print 'Set values are: \t' + str(foo)
print 'List values are: \t' + str(bar)

print '10 iterations: {setlength} elements in the set'.format(setlength=len(foo))
print '10 iterations: {listlength} elements in the list'.format(listlength=len(bar))
Set values are: 	set([769, 976, 293, 327, 647, 941, 944, 242, 659, 212])
List values are: 	[293, 647, 327, 769, 659, 944, 212, 976, 242, 941]
10 iterations: 10 elements in the set
10 iterations: 10 elements in the list

In [30]:
foo.clear()
bar[:] = list()

for i in xrange(100):
    val = random.randint(0, 1000)
    foo.add(val)
    bar.append(val)
    
print '100 iterations: {setlength} elements in the set'.format(setlength=len(foo))
print '100 iterations: {listlength} elements in the list'.format(listlength=len(bar))
100 iterations: 96 elements in the set
100 iterations: 100 elements in the list

In [31]:
foo.clear()
bar[:] = []

for i in xrange(1000):
    val = random.randint(0, 1000)
    foo.add(val)
    bar.append(val)
    
print '1000 iterations: {setlength} elements in the set'.format(setlength=len(foo))
print '1000 iterations: {listlength} elements in the list'.format(listlength=len(bar))
1000 iterations: 625 elements in the set
1000 iterations: 1000 elements in the list

In [32]:
foo.clear()
bar[:] = []

for i in xrange(10000):
    val = random.randint(0, 1000)
    foo.add(val)
    bar.append(val)
    
print '10000 iterations: {setlength} elements in the set'.format(setlength=len(foo))
print '10000 iterations: {listlength} elements in the list'.format(listlength=len(bar))
10000 iterations: 1001 elements in the set
10000 iterations: 10000 elements in the list

In [33]:
print "Sets have the following methods: " + str([thing for thing in dir(foo) if not thing.startswith("__")])
Sets have the following methods: ['add', 'clear', 'copy', 'difference', 'difference_update', 'discard', 'intersection', 'intersection_update', 'isdisjoint', 'issubset', 'issuperset', 'pop', 'remove', 'symmetric_difference', 'symmetric_difference_update', 'union', 'update']

In [34]:
print "Lists have the following methods: " + str([thing for thing in dir(bar) if not thing.startswith("__")])
Lists have the following methods: ['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort']

Dictionaries

A dictionary is a key-value mapping. If you have taken CS 106B, dicst are like hashmap or hashtable object.

In [35]:
baz = {}
baz['a'] = 1
baz['b'] = 2
baz
Out[35]:
{'a': 1, 'b': 2}

a and b are keys.

1 is the value for a and 2 is the value for b.

Note that I had to declare the dictionary baz first.

What kind of key-value mappings can a dictionary have?

In [36]:
baz['c'] = alphabet
baz['d'] = numbers
baz['e'] = list(numbers)
baz['f'] = {'a': 1, 'b': 2}
baz[alphabet] = '1'
baz
Out[36]:
{'a': 1,
 'abcdefghijklmnopqrstuvwxyz': '1',
 'b': 2,
 'c': 'abcdefghijklmnopqrstuvwxyz',
 'd': [0, 2, 4],
 'e': [0, 2, 4],
 'f': {'a': 1, 'b': 2}}

What kind of key-value mappings don't work?

In [37]:
baz[{'a': 1, 'b':2}] = 0
baz
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-37-b77e201d3ce1> in <module>()
----> 1 baz[{'a': 1, 'b':2}] = 0
      2 baz

TypeError: unhashable type: 'dict'
In [38]:
baz[list(numbers)] = numbers
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-a90e71148a56> in <module>()
----> 1 baz[list(numbers)] = numbers

TypeError: unhashable type: 'list'

Python in application

Retrieving all Political Party Platforms using a Python script

You can always learn more about Python and code patterns, but you have to start small.

We're going to use two third-party libraries (requests and BeautifulSoup) and the techniques and code patterns you just learned to programatically scrape a list of pages.

In [39]:
from IPython.display import HTML
import requests # 3rd party library

presidency_platforms_url = 'http://www.presidency.ucsb.edu/platforms.php'

HTML("<iframe src=" + presidency_platforms_url + " width=100% height=400px></iframe>")
Out[39]:

Request/Response model

Request/response is a messaging protocol.

It is the underlying architectural model for the Hypertext Transfer Protocol, which is the agreed-upon standard for the way the Web works.

The very general, grossly oversimplified idea:

  1. Clients (like you!) issue requests to servers
  2. Servers issue responses if they receive a request

Servers sit around waiting to respond to requests. If a server doesn't respond, something is wrong .

How do I know that my request was issued successfully?

In [40]:
import requests

r = requests.get(presidency_platforms_url)

print 'Server response status code = ' + r.status_code
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-40-8bcb880d99a2> in <module>()
      3 r = requests.get(presidency_platforms_url)
      4 
----> 5 print 'Server response status code = ' + r.status_code

TypeError: cannot concatenate 'str' and 'int' objects
In []:
print type(r.status_code)
print 'Server response status code = ' + str(r.status_code)
print 'Server response status code = %i' % r.status_code
print 'Server response status code = {statuscode}'.format(statuscode=r.status_code)

What's a status code?

The Web only works because everybody agreed to honor HTTP.

All HTTP clients (e.g. a web browser) must recognize status codes.

Generally:

  • 2XX is good
  • 4XX and 5XX are bad

If you write a script to automate scraping, check for status code = 200. Otherwise, you might get junk data!

What else comes in a server response?

You can check the response headers to find out more information about the server you've hit.

In [41]:
import pprint # A standard library module that helps to pretty print output.
print r.encoding
headers = r.headers
pprint.pprint(headers.items()) # Prints a list of tuples
ISO-8859-1
[('content-length', '4241'),
 ('content-encoding', 'gzip'),
 ('vary', 'Accept-Encoding'),
 ('server', 'Apache'),
 ('date', 'Fri, 25 Apr 2014 20:46:40 GMT'),
 ('content-type', 'text/html')]

FYI, the server sees a lot of information from you as well !

In [42]:
r = requests.get('http://httpbin.org/user-agent') # Website that allows you to test for HTTP behaviors
r.text 
Out[42]:
u'{\n  "user-agent": "python-requests/2.2.1 CPython/2.7.6 Darwin/12.5.0"\n}'

Note that a string prefaced with u' means it's a Unicode string. Unicode relates to character encoding. Encoding is tricky.

I also created a small history lesson on character encoding.

Your Response object has a lot of helpful attributes that come in handy with web scraping:

In [43]:
print 'Requests.get() returns a {} object.'.format(type(r))
for attr in dir(r):
    if attr.startswith('__') or attr.startswith('_'):
        pass
    else:
        print attr
Requests.get() returns a <class 'requests.models.Response'> object.
apparent_encoding
close
connection
content
cookies
elapsed
encoding
headers
history
iter_content
iter_lines
json
links
ok
raise_for_status
raw
reason
request
status_code
text
url

In [44]:
print r.text[:1000] # Truncated for example
print len(r.text) # in characters
{
  "user-agent": "python-requests/2.2.1 CPython/2.7.6 Darwin/12.5.0"
}
71

You don't have to work solely with the raw response r.text. You can also get the response back as JSON with r.json().

JSON can be preferrable because then you can work with Python dictionaries.

In [45]:
r = requests.get('http://httpbin.org/ip')
print r.json()
{u'origin': u'171.65.238.4'}

In [46]:
import json
print r.json()['origin']
171.65.238.4

Extracting information from structured content ("Page scraping")

This is probably more appropriately called screen scraping. I'll get more into that at the end.

3rd party libraries: BeautifulSoup and lxml

BeautifulSoup

lxml

  • More powerful parsing capabilities: XPath, CSS Selectors
  • Has C dependencies (can be hard to install if you don't feel comfortable building software from source)
  • Can work with more than HTML (e.g. XML).

Both of these can be easily installed if you use the Enthought Python Distribution. We're going to do an example with BeautifulSoup.

In [47]:
from bs4 import BeautifulSoup
r = requests.get(presidency_platforms_url)
soup = BeautifulSoup(r.text)
print type(soup)
<class 'bs4.BeautifulSoup'>

In [48]:
print soup.prettify()[0:1000]
<html>
 <head>
  <title>
   Political Party Platforms
  </title>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="President of the United States, presidency, American Presidency, American President, Public Papers of the Presidents, State of the Union Address, Inaugural Address, Presidents, American Presidents, George W. Bush, Bill Clinton, George Bush, Ronald Reagan, Jimmy Carter, Gerald Ford, Richard Nixon, Lyndon Johnson, John F. Kennedy. John Kennedy, Dwight Eisenhower, Harry Truman, FDR, Franklin Roosevelt, Presidential Elections, Presidential Rhetoric" name="keywords"/>
  <meta content="The American Presidency Project contains the most comprehensive collection of resources pertaining to the study of the President of the United States.  Compiled by John Woolley and Gerhard Peters" name="description"/>
  <link href="http://www.presidency.ucsb.edu/styles/main.css" rel="stylesheet" type="text/css"/>
  <!-- BEGIN Tynt Script -->
  <!-- <script typ

Scraping content

The general idea: if an HTML file renders in your browser and you can see it on your screen, it probably has some structure in it.

You don't want the raw HTML. You want the content that is rendered on your browser screen.

The goal is to use the HTML to retrieve the relevant content.

In [49]:
print soup.title
print soup.meta
print soup.a
print soup.p
<title>Political Party Platforms</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<a href="../index.php"><img alt="Home" border="0" height="29" src="http://www.presidency.ucsb.edu/images/l1.gif" width="26"/></a>
<p><span class="datatitle">Political Party Platforms of Parties Receiving Electoral Votes: </span><span class="datadates">1840 - 2012</span></p>

What are these functions?

Beautiful Soup has written some functions that are helpful for working with HTML. They are essentially wrappers to retrieve very common HTML elements.

In [50]:
Image('http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png')
Out[50]:

The HTML Document versus the DOM

Most modern browsers have a parser that reads in the HTML document, parses it into a DOM structure, and renders the DOM structure.

Much like HTTP, the DOM is an agreed-upon standard.

The DOM is much more than what I've described, but we don't have the time to go into it.

In [51]:
Image('http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png')
Out[51]:
In [52]:
example_html = '<html>\n<body>\n<h1>Title</h1>\n<p>A <em>word</em> </p>\n</body>\n</html>'
example_soup = BeautifulSoup(example_html)
print example_soup.p
print example_soup.p.get_text()
print example_soup.em.get_text()
<p>A <em>word</em> </p>
A word 
word

There are many ways we could do this. First thing to do is examine the page source.

In [53]:
print r.text[0:1000]
<html>
<head>
<title>Political Party Platforms</title>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">
<meta name="keywords" content="President of the United States, presidency, American Presidency, American President, Public Papers of the Presidents, State of the Union Address, Inaugural Address, Presidents, American Presidents, George W. Bush, Bill Clinton, George Bush, Ronald Reagan, Jimmy Carter, Gerald Ford, Richard Nixon, Lyndon Johnson, John F. Kennedy. John Kennedy, Dwight Eisenhower, Harry Truman, FDR, Franklin Roosevelt, Presidential Elections, Presidential Rhetoric">
<meta name="description" content="The American Presidency Project contains the most comprehensive collection of resources pertaining to the study of the President of the United States.  Compiled by John Woolley and Gerhard Peters">
<link href="http://www.presidency.ucsb.edu/styles/main.css" rel="stylesheet" type="text/css">
<!-- BEGIN Tynt Script -->
<!-- <script type="text/jav

Now let's extract every link URL on the page.

In [54]:
all_links = []
for link in soup.findAll('a'):
    all_links.append(link.get('href'))
print 'All link hrefs in a list from a for loop: %s' % len(all_links)

all_links_comprehension = [link.get('href') for link in soup.findAll('a')]

print 'All link hrefs in a list from a list comprehension: %s' % len(all_links_comprehension)
All link hrefs in a list from a for loop: 142
All link hrefs in a list from a list comprehension: 142

BeautifulSoup has a .get_text() method that extracts the text attribute from every tag.

In [55]:
print soup.get_text()[:1000]
Political Party Platforms
<!--
function MM_jumpMenu(targ,selObj,restore){ //v3.0
 eval(targ+".location='"+selObj.options[selObj.selectedIndex].value+"'");
 if (restore) selObj.selectedIndex=0;
}
//-->



 


































 

 










Document Archive


• Public Papers of the Presidents


• State of the Union
          Addresses & Messages


• Inaugural Addresses


• Weekly  Addresses


• Fireside Chats


• News Conferences


• Executive Orders


• Proclamations


• Signing Statements


• Press Briefings 


• Statements of
           Administration Policy


• Economic Report of the President


• Debates


• Convention Speeches


• Party Platforms


• 2012 Election Documents 


• 2008 Election Documents 


• 2004 Election Documents 


• 1960 Election Documents 


• 2009 Transition


• 2001 Transition


Data Archive 


Data Index


Media Archive


Audio/Video Index


Elections


Election Index


Florida 2000


Links


Presidential Libraries






Political Pa

Remember, that's every tag. If you're lucky, you'll be able select only the relevant tags that you care about and extract their text with this method.

In [56]:
all_links[40:60]
Out[56]:
['http://www.presidency.ucsb.edu/ws/index.php?pid=78283',
 'http://www.presidency.ucsb.edu/papers_pdf/78283.pdf',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29613',
 'http://www.presidency.ucsb.edu/papers_pdf/29613.pdf',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29612',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29611',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29610',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29609',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29608',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29607',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29606',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29605',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29604',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29603',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29602',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29601',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29600',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29599',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29598',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29597']
In [57]:
for link in all_links[40:60]:
    print link.split('/')    
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=78283']
['http:', '', 'www.presidency.ucsb.edu', 'papers_pdf', '78283.pdf']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29613']
['http:', '', 'www.presidency.ucsb.edu', 'papers_pdf', '29613.pdf']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29612']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29611']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29610']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29609']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29608']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29607']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29606']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29605']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29604']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29603']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29602']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29601']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29600']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29599']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29598']
['http:', '', 'www.presidency.ucsb.edu', 'ws', 'index.php?pid=29597']

In [58]:
for link in all_links[40:60]:
    print 'href #' + str(all_links.index(link)) + ' = '  + link.split('/')[-1]
href #40 = index.php?pid=78283
href #41 = 78283.pdf
href #42 = index.php?pid=29613
href #43 = 29613.pdf
href #44 = index.php?pid=29612
href #45 = index.php?pid=29611
href #46 = index.php?pid=29610
href #47 = index.php?pid=29609
href #48 = index.php?pid=29608
href #49 = index.php?pid=29607
href #50 = index.php?pid=29606
href #51 = index.php?pid=29605
href #52 = index.php?pid=29604
href #53 = index.php?pid=29603
href #54 = index.php?pid=29602
href #55 = index.php?pid=29601
href #56 = index.php?pid=29600
href #57 = index.php?pid=29599
href #58 = index.php?pid=29598
href #59 = index.php?pid=29597

In [59]:
valid_links = []
for link in all_links:
    final_url_element = link.split('/')[-1]
    if final_url_element.startswith('index.php?'):
        valid_links.append(link)

print 'There are {} valid links.'.format(len(valid_links))
valid_links[:10]
There are 96 valid links.

Out[59]:
['http://www.presidency.ucsb.edu/ws/index.php?pid=101962',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=78283',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29613',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29612',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29611',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29610',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29609',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29608',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29607',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29606']
In [60]:
from datetime import datetime # Another standard library module.
for link in valid_links[:10]: # Limited for demonstration.  Also check out import time; time.sleep()
    r = requests.get(link)
    print '{time}\t{link}\t{status}'.format(time=datetime.isoformat(datetime.now()), link=link, status=r.status_code)
2014-04-25T13:46:56.508045	http://www.presidency.ucsb.edu/ws/index.php?pid=101962	200
2014-04-25T13:46:56.614769	http://www.presidency.ucsb.edu/ws/index.php?pid=78283	200
2014-04-25T13:46:56.703689	http://www.presidency.ucsb.edu/ws/index.php?pid=29613	200
2014-04-25T13:46:56.831947	http://www.presidency.ucsb.edu/ws/index.php?pid=29612	200
2014-04-25T13:46:56.939320	http://www.presidency.ucsb.edu/ws/index.php?pid=29611	200
2014-04-25T13:46:56.994097	http://www.presidency.ucsb.edu/ws/index.php?pid=29610	200
2014-04-25T13:46:57.046153	http://www.presidency.ucsb.edu/ws/index.php?pid=29609	200
2014-04-25T13:46:57.150035	http://www.presidency.ucsb.edu/ws/index.php?pid=29608	200
2014-04-25T13:46:57.269501	http://www.presidency.ucsb.edu/ws/index.php?pid=29607	200
2014-04-25T13:46:57.381718	http://www.presidency.ucsb.edu/ws/index.php?pid=29606	200

However, that previous example will print to stdout (that's what print() does).

If you execute this as a script at the command line, it would be better to have this write to a file, so let's open up a text file and write the output to that:

In [61]:
import os
request_log_file = open('presidency_platforms_scraping.log', 'w')
print type(request_log_file) # What kind of object does open() create?
request_log_file.write('Timestamp\tURL\tStatus Code\n')
for link in valid_links[:10]:
    r = requests.get(link)
    request_event_string = '{time}\t{link}\t{status}\n'.format(time=datetime.isoformat(datetime.now()), link=link, status=r.status_code)
    request_log_file.write(request_event_string) # Note I had to add the line ending "\n" above
request_log_file.close() # Make sure you close the file
print os.listdir(os.getcwd())
<type 'file'>
['.custom.css.swp', '.git', '.gitignore', '.ipynb_checkpoints', '.VAM.slides.html.swp', 'custom.css', 'env', 'index.html', 'LICENSE', 'output', 'presidency_platforms_scraping.log', 'README.md', 'requirements.txt', 'reveal.js', 'scraping_example.py', 'test.py', 'VAM.html', 'VAM.ipynb', 'VAM.slides.html', 'VAM_files']

If you don't want to bother handling files, you could keep printing to stdout and redirect to a text file, assuming you are comfortable with the shell!

In [62]:
%%bash 
# This is IPython cell magic
head presidency_platforms_scraping.log
Timestamp	URL	Status Code
2014-04-25T13:46:58.039135	http://www.presidency.ucsb.edu/ws/index.php?pid=101962	200
2014-04-25T13:46:58.150156	http://www.presidency.ucsb.edu/ws/index.php?pid=78283	200
2014-04-25T13:46:58.231671	http://www.presidency.ucsb.edu/ws/index.php?pid=29613	200
2014-04-25T13:46:58.341762	http://www.presidency.ucsb.edu/ws/index.php?pid=29612	200
2014-04-25T13:46:58.430370	http://www.presidency.ucsb.edu/ws/index.php?pid=29611	200
2014-04-25T13:46:58.497058	http://www.presidency.ucsb.edu/ws/index.php?pid=29610	200
2014-04-25T13:46:58.556787	http://www.presidency.ucsb.edu/ws/index.php?pid=29609	200
2014-04-25T13:46:58.679715	http://www.presidency.ucsb.edu/ws/index.php?pid=29608	200
2014-04-25T13:46:58.804531	http://www.presidency.ucsb.edu/ws/index.php?pid=29607	200

Scraping content

We're lucky; these are fairly simple pages. All of the relevant text we want looks like the text attributes of <p> elements per platform's page.

It also doesn't look like it's dynamically generated by Javascript or forms, so we can just rip it right out of the page.

Let's extract all the text from each page and save each one to a simple .txt file.

In [64]:
r = requests.get('http://www.presidency.ucsb.edu/ws/index.php?pid=101962')
soup = BeautifulSoup(r.text)

print soup.a
print soup.p
print soup.p.get_text()
<a href="../index.php"><img alt="Home" border="0" height="29" src="http://www.presidency.ucsb.edu/images/l1.gif" width="26"/></a>
<p>Four years ago, Democrats, independents, and many Republicans came together as Americans to move our country forward. We were in the midst of the greatest economic crisis since the Great Depression, the previous administration had put two wars on our nation's credit card, and the American Dream had slipped out of reach for too many. </p>
Four years ago, Democrats, independents, and many Republicans came together as Americans to move our country forward. We were in the midst of the greatest economic crisis since the Great Depression, the previous administration had put two wars on our nation's credit card, and the American Dream had slipped out of reach for too many. 

In [65]:
all_p_tags = soup.findAll('p')
print type(all_p_tags[0])
<class 'bs4.element.Tag'>

What other attributes do bs4.element.Tag objects have?

In [66]:
[attr for attr in dir(all_p_tags[0]) if not attr.startswith('__') and not attr.startswith('_')]
Out[66]:
['HTML_FORMATTERS',
 'XML_FORMATTERS',
 'append',
 'attribselect_re',
 'attrs',
 'can_be_empty_element',
 'childGenerator',
 'children',
 'clear',
 'contents',
 'decode',
 'decode_contents',
 'decompose',
 'descendants',
 'encode',
 'encode_contents',
 'extract',
 'fetchNextSiblings',
 'fetchParents',
 'fetchPrevious',
 'fetchPreviousSiblings',
 'find',
 'findAll',
 'findAllNext',
 'findAllPrevious',
 'findChild',
 'findChildren',
 'findNext',
 'findNextSibling',
 'findNextSiblings',
 'findParent',
 'findParents',
 'findPrevious',
 'findPreviousSibling',
 'findPreviousSiblings',
 'find_all',
 'find_all_next',
 'find_all_previous',
 'find_next',
 'find_next_sibling',
 'find_next_siblings',
 'find_parent',
 'find_parents',
 'find_previous',
 'find_previous_sibling',
 'find_previous_siblings',
 'format_string',
 'get',
 'getText',
 'get_text',
 'has_attr',
 'has_key',
 'hidden',
 'index',
 'insert',
 'insert_after',
 'insert_before',
 'isSelfClosing',
 'is_empty_element',
 'name',
 'namespace',
 'next',
 'nextGenerator',
 'nextSibling',
 'nextSiblingGenerator',
 'next_element',
 'next_elements',
 'next_sibling',
 'next_siblings',
 'parent',
 'parentGenerator',
 'parents',
 'parserClass',
 'parser_class',
 'prefix',
 'prettify',
 'previous',
 'previousGenerator',
 'previousSibling',
 'previousSiblingGenerator',
 'previous_element',
 'previous_elements',
 'previous_sibling',
 'previous_siblings',
 'recursiveChildGenerator',
 'renderContents',
 'replaceWith',
 'replaceWithChildren',
 'replace_with',
 'replace_with_children',
 'select',
 'setup',
 'string',
 'strings',
 'stripped_strings',
 'tag_name_re',
 'text',
 'unwrap',
 'wrap']

Scraping each page and saving extracted text to a new file.

Now let's glue together all of these pieces.

We'll introduce two new functions: os.makedirs() and os.path.join()

Also one last code pattern: opening and closing files using with()

In [67]:
if not os.path.exists('output'):
    os.makedirs('output')

request_log_file = open('output/presidency_platforms_scraping.log', 'w')
request_log_file.write('Timestamp\tURL\tStatus Code\n')

print 'Starting scraping.'
for link in valid_links:
    r = requests.get(link)
    request_event_string = '{time}\t{link}\t{status}\n'.format(time=datetime.isoformat(datetime.now()), link=link, status=r.status_code)
    request_log_file.write(request_event_string) # Note I had to add the line ending "\n" above
    
    soup = BeautifulSoup(r.text)
    all_p_tags = soup.findAll('p')
    filename = link.split('/')[-1].split('=')[-1] + '.txt'
    filename_path = os.path.join('output', filename)
    
    with open(filename_path, 'w') as scraped_text_outfile:
        text_as_list_of_strings = [p.get_text() for p in all_p_tags]
        scraped_text = ''.join(text_as_list_of_strings)
        scraped_text_outfile.write(scraped_text.encode('utf8')) 
        # Encoding is hard!  Why did I do this?  Hint: check "import sys; sys.stdout.encoding"

request_log_file.close()
print 'Finished scraping.'
Starting scraping.
Finished scraping.

Let's examine the results

In [68]:
print os.listdir('output')
['101961.txt', '101962.txt', '25835.txt', '25836.txt', '25837.txt', '25838.txt', '25839.txt', '25840.txt', '25841.txt', '25842.txt', '25843.txt', '25844.txt', '25845.txt', '25846.txt', '25847.txt', '25848.txt', '25849.txt', '25850.txt', '25851.txt', '25852.txt', '25855.txt', '25856.txt', '25857.txt', '29570.txt', '29571.txt', '29572.txt', '29573.txt', '29574.txt', '29575.txt', '29576.txt', '29577.txt', '29578.txt', '29579.txt', '29580.txt', '29581.txt', '29582.txt', '29583.txt', '29584.txt', '29585.txt', '29586.txt', '29587.txt', '29588.txt', '29589.txt', '29590.txt', '29591.txt', '29592.txt', '29593.txt', '29594.txt', '29595.txt', '29596.txt', '29597.txt', '29598.txt', '29599.txt', '29600.txt', '29601.txt', '29602.txt', '29603.txt', '29604.txt', '29605.txt', '29606.txt', '29607.txt', '29608.txt', '29609.txt', '29610.txt', '29611.txt', '29612.txt', '29613.txt', '29614.txt', '29615.txt', '29616.txt', '29617.txt', '29618.txt', '29619.txt', '29620.txt', '29621.txt', '29622.txt', '29623.txt', '29624.txt', '29625.txt', '29626.txt', '29627.txt', '29628.txt', '29629.txt', '29630.txt', '29631.txt', '29632.txt', '29633.txt', '29634.txt', '29635.txt', '29636.txt', '29637.txt', '29638.txt', '29639.txt', '29640.txt', '78283.txt', '78545.txt', 'presidency_platforms_scraping.log']

In [69]:
%%bash
echo "Number of files in output/"
ls output/ | wc -l

cd output
echo "Example text file:"
head -c 1000 101962.txt
Number of files in output/
      97
Example text file:
Four years ago, Democrats, independents, and many Republicans came together as Americans to move our country forward. We were in the midst of the greatest economic crisis since the Great Depression, the previous administration had put two wars on our nation's credit card, and the American Dream had slipped out of reach for too many. Today, our economy is growing again, al-Qaeda is weaker than at any point since 9/11, and our manufacturing sector is growing for the first time in more than a decade. But there is more we need to do, and so we come together again to continue what we started. We gather to reclaim the basic bargain that built the largest middle class and the most prosperous nation on Earth - the simple principle that in America, hard work should pay off, responsibility should be rewarded, and each one of us should be able to go as far as our talent and drive take us. This election is not simply a choice between two candidates or two political parties, but between two fundame
In [70]:
from IPython.display import Image
# import antigravity
Image('http://imgs.xkcd.com/comics/python.png')
Out[70]:

Appendix

Here are all the module versions that I used in this tutorial:

In [71]:
%%bash
pip freeze
Canopy==1.3.0.dev6305
CanopyTraining==0.9.dev1687
Cython==0.19.2
Examples==7.3
GDAL==1.10.0
Jinja2==2.7.1
MDP==3.3
MKL==10.3
MarkupSafe==0.18
Meta==0.4.2.dev
PIL==1.1.7
PyOpenGL==3.0.1
PySide==1.2.1
PyYAML==3.10
Pycluster==1.50
Pygments==1.6.0
PythonDoc==2.7.3
Qt==4.8.5
Reportlab==2.5
SQLAlchemy==0.8.3
Shapely==1.2.17
Sphinx==1.2.2
Twisted==12.0.0
VTK==5.10.1
appinst==2.1.2
apptools==4.2.1
atom==0.3.8
basemap==1.0.7
beautifulsoup4==4.3.1
biopython==1.62.0
bitarray==0.8.0
blist==1.3.4
blockcanvas==4.0.3
boto==2.19.0
bsdiff4==1.1.1
casuarius==1.1
chaco==4.4.1
cloud==2.4.6
codetools==4.2.0
configobj==4.7.2
coverage==3.7.1
curl==7.25.0
distribute==0.6.49
doclinks==7.3
docutils==0.11
enable==4.3.0
enaml==0.9.4
encore==0.5.1
enstaller==4.6.4
envisage==4.4.0
epydoc==3.0.1
esky==0.9.2.dev473
ets==4.4.1
etsdevtools==4.0.2
etsproxy==0.1.2
expat==2.0.1
faulthandler==2.0
feedparser==5.1.3
foolscap==0.6.3
freetype==2.4.4
fwrap==0.1.1
graphcanvas==4.0.2
grin==1.2.1
h5py==2.2.1
hdf5==1.8.11
html5lib==0.95
idle==2.7.3
ipython==2.0.0
jsonpickle==0.4.0
kernmagic==0.2.0
keyring==3.7.0
kiwisolver==0.1.2
lib-netcdf4==4.3.0
libgdal==1.10.1
libjpeg==7.0
libpng==1.2.40
libxml2==2.7.8
libxslt==1.1.26
llvmmath==0.1.1
llvmpy==0.12.1
lxml==3.2.3
matplotlib==1.3.1
mayavi==4.3.1
mock==1.0.1
netCDF4==1.0.7
networkx==1.8.1
nose==1.3.0
numba==0.13.0
numexpr==2.2.2
numpy==1.8.0
openpyxl==1.8.5
pandas==0.13.1
paramiko==1.10.1
patsy==0.2.0
pep8==1.4.6
ply==3.4
psutil==1.2.1
pyOpenSSL==0.13.1
pyasn1==0.1.7
pycrypto==2.6.1
pydot==1.0.28
pyface==4.4.0
pyfits==3.0.6
pyflakes==0.7.3
pygarrayimage==0.0.7
pyglet==1.1.4
pyhdf==0.8.3
pyparsing==1.5.6
pyproj==1.9.3
pyserial==2.6
pytables==2.4.0
python-dateutil==2.2.0
pytz==2013.8.0
pyzmq==14.1.1
readline==6.2.1
requests==2.2.1
rsa==3.1.2
scikit-learn==0.14.1
scikits.image==0.9.3
scikits.timeseries==0.91.3
scimath==4.1.2
scipy==0.13.3
scons==2.0.1
shiboken==1.2.1
simpy==3.0.2
six==1.4.1
statsmodels==0.5.0
supplement==0.5dev.dev202
swig==1.3.40
sympy==0.7.3
tornado==3.1.1
traits==4.4.0
traitsui==4.4.0
wsgiref==0.1.2
wxPython==2.9.2.4
xlrd==0.9.2
xlwt==0.7.5
zeromq==3.2.4
zope.interface==4.1.1

In [72]:
%%bash
pip freeze > requirements.txt

Python Easter eggs

In [73]:
import this
The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

In [74]:
from __future__ import braces
  File "<ipython-input-74-2aebb3fc8ecf>", line 1
    from __future__ import braces
SyntaxError: not a chance
In [75]:
# http://legacy.python.org/dev/peps/pep-0401/
from __future__ import barry_as_FLUFL 
  File "<ipython-input-75-83678a90e3e9>", line 2
    from __future__ import barry_as_FLUFL
SyntaxError: future feature barry_as_FLUFL is not defined