Objectives¶

Introduce the following

jupyter notebooks
requests library
BeautifulSoup library
list-comprehension
pandas.Dataframe object

Jupyter notebooks¶

notebooks are great for exploring code and ideas

they provide an interactive environment with tab completion for objects in code, meaning you can explore libraries and methods in a way not possible via terminal or in some ide's

Some important things to help you in your development with jupyter

key	function
h	pulls out help menu
shift + enter	runs notebook cell
? after an object (function, method, or object of class)	shows object signature
a	creates cell on top of current cell
b	creates cell below current cell
x	deletes current cell

example

import os
os?
os.&lt;tab&gt;
# shows functions in "os" module

In [1]:

import os
os?

Lists¶

In [6]:

py_list = [1, 3, 4, 5.0]
py_list.append(40)
print(py_list)

[1, 3, 4, 5.0, 40]

list comprehension¶

create lists based on other lists

In [9]:

[i for i in py_list]

Out[9]:

[1, 3, 4, 5.0, 40]

In [10]:

[i * 3 for i in py_list]

Out[10]:

[3, 9, 12, 15.0, 120]

Exercise¶

Create a list of even numbers 1 - 10
Using list comprehension return a new list which is every element in old list multipled by 542

In [58]:

# your code here

Dictionaries¶

store value by key

In [11]:

ages = {'alex': 23, 'mike': 25}
print(ages['alex'])

Exercise¶

Make a dictionary with the following key,value mapping

1 -> bill
2 -> jane
3 -> mary

name your dictionary "students"

In [12]:

# your code here

HTTP¶

The two most important methods are GET and POST check out the other methods

GET requests¶

GET requests retrieve data from a url, you can add parameters to your query to get the page you want!

Parameters are like dictionaries!

the key is the query, the value is whatever you're requesting!

the format is:

sitename.com/whatever/path?key=value

The important thing here is noting the "?", and the "key=value"

Example:

Youtube searches for videos using the key "search_query"

site: https://youtube.com
query: "cats"
https://www.youtube.com/results?search_query=cats

Requests ¶

requests is a python library for handling http requests

it is extremely simple to use and is very intuitive

get requests use the 'get' method requests.get
to pass in queries use the second argument and pass in a dictionary!

In [15]:

import requests

In [5]:

requests.get?

In [16]:

params = {'search_query': 'cats'}

In [17]:

url = 'https://youtube.com/results'

In [18]:

r = requests.get(url, params)

In [19]:

print(r.url)

https://www.youtube.com/results?search_query=cats

In [20]:

# html of the page (first 500 characters)
print(r.content[:500])

b'  <!DOCTYPE html><html lang="en" data-cast-api-enabled="true"><head><style name="www-roboto" >@font-face{font-family:\'Roboto\';font-style:italic;font-weight:500;src:local(\'Roboto Medium Italic\'),local(\'Roboto-MediumItalic\'),url(//fonts.gstatic.com/s/roboto/v18/OLffGBTaF0XFOW1gnuHF0Z0EAVxt0G0biEntp43Qt6E.ttf)format(\'truetype\');}@font-face{font-family:\'Roboto\';font-style:italic;font-weight:400;src:local(\'Roboto Italic\'),local(\'Roboto-Italic\'),url(//fonts.gstatic.com/s/roboto/v18/W4wDsBUluyw0tK3tykh'

Exercise¶

send get request to twitter searching for term "dude"
- hint: twitter uses the key "q" for search
return the html of the page

In [31]:

site = 'https://twitter.com'
# your code here

HTML¶

html is the skeleton of websites. You can view a website's html through the developer console. to pull out the developer console in chrome:

os	key
linux/windows	ctrl + shift + j
osx	Cmd + Opt + K

BeautifulSoup ¶

BeautifulSoup is a library used to parse html and xml

important methods:

find
- finds first tag, class, id, or other selector types matching query
- finding 'p' tag
  - soup.find('p')
  - soup.find(id='myid')
  - soup.find(class_='myclass')
find_all
- find all matching tags and return them in a list
  - soup.find_all('myclass')
text
- returns text of html tags

In [32]:

from bs4 import BeautifulSoup

steps.¶

initialize beautifulsoup with html
search html

In [35]:

# init
soup = BeautifulSoup('<div class="title"><p> Pie </p></div><div class="title"><p> xyz </p></div>',
                    'html5lib')

In [38]:

print(soup.text)
print(soup.find('p'))
print(soup.find(class_='title'))
print(soup.find_all(class_='title'))
print([i.text for i in soup.find_all(class_='title')])

 Pie  xyz 
<p> Pie </p>
<div class="title"><p> Pie </p></div>
[<div class="title"><p> Pie </p></div>, <div class="title"><p> xyz </p></div>]
[u' Pie ', u' xyz ']

open https://nbconvert.readthedocs.io/en/latest/usage.html

In [41]:

url = 'https://nbconvert.readthedocs.io/en/latest/usage.html'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.find('h1').text)

Using as a command line tool¶

Exercise:¶

return all "h3" tags from within each class with "class" name "section" from previous url
- hint: to get html tag by class use "class_" like we did previously
- hint: find_all is needed
- hint: list comprehension

In [48]:

# your code here

pandas.DataFrame¶

DataFrame from the pandas module transforms a list or dictionary into a table

In turn we can export the table to a csv object

csv¶

csv stands for comma separated values, and is the common way to view data, similar to excel files

the format is

row1col1,row1col2
row2col1,row2col2

which is equivalent to the table

row1col1	row1col2
row2col1	row2col2

In [2]:

from pandas import DataFrame

In [5]:

data = [1, 2, 3, 4]
df = DataFrame(data, columns=['x'])
df

Out[5]:

	x
0	1
1	2
2	3
3	4

In [8]:

path = 'test.csv'
df.to_csv(path, index=False)
%cat test.csv

Pay Notebook Creator: Salah Ahmed	0
Set Container: Numerical CPU with TINY Memory for 10 Minutes	0
Total	0

web-scraping with BeautifulSoup

Objectives¶

Jupyter notebooks¶

Lists¶

list comprehension¶

Exercise¶

Dictionaries¶

Exercise¶

HTTP¶

GET requests¶

Requests¶

Exercise¶

HTML¶

BeautifulSoup¶

steps.¶

Exercise:¶

pandas.DataFrame¶

csv¶

Requests ¶

BeautifulSoup ¶