web-scraping with BeautifulSoup




Pay Notebook Creator: Salah Ahmed0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0

Objectives

Introduce the following

  • jupyter notebooks
  • requests library
  • BeautifulSoup library
  • list-comprehension
  • pandas.Dataframe object

Jupyter notebooks

notebooks are great for exploring code and ideas

they provide an interactive environment with tab completion for objects in code, meaning you can explore libraries and methods in a way not possible via terminal or in some ide's

Some important things to help you in your development with jupyter

key function
h pulls out help menu
shift + enter runs notebook cell
? after an object (function, method, or object of class) shows object signature
a creates cell on top of current cell
b creates cell below current cell
x deletes current cell

example

import os
os?
os.<tab>
# shows functions in "os" module
In [1]:
import os
os?

Lists

In [6]:
py_list = [1, 3, 4, 5.0]
py_list.append(40)
print(py_list)
[1, 3, 4, 5.0, 40]

list comprehension

  • create lists based on other lists
In [9]:
[i for i in py_list]
Out[9]:
[1, 3, 4, 5.0, 40]
In [10]:
[i * 3 for i in py_list]
Out[10]:
[3, 9, 12, 15.0, 120]

Exercise

  1. Create a list of even numbers 1 - 10
  2. Using list comprehension return a new list which is every element in old list multipled by 542
In [58]:
# your code here

Dictionaries

store value by key

In [11]:
ages = {'alex': 23, 'mike': 25}
print(ages['alex'])
23

Exercise

  1. Make a dictionary with the following key,value mapping
1 -> bill
2 -> jane
3 -> mary

name your dictionary "students"

In [12]:
# your code here

HTTP

The two most important methods are GET and POST check out the other methods

GET requests

GET requests retrieve data from a url, you can add parameters to your query to get the page you want!

Parameters are like dictionaries!

the key is the query, the value is whatever you're requesting!

the format is:

sitename.com/whatever/path?key=value

The important thing here is noting the "?", and the "key=value"

Example:

Youtube searches for videos using the key "search_query"

  1. site: https://youtube.com
  2. query: "cats"
  3. https://www.youtube.com/results?search_query=cats

Requests

requests is a python library for handling http requests

it is extremely simple to use and is very intuitive

  • get requests use the 'get' method requests.get
  • to pass in queries use the second argument and pass in a dictionary!
In [15]:
import requests
In [5]:
requests.get?
In [16]:
params = {'search_query': 'cats'}
In [17]:
url = 'https://youtube.com/results'
In [18]:
r = requests.get(url, params)
In [19]:
print(r.url)
https://www.youtube.com/results?search_query=cats
In [20]:
# html of the page (first 500 characters)
print(r.content[:500])
b'  <!DOCTYPE html><html lang="en" data-cast-api-enabled="true"><head><style name="www-roboto" >@font-face{font-family:\'Roboto\';font-style:italic;font-weight:500;src:local(\'Roboto Medium Italic\'),local(\'Roboto-MediumItalic\'),url(//fonts.gstatic.com/s/roboto/v18/OLffGBTaF0XFOW1gnuHF0Z0EAVxt0G0biEntp43Qt6E.ttf)format(\'truetype\');}@font-face{font-family:\'Roboto\';font-style:italic;font-weight:400;src:local(\'Roboto Italic\'),local(\'Roboto-Italic\'),url(//fonts.gstatic.com/s/roboto/v18/W4wDsBUluyw0tK3tykh'

Exercise

  1. send get request to twitter searching for term "dude"
    • hint: twitter uses the key "q" for search
  2. return the html of the page
In [31]:
site = 'https://twitter.com'
# your code here

HTML

html is the skeleton of websites. You can view a website's html through the developer console. to pull out the developer console in chrome:

os key
linux/windows ctrl + shift + j
osx Cmd + Opt + K

BeautifulSoup

BeautifulSoup is a library used to parse html and xml

important methods:

  • find
    • finds first tag, class, id, or other selector types matching query
    • finding 'p' tag
      • soup.find('p')
      • soup.find(id='myid')
      • soup.find(class_='myclass')
  • find_all
    • find all matching tags and return them in a list
      • soup.find_all('myclass')
  • text
    • returns text of html tags
In [32]:
from bs4 import BeautifulSoup

steps.

  1. initialize beautifulsoup with html
  2. search html
In [35]:
# init
soup = BeautifulSoup('<div class="title"><p> Pie </p></div><div class="title"><p> xyz </p></div>',
                    'html5lib')
In [38]:
print(soup.text)
print(soup.find('p'))
print(soup.find(class_='title'))
print(soup.find_all(class_='title'))
print([i.text for i in soup.find_all(class_='title')])
 Pie  xyz 
<p> Pie </p>
<div class="title"><p> Pie </p></div>
[<div class="title"><p> Pie </p></div>, <div class="title"><p> xyz </p></div>]
[u' Pie ', u' xyz ']
In [41]:
url = 'https://nbconvert.readthedocs.io/en/latest/usage.html'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.find('h1').text)
Using as a command line tool¶

Exercise:

  1. return all "h3" tags from within each class with "class" name "section" from previous url
    • hint: to get html tag by class use "class_" like we did previously
    • hint: find_all is needed
    • hint: list comprehension
In [48]:
# your code here

pandas.DataFrame

DataFrame from the pandas module transforms a list or dictionary into a table

In turn we can export the table to a csv object

csv

csv stands for comma separated values, and is the common way to view data, similar to excel files

the format is

row1col1,row1col2
row2col1,row2col2

which is equivalent to the table

row1col1 row1col2
row2col1 row2col2
In [2]:
from pandas import DataFrame
In [5]:
data = [1, 2, 3, 4]
df = DataFrame(data, columns=['x'])
df
Out[5]:
<style> .dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; } </style>
x
0 1
1 2
2 3
3 4
In [8]:
path = 'test.csv'
df.to_csv(path, index=False)
%cat test.csv
x
1
2
3
4