Opening your session in 90 seconds...

If your browser keeps redirecting back and forth from this page in an endless loop, it is possible that you are using an older browser. Please update Google Chrome or use Mozilla Firefox.

Extract Text from a Webpage¶

We presented this tool and notebook as part of our workshop on Computational Approaches to Fight Human Trafficking.

We use requests and BeautifulSoup to perform the following steps:

Get HTML
Extract the title
Extract the body while ignoring scripts and tags
Replace multiple newlines with a single newline

In [ ]:

# CrossCompute
url = 'https://www.unodc.org'
target_folder = '/tmp'

In [ ]:

import requests

response = requests.get(url)
html = response.content
html[:200]

In [ ]:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
title = soup.find('title')
print('title = ' + title.text)

In [ ]:

body = soup.find('body')
# Remove script content
for x in body.find_all('script'):
    x.decompose()
# Extract text without tags
text = body.getText(separator=u'\n').strip()
print(text[:70])

In [ ]:

import re

# Replace multiple newlines with a single newline
pattern = re.compile(r'\n+', re.MULTILINE)
text = pattern.sub('\n', text)
print(text[:89])

In [ ]:

target_path = target_folder + '/raw.txt'
open(target_path, 'wt').write(text)
print('body_text_path = ' + target_path)

Pay Notebook Creator: Roy Hyunjin Han	0
Set Container: Numerical CPU with TINY Memory for 10 Minutes	0
Total	0

Build a Human Trafficking Dataset from Court Cases and News Articles 20171214

Extract Text from a Webpage¶