Build a Human Trafficking Dataset from Court Cases and News Articles 20171214




Pay Notebook Creator: Roy Hyunjin Han0
Set Container: Numerical CPU with TINY Memory for 10 Minutes 0
Total0

Extract Text from a Webpage

We presented this tool and notebook as part of our workshop on Computational Approaches to Fight Human Trafficking.

We use requests and BeautifulSoup to perform the following steps:

  1. Get HTML
  2. Extract the title
  3. Extract the body while ignoring scripts and tags
  4. Replace multiple newlines with a single newline
In [ ]:
# CrossCompute
url = 'https://www.unodc.org'
target_folder = '/tmp'
In [ ]:
import requests

response = requests.get(url)
html = response.content
html[:200]
In [ ]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
title = soup.find('title')
print('title = ' + title.text)
In [ ]:
body = soup.find('body')
# Remove script content
for x in body.find_all('script'):
    x.decompose()
# Extract text without tags
text = body.getText(separator=u'\n').strip()
print(text[:70])    
In [ ]:
import re

# Replace multiple newlines with a single newline
pattern = re.compile(r'\n+', re.MULTILINE)
text = pattern.sub('\n', text)
print(text[:89])
In [ ]:
target_path = target_folder + '/raw.txt'
open(target_path, 'wt').write(text)
print('body_text_path = ' + target_path)