Here's an HTML document I'll be using as an example throughout this document. It's part of a story from "Alice in Wonderland":
html_doc ="""<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""
Running the "three sisters" document through Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure:
from bs4 import BeautifulSoupsoup =BeautifulSoup(html_doc, 'html.parser')print(soup.prettify())# <html># <head># <title># The Dormouse's story# </title># </head># <body># <p class="title"># <b># The Dormouse's story# </b># </p># <p class="story"># Once upon a time there were three little sisters; and their names were# <a class="sister" href="http://example.com/elsie" id="link1"># Elsie# </a># ,# <a class="sister" href="http://example.com/lacie" id="link2"># Lacie# </a># and# <a class="sister" href="http://example.com/tillie" id="link3"># Tillie# </a># ; and they lived at the bottom of a well.# </p># <p class="story"># ...# </p># </body># </html>
Here are some simple ways to navigate that data structure:
One common task is extracting all the URLs found within a page's <a> tags:
for link in soup.find_all('a'):print(link.get('href'))# http://example.com/elsie# http://example.com/lacie# http://example.com/tillie
Another common task is extracting all the text from a page:
print(soup.get_text())# The Dormouse's story## The Dormouse's story## Once upon a time there were three little sisters; and their names were# Elsie,# Lacie and# Tillie;# and they lived at the bottom of a well.## ...
Example
The following program scrapes the 247ctf.com scoreboard:
import requestsfrom bs4 import BeautifulSouppage = requests.get('https://247ctf.com/scoreboard')soup =BeautifulSoup(page.content, 'html.parser')table = soup.find('table')table_body = table.find('tbody')rows = table_body.find_all('tr')for row in rows:print('------------------------------------------------------') cols = [x.text.strip()for x in row.find_all('td')]print(f"{cols[2]} is in {cols[0]} place with {cols[4]}.")