Sometimes, the data you need isn’t available in a clean CSV file or through an API. It’s simply displayed on a website. Web scraping is the process of automatically downloading the HTML code of a web page and extracting useful information from it.
This is an incredibly powerful tool for a business analyst, allowing you to gather competitive intelligence, track news sentiment, collect product prices, and much more.
If you do not have internet access, you can still explore the examples in this lesson. The folder includes a curated presidents.csv
containing a snapshot of key columns—number, name, party, term dates, and vice presidents—for every U.S. president through Joe Biden. The exercise scripts will look for this local file first, so you can experiment with parsing and analysis even when the Wikipedia page is unavailable. When a connection is available you can still re-run the scraper to refresh the dataset, which will regenerate presidents.json
. Git ignores these generated JSON files so your repository stays clean.
A VERY IMPORTANT NOTE ON ETHICS AND LEGALITY:
robots.txt
: Always check a website’s robots.txt
file (e.g., https://example.com/robots.txt
) to see which parts of the site you are allowed to scrape. Respect the rules.We will use two main libraries for web scraping:
requests
: A simple and elegant library for making HTTP requests to download web pages.BeautifulSoup
: A library for parsing HTML and XML documents. It creates a parse tree from the page’s source code that you can use to extract data.<h1>
, <p>
, <table>
, <div>
) that contain the data you need. Look for unique id
or class
attributes on those tags.requests.get(url)
function to download the page’s HTML content.BeautifulSoup
constructor to create a parsable object.find()
and find_all()
, to locate the specific HTML tags containing your data..get_text()
method to extract the clean text from them.import requests
from bs4 import BeautifulSoup
url = 'http://example.com' # A simple example page
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Find the first <h1> tag
header = soup.find('h1').get_text()
# Find all <p> (paragraph) tags
paragraphs = soup.find_all('p')
first_paragraph_text = paragraphs[0].get_text()
Profiling helps you spot whether networking or HTML parsing is the bottleneck. Two helper commands wire into the shared profiler:
python Day_30_Web_Scraping/profile_web_scraping.py --mode cprofile
python Day_30_Web_Scraping/profile_web_scraping.py --mode timeit --local-html Day_30_Web_Scraping/books_sample.html --repeat 5 --number 3
The cProfile
output shows that almost all time is spent inside requests.Session.get
—network I/O dominates the runtime, so batching requests or caching responses offers the biggest win.【ad83b3†L1-L29】 For deterministic timing, use the saved books_sample.html
page (refresh it with curl http://books.toscrape.com/ -o Day_30_Web_Scraping/books_sample.html
). Parsing that local file takes ~0.03 seconds per iteration across five repeats, letting you focus on BeautifulSoup performance without hitting the network.【de293a†L1-L7】 Reusing a single requests.Session
and avoiding repeated downloads can dramatically cut the cost when scraping multiple pages.
For these exercises, we will scrape the website http://books.toscrape.com/
, a site specifically designed for scraping practice.
Scrape Book Titles:
http://books.toscrape.com/
.<h3>
tags, within an <a>
tag).Scrape Book Prices:
p
tags with the class price_color
).Create a DataFrame:
.head()
.🎉 Great job! Web scraping is a powerful skill that opens up a vast new source of data for your analyses. While it can be complex, mastering the basics of requests
and BeautifulSoup
is a huge step forward.
Run this lesson’s code interactively in your browser:
!!! tip “About JupyterLite” JupyterLite runs entirely in your browser using WebAssembly. No installation or server required! Note: First launch may take a moment to load.
???+ example “presidents.py” View on GitHub
```python title="presidents.py"
"""Day 30: Web Scraping Presidents Data.
This script prefers locally curated data so learners can work offline, falls back
to a lightweight mock dataset, and only reaches out to Wikipedia when it needs to
refresh the snapshot.
"""
import csv
import json
from pathlib import Path
from typing import Optional
import pandas as pd
import requests
MOCK_PRESIDENTS_DATA = [
{
"number": "1",
"president": "George Washington",
"term_start": "1789-04-30",
"term_end": "1797-03-04",
"party": "Independent",
"vice_president": "John Adams",
},
{
"number": "2",
"president": "John Adams",
"term_start": "1797-03-04",
"term_end": "1801-03-04",
"party": "Federalist",
"vice_president": "Thomas Jefferson",
},
{
"number": "3",
"president": "Thomas Jefferson",
"term_start": "1801-03-04",
"term_end": "1809-03-04",
"party": "Democratic-Republican",
"vice_president": "Aaron Burr; George Clinton",
},
{
"number": "4",
"president": "James Madison",
"term_start": "1809-03-04",
"term_end": "1817-03-04",
"party": "Democratic-Republican",
"vice_president": "George Clinton; Elbridge Gerry",
},
{
"number": "5",
"president": "James Monroe",
"term_start": "1817-03-04",
"term_end": "1825-03-04",
"party": "Democratic-Republican",
"vice_president": "Daniel D. Tompkins",
},
{
"number": "6",
"president": "John Quincy Adams",
"term_start": "1825-03-04",
"term_end": "1829-03-04",
"party": "Democratic-Republican",
"vice_president": "John C. Calhoun",
},
{
"number": "7",
"president": "Andrew Jackson",
"term_start": "1829-03-04",
"term_end": "1837-03-04",
"party": "Democratic",
"vice_president": "John C. Calhoun; Martin Van Buren",
},
{
"number": "8",
"president": "Martin Van Buren",
"term_start": "1837-03-04",
"term_end": "1841-03-04",
"party": "Democratic",
"vice_president": "Richard Mentor Johnson",
},
{
"number": "9",
"president": "William Henry Harrison",
"term_start": "1841-03-04",
"term_end": "1841-04-04",
"party": "Whig",
"vice_president": "John Tyler",
},
{
"number": "10",
"president": "John Tyler",
"term_start": "1841-04-04",
"term_end": "1845-03-04",
"party": "Whig (expelled)",
"vice_president": "None",
},
]
def scrape_presidents_data(url: str) -> Optional[pd.DataFrame]:
"""
Scrapes the list of US presidents from a Wikipedia page.
Args:
url (str): The URL of the Wikipedia page.
Returns:
pd.DataFrame: A DataFrame containing the presidents' data.
"""
try:
print(f"🌐 Connecting to {url}...")
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
response = requests.get(url, headers=headers, timeout=30)
response.raise_for_status()
print("📊 Parsing HTML tables...")
# Use response.text instead of response.content for pd.read_html
tables = pd.read_html(response.text)
if not tables:
print("❌ No tables found on the page")
return None
print(f"✅ Found {len(tables)} tables on the page")
# The first table is typically the one we want
presidents_df = tables[0].copy()
print(f"📋 Original data shape: {presidents_df.shape}")
# Clean up the data
# Drop the last row if it appears to be a footnote (common in Wikipedia tables)
if len(presidents_df) > 1:
presidents_df = presidents_df.iloc[:-1].copy()
# Replace empty strings with NaN
presidents_df.replace("", float("NaN"), inplace=True)
presidents_df.replace(
"—", float("NaN"), inplace=True
) # Common dash used in Wikipedia
# Drop columns that are all NaN
presidents_df.dropna(how="all", axis=1, inplace=True)
print(f"📋 Cleaned data shape: {presidents_df.shape}")
print(f"📋 Columns: {list(presidents_df.columns)}")
return presidents_df
except requests.exceptions.Timeout:
print("❌ Request timed out. The server might be slow.")
return None
except requests.exceptions.ConnectionError:
print("❌ Connection error. Please check your internet connection.")
return None
except requests.exceptions.HTTPError as e:
print(f"❌ HTTP error occurred: {e}")
return None
except requests.exceptions.RequestException as e:
print(f"❌ Error downloading the page: {e}")
return None
except (IndexError, KeyError) as e:
print(f"❌ Error parsing the table: {e}")
print("💡 The website structure may have changed.")
return None
except Exception as e:
print(f"❌ Unexpected error occurred: {e}")
return None
def convert_csv_to_json(csv_file_path: str, json_file_path: str) -> None:
"""
Converts a CSV file to a JSON file.
The JSON file will have the first column of the CSV as keys.
Args:
csv_file_path (str): The path to the input CSV file.
json_file_path (str): The path to the output JSON file.
"""
data = {}
try:
print(f"📄 Converting {csv_file_path} to {json_file_path}...")
with open(csv_file_path, encoding="utf-8") as csv_file:
csv_reader = csv.DictReader(csv_file)
# Check if fieldnames exist and get the first column
if not csv_reader.fieldnames:
print("❌ No column headers found in CSV file")
return
key_column = csv_reader.fieldnames[0]
print(f"📊 Using '{key_column}' as the key column")
row_count = 0
for row in csv_reader:
key = row[key_column]
if key: # Only add rows with non-empty keys
data[key] = row
row_count += 1
print(f"✅ Processed {row_count} records")
with open(json_file_path, "w", encoding="utf-8") as json_file:
json.dump(data, json_file, indent=4, ensure_ascii=False)
print(f"✅ Successfully converted to JSON: {json_file_path}")
except FileNotFoundError:
print(f"❌ Error: The file {csv_file_path} was not found.")
except PermissionError:
print("❌ Error: Permission denied when accessing files.")
except Exception as e:
print(f"❌ An error occurred during conversion: {e}")
def save_mock_json(json_file_path: Path) -> None:
"""Persist the lightweight mock dataset to a JSON file."""
print("📄 Local CSV not found. Using built-in mock dataset.")
data = {item["number"]: item for item in MOCK_PRESIDENTS_DATA}
with json_file_path.open("w", encoding="utf-8") as json_file:
json.dump(data, json_file, indent=4, ensure_ascii=False)
print(f"✅ Mock JSON file created: {json_file_path}")
def main():
"""
Main function to scrape, process, and save the presidents' data.
"""
print("🕸️ Day 30: Web Scraping Presidents Data")
print("🏛️ Scraping US Presidents data from Wikipedia")
print("=" * 50)
# URL of the Wikipedia page with the list of US presidents
presidents_url = (
"https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States"
)
# Define file paths relative to the script's location
# This makes the script more portable
base_dir = Path(__file__).resolve().parent
csv_path = base_dir / "presidents.csv"
json_path = base_dir / "presidents.json"
# The JSON outputs are ignored by git so learners can regenerate them
# locally without creating untracked files.
print(f"📁 Output files will be saved to: {base_dir}")
# Scrape the data
if csv_path.exists():
print("📄 Found curated CSV. Converting to JSON without scraping.")
convert_csv_to_json(str(csv_path), str(json_path))
return
# If the curated CSV is missing, provide learners with the mock dataset first.
save_mock_json(json_path)
presidents_df = scrape_presidents_data(presidents_url)
if presidents_df is not None:
try:
# Save the DataFrame to a temporary CSV file, without the index
temp_csv_path = base_dir / "presidents_download.csv"
print("💾 Saving scraped data to a temporary CSV...")
presidents_df.to_csv(temp_csv_path, index=False, encoding="utf-8")
print(f"✅ CSV file created: {temp_csv_path}")
# Convert the CSV to JSON
convert_csv_to_json(str(temp_csv_path), str(json_path))
# Remove the temporary CSV file
try:
temp_csv_path.unlink()
print("🧹 Removed temporary CSV file")
except OSError as e:
print(
f"⚠️ Warning: Could not remove temporary file {temp_csv_path}: {e}"
)
# Verify the JSON file was created and show some info
if json_path.exists():
file_size = json_path.stat().st_size
print(f"\n🎉 Success! Created '{json_path}' ({file_size:,} bytes)")
# Show a preview of the data
try:
with json_path.open("r", encoding="utf-8") as f:
data = json.load(f)
print(f"📊 Total presidents in dataset: {len(data)}")
if data:
first_key = next(iter(data))
print(
f"📋 Sample entry keys: {list(data[first_key].keys())[:5]}..."
)
except Exception as e:
print(f"⚠️ Could not preview JSON data: {e}")
else:
print("❌ JSON file was not created successfully")
except Exception as e:
print(f"❌ Error in main processing: {e}")
else:
print("❌ Failed to scrape presidents data.")
print("💡 This could be due to:")
print(" • Network connectivity issues")
print(" • Wikipedia page structure changes")
print(" • Rate limiting on Wikipedia")
print(" • Server blocking the request")
print(" • Temporary website unavailability")
print("📦 Continuing with the mock dataset so you can keep practicing offline.")
if __name__ == "__main__":
main()
```
???+ example “profile_web_scraping.py” View on GitHub
```python title="profile_web_scraping.py"
"""Profile the book scraping workflow used in the web scraping lesson."""
from __future__ import annotations
import argparse
import sys
from pathlib import Path
from typing import Callable
try:
from mypackage.profiling import print_report, profile_callable
except ImportError:
PROJECT_ROOT = Path(__file__).resolve().parents[1]
if str(PROJECT_ROOT) not in sys.path:
sys.path.append(str(PROJECT_ROOT))
from mypackage.profiling import print_report, profile_callable
try: # pragma: no cover - runtime guard for direct execution
from .web_scraping import URL, process_book_data, scrape_books
except ImportError: # pragma: no cover - allows ``python profile_web_scraping.py``
CURRENT_DIR = Path(__file__).resolve().parent
if str(CURRENT_DIR) not in sys.path:
sys.path.append(str(CURRENT_DIR))
from web_scraping import URL, process_book_data, scrape_books # type: ignore
def build_pipeline(url: str, html_path: Path | None) -> Callable[[], None]:
"""Return a callable that performs the scraping workflow."""
if html_path is not None:
html_bytes = html_path.read_bytes()
def pipeline() -> None:
process_book_data(html_bytes)
else:
def pipeline() -> None:
scrape_books(url)
return pipeline
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--mode",
choices=("cprofile", "timeit"),
default="cprofile",
help="Profiling backend to use (default: cprofile)",
)
parser.add_argument(
"--url",
default=URL,
help="Target URL to scrape when not using --local-html",
)
parser.add_argument(
"--local-html",
type=Path,
help="Optional path to a saved HTML page for offline profiling",
)
parser.add_argument(
"--repeat",
type=int,
default=5,
help="Number of timing repeats when --mode=timeit",
)
parser.add_argument(
"--number",
type=int,
default=1,
help="Number of calls per repeat when --mode=timeit",
)
args = parser.parse_args()
if args.mode == "timeit" and args.local_html is None:
raise SystemExit(
"--mode=timeit requires --local-html to avoid repeated network calls"
)
pipeline = build_pipeline(url=args.url, html_path=args.local_html)
profile_report, timing_report = profile_callable(
pipeline,
mode=args.mode,
repeat=args.repeat,
number=args.number,
)
print_report(profile_report=profile_report, timing_report=timing_report)
if __name__ == "__main__":
main()
```
???+ example “solutions.py” View on GitHub
```python title="solutions.py"
"""
Day 30: Solutions to Exercises
"""
import bs4 # Add this import for type checking
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = "http://books.toscrape.com/"
headers = {"User-Agent": "Mozilla/5.0"}
try:
response = requests.get(URL, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
print("Successfully connected to books.toscrape.com")
except requests.exceptions.RequestException as e:
print(f"Could not connect to the website: {e}")
soup = None
if soup:
# --- Exercise 1: Scrape Book Titles ---
print("\n--- Solution to Exercise 1 ---")
# Find all <h3> tags, then find the <a> tag inside, and get its 'title' attribute
title_tags = soup.find_all("h3")
titles = []
for tag in title_tags:
if isinstance(tag, bs4.element.Tag):
a_tag = tag.find("a")
if isinstance(a_tag, bs4.element.Tag) and "title" in a_tag.attrs:
titles.append(a_tag.get("title"))
print("Found the following titles:")
# Print first 5 for brevity
for title in titles[:5]:
print(f"- {title}")
print("-" * 20)
# --- Exercise 2: Scrape Book Prices ---
print("\n--- Solution to Exercise 2 ---")
# Find all <p> tags that have the class 'price_color'
price_tags = soup.find_all("p", class_="price_color")
prices = [tag.get_text() for tag in price_tags]
print("Found the following prices:")
# Print first 5 for brevity
for price in prices[:5]:
print(f"- {price}")
print("-" * 20)
# --- Exercise 3: Create a DataFrame ---
print("\n--- Solution to Exercise 3 ---")
# Check if we have the same number of titles and prices
if len(titles) == len(prices):
book_df = pd.DataFrame({"Title": titles, "Price": prices})
print("Created DataFrame from scraped data:")
print(book_df.head())
else:
print(
"Mismatch between number of titles and prices found. Cannot create DataFrame."
)
print("-" * 20)
else:
print("\nSkipping exercises as the website could not be scraped.")
```
???+ example “web_scraping.py” View on GitHub
```python title="web_scraping.py"
"""
Day 30: Web Scraping in Practice
This script demonstrates the fundamentals of web scraping by
extracting book titles and prices from a practice website.
This educational example shows how to:
- Make HTTP requests with proper headers
- Parse HTML content with BeautifulSoup
- Handle errors gracefully
- Extract and clean data
- Structure data in pandas DataFrame
Author: 50 Days of Python Course
Purpose: Educational example for MBA students
"""
import time
from typing import Any, Dict, Optional, Tuple
import bs4
import pandas as pd
import requests
from bs4 import BeautifulSoup
# The URL of the website we want to scrape
# This site is specifically designed for scraping practice.
URL = "http://books.toscrape.com/"
class ScrapingError(Exception):
"""Custom exception for scraping errors."""
def scrape_books(
url: str, session: Optional[requests.Session] = None
) -> Tuple[pd.DataFrame, pd.DataFrame, Dict[str, Any]]:
"""
Scrape book data from the given URL.
Args:
url (str): The URL to scrape
Returns:
Tuple containing the raw scraped DataFrame, the cleaned DataFrame, and
a dictionary of summary statistics.
"""
# --- 1. Download the HTML Content ---
# Use requests.get() to download the page.
# It's good practice to include a 'User-Agent' header to identify your script.
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
session = session or requests.Session()
try:
response = session.get(url, headers=headers, timeout=10)
response.raise_for_status()
except requests.exceptions.Timeout:
raise
except requests.exceptions.ConnectionError:
raise
except requests.exceptions.HTTPError:
raise
except requests.exceptions.RequestException as exc:
raise ScrapingError("Error downloading the page") from exc
# If we get here, the request was successful
return process_book_data(response.content)
def process_book_data(
html_content: bytes,
) -> Tuple[pd.DataFrame, pd.DataFrame, Dict[str, Any]]:
"""
Process the HTML response and extract book data.
Args:
response: The HTTP response object
Returns:
Tuple containing the raw scraped DataFrame, the cleaned DataFrame, and
a dictionary of summary statistics.
"""
# --- 2. Create a BeautifulSoup Object ---
# This object parses the HTML content and makes it searchable.
soup = BeautifulSoup(html_content, "html.parser")
# --- 3. Find and Extract Data ---
# We inspected the page and found that book information is within <article> tags with the class 'product_pod'
books = soup.find_all("article", class_="product_pod")
if not books:
raise ValueError("No books found in the provided HTML content")
titles = []
prices = []
# Loop through each book found on the page
for book in books:
# Type check to ensure book is a Tag
if not isinstance(book, bs4.element.Tag):
continue
# The title is in an 'a' tag within an 'h3' tag.
# We access the 'title' attribute of the 'a' tag.
h3_tag = book.find("h3")
if isinstance(h3_tag, bs4.element.Tag):
a_tag = h3_tag.find("a")
if isinstance(a_tag, bs4.element.Tag):
title = a_tag.get("title")
titles.append(str(title) if title else "N/A")
else:
titles.append("N/A")
else:
titles.append("N/A")
# The price is in a 'p' tag with the class 'price_color'
price_tag = book.find("p", attrs={"class": "price_color"})
if isinstance(price_tag, bs4.element.Tag):
price_text = price_tag.get_text(strip=True)
prices.append(price_text)
else:
prices.append("N/A")
# --- 4. Structure the Data in a DataFrame ---
if not titles or not prices or len(titles) != len(prices):
raise ValueError("Mismatch between titles and prices in the HTML content")
book_data = pd.DataFrame({"Title": titles, "Price": prices})
# --- 5. Data Cleaning (Bonus) ---
clean_data = book_data.copy()
clean_data["Price_Float"] = pd.to_numeric(
clean_data["Price"].str.replace("£", "", regex=False), errors="coerce"
)
clean_data = clean_data.dropna(subset=["Price_Float"]).copy()
if clean_data.empty:
return book_data, clean_data, {}
# --- 6. Basic Analysis ---
price_series = clean_data["Price_Float"]
analysis: Dict[str, Any] = {
"average_price": float(price_series.mean()),
"min_price": float(price_series.min()),
"max_price": float(price_series.max()),
"count": int(len(clean_data)),
}
most_expensive = clean_data.loc[price_series.idxmax()]
cheapest = clean_data.loc[price_series.idxmin()]
analysis["most_expensive_title"] = most_expensive["Title"]
analysis["most_expensive_price"] = most_expensive["Price"]
analysis["cheapest_title"] = cheapest["Title"]
analysis["cheapest_price"] = cheapest["Price"]
return book_data, clean_data, analysis
def main():
"""
Main function to demonstrate web scraping workflow.
"""
print("🕸️ Day 30: Web Scraping Demonstration")
print("📚 Scraping book data from books.toscrape.com")
print("=" * 50)
# Add a small delay to be respectful to the server
print("⏳ Starting scraping process...")
time.sleep(1)
# Execute the scraping
try:
print(f"🌐 Connecting to {URL}...")
raw_df, clean_df, analysis = scrape_books(URL)
except requests.exceptions.Timeout:
print("❌ Request timed out. The server might be slow or unresponsive.")
return
except requests.exceptions.ConnectionError:
print("❌ Connection error. Please check your internet connection.")
return
except requests.exceptions.HTTPError as exc:
print(f"❌ HTTP error occurred: {exc}")
return
except ScrapingError as exc:
print(f"❌ {exc}")
print("💡 This could be due to:")
print(" • Network connectivity issues")
print(" • Website being temporarily unavailable")
print(" • Blocked by website's anti-bot protection")
print(" • URL has changed or is incorrect")
return
except ValueError as exc:
print(f"❌ {exc}")
print("💡 The website structure may have changed. Try updating the parser.")
return
print("✅ Successfully downloaded the content!")
print(f"📊 Total books scraped: {len(raw_df)}")
if clean_df.empty:
print("⚠️ No valid price data found for analysis.")
return
print("\n--- Sample of Scraped Book Data ---")
print(raw_df.head(10))
print("\n--- Cleaned Price Data ---")
print(clean_df.head(10))
print("\n📈 Basic Price Analysis:")
print(f" Average price: £{analysis['average_price']:.2f}")
print(f" Minimum price: £{analysis['min_price']:.2f}")
print(f" Maximum price: £{analysis['max_price']:.2f}")
print(f" Number of books: {analysis['count']}")
print(
f"💰 Most expensive: '{analysis['most_expensive_title']}' - {analysis['most_expensive_price']}"
)
print(f"💸 Cheapest: '{analysis['cheapest_title']}' - {analysis['cheapest_price']}")
print("\n💡 Next steps you could take:")
print(" • Save data to CSV: clean_df.to_csv('books.csv', index=False)")
print(" • Filter books by price range")
print(" • Scrape additional pages for more data")
print(" • Add more data fields (ratings, availability, etc.)")
if __name__ == "__main__":
main()
```
???+ example “web_scraping_bu.py” View on GitHub
```python title="web_scraping_bu.py"
import json
from pathlib import Path
import requests
from bs4 import BeautifulSoup
url1 = "https://www.bu.edu/president/boston-university-facts-stats/"
response1 = requests.get(url1)
content1 = response1.content
soup1 = BeautifulSoup(content1, "html.parser")
tables = soup1.find_all("div", class_="facts-wrapper")
list_of_tables = []
for i in tables:
keys = []
values = []
temp_dict = {}
i = str(i)
category = i[i.find("<h5>") + 4 : i.find("</h5>")]
temp_dict["Category"] = category
all_key_start_indexes = [x + 7 for x in range(len(i)) if i.startswith('"text">', x)]
all_key_end_indexes = [x for x in range(len(i)) if i.startswith("</p>", x)]
for start_index, end_index in zip(all_key_start_indexes, all_key_end_indexes):
keys.append(i[start_index:end_index])
all_values_start_indexes = []
for v in range(len(i)):
if i.startswith('value">', v):
all_values_start_indexes.append(v + 7)
if i.startswith('value-text">', v):
all_values_start_indexes.append(v + 12)
all_values_end_indexes = [x for x in range(len(i)) if i.startswith("</span>", x)]
for m in range(len(all_values_end_indexes)):
values.append(i[all_values_start_indexes[m] : all_values_end_indexes[m]])
for r in range(len(keys)):
temp_dict[keys[r]] = values[r]
list_of_tables.append(temp_dict)
output_path = Path(__file__).resolve().parent / "scraped_exercise_1.json"
with output_path.open("w", encoding="utf-8") as fp:
json.dump(list_of_tables, fp)
```