We’ve seen lists for ordered data and tuples for immutable data. Now we’ll learn about sets, which are powerful for two main business reasons: ensuring uniqueness and performing membership analysis.
A set is an unordered collection of unique items.
This de-duplication feature is one of the most common uses for sets in data analysis.
The true power of sets comes from their mathematical operations, which are invaluable for customer segmentation and cohort analysis.
Operation | Python Operator | Business Question Answered |
---|---|---|
Union | A | B |
What is the total unique audience for two groups? |
Intersection | A & B |
Which customers are in both Group A and Group B? |
Difference | A - B |
Which customers are in Group A but not in Group B? |
Before you begin, ensure you have followed the setup instructions in the main README.md to set up your virtual environment and install the required libraries.
The script for this lesson, sets.py
, has been refactored into functions to make the logic for de-duplication and segmentation reusable and testable.
Day_07_Sets/sets.py
. Notice the functions get_unique_items()
, analyze_visitor_segments()
, and upgrade_plan_features()
.Coding-For-MBA
), run the script to see the functions in action:
python Day_07_Sets/sets.py
pytest tests/test_day_07.py
Find Unique Customer Cities:
my_solutions_07.py
), you have a list of cities: order_cities = ["New York", "Los Angeles", "Chicago", "New York", "Boston", "Los Angeles"]
.get_unique_items
function from the lesson script.Analyze Website Visitor Activity:
pricing_visitors = {"user1", "user3", "user5", "user7"}
contact_visitors = {"user2", "user3", "user4", "user5"}
analyze_visitor_segments
function.intersection
and difference_a_b
from the returned dictionary to find highly engaged users and users who only viewed pricing.Manage Product Features:
standard_features = {"reporting", "data_export", "basic_support"}
.["api_access", "priority_support"]
for the “Pro Plan”.upgrade_plan_features
function to create the new feature set for the Pro Plan.🎉 Well done! Sets are a specialized but incredibly efficient tool. When you need to de-duplicate a list or analyze the overlap between two groups, sets are the best tool for the job.
Run this lesson’s code interactively in your browser:
!!! tip “About JupyterLite” JupyterLite runs entirely in your browser using WebAssembly. No installation or server required! Note: First launch may take a moment to load.
???+ example “sets.py” View on GitHub
```python title="sets.py"
"""
Day 7: Using Sets for Unique Data and Segmentation (Refactored)
This script demonstrates how to use sets to de-duplicate data
and perform segmentation analysis on business data. This version
is refactored into functions for better organization and testability.
"""
def get_unique_items(items_list):
"""Converts a list to a set to get unique items."""
return set(items_list)
def analyze_visitor_segments(set_a, set_b):
"""
Performs intersection, difference, and union operations on two sets.
Returns a dictionary with the results.
"""
intersection = set_a.intersection(set_b)
difference = set_a.difference(set_b)
union = set_a.union(set_b)
return {"intersection": intersection, "difference_a_b": difference, "union": union}
def upgrade_plan_features(base_features, new_features_list):
"""
Adds new features to a base set of features.
"""
upgraded_plan = base_features.copy()
upgraded_plan.update(new_features_list)
return upgraded_plan
if __name__ == "__main__":
# --- Using a Set to Find Unique Items ---
print("--- Finding Unique Customer Cities ---")
order_cities_list = [
"New York",
"Los Angeles",
"Chicago",
"New York",
"Boston",
"Los Angeles",
"Chicago",
]
print(f"Original list of cities: {order_cities_list}")
unique_cities_set = get_unique_items(order_cities_list)
print(f"Unique cities set: {unique_cities_set}")
print(f"Number of unique cities: {len(unique_cities_set)}")
print("-" * 20)
# --- Using Set Operations for Customer Segmentation ---
print("--- Analyzing Website Visitor Segments ---")
pricing_page_visitors = {"user1", "user3", "user5", "user7", "user8"}
contact_page_visitors = {"user2", "user3", "user4", "user5", "user9"}
segment_analysis = analyze_visitor_segments(
pricing_page_visitors, contact_page_visitors
)
print(
f"Users who visited Pricing AND Contact pages: {segment_analysis['intersection']}"
)
print(
f"Users who only visited the Pricing page: {segment_analysis['difference_a_b']}"
)
print(f"All unique visitors to either page: {segment_analysis['union']}")
print("-" * 20)
# --- Modifying Sets to Manage Product Plans ---
print("--- Managing Product Plan Features ---")
standard_plan_features = {"reporting", "data_export", "basic_support"}
print(f"Standard Plan Features: {standard_plan_features}")
features_to_add_for_pro = ["api_access", "priority_support", "24/7_monitoring"]
pro_plan_features = upgrade_plan_features(
standard_plan_features, features_to_add_for_pro
)
print(f"Pro Plan Features: {pro_plan_features}")
pro_only = pro_plan_features.difference(standard_plan_features)
print(f"Features unique to the Pro Plan: {pro_only}")
print("-" * 20)
```
???+ example “solutions.py” View on GitHub
```python title="solutions.py"
"""
Day 7: Solutions to Exercises
"""
# --- Exercise 1: Find Unique Customer Cities ---
print("--- Solution to Exercise 1 ---")
order_cities = [
"New York",
"Los Angeles",
"Chicago",
"New York",
"Boston",
"Los Angeles",
]
print(f"Original list of cities: {order_cities}")
# Converting the list to a set automatically removes duplicates
unique_cities = set(order_cities)
print(f"Set of unique cities: {unique_cities}")
print(f"Number of unique cities where orders were placed: {len(unique_cities)}")
print("-" * 20)
# --- Exercise 2: Analyze Website Visitor Activity ---
print("--- Solution to Exercise 2 ---")
pricing_visitors = {"user1", "user3", "user5", "user7"}
contact_visitors = {"user2", "user3", "user4", "user5"}
print(f"Pricing Page Visitors: {pricing_visitors}")
print(f"Contact Page Visitors: {contact_visitors}")
# Intersection: users who did both
both_pages_visitors = pricing_visitors.intersection(contact_visitors)
print(f"Users who visited BOTH pages: {both_pages_visitors}")
# Difference: users who visited pricing but not contact
pricing_only_visitors = pricing_visitors.difference(contact_visitors)
print(f"Users who visited Pricing but NOT Contact: {pricing_only_visitors}")
# Union: all unique users who visited either page
all_visitors = pricing_visitors.union(contact_visitors)
print(f"All unique visitors to either page: {all_visitors}")
print("-" * 20)
# --- Exercise 3: Manage Product Features ---
print("--- Solution to Exercise 3 ---")
standard_features = {"reporting", "data_export", "basic_support"}
print(f"Standard Plan Features: {standard_features}")
# Create a copy to avoid modifying the original set
pro_features = standard_features.copy()
# New features to add
new_pro_features = ["api_access", "priority_support"]
pro_features.update(new_pro_features)
print(f"Pro Plan Features after update: {pro_features}")
print("-" * 20)
```