Skip to content

Day 83 – BI Cloud and Modern Data Stack

This lesson is part of the Phase 5 Business Intelligence specialization. Use the Phase 5 overview to see how the developer-roadmap topics align across Days 68–84.

Why it matters

Modern BI teams assemble cloud-native tooling that balances time-to-value, governance, and spend. Understanding how the cloud ecosystem fits together ensures analysts can navigate trade-offs when choosing warehouses, integration layers, and visualization services.

Developer-roadmap alignment

  • Cloud Computing Basics
  • Cloud BI Ecosystem
  • Cloud data warehouses
  • Providers: AWS, GCP, Azure
  • Cloud

Cloud architecture patterns

| Pattern | Components | Feature focus | Cost trade-off | | --- | --- | --- | --- | | Centralized warehouse with semantic layer | Serverless warehouse, ELT pipelines, BI semantic model | Curated metrics exposed through governed BI layers | Reserved capacity discounts exchange flexibility for governance licensing costs | | Lakehouse with streaming ingestion | Object storage, streaming ingestion, open table formats, SQL endpoints | Unified analytics supporting dashboards and ML on the same platform | Streaming autoscale fees must be balanced against freshness SLAs | | Composable stack with reverse ETL | Cloud warehouse, transformation service, reverse ETL activations | Operationalizes analytics inside SaaS tools without duplicating logic | Connector-based pricing introduces variable spend per downstream system |

Provider evaluation checklist

  • Confirm the managed warehouse option (Redshift, BigQuery, Synapse) and how it scales.
  • Map analytics services (QuickSight, Looker, Power BI) to stakeholder use cases.
  • Align orchestration choices (Step Functions, Cloud Composer, Data Factory) with existing engineering standards.
  • Capture pricing guardrails, including autosuspend, flat-rate commitments, and hybrid benefits.
  • Note governance integrations such as IAM, Dataplex, and Purview for security reviews.

Next steps

  • Use the comparison matrix in lesson.py to facilitate vendor shortlists.
  • Draft cost scenarios that highlight egress, autoscaling, and reserved capacity for each provider.

Additional Topic: Career Assets & Credentials

This lesson is part of the Phase 5 Business Intelligence specialization. Use the Phase 5 overview to see how the developer-roadmap topics align across Days 68–84.

Why it matters

Design a career evidence plan that highlights BI outcomes.

Developer-roadmap alignment

  • Building Your Portfolio
  • Job Preparation
  • Certifications
  • Networking

Next steps

  • Draft case studies and notebooks that exercise these roadmap nodes.
  • Update the Phase 5 cheat sheet with the insights you capture here.

Previous: Day 82 – Day 82 – BI ETL and Pipeline Automation β€’ Next: Day 84 – Day 84 – BI Career Development and Capstone

You are on lesson 83 of 108.

Additional Materials

lesson.py

View on GitHub

lesson.py
# %%
"""Day 83 – BI Cloud and Modern Data Stack classroom script."""

# %%
from __future__ import annotations

from typing import Mapping

import pandas as pd

from Day_83_BI_Cloud_and_Modern_Data_Stack import (
    build_cloud_topic_dataframe,
    build_provider_comparison_frame,
    group_cloud_topics,
)

# %%
CLOUD_GROUPS = group_cloud_topics()
CLOUD_TOPIC_FRAME = build_cloud_topic_dataframe()
PROVIDER_FRAME = build_provider_comparison_frame()

CLOUD_ARCHITECTURE_PATTERNS: Mapping[str, Mapping[str, str]] = {
    "Centralized warehouse with semantic layer": {
        "components": "Serverless warehouse, ELT pipelines, BI semantic model",
        "strength": "Balances governed data with curated metrics exposed through BI tools.",
        "cost_trade_off": (
            "Reserved capacity lowers compute rates, but semantic modeling requires "
            "licensing for governance layers."
        ),
    },
    "Lakehouse with streaming ingestion": {
        "components": "Object storage, streaming ingestion, open table formats, SQL endpoints",
        "strength": "Enables near real-time dashboards while supporting ML workloads on the same lake.",
        "cost_trade_off": (
            "Storage remains inexpensive, yet streaming autoscale costs must be tracked "
            "against refresh SLAs."
        ),
    },
    "Composable stack with reverse ETL": {
        "components": "Cloud warehouse, transformation service, reverse ETL activations",
        "strength": "Delivers analytics in operational tools without duplicating governance logic.",
        "cost_trade_off": (
            "SaaS integration fees add up, so teams trade platform simplicity for per-connector charges."
        ),
    },
}

COST_OPTIMIZATION_PROMPTS: Mapping[str, str] = {
    "Elastic compute": "How can we use autosuspend and scale-to-zero policies to reduce idle spend?",
    "Storage tiers": "When do we archive historical BI extracts into colder tiers without hurting SLAs?",
    "Data movement": "Which provider-native services offset egress fees through in-platform processing?",
}


# %%
def display_topic_groups(groups: Mapping[str, list]) -> None:
    """Print the grouped roadmap topics for facilitation."""

    print("\nCloud BI roadmap groupings:\n")
    for section, topics in groups.items():
        titles = ", ".join(topic.title for topic in topics)
        print(f"- {section}: {titles}")


# %%
def show_cloud_topic_frame(frame: pd.DataFrame) -> None:
    """Display the topic dataframe with descriptions and trade-offs."""

    print("\nLesson overview matrix:\n")
    print(frame.to_markdown(index=False))


# %%
def explain_architecture_patterns(patterns: Mapping[str, Mapping[str, str]]) -> None:
    """Describe reference architectures and their cost/feature positioning."""

    print("\nCloud architecture patterns and trade-offs:\n")
    for name, metadata in patterns.items():
        components = metadata.get("components", "")
        strength = metadata.get("strength", "")
        cost_trade_off = metadata.get("cost_trade_off", "")
        print(f"* {name}")
        print(f"  - Components: {components}")
        print(f"  - Strength: {strength}")
        print(f"  - Cost trade-off: {cost_trade_off}\n")


# %%
def preview_provider_matrix(frame: pd.DataFrame) -> None:
    """Show the provider comparison matrix across AWS, GCP, and Azure."""

    print("\nProvider capability comparison:\n")
    print(frame.to_markdown(index=False))


# %%
def prompt_cost_reviews(prompts: Mapping[str, str]) -> None:
    """Offer facilitation questions that emphasize ongoing cost reviews."""

    print("\nCost optimization prompts:\n")
    for theme, question in prompts.items():
        print(f"- {theme}: {question}")


# %%
def main() -> None:
    """Run the Day 83 classroom walkthrough."""

    display_topic_groups(CLOUD_GROUPS)
    show_cloud_topic_frame(CLOUD_TOPIC_FRAME)
    explain_architecture_patterns(CLOUD_ARCHITECTURE_PATTERNS)
    preview_provider_matrix(PROVIDER_FRAME)
    prompt_cost_reviews(COST_OPTIMIZATION_PROMPTS)


# %%
if __name__ == "__main__":
    main()
solutions.py

View on GitHub

solutions.py
"""Topic helpers for the Day 83 BI Cloud and Modern Data Stack lesson."""

from __future__ import annotations

from typing import Dict, List, Mapping, Sequence

import pandas as pd

from mypackage.bi_curriculum import BiTopic, group_topics_by_titles, topics_by_titles

CLOUD_TITLES: Sequence[str] = (
    "Cloud BI Ecosystem",
    "Cloud Computing Basics",
    "Cloud data warehouses",
    "Providers: AWS, GCP, Azure",
    "Cloud",
)

CLOUD_TOPIC_GROUPS: Mapping[str, Sequence[str]] = {
    "Cloud foundations": (
        "Cloud Computing Basics",
        "Cloud",
    ),
    "Analytics ecosystem": (
        "Cloud BI Ecosystem",
        "Cloud data warehouses",
    ),
    "Provider landscape": ("Providers: AWS, GCP, Azure",),
}

CLOUD_TOPIC_DESCRIPTIONS: Mapping[str, str] = {
    "Cloud Computing Basics": (
        "Baseline students on elasticity, shared responsibility, and on-demand "
        "pricing so BI teams can evaluate managed services."
    ),
    "Cloud": (
        "Frame cloud operating models and the relationship between regions, "
        "availability zones, and compliance domains."
    ),
    "Cloud BI Ecosystem": (
        "Connect ingestion, warehousing, transformation, and visualization "
        "services into an integrated reference architecture."
    ),
    "Cloud data warehouses": (
        "Compare serverless warehouses and managed clusters for scale, query "
        "performance, and workload isolation."
    ),
    "Providers: AWS, GCP, Azure": (
        "Guide students through evaluating vendor strengths, default tooling, and "
        "partner ecosystems."
    ),
}

CLOUD_COST_CONSIDERATIONS: Mapping[str, str] = {
    "Cloud Computing Basics": "Variable compute and storage pricing favors bursty BI workloads.",
    "Cloud": "Networking egress and compliance guardrails become the dominant cost drivers.",
    "Cloud BI Ecosystem": "Managed services reduce admin labor but require budgeting for integration tiers.",
    "Cloud data warehouses": "Scale-to-zero options curb idle spend while reserved capacity lowers steady-state cost.",
    "Providers: AWS, GCP, Azure": "Marketplace commitments can trade flexibility for discounts across the stack.",
}

PROVIDER_COMPARISON: Mapping[str, Mapping[str, str]] = {
    "AWS": {
        "managed_warehouse": "Amazon Redshift Serverless with RA3 scaling tiers",
        "analytics_services": "QuickSight, Athena, Glue, Lake Formation",
        "orchestration": "Managed Airflow, Step Functions, and event-driven Lambda",
        "pricing_highlight": "Granular per-second billing with savings plans for reserved throughput",
        "notable_integration": "Tight coupling with S3 data lake and security via IAM",
    },
    "GCP": {
        "managed_warehouse": "BigQuery with autoscaling slots and data lake federation",
        "analytics_services": "Looker, Data Studio, Dataflow, Dataproc",
        "orchestration": "Cloud Composer, Workflows, and Cloud Functions",
        "pricing_highlight": "Serverless query pricing plus flat-rate commitments for enterprise teams",
        "notable_integration": "Unified governance through Dataplex and Vertex AI integrations",
    },
    "Azure": {
        "managed_warehouse": "Azure Synapse with serverless SQL pools and dedicated nodes",
        "analytics_services": "Power BI, Azure Data Factory, Databricks",
        "orchestration": "Data Factory pipelines, Logic Apps, and Functions",
        "pricing_highlight": "Hybrid benefits with reserved capacity discounts and spot compute tiers",
        "notable_integration": "Deep integration with Microsoft 365 security and Purview governance",
    },
}


def load_cloud_topics(titles: Sequence[str] = CLOUD_TITLES) -> List[BiTopic]:
    """Return the BI roadmap topics for the cloud and modern data stack lesson."""

    return list(topics_by_titles(titles))


def group_cloud_topics(
    groups: Mapping[str, Sequence[str]] = CLOUD_TOPIC_GROUPS,
) -> Dict[str, List[BiTopic]]:
    """Return grouped cloud topics covering foundations, ecosystem, and providers."""

    return {
        section: topics for section, topics in group_topics_by_titles(groups).items()
    }


def build_cloud_topic_dataframe(
    *,
    groups: Mapping[str, Sequence[str]] = CLOUD_TOPIC_GROUPS,
    descriptions: Mapping[str, str] = CLOUD_TOPIC_DESCRIPTIONS,
    cost_notes: Mapping[str, str] = CLOUD_COST_CONSIDERATIONS,
) -> pd.DataFrame:
    """Create a dataframe summarizing lesson sections, descriptions, and trade-offs."""

    grouped = group_cloud_topics(groups=groups)
    records: list[dict[str, str]] = []
    for section, topics in grouped.items():
        for topic in topics:
            records.append(
                {
                    "section": section,
                    "title": topic.title,
                    "description": descriptions.get(topic.title, ""),
                    "cost_trade_off": cost_notes.get(topic.title, ""),
                }
            )
    return pd.DataFrame(
        records,
        columns=["section", "title", "description", "cost_trade_off"],
    )


def build_provider_comparison_frame(
    comparisons: Mapping[str, Mapping[str, str]] = PROVIDER_COMPARISON,
) -> pd.DataFrame:
    """Return a provider feature matrix for AWS, GCP, and Azure offerings."""

    rows: list[dict[str, str]] = []
    columns = [
        "provider",
        "managed_warehouse",
        "analytics_services",
        "orchestration",
        "pricing_highlight",
        "notable_integration",
    ]
    for provider, features in comparisons.items():
        row = {"provider": provider}
        row.update(features)
        rows.append(row)
    frame = pd.DataFrame(rows, columns=columns)
    return frame.sort_values("provider").reset_index(drop=True)


__all__ = [
    "CLOUD_TITLES",
    "build_cloud_topic_dataframe",
    "build_provider_comparison_frame",
    "group_cloud_topics",
    "load_cloud_topics",
]