Fabric Project 1 : Scraping The Ideas Site

Part of 1 of end-to-end Fabric project to scrape Ideas site and create analytics solution

Motivation

My friend Alex Powers recently shared the PowerQuery template file to scrape the Microsoft Fabric ideas site. Until Microsoft Fabric was announced, this would have been the only way to scrape the site for analysis in Power BI. But with Fabric, we can not only scrape the data but also create and operationalize analytics solutions that were not possible using one single platform. I strongly believe Fabric will facilitate the development of an array of solutions using machine learning, LLMs that will further augment what we have been developing using Power BI alone.

This is a weekend side project for me, so I will attempt to share as much as I can whenever possible. Like real-world projects, I will iterate over many ideas and solutions so others can see how it evolves and also follow along/contribute.

My Goals:

  • Learn and share what I learn

  • Explore Microsoft Fabric

  • Highlight how Microsoft Fabric allows creating solutions not possible before, especially for Power BI developers

  • Scrape responsibly and develop performant/scalable solutions

  • The final solution will contain notebooks, orchestration pipelines, dimensional modeling, machine learning models with experiments and a Power BI report

  • It is semi-ambitious and will take time but I know it will be worth it

If you are new to Python, I will try to explain as much code as I can. This will be a work in progress and I will label it accordingly. If I feel something is ready, I will label it as complete.

Scraping The Site:

To scrape the ideas site, I will be using BeautifulSoup.

I have updated the script below to scrape pages concurrently. This reduced the scraping time in half compared to the previous version.

Below Python script is a multi-threaded web scraper that extracts data. The script retrieves idea submissions, associated metadata (like votes, state, description), and user information from the specified number of pages, processes this information and saves it into a pandas DataFrame. I like to think in steps and use pseudocode so if you are new to this, use this pseudo-code to understand what's going on:

Pseudo-code

Code

!pip install beautifulsoup4

import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
import concurrent.futures

class Scraper:
    """
    Author: Sandeep Pawar   | Fabric.Guru   | Jun 19, 2023

    This class is responsible for scraping ideas from the Fabric ideas site

    Warning: Be careful with the value of max_pages. If you set it to 'max', it will
    scrape data from pages until no data is available. This could result in a large
    number of requests to the server, which may not be allowed by the site's policy.
    Always ensure that your scraping activities respect the site's terms of service.
    Use it only for learning purposes and avoid any PII data.

    """
    def __init__(self, url, experience_name, max_pages=None):
        self.base_url = url + "?page="
        self.dfs = []  # List to store dataframes
        if max_pages is None:
            self.max_pages = 3
        elif max_pages == -1:
            self.max_pages = 'max'
        else:
            self.max_pages = max_pages

        self.experience_name = experience_name

    def extract_data(self, idea):
        # Helper function to reduce redundancy
        def get_text(selector):
            element = idea.select_one(selector)
            return element.text.strip() if element else None

        def get_href(selector):
            element = idea.select_one(selector)
            return element.get('href') if element else None

        href = get_href(".mspb-mb-18 > .ms-card-heading-1 > A:nth-child(1):nth-last-child(1)")
        user_id = get_href("a.view-profile")
        user_id = user_id.split('=')[-1] if user_id else None
        idea_id = href.split('/')[-1] if href else None

        return {

            "product_name": get_text('span.mspb-mb-4.ms-card-heading-2'),
            "idea_id": idea_id.split("=")[1] if idea_id else None,
            "idea": get_text(".mspb-mb-18 A"),
            "votes": get_text(".ms-fontWeight-semibold *"),
            "date": get_text(".mspb-my-0 SPAN *"),
            "state": get_text(".ms-tag\-\-lightgreen"),
            "user_id": user_id if user_id else "NA",
            "like_count": get_text(".ms-like-count"),
            "description": get_text(".ms-fontColor-gray150.mspb-mb-12"),
        }

    def get_page_data(self, session, page):
#         time.sleep(1)  # Add a delay here
        current_url = self.base_url + str(page)
        response = session.get(current_url)
        if response.status_code != 200:
            return None
        soup = BeautifulSoup(response.content, "html.parser")
        ideas = soup.select(".mspb-py-sm-24")
        if not ideas:  # If there are no matches
            return None
        return [self.extract_data(idea) for idea in ideas]

    def scrape_data(self):
        # Create a session
        with requests.Session() as session:
            pages_to_scrape = list(range(1, self.max_pages+1)) if self.max_pages != 'max' else range(1, 100)  # Consider a large number for 'max'

            with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:  # Limit the number of workers
                results = executor.map(lambda page: self.get_page_data(session, page), pages_to_scrape)

            for result in results:
                # If the page doesn't contain data, skip
                if not result:
                    continue

                df_page = pd.DataFrame(result)
                self.dfs.append(df_page)

        # Concatenate all the dataframes into one
        df = pd.concat(self.dfs, ignore_index=True)
        df["Experience"] = self.experience_name
        df["scraped_date"] = pd.Timestamp.now().normalize().strftime("%Y-%m-%d")
        return df

url = "https://ideas.fabric.microsoft.com/"
scraper = Scraper(url, "Power BI", max_pages=3)
df_ideas = scraper.scrape_data()

Few things to note about the code:

  • Concurrent Scraping: The code utilizes Python's concurrent.futures library to scrape multiple pages simultaneously. This is much faster than scraping pages one after another.

  • Error Handling: The code includes checks to handle situations where a requested page returns a status code other than 200 (which indicates success), or when no 'idea' elements are found on a page. These checks help the program avoid crashes and missing data.

  • Customizable Scrape Limits: The scraper can be set to scrape a specific number of pages, or to keep scraping until it doesn't find any more data. This is handled in the __init__ and scrape_data methods of the Scraper class.

  • Session Usage: The requests.Session object is used to persist certain parameters across requests. It also provides connection pooling, which means that after a connection has been established, subsequent requests to the same host will reuse the underlying TCP connection, leading to performance improvements.

The output of the Scraper class will be a pandas DataFrame containing selected attributes of each idea. This will serve as the primary fact table. To avoid collecting personally identifiable information (PII), I am intentionally not scraping the usernames. Remember to scrape responsibly and use this as a learning opportunity.

Ideas Dimension Table

In the above class, we only scraped the ideas. Now we get the full description of each idea. The logic is still the same as above, just that we go to each ideas page and get the details about the idea:

Status: completed

import concurrent.futures

from tqdm import tqdm

def get_description(idea_url):
    try:
        with requests.Session() as session:
            response = session.get(idea_url)
            if response.status_code == 200:
                soup = BeautifulSoup(response.content, 'html.parser')
                text_element = soup.select_one('.mspb-mb-18')
                if text_element:
                    desc = text_element.get_text(strip=True)
                    return desc
            else:
                print(f"Page not found. URL: {idea_url}")
    except (requests.RequestException, ValueError, AttributeError) as e:
        print(f"Error processing URL: {idea_url}. {e}")
    return None


df_descriptions = df_ideas[['idea_id']].copy()
df_descriptions.loc[:,'urls'] = "https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=" + df_descriptions['idea_id'].astype(str)

ideas = df_descriptions['urls']

descriptions = []

def process_idea(idea):
    description = get_description(idea)
    return description

with concurrent.futures.ThreadPoolExecutor() as executor:
    results = list(tqdm(executor.map(process_idea, ideas), total=len(ideas), desc="Processing URLs"))

df_descriptions['description'] = results
df_descriptions["scraped_date"] = pd.Timestamp.now().normalize().strftime("%Y-%m-%d")

The output of this is a pandas dataframe containing the full description of each idea.

User Dimension Table

As mentioned above I am not scraping the user name intentionally. But for dimensional modeling, it will be nice to have user names so I will use Faker library to create fake names in different languages with user attributes.

!pip install Faker

from faker import Faker
import random
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

def generate_fake_data(df):
    # Copy the dataframe
    user_df = df[['user_id', 'scraped_date']].copy().drop_duplicates()

    # Initialize Faker with multiple locales
    fake = Faker(['it_IT', 'en_US', 'ja_JP', 'en_IN'])

    # Define the start date (6 years ago from today)
    start_date = datetime.today() - timedelta(days=6*365)

    # Create a vectorized function for generating random dates
    vectorized_random_date = np.vectorize(lambda x: start_date + (datetime.today() - start_date) * random.random())

    # Create new columns with fake data
    user_df['name'] = np.array([fake.name() for _ in range(user_df.shape[0])])
    user_df['email'] = np.array([fake.email() for _ in range(user_df.shape[0])])
    user_df['company'] = np.array([fake.company() for _ in range(user_df.shape[0])])
    user_df['title'] = np.array([fake.job() for _ in range(user_df.shape[0])])
    user_df['user_since'] = vectorized_random_date(np.ones(user_df.shape[0]))
    user_df['user_since'] = pd.to_datetime(user_df['user_since']).dt.date

    return user_df

df_users = generate_fake_data(df_ideas)

The output is a pandas dataframe with user details.

Date Dimension Table

I already have a blog post on how to create a date table using Python. I will use that. I don't need all the columns and will drop unnecessary columns later.


import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset, BMonthEnd
from datetime import datetime

#This is for Microsoft Fabric only to optimize the Delta table
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")


def date_dimension(start_date: str, end_date: str) -> pd.DataFrame:

    '''
    Author : Sandeep Pawar | fabric.guru

    Version :2 [06-15-2023]

    This functions creates a date dimension table based on the start date and     end date.     
    start date and end date should be in 'yyyy-mm-dd' format such as '2023-12-31'


    '''

    try:
        start_date = datetime.strptime(start_date, '%Y-%m-%d')
        end_date = datetime.strptime(end_date, '%Y-%m-%d')
    except ValueError:
        raise ValueError("Invalid date format. Please provide dates in 'yyyy-mm-dd' format.")

    if end_date <= start_date:
        raise ValueError("End date should be after start date.")

    df = pd.DataFrame({"Date": pd.date_range(start_date, end_date)})
    df["DateKey"] = df.Date.dt.strftime('%Y%m%d').astype(int)

    today = pd.Timestamp.now().normalize()
    current_week = today.week
    current_month = today.month
    current_quarter = today.quarter
    current_year = today.year

    df["ISODateName"] = df.Date.dt.strftime('%Y-%m-%d')
    df["AmericanDateName"] = df.Date.dt.strftime('%m/%d/%Y')
    df["DayOfWeekName"] = df.Date.dt.day_name()
    df["DayOfWeekShort"] = df.Date.dt.day_name().str[:3]
    df["MonthName"] = df.Date.dt.month_name()
    df["MonthShort"] = df.Date.dt.month_name().str[:3]
    df["YearWeekName"] = df.Date.dt.strftime('%YW%V')
    df["YearMonthName"] = df.Date.dt.strftime('%Y-%m')
    df["MonthYearName"] = df.Date.dt.strftime('%b %Y')
    df["YearQuarterName"] = df.Date.dt.year.astype(str) + 'Q' + df.Date.dt.quarter.astype(str)
    df["Year"] = df.Date.dt.year
    df["YearWeek"] = df.Date.dt.year*100 + df.Date.dt.isocalendar().week
    df["ISOYearWeekCode"] = df.Date.dt.year*100 + df.Date.dt.isocalendar().week
    df["YearMonth"] = df.Date.dt.year*100 + df.Date.dt.month
    df["YearQuarter"] = df.Date.dt.year*100 + df.Date.dt.quarter
    df["DayOfWeekStartingMonday"] = df.Date.dt.dayofweek + 1
    df["DayOfWeek"] = np.where(df.Date.dt.dayofweek == 6, 1, df.Date.dt.dayofweek + 2)
    df["DayOfMonth"] = df.Date.dt.day
    df["DayOfQuarter"] = (df.Date.dt.dayofyear - 1) % 91 + 1
    df["DayOfYear"] = df.Date.dt.dayofyear
    df["WeekOfQuarter"] = (df.Date.dt.day - 1) // 7 + 1
    df["WeekOfYear"] = df.Date.dt.isocalendar().week.astype('int64')
    df["ISOWeekOfYear"] = df.Date.dt.isocalendar().week.astype('int64')
    df["Month"] = df.Date.dt.month
    df["MonthOfQuarter"] = (df.Date.dt.month - 1) % 3 + 1
    df["Quarter"] = df.Date.dt.quarter
    df["DaysInMonth"] = df.Date.dt.days_in_month
    df["DaysInQuarter"] = (df.Date + pd.offsets.QuarterEnd(1)).dt.day
    df['DaysInYear'] = df['Date'].dt.is_leap_year+365
    df['FirstDayOfMonthFlag'] = (df['Date'].dt.is_month_start).astype(int)
    df['LastDayOfMonthFlag'] = (df['Date'].dt.is_month_end).astype(int)
    df['IsTodayFlag']=(df['Date'] == pd.Timestamp.today().date()).astype(int)

    df['IsToday'] = np.where(df['Date'].dt.date == today, 1, 0)
    df['IsCurrentWeek'] = np.where(df['Date'].dt.isocalendar().week == current_week, 1, 0)
    df['IsCurrentMonth'] = np.where(df['Date'].dt.month == current_month, 1, 0)
    df['IsCurrentYear'] = np.where(df['Date'].dt.year == current_year, 1, 0)
    df['IsCurrentQuarter'] = np.where(df['Date'].dt.quarter == current_quarter, 1, 0)
    df['NextDay'] = df['Date'] + DateOffset(days=1)
    df['PreviousDay'] = df['Date'] - DateOffset(days=1)
    df['PreviousYearDay'] = df['Date'] - DateOffset(years=1)
    df['PreviousMonthDay'] = df['Date'] - DateOffset(months=1)
    df['NextMonthDay'] = df['Date'] + DateOffset(months=1)
    df['NextYearDay'] = df['Date'] + DateOffset(years=1)

    return df

# For testing
start_date = "2023-01-01" 
end_date = "2024-05-25" 

date_table = date_dimension(start_date, end_date)
date_table

Output is a pandas dataframe with dates and date attributes

Next :

Need to create a couple of more dimensions, scrape for each Fabric workload, clean/transform.

Did you find this article valuable?

Support Sandeep Pawar by becoming a sponsor. Any amount is appreciated!