Fabric Project 1 : Scraping The Ideas Site
Part of 1 of end-to-end Fabric project to scrape Ideas site and create analytics solution
Motivation
My friend Alex Powers recently shared the PowerQuery template file to scrape the Microsoft Fabric ideas site. Until Microsoft Fabric was announced, this would have been the only way to scrape the site for analysis in Power BI. But with Fabric, we can not only scrape the data but also create and operationalize analytics solutions that were not possible using one single platform. I strongly believe Fabric will facilitate the development of an array of solutions using machine learning, LLMs that will further augment what we have been developing using Power BI alone.
This is a weekend side project for me, so I will attempt to share as much as I can whenever possible. Like real-world projects, I will iterate over many ideas and solutions so others can see how it evolves and also follow along/contribute.
My Goals:
Learn and share what I learn
Explore Microsoft Fabric
Highlight how Microsoft Fabric allows creating solutions not possible before, especially for Power BI developers
Scrape responsibly and develop performant/scalable solutions
The final solution will contain notebooks, orchestration pipelines, dimensional modeling, machine learning models with experiments and a Power BI report
It is semi-ambitious and will take time but I know it will be worth it
If you are new to Python, I will try to explain as much code as I can. This will be a work in progress and I will label it accordingly. If I feel something is ready, I will label it as complete.
Scraping The Site:
To scrape the ideas site, I will be using BeautifulSoup.
I have updated the script below to scrape pages concurrently. This reduced the scraping time in half compared to the previous version.
Below Python script is a multi-threaded web scraper that extracts data. The script retrieves idea submissions, associated metadata (like votes, state, description), and user information from the specified number of pages, processes this information and saves it into a pandas DataFrame. I like to think in steps and use pseudocode so if you are new to this, use this pseudo-code to understand what's going on:
Pseudo-code
Code
!pip install beautifulsoup4
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
import concurrent.futures
class Scraper:
"""
Author: Sandeep Pawar | Fabric.Guru | Jun 19, 2023
This class is responsible for scraping ideas from the Fabric ideas site
Warning: Be careful with the value of max_pages. If you set it to 'max', it will
scrape data from pages until no data is available. This could result in a large
number of requests to the server, which may not be allowed by the site's policy.
Always ensure that your scraping activities respect the site's terms of service.
Use it only for learning purposes and avoid any PII data.
"""
def __init__(self, url, experience_name, max_pages=None):
self.base_url = url + "?page="
self.dfs = [] # List to store dataframes
if max_pages is None:
self.max_pages = 3
elif max_pages == -1:
self.max_pages = 'max'
else:
self.max_pages = max_pages
self.experience_name = experience_name
def extract_data(self, idea):
# Helper function to reduce redundancy
def get_text(selector):
element = idea.select_one(selector)
return element.text.strip() if element else None
def get_href(selector):
element = idea.select_one(selector)
return element.get('href') if element else None
href = get_href(".mspb-mb-18 > .ms-card-heading-1 > A:nth-child(1):nth-last-child(1)")
user_id = get_href("a.view-profile")
user_id = user_id.split('=')[-1] if user_id else None
idea_id = href.split('/')[-1] if href else None
return {
"product_name": get_text('span.mspb-mb-4.ms-card-heading-2'),
"idea_id": idea_id.split("=")[1] if idea_id else None,
"idea": get_text(".mspb-mb-18 A"),
"votes": get_text(".ms-fontWeight-semibold *"),
"date": get_text(".mspb-my-0 SPAN *"),
"state": get_text(".ms-tag\-\-lightgreen"),
"user_id": user_id if user_id else "NA",
"like_count": get_text(".ms-like-count"),
"description": get_text(".ms-fontColor-gray150.mspb-mb-12"),
}
def get_page_data(self, session, page):
# time.sleep(1) # Add a delay here
current_url = self.base_url + str(page)
response = session.get(current_url)
if response.status_code != 200:
return None
soup = BeautifulSoup(response.content, "html.parser")
ideas = soup.select(".mspb-py-sm-24")
if not ideas: # If there are no matches
return None
return [self.extract_data(idea) for idea in ideas]
def scrape_data(self):
# Create a session
with requests.Session() as session:
pages_to_scrape = list(range(1, self.max_pages+1)) if self.max_pages != 'max' else range(1, 100) # Consider a large number for 'max'
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor: # Limit the number of workers
results = executor.map(lambda page: self.get_page_data(session, page), pages_to_scrape)
for result in results:
# If the page doesn't contain data, skip
if not result:
continue
df_page = pd.DataFrame(result)
self.dfs.append(df_page)
# Concatenate all the dataframes into one
df = pd.concat(self.dfs, ignore_index=True)
df["Experience"] = self.experience_name
df["scraped_date"] = pd.Timestamp.now().normalize().strftime("%Y-%m-%d")
return df
url = "https://ideas.fabric.microsoft.com/"
scraper = Scraper(url, "Power BI", max_pages=3)
df_ideas = scraper.scrape_data()
Few things to note about the code:
Concurrent Scraping: The code utilizes Python's
concurrent.futures
library to scrape multiple pages simultaneously. This is much faster than scraping pages one after another.Error Handling: The code includes checks to handle situations where a requested page returns a status code other than 200 (which indicates success), or when no 'idea' elements are found on a page. These checks help the program avoid crashes and missing data.
Customizable Scrape Limits: The scraper can be set to scrape a specific number of pages, or to keep scraping until it doesn't find any more data. This is handled in the
__init__
andscrape_data
methods of theScraper
class.Session Usage: The
requests.Session
object is used to persist certain parameters across requests. It also provides connection pooling, which means that after a connection has been established, subsequent requests to the same host will reuse the underlying TCP connection, leading to performance improvements.
The output of the Scraper class will be a pandas DataFrame containing selected attributes of each idea. This will serve as the primary fact table. To avoid collecting personally identifiable information (PII), I am intentionally not scraping the usernames. Remember to scrape responsibly and use this as a learning opportunity.
Ideas Dimension Table
In the above class, we only scraped the ideas. Now we get the full description of each idea. The logic is still the same as above, just that we go to each ideas page and get the details about the idea:
Status: completed
import concurrent.futures
from tqdm import tqdm
def get_description(idea_url):
try:
with requests.Session() as session:
response = session.get(idea_url)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
text_element = soup.select_one('.mspb-mb-18')
if text_element:
desc = text_element.get_text(strip=True)
return desc
else:
print(f"Page not found. URL: {idea_url}")
except (requests.RequestException, ValueError, AttributeError) as e:
print(f"Error processing URL: {idea_url}. {e}")
return None
df_descriptions = df_ideas[['idea_id']].copy()
df_descriptions.loc[:,'urls'] = "https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=" + df_descriptions['idea_id'].astype(str)
ideas = df_descriptions['urls']
descriptions = []
def process_idea(idea):
description = get_description(idea)
return description
with concurrent.futures.ThreadPoolExecutor() as executor:
results = list(tqdm(executor.map(process_idea, ideas), total=len(ideas), desc="Processing URLs"))
df_descriptions['description'] = results
df_descriptions["scraped_date"] = pd.Timestamp.now().normalize().strftime("%Y-%m-%d")
The output of this is a pandas dataframe containing the full description of each idea.
User Dimension Table
As mentioned above I am not scraping the user name intentionally. But for dimensional modeling, it will be nice to have user names so I will use Faker library to create fake names in different languages with user attributes.
!pip install Faker
from faker import Faker
import random
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
def generate_fake_data(df):
# Copy the dataframe
user_df = df[['user_id', 'scraped_date']].copy().drop_duplicates()
# Initialize Faker with multiple locales
fake = Faker(['it_IT', 'en_US', 'ja_JP', 'en_IN'])
# Define the start date (6 years ago from today)
start_date = datetime.today() - timedelta(days=6*365)
# Create a vectorized function for generating random dates
vectorized_random_date = np.vectorize(lambda x: start_date + (datetime.today() - start_date) * random.random())
# Create new columns with fake data
user_df['name'] = np.array([fake.name() for _ in range(user_df.shape[0])])
user_df['email'] = np.array([fake.email() for _ in range(user_df.shape[0])])
user_df['company'] = np.array([fake.company() for _ in range(user_df.shape[0])])
user_df['title'] = np.array([fake.job() for _ in range(user_df.shape[0])])
user_df['user_since'] = vectorized_random_date(np.ones(user_df.shape[0]))
user_df['user_since'] = pd.to_datetime(user_df['user_since']).dt.date
return user_df
df_users = generate_fake_data(df_ideas)
The output is a pandas dataframe with user details.
Date Dimension Table
I already have a blog post on how to create a date table using Python. I will use that. I don't need all the columns and will drop unnecessary columns later.
import pandas as pd
import numpy as np
from pandas.tseries.offsets import DateOffset, BMonthEnd
from datetime import datetime
#This is for Microsoft Fabric only to optimize the Delta table
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
def date_dimension(start_date: str, end_date: str) -> pd.DataFrame:
'''
Author : Sandeep Pawar | fabric.guru
Version :2 [06-15-2023]
This functions creates a date dimension table based on the start date and end date.
start date and end date should be in 'yyyy-mm-dd' format such as '2023-12-31'
'''
try:
start_date = datetime.strptime(start_date, '%Y-%m-%d')
end_date = datetime.strptime(end_date, '%Y-%m-%d')
except ValueError:
raise ValueError("Invalid date format. Please provide dates in 'yyyy-mm-dd' format.")
if end_date <= start_date:
raise ValueError("End date should be after start date.")
df = pd.DataFrame({"Date": pd.date_range(start_date, end_date)})
df["DateKey"] = df.Date.dt.strftime('%Y%m%d').astype(int)
today = pd.Timestamp.now().normalize()
current_week = today.week
current_month = today.month
current_quarter = today.quarter
current_year = today.year
df["ISODateName"] = df.Date.dt.strftime('%Y-%m-%d')
df["AmericanDateName"] = df.Date.dt.strftime('%m/%d/%Y')
df["DayOfWeekName"] = df.Date.dt.day_name()
df["DayOfWeekShort"] = df.Date.dt.day_name().str[:3]
df["MonthName"] = df.Date.dt.month_name()
df["MonthShort"] = df.Date.dt.month_name().str[:3]
df["YearWeekName"] = df.Date.dt.strftime('%YW%V')
df["YearMonthName"] = df.Date.dt.strftime('%Y-%m')
df["MonthYearName"] = df.Date.dt.strftime('%b %Y')
df["YearQuarterName"] = df.Date.dt.year.astype(str) + 'Q' + df.Date.dt.quarter.astype(str)
df["Year"] = df.Date.dt.year
df["YearWeek"] = df.Date.dt.year*100 + df.Date.dt.isocalendar().week
df["ISOYearWeekCode"] = df.Date.dt.year*100 + df.Date.dt.isocalendar().week
df["YearMonth"] = df.Date.dt.year*100 + df.Date.dt.month
df["YearQuarter"] = df.Date.dt.year*100 + df.Date.dt.quarter
df["DayOfWeekStartingMonday"] = df.Date.dt.dayofweek + 1
df["DayOfWeek"] = np.where(df.Date.dt.dayofweek == 6, 1, df.Date.dt.dayofweek + 2)
df["DayOfMonth"] = df.Date.dt.day
df["DayOfQuarter"] = (df.Date.dt.dayofyear - 1) % 91 + 1
df["DayOfYear"] = df.Date.dt.dayofyear
df["WeekOfQuarter"] = (df.Date.dt.day - 1) // 7 + 1
df["WeekOfYear"] = df.Date.dt.isocalendar().week.astype('int64')
df["ISOWeekOfYear"] = df.Date.dt.isocalendar().week.astype('int64')
df["Month"] = df.Date.dt.month
df["MonthOfQuarter"] = (df.Date.dt.month - 1) % 3 + 1
df["Quarter"] = df.Date.dt.quarter
df["DaysInMonth"] = df.Date.dt.days_in_month
df["DaysInQuarter"] = (df.Date + pd.offsets.QuarterEnd(1)).dt.day
df['DaysInYear'] = df['Date'].dt.is_leap_year+365
df['FirstDayOfMonthFlag'] = (df['Date'].dt.is_month_start).astype(int)
df['LastDayOfMonthFlag'] = (df['Date'].dt.is_month_end).astype(int)
df['IsTodayFlag']=(df['Date'] == pd.Timestamp.today().date()).astype(int)
df['IsToday'] = np.where(df['Date'].dt.date == today, 1, 0)
df['IsCurrentWeek'] = np.where(df['Date'].dt.isocalendar().week == current_week, 1, 0)
df['IsCurrentMonth'] = np.where(df['Date'].dt.month == current_month, 1, 0)
df['IsCurrentYear'] = np.where(df['Date'].dt.year == current_year, 1, 0)
df['IsCurrentQuarter'] = np.where(df['Date'].dt.quarter == current_quarter, 1, 0)
df['NextDay'] = df['Date'] + DateOffset(days=1)
df['PreviousDay'] = df['Date'] - DateOffset(days=1)
df['PreviousYearDay'] = df['Date'] - DateOffset(years=1)
df['PreviousMonthDay'] = df['Date'] - DateOffset(months=1)
df['NextMonthDay'] = df['Date'] + DateOffset(months=1)
df['NextYearDay'] = df['Date'] + DateOffset(years=1)
return df
# For testing
start_date = "2023-01-01"
end_date = "2024-05-25"
date_table = date_dimension(start_date, end_date)
date_table
Output is a pandas dataframe with dates and date attributes
Next :
Need to create a couple of more dimensions, scrape for each Fabric workload, clean/transform.