Sandeep Pawar | Microsoft Fabric

Exporting Power BI Reports And Sharing With Granular Access Control In Fabric

Sandeep Pawar — Wed, 24 Apr 2024 22:45:35 GMT

There are several different ways to export a Power BI report as PDF, PPTX etc. Users can do it manually, via subscriptions, API, Power Automate and couple of other ways. All work but with Fabric, and especially with Semantic Link, it became very easy. Plus, with the recent announcement of OneLake Data Access Control, you can add a security layer in the Lakehouse to ensure users are able to access only the extracts they have been granted access to.

Before I begin, I want to thank Michael Kovalsky and Aaron Merrill at Microsoft for collaborating and answering my questions. I will be using Michael's fabric_cat_toolslibrary. If you have not checked it out yet, I highly recommend taking a look.

With the proposed solution below, you will be able to :

Export a Power BI report, or a page of a report or a specific visual from any page as a PDF, PNG, PPTX or other supported file formats
Apply report level filters before exporting
Automate the extracts on a schedule
Save the exported reports to specific folders
Grant access to individual folders in the Lakehouse

This does not obviate the need for a Paginated Report since you can't create pagination but I have worked on several projects where users want the PDF extracts or I have had to create Paginated Reports that could have been Power BI reports. Below, I will not share dynamic/interactive filtering. I will cover it in a future blog on how users can specify their own filters and folder location to save the exported files on demand, similar to Paginated Reports.

Exporting The Power BI Report:

You can useExport To File API to export any Power BI report. You can use the PowerBIRestClient in Semantic Link to do so easily. However, Michael has a convenient wrapperexport_report to do it and provides an easy way to pass the filter expressions.

%pip install "https://raw.githubusercontent.com/m-kovalsky/fabric_cat_tools/main/fabric_cat_tools-0.3.1-py3-none-any.whl"#example codeimport fabric_cat_tools as fctfct.export_report(            report = 'AdventureWorks'            ,export_format = 'PDF'            ,page_name = 'ReportSection293847182375'            ,report_filter = "'Product Category'[Color] in ('Blue', 'Orange') and 'Calendar'[CalendarYear] <= 2020"            ,workspace = None            )

In my example, I want to apply filter based on Store column in Store table in the semantic model and save all pages to the respective folders.

In the below highly commented code, I first build the filtering logic using a dictionary. You can implement your own logic, make it as simple or complex as needed. Specify the folder lakehouse location and the file name.

Note the filter string. If you used the API, you will have to specify the filter using a not-so-obvious filter expression such as URL?filter=Table/Field eq 'value' . But Michael has simplified it, you can use 'Table'[Column]="value". The function automatically resolves it to the correct filter expression.

💡

You must have a default lakehouse mounted before running the code. You can use the File API path /lakehouse/default/Folder/ to specify the location.

#build the filter logic. e.g. in the below dictionary the key is the folder where I want to #save the file and the value associated with that key is the filter I want to use#you can create custom filtering options based on your requirements (e.g. year, month, store etc)import sempy.fabric as fabricimport fabric_cat_tools as fctfolder_filter_dict = {"Fama Stores":"Fama", "Contoso Stores":"Contoso", "Leo Stores":"Leo"}#Relative Path where the exported reports should be savedpath = "Files/Exported_Reports_PDF"for folder in folder_filter_dict.keys():     #check if the folder exists, create the folder if it does not exist    if not mssparkutils.fs.exists(f"{path}/{folder}"):        print(f"Folder {f} does not exist at {path}.. creating it")        mssparkutils.fs.mkdirs(f"{path}/{folder}")    try:        print(f"---->Exporting to {folder} folder")        fct.export_report(            #name of the report you want to export            report = 'Sales & Returns',            #name of the workspace where this report exists. If None, current workspace will be used            workspace = 'Sales Workspace',            #file format, e.g. pdf, png, pptx etc.            export_format = 'PDF',            #report level filter to apply using filter expression            #change this based on your requirements            report_filter = f"'Store'[Store] = '{folder_filter_dict[folder]}'",            #path and name of the file. do not add file extension such as .pdf, .png            #you must mount a deault lakehouse            file_name=f'/lakehouse/default/{path}/{folder}/{folder_filter_dict[folder]}',            #name of the report page you want to export. Use the ReportSection#### from the URL            #page_name = None,             #name of the visual. Must specify the page_name if visual_name is specified            #To get the visual name, right click a visual in service > Share > Link to visual > you wil see the visual name            #visual_name=None        )    except:        print("Error exporting to {folder}")

After executing above code, the reports were exported as below:

If you re-run the code, it will overwrite the existing files. You can suffix timestamps to file name if you want to keep a history of all the extracts.

Below is the exported PDF:

Some visual types may not be supported in the exported PDF so be sure to test it.

In a future blog, I will show how to use widgets + Semantic Link + Notebook permissions to mimic Paginated Report like interactive experience. Stay tuned !

Data Access Control

Once exported, I want to give granular access, i.e. only the users from Contoso stores should only have access to Contoso Stores folder etc. We can achieve this using Data Access Control in the Lakehouse. Note that data access control is currently in preview

Grant Read access to the Lakehouse
First step is to give Read access to the Lakehouse by unchecking all the boxes from Lakehouse sharing dialog box. This will enable metadata access to the Lakehouse.

You can read more about it here. A quick overview of different permissions is below. In my case the user does not have a workspace role.

Define Data access role:
- Open the Lakehouse and select Manage OneLake data access
- Create a new role (not the DefaultReader)
- Select the folder to define the scope

Assign role:
Select the newly created role and select Assign. Include users in Read permissions. This virtualizes all users with the checked permissions and includes in them in the role. Note below that the names of the users who have Read access to the Lakehouse automatically showed up with their respective item permissions. I entered user Megan Bowen but her Permissions is shown as None. That's because she does not have access to the Lakehouse. For her to see the folder contents, she must have workspace permissions or the item permissions.

The user will see the shared lakehouse in their Onelake data hub :

When they open the Lakehouse, they will see only the folder they have been given access to and not the other folders. The best part is that with the magic of Onelake Explorer, they will be able to access the exported reports on local machine, just like a OneDrive for business !

💡

I gave a simplified example for demonstration purposes. Follow the best practices and company governance practices of using security groups etc. to define security. I highly recommend reading these best practices for Onelake security.

What About Custom Stylized PDF Reports?

Absolutely ! With Semantic Link we can query the semantic model, transform it and with Python/R libraries create highly customized and stylized PDFs, Excel reports as needed and follow the same process as above to distribute it securely.

In the below example, I am using Semantic Link to query the dataset, great_tables to create a stylized report and Weasyprint to export it as a PDF to a Lakehouse location.

%pip install great_tables --q%pip install WeasyPrint --qimport reimport pandas as pdfrom datetime import datetimefrom weasyprint import HTMLfrom great_tables import GT,mdimport sempy.fabric as fabricdax = """EVALUATE    VAR __filter =         TREATAS({"Abbas",            "Aliqui",            "Barba",            "Contoso"}, 'Store'[Store])    VAR __table =         SUMMARIZECOLUMNS(            'Store'[Store],            'Product'[Category],            'Customer'[Price Range],            __filter,            "Net_Sales", 'Analysis DAX'[Net Sales]        )RETURN__table"""result = fabric.evaluate_dax(dataset="Sales & Returns", workspace="Sales Workspace" , dax_string=dax)result.columns = [re.search(r'\[(.*?)\]', item).group(1) for item in result.columns]table = pd.pivot_table(data=result, columns="Price Range", index=["Store", "Category"], values="Net_Sales").reset_index()table["Total"] = table.select_dtypes('float').sum(axis=1).rename("Total")table = table[["Category","Store","<$40","$40 - $70",">$70", "Total"]]result = (    GT(table, rowname_col="Store", groupname_col="Category")    .tab_header(        title=md("**Net Sales By Product, Store by Price Range**"),         subtitle=md("*Dataset : Sales & Returns | Workspace : Sales*" )        )    .tab_spanner("Price Range", columns = ["<$40","$40 - $70",">$70"])    .fmt_currency(columns=['$40 - $70', '<$40', '>$70', "Total"], decimals=0)      .tab_source_note(md(f"*Data as of {datetime.now().strftime('%Y-%m-%d')} , fabric.guru*"))    .opt_stylize() )#resultHTML(string=result.as_raw_html()).write_pdf('/lakehouse/default/Files/pdf_report.pdf')

Refer to the respective library documentation and API for details. great_tables has .as_pdf option to save the table directly as a PDF but it requires a browser to be installed which Fabric VMs do not have so I exported the HTML and used Weasyprint to create PDF. You know a better way, let me know.

In summary, Fabric notebooks, Semantic Link and the granular data access control open up many interesting use cases and applications to meet reporting and analytics needs while enforcing security. You can access the pbix I used from here.

💡

Side note: Today is the 4th anniversary of blog. Those of you who are ardent cricket fans like me might recognize the date Apr, 24, my hero Sachin Tendulkar's birthday. Happy Birthday Sachin ! You continue to inspire me every single day.

Calculating and Reducing Power BI Dataset Size Using Semantic Link

Sandeep Pawar — Thu, 29 Feb 2024 23:22:12 GMT

Semantic Link v0.6 is out and it has many new exciting additions to its growing list of list_* methods. Highlighted are some of the new methods. Install the latest version and check it out.

Some of the existing methods such as list_columns() have an additional parameter extended which returns more column information such as column cardinality, size, encoding and many more column properties. This allows users to get detailed information about the dataset and the columns.

I used this to create a function that calculates:

overall size of the semantic model based on column size. This includes data, dictionary and hierarchy size
number of tables in the model
%of size by calculated columns in the model. This can be used to identify datasets that heavily use calculated columns
%of size by calculated tables. We know that calculated tables do not compress as well as tables created using Power Query.
%of size by auto Datetime. Auto datetime can increase the size of the model significantly
%of size by floats, i.e. columns that have decimal data type. In Power BI, such columns can significantly increase the size of the model
%of size by column hierarchy. Column hierarchy size can increase for high cardinality columns. Turning off IsAvailableInMDX for such columns can lead to reduced size and faster refreshes. Watch this video by Marco to learn more.

All of this has already been available via VertiPaq Analyzer but one dataset at a time. Now with Semantic Link and Fabric notebooks, we can easily scan the entire tenant and obtain this information to :

Get an overall estimate of size of the model in memory
Identify cause of the bloated model size
Track and monitor change in size
Alert users proactively
Keep the capacity running efficiently

💡

In the function below, I am only using the column size. The total size you see in VertiPaq Analyzer includes user hierarchy, relationship size and other metatada from service but column size is the bulk of the overall size and relevant to what I want to achieve.

Below function:

Finds active capacities in the tenant
Only Premium/Fabric workspaces are included in the analysis
By default, default semantic models are excluded but you can include them by setting drop_default_datasets=False
For default semantic models, since XMLA endpoint is not available, size cannot be calculated

Code

%pip install semantic-link --q #install semantic-link v0.6import sempy.fabric as fabricimport pandas as pdimport datetimedef analyze_model_size(drop_default_datasets=True):    """    Author : Sandeep Pawar |  fabric.guru  |  02-29-2024    This function scans the active premium workspaces in the entire tenant and calculates the total column size    of premium datasets (import and Direct Lake). The purpose of the function is to get the total column size and    identify size reduction opportunities by calculating percentage of size occupied by calculated tables, calculated columns,    auto datetime tables, columns with double data type and column hierarchy size.    Note that user hierarchy and relationship size are not used in memory size calculation.       args: drop_default_datasets=True          set to False to drop default datasets from the final dataframe. Default is True.      """    def get_ds_size(workspace, dataset):        all_columns = fabric.list_columns(workspace=workspace, dataset=dataset, extended=True)        size = sum(all_columns["Total Size"])        calc_cols = sum(all_columns[all_columns.Type == "Calculated"]["Total Size"])        calc_tables = sum(all_columns.query('Type == "CalculatedTableColumn"')["Total Size"])        auto_date = sum(all_columns[all_columns['Table Name'].str.startswith(("DateTableTemplate_", "LocalDateTable_"))]["Total Size"])        float_cols = sum(all_columns.query('`Data Type` == "Double"')["Total Size"])        hierarchy_size = sum(all_columns["Hierarchy Size"]) #column hierarchy size        num_tables = all_columns["Table Name"].nunique()        return size, calc_cols, calc_tables, auto_date, float_cols, hierarchy_size, num_tables    #Get active capacities only    active_capacities = fabric.list_capacities().query('State == "Active"')    #Premium and Fabric workspaces    ws = fabric.list_workspaces().query('`Is On Dedicated Capacity`==True')    premium_workspaces = ws[ws['Capacity Id'].isin(list(active_capacities.Id))]    datasets = pd.concat([fabric.list_datasets(ws).assign(workspace=ws) for ws in premium_workspaces['Name']], ignore_index=True)    col_list = ['total_columnsize_MB', 'pct_size_calculated_cols', 'pct_size_calculated_tables',              'pct_size_autodatetime', 'pct_size_floats', 'pct_hierarchy_size','number_of_tables']        catalog = datasets[["workspace", "Dataset Name"]].copy().assign(date = datetime.date.today())    catalog[col_list] = pd.NA    for i, row in catalog.iterrows():        try:            size, calc_cols, calc_tables, auto_date, float_cols, hierarchy_size, num_tables = get_ds_size(row["workspace"], row["Dataset Name"])            catalog.loc[i, ['total_columnsize_MB', "pct_size_calculated_cols", 'pct_size_calculated_tables',                             'pct_size_autodatetime', 'pct_size_floats', 'pct_hierarchy_size', 'number_of_tables']] = [                round(size/(1024**2), 1), round(100 * (calc_cols / size), 1), round(100 * (calc_tables / size), 1),                round(100 * (auto_date / size), 1), round(100 * (float_cols / size), 1), round(100 * (hierarchy_size / size), 1), int(num_tables)            ]        #Excpetion to handle default datasets which do not have XMLA endpoint        except Exception:             continue    for col in col_list:        catalog[col]=pd.to_numeric(catalog[col], errors='coerce')    if drop_default_datasets:        #default datasets will show NaN values. To include default, set drop_default_datasets=False        catalog.dropna(inplace=True)     catalog.sort_values(by=['total_columnsize_MB'], ascending=False, inplace=True)    catalog.reset_index(drop=True, inplace=True)    return catalogdf = analyze_model_size()df

In the example above, I can right away get the idea about the size of the semantic model and factors that are potentially contributing to the size of the model. There can be other reasons as well but this will help identify the problematic models quickly and direct the optimization efforts.

Summary Report

from IPython.display import display, HTMLdf = analyze_model_size()df2  = (df        .groupby('workspace')        .agg(Number_of_Datasets=('Dataset Name', 'count'),         Estimated_Size_MB=('total_columnsize_MB', 'sum'))       .sort_values('Estimated_Size_MB', ascending=False)        )def summarize_model(dfs_dict):    """    Displays dataframes in tabs in the notebook    Parameters:    - dfs_dict: Dictionary with tab titles as keys and pandas DataFrames as values.    """    # Basic styles for the tabs and tab content    styles = """        """    # JavaScript for tab functionality    script = """        """    # HTML for tabs    tab_html = ''    content_html = ''    for i, (title, df) in enumerate(dfs_dict.items()):        tab_id = f"tab{i}"        tab_html += f''        content_html += f'{tab_id}" class="tabcontent">{title}
{df.to_html()}
'    tab_html += '
'    # Display the tabs, tab contents, and run the script    display(HTML(styles + tab_html + content_html + script))    # Default to open the first tab    display(HTML(""))#define the dictionary with {"Tab name":df}df_dict = {    "Details":df,    "Workspace Summary":df2,    "Top 10 By Column Size": df.nlargest(10, 'total_columnsize_MB'),    "Top 10 By Calculated Columns": df.nlargest(10, 'pct_size_calculated_cols'),    "Top 10 By Calculated Columns": df.nlargest(10, 'pct_size_calculated_tables'),    "Top 10 By Column Hierarchy Size": df.nlargest(10, 'pct_hierarchy_size'),    "Top 10 By Auto DateTime": df.nlargest(10, 'pct_size_autodatetime')        }    summarize_model(df_dict)

The tabbed summary report shows the overall summary as well as the datasets sorted by different criteria to help users get started.

💡

As you can imagine, this is just the beginning. When I first wrote about Semantic Link, I mentioned that you will be able to execute BPA rules and get VertiPaq Analyzer-like summary of the model. Well, stay tuned - more to come!

This is not a replacement of existing methods or tools to track and optimize the semantic models. For example, Measure Killer provides very detailed information about a model. Consider this as another tool in your tool chest to programmatically identify optimization opportunities.

If you think of any additional metrics I should include, please let me know.

I want to thank Markus Cozowicz and Michael Kovalsky at Microsoft for their help.

Synthetic Streaming Data Generation For Real Time Analytics in Fabric

Sandeep Pawar — Tue, 06 Feb 2024 23:08:54 GMT

If you want to learn or demo Real Time Analytics in Microsoft Fabric, you will need a streaming data source. You can use the built-in samples to get started. But there are several data generators which you can use to create custom streaming sample datasets, Azure Stream Analytics data generator being one of them. You can see them here. In this blog, I will show how to set one up to use with Fabric Eventstream.

Create an Evenstream in Fabric Real Time Analytics workload

Click on New source and create a custom app
Select the custom app and copy the Event hub connection string. The connection string will be of the form:
"Endpoint=sb://<>.servicebus.windows.net/;SharedAccessKeyName=;SharedAccessKey=;EntityPath="
Download the Fradulent Call Data generator from here and save it to your local machine.
In the downloaded zip file, you will find telcodatagen.exe.config file. Open it in an editor (Notepad or VS Code)
In the config file, you need to change two app configurations EventHubName and
Microsoft.ServiceBus.ConnectionString values.
EventHubName: You can either get it from the Eventstream app settings under Keys or from the connection string copied in step 3. In the connection string copied in step 3, EventHubName is the value of the EntityPath starting with "es_".
Connection String: Edit the connection string copied in step 3 to delete the EntityPath. Connection string starts with "Endppoint=sb://es<>" . You also need to remove the semi-colon (;) before the EntityPath so that the connection string ends with =". Your final connection string should look like :
"Endpoint=sb://<>.servicebus.windows.net/;SharedAccessKeyName=="
Save the file and browse to the folder location in command prompt cd folder location . To generate 250 records per hour with 2% fraudulent calls for 1 hour, use telcodatagen.exe 250 0.02 1 . This will start streaming the synthetic data to the event hub created above. To cancel the stream, press CNTRL + C.
You should now see data streaming into the event hub and you can save this data to the destination of your choice (Lakehouse, KQL DB, custom app, Reflex) for further analytics !

💡

Eventstream automatically creates a kafka client which you can use in spark streaming as well for demo purposes. To learn more about how to send data from kafka to Eventstream, check this out.

Using runMultiple To Orchastrate Notebook Execution in Microsoft Fabric

Sandeep Pawar — Mon, 29 Jan 2024 23:54:54 GMT

Using runMultiple in MicrosoftFabric notebook isn't new. It was announced, I think, at MS Ignite in November 2023. I tested it and worked beautifully. But the Notebooks team has been quietly adding great features to it and I wanted to show some of those. I want to thank Jene Zhang, Fang Zhang and Yi Lin from the Fabric Notebooks team for answering my questions.

💡

Please note that runMultiple is still under development and these features/API may change. Refer to the official documentation for updates.

The notebook class in mssparkutils has two methods to run notebooks - run and runMultiple . run allows you to trigger a notebook run for one single notebook. Mim wrote a nice blog to show how to use it and its usefulness.

runMultiple , on the other hand, allows you to create a Direct Acyclic Graph (DAG) of notebooks to execute notebooks in parallel and in specified order, similar to a pipeline run except in a notebook. The advantages here are that you can:

programmatically define the execution order
run notebooks in parallel or in squence
define dependencies between notebook runs
efficient use of the compute resources as it will use the compute of the orchestrating notebook
Depending on the F SKU and the node size, you may get TooManyRequestsForCapacity error if you run multiple notebooks at the same time because of concurrency limits. Since runMultiple uses the same compute, you can mitigate this error. Advancing Analytics has an excellent blog that goes in-depth into this. My colleague Will Crayger also has also researched this topic and summarized his findings here.

Here is how you use it:

Create a DAG which is basically a json object with defined parameters. Use mssparkutils.notebook.help("runMultiple") to learn about the API and the arguments
Create a orchtration notebook to run all the DAG
There are quite a few additions to this method but a couple I would like to point out:
- It now plots the DAG to show the execution order and the dependencies
- You can monitor the progress of the execution interactively as the notebooks are being executed
- You can alter the type and size of the DAG graph. As mentioned above, this is still under development and may change at any point. Keep an eye on the updates.

Example:

I have 9 notebooks in my pipeline:

4 extraction notebooks (2 dims and 2 facts for example purposes)
Each notebook has parameters defined (using toggle parameter)
extraction notebooks do not have any dependencies
I create 2 dim tables which depend on the dim extraction notebooks
The 2 facts are loaded after the successful extraction of fact data and after the dim tables are created
A semantic model is dependent on successful execution of all the above notebook and is refreshed after the dims and facts table created

Below is the DAG I created to capture this logic:

DAG = {    "activities": [        {            "name": "extract_customers",             "path": "extract_customers",             "timeoutPerCellInSeconds": 120,            "args": {"rows": 1000},        },        {            "name": "extract_products",             "path": "extract_products",             "timeoutPerCellInSeconds": 120,            "args": {"rows": 5000},        },        {            "name": "extract_offers",             "path": "extract_offers",             "timeoutPerCellInSeconds": 120,            "args": {"rows": 1000},        },        {            "name": "extract_leads",             "path": "extract_leads",             "timeoutPerCellInSeconds": 120,            "args": {"rows": 100000},        },        {            "name": "customer_table",            "path": "customer_table",            "timeoutPerCellInSeconds": 90,            "retry": 1,            "retryIntervalInSeconds": 10,            "dependencies": ["extract_customers"]        },        {            "name": "products_table",            "path": "products_table",            "timeoutPerCellInSeconds": 90,            "retry": 1,            "retryIntervalInSeconds": 10,            "dependencies": ["extract_products"]        },                {            "name": "leads_table",            "path": "leads_table",            "timeoutPerCellInSeconds": 90,            "retry": 1,            "retryIntervalInSeconds": 10,            "dependencies": ["extract_leads","customer_table", "products_table"]        },           {            "name": "offers_table",            "path": "offers_table",            "timeoutPerCellInSeconds": 90,            "retry": 1,            "retryIntervalInSeconds": 10,            "dependencies": ["extract_offers","customer_table", "products_table"]        },                   {            "name": "refresh_dataset",            "path": "refresh_dataset",            "timeoutPerCellInSeconds": 90,            "retry": 1,            "retryIntervalInSeconds": 10,            "dependencies": ["customer_table","products_table","leads_table","offers_table"]        }    ],    "timeoutInSeconds": 3600, # max 1 hour for the entire pipeline    "concurrency": 5 # max 5 notebooks in parallel}

Note that in the above example, I also defined the timeout for the entire pipeline and also the concurrency limit of no more than 5 notebooks can be run in parallel.

To run and visualize the DAG:

#check the API documentation for available DAG layouts, e.g. spring, circular, planar etcmssparkutils.notebook.runMultiple(DAG, {"displayDAGViaGraphviz":True, "DAGLayout":"spectral", "DAGSize":11})

Progress bar :

You can see the interactive progress bar as the notebooks are executed in the specified order, respecting the dependencies.

Although I have 9 notebooks in this pipeline, in the monitoring hub, I see only one that's being executed because all the notebooks share the compute of the orchestrating notebook.

The DAG viz shows the dependencies in a network graph. The arrows show how the notebooks are dependent on each other. While I like this, I created mine the way I like it :

from graphviz import Digraphdef generate_notebook_dependency_graph(json_data, show_legend=False, apply_color=False):    """    Sandeep Pawar  |  fabric.guru  | 1/30/24    Visualize runMultiple DAG using GraphViz.     """    dot = Digraph(comment='Notebook Dependency Graph')    dot.attr(rankdir='TB', size='10,10')    dot.attr('node', shape='box', style='rounded', fontsize='12')    if show_legend:        with dot.subgraph(name='cluster_legend') as legend:            legend.attr(label='Legend', fontsize='12', fontcolor='black', style='dashed')            legend.node('L1', 'No Dependencies', style='filled', fillcolor='lightgreen', rank='max')            legend.node('L2', 'One Dependency', style='filled', fillcolor='lightblue', rank='max')            legend.node('L3', 'Multiple Dependencies', style='filled', fillcolor='lightyellow', rank='max')            legend.node('L4', 'Args Present (in label)', rank='max')    dot.attr('node', rank='', color='black')    for notebook in json_data["activities"]:        notebook_name = notebook['name']        args = notebook.get('args', {})        label = f"{notebook_name}"        if args:            label += f"\nArgs: {', '.join([f'{k}={v}' for k, v in args.items()])}"        if apply_color:            num_dependencies = len(notebook.get('dependencies', []))            color = 'lightgreen' if num_dependencies == 0 else 'lightblue' if num_dependencies == 1 else 'lightyellow'        else:            color = 'white'        dot.node(notebook_name, label, style='filled', fillcolor=color)        for dependency in notebook.get('dependencies', []):            dot.edge(dependency, notebook_name)    display(dot)generate_notebook_dependency_graph(DAG, show_legend=True, apply_color=True)

In the above DAG, I color coded the blocks and added them in a hierarchy to show how they are executed, which notebooks have parameters and which have dependencies. It would be nice to be able to build the DAG visually.

Current Limitations:

This feature is still in development
You can only refer to notebooks from the same workspace
Since the notebooks share the compute, all the notebooks must have the same environment.
You cannot create conditional branching based on exit value
Not so much a limitation but something to be aware of - you do not see the run history of the child notebooks in "All runs". e.g. if you need to see how long the individual notebooks ran and when etc., you cannot see that. You only see the run history of the orchestrating notebook. I am not sure how to use logging in this case.

Bonus Tips:

List of notebooks

When creating the DAG, you will need to see a list of notebooks. You can use below to get a list of notebooks in the workspace:

import sempy.fabric as fabricnotebooks = fabric.list_items().query("Type == 'Notebook'")notebooks

Running notebooks from another workspace:

While mssparkutils "currently" doesn't support cross-workspace notebooks, you can use semantic-link to run notebooks from any workspace:

fabric.run_notebook_job("notebook_id", workspace="workspace_name")

I will update this blog as more features are added. In the meantime, if you have any suggestions/features, I encourage you to share feedback with Jene on Twitter.

Let me know if you would like the notebooks.

One Line Code To Get A List Of Items From All The Fabric/Premium Workspaces

Sandeep Pawar — Wed, 17 Jan 2024 14:52:29 GMT

What the title says, thanks to the latest version (v0.5) of Semantic Link

!pip install --upgrade semantic-link --q #upgrade to semantic-link v0.5import sempy.fabric as fabricimport pandas as pd df = pd.concat([fabric.list_items(workspace=ws) for ws in fabric.list_workspaces().query('`Is On Dedicated Capacity` == True').Id], ignore_index=True)df

💡

If you have a large tenant with thousands of workspaces and items, this will be slow. In that case optimize the code to execute in parallel using ThreadPoolExecutor

Cancelling Dataflow Gen2 Refresh

Sandeep Pawar — Thu, 11 Jan 2024 20:04:11 GMT

Update : Mar 8, 2024

As of March 8, you can cancel the Fabric Dataflow Gen2 refresh in service using UI:

Programmatically cancel the refresh:

DFg2 can be resource intensive, and if the refresh takes longer than expected, it may consume a significant amount of CUs. Thankfully, you can use the Power BI Rest API to cancel it programmatically. My friend Alex Powers already has a PowerShell script that you can use. You can also use the Power BI VS Code extension by Gerhard Brueckl.

But I would like to show you how you can do this using the PowerBIRestClient in the latest version of Semantic-Link (v0.5.0). The API is similar to the FabricRestClient I blogged about in my last blog. The new version has many updates I love and I will write about it soon. But in the meantime, checkout the official documentation for the details.

In the below code:

I first get the list of all the dataflows in a workspace using the Get Dataflows API. You can pass either the workspace name or id. By default, the workspace id of the workspace hosting the notebook is used.
For each dataflow, get the transaction id and the status of the dataflow using Get Dataflow Transactions API.
Cancel the refresh, using the Cancel Dataflow Transaction API

Semantic Link makes calling APIs very easy. Authentication is handled automatically for you so no need to generate a token.

# Install semantic-link >= 0.5.0!pip install semantic-link --qimport pandas as pdimport sempy.fabric as fabricfrom sempy.fabric.exceptions import FabricHTTPException# Instantiate the clientclient = fabric.PowerBIRestClient()def get_dataflow_transactions(workspace_id, dataflow_id):    try:        response = client.get(f"v1.0/myorg/groups/{workspace_id}/dataflows/{dataflow_id}/transactions")        if response.status_code != 200:            raise FabricHTTPException(response)            return pd.DataFrame()        transactions = response.json().get('value', [])        if not transactions:            return pd.DataFrame([{'status': 'No Status', 'transaction': 'No Transaction ID', 'refreshType': 'None'}])        first_transaction = transactions[0]        return pd.DataFrame([{            'status': first_transaction['status'],            'transaction': first_transaction['id'],            'refreshType': first_transaction['refreshType']        }])    except Exception as e:        print("Error getting transactions. Check workspace, dataflow id:", e)        return pd.DataFrame()def get_dataflows(workspace):    try:        workspace_id = fabric.resolve_workspace_id(workspace)    except Exception as e:        print("Check workspace name/id:", e)        return pd.DataFrame()    try:        response = client.get(f"v1.0/myorg/groups/{workspace_id}/dataflows")        if response.status_code != 200:            raise FabricHTTPException(response)            return pd.DataFrame()        dataflows = response.json().get('value', [])        dataflow_df = (pd.json_normalize(dataflows, max_level=1)                       .rename(columns={"name": "dataflowName", "objectId": "dataflowID"})                       .dropna()                       .reset_index(drop=True))        result = dataflow_df.join(dataflow_df.apply(lambda x: get_dataflow_transactions(workspace_id, x['dataflowID']).iloc[0], axis=1))        return result    except Exception as e:        print("Error in get_dataflows:", e)        return pd.DataFrame()def cancel_dataflow(workspace, transaction_id):    """    Cancel a specific dataflow transaction.    """    try:        workspace_id = fabric.resolve_workspace_id(workspace)        response = client.post(f"v1.0/myorg/groups/{workspace_id}/dataflows/transactions/{transaction_id}/cancel")        if response.status_code != 200:            raise FabricHTTPException(response)            return None        return response.json()    except Exception as e:        print("Error:", e)        return None

Example:

Get dataflow transactions:

Cancel the dataflow:

💡

As I mentioned in my last blog, you should use the FabricRestClient for calling the Fabric API endpoints and PowerBIRestClient for the Power BI APIs. Hopefully in the future, both will merge and only one will be needed.

💡

Tip: If you need to run a DFg2 with heavy transformations, run it in a pipeline so you can set the timeout and cancel the pipeline run instead.

This was one example . You can use this client to call any Power BI Rest endpoint and make GET, POST requests. If you are new to Python, note that the actual code is just a couple of lines, majority is for error/exception handling to make it robust. I love Semantic Link, and it's becoming indispensable day by day.

Using FabricRestClient To Make Fabric REST API Calls

Sandeep Pawar — Wed, 20 Dec 2023 21:11:06 GMT

Accessing Fabric REST endpoints in Fabric notebooks was already easy but it became easier and straightforward with semantic-link version 0.4.0. You can use the FabricRestClient class from sempy to set up a REST client and call the APIs. Authentication is automatically managed for you.

💡

Note that the Fabric API is still in preview and the FabricRestClient is experimental.

#sempy version 0.4.0 or higher!pip install semantic-link --q import pandas as pdimport sempy.fabric as fabric#Instantiate the clientclient = fabric.FabricRestClient()

Refer to the official documentation for API details. To call an API, you only need to pass the below portion of the URL.

#example usageurl = "v1/workspaces//items"result = client.get(url)

This approach is not substantially different from using requests to make API calls, but there are a few advantages:

You don't have to generate and pass the authentication token. sempy uses the environment details to authenticate the user automatically.
This approach is similar to using requests.Session() to maintain a persistent connection, keeping the session open longer. This is particularly beneficial if you plan to call multiple REST endpoints within the same session.
sempy has a few nice utilities and functions for exception handling and getting workspace/lakehouse ids.

Example

In this simple example, I will get a list of Fabric items from a workspace. I will show how some of the utilities can help make this even easier and robust.

#instantiate the clientclient = fabric.FabricRestClient()#get a list of items from a workspaceworkspaceId = ""client.get(f"/v1/workspaces/{workspaceId}/items")

In this example, I manually provided the workspace ID. However, we can obtain the notebook's workspace ID automatically without needing to define it.

#instantiate the clientimport pandas as pdclient = fabric.FabricRestClient()#get the default workspace id of the notebookworkspaceId = fabric.get_workspace_id()#get a list of items from a workspaceresponse = client.get(f"/v1/workspaces/{workspaceId}/items")df_items = pd.json_normalize(response.json()['value'])df_items

For production use cases you will need to include exception handling. sempy provides a few nice utilities, see an example below:

import sempy.fabric as fabricfrom sempy.fabric.exceptions import FabricHTTPException, WorkspaceNotFoundExceptionimport pandas as pdclient = fabric.FabricRestClient()def get_workspace_items(workspace=None):    """    Sandeep Pawar | fabric.guru    Get a list of workspace items using FabricRestClient    """    df_items = None    if workspace is None:        workspaceId = fabric.get_workspace_id()    else:        try:            workspaceId = fabric.resolve_workspace_id(workspace)            response = client.get(f"/v1/workspaces/{workspaceId}/items")            # Check the status code of the response for this endpoint            # Use 200 if operation is completed, 201 if item is created            if response.status_code != 200:                raise FabricHTTPException(response)            df_items = pd.json_normalize(response.json()['value'])        except WorkspaceNotFoundException as e:            print("Caught a WorkspaceNotFoundException:", e)        except FabricHTTPException as e:            print("Caught a FabricHTTPException. Check the API endpoint, authentication.")    return df_itemsget_workspace_items("")

In the above example:

workspace is optional. If no workspace name/id is passed, fabric.get_workspace_id() will use the workspace id of the default workspace. Similarly, you can use fabric.get_lakehouse_id() to get the id of the default lakehouse attached to the notebook.
All APIs require using workspace id. With fabric.resolve_workspace_id(workspace) you can get the id of a workspace if the user enters workspace name
WorkspaceNotFoundException, FabricHTTPException help catch workspace and REST endpoint exceptions.

💡

Use the FabricRestClient for calling Fabric APIs only. There are some differences between the Power BI and Fabric API URLs. sempy is expected to cover Power BI APIs as well soon. I will update the blog when it becomes available.

Update: 1/21/2024

Making a POST Request

A redear asked how to make a POST request using the FABRICRestClient. Below is te example where I create a lakehouse using a POST request.

import sempy.fabric as fabricfrom sempy.fabric.exceptions import FabricHTTPException, WorkspaceNotFoundExceptionclient = fabric.FabricRestClient()def create_lakehouse(payload:dict, workspace=None):    """    Sandeep Pawar | fabric.guru    Create a lakehouse using POST request    """    if workspace is None:        workspaceId = fabric.get_workspace_id()    else:        workspaceId = fabric.resolve_workspace_id(workspace)    try:        response = client.post(f"/v1/workspaces/{workspaceId}/items",json= payload)        if response.status_code != 201:            raise FabricHTTPException(response)    except WorkspaceNotFoundException as e:        print("Caught a WorkspaceNotFoundException:", e)    except FabricHTTPException as e:        print("Caught a FabricHTTPException. Check the API endpoint, authentication.")    return#payload/body for the POST request as a dict#example; create a lakehouse named "mylh" in the default workspacepayload = {        "displayName": "mylh",        "type": "Lakehouse"          }#workspace name or id is optional. If None, default workspace is used.create_lakehouse(payload=payload, workspace=None)

Thanks to Markus Cozowicz for answering my questions and for his input.

Refreshing Individual Tables and Partitions With Semantic Link

Sandeep Pawar — Thu, 07 Dec 2023 21:49:25 GMT

Since version 0.4 Semantic Link there have been several methods to refresh individual tables and partitions in a semantic model. In this blog, I will show how to use the .refresh_dataset().execute_tmsl() and TOM to achieve granular refresh.

Prerequisites: You need Microsoft Fabric capacity. The semantic model needs to be in a workspace backed by F/P/PPU capacity/license.

Method 1: Using the Enhanced Refresh API

To refresh the entire semantic model, use fabric.refresh_dataset( dataset, workspace ) . But you can also provide a list with key-value pairs for tables and partitions to refresh. This is particularly helpful for large semantic models and models with incremental refresh defined.

In the below example, I am only refreshing the Order_Details table and Customer-ROW partition from the Customers table.

You can use the REST API method as well, but this is a more convenient way in my opinion.

#!pip install semantic-link --q# Define the dataset and workspaceimport sempy.fabric as fabric dataset = "SL-Refresh"workspace = "Sales"# Objects to refresh, define using a dictionaryobjects_to_refresh = [    {        "table": "Customers",        "partition": "Customers-ROW"    },    {        "table": "Order_Details"    }]# Refresh the datasetfabric.refresh_dataset(    workspace=workspace,    dataset=dataset,     objects=objects_to_refresh,)# List the refresh requestsfabric.list_refresh_requests(dataset=dataset, workspace=workspace)

To confirm the partitions have been refreshed successfully, we can use the .get_tmsl() to get the details on each partition in the table.

import pandas as pdimport jsonimport sempy.fabric as fabric def get_partition_refreshes(dataset, workspace):    """    Sandeep Pawar  |   fabric.guru    Returns a pandas dataframe with three columns - table_name, partition_name, refreshedTime    """    tmsl_data = json.loads(fabric.get_tmsl(dataset=dataset, workspace=workspace))    df = pd.json_normalize(        tmsl_data,         record_path=['model', 'tables', 'partitions'],         meta=[            ['model', 'name'],             ['model', 'tables', 'name']        ],        errors='ignore',        record_prefix='partition_'    )    df = df.rename(columns={'model.tables.name': 'table_name'})    return df[['table_name', 'partition_name', 'partition_refreshedTime']]df = get_partition_refreshes(dataset=dataset, workspace=workspace)df

Sample output:

This method also has a ton of other options to refresh the semantic models in granular details and I highly recommend checking it out.

Update 4/15/2024: You can also use fabric.list_partitions() method to get the partition refresh timestamps.

Method 2: Using TMSL

Alternatively, you can use the .execute_tmsl to refresh specific tables and partitions. This is a far more flexible and powerful method because you can not only refresh the tables, you can alter the table/partition properties as well.

#define TMSL#In the below example, I am refreshing a table called Order_Details and #partition named Customers-ROW from Customers table#database is your semantic model nametmsl_script = {  "refresh": {    "type": "full",    "objects": [      {        "database": "SL-Refresh",        "table": "Order_Details"      },      {        "database": "SL-Refresh",        "table": "Customers",        "partition": "Customers-ROW"      }    ]  }}fabric.execute_tmsl(workspace="", script=tmsl_script)

With Semantic Link, you can define your own custom refresh schedules, make refreshes conditional (e.g. based on data quality checks, arrival of data etc.) and much more.

Method 3 : Using TOM (Semantic Link v>=0.7)

Update : Apr 15, 2024

With Semantic Link >=v0.7, you can update TOM properties. Below I show how you can update all/selected tables and all/selected partitions. I used this blog by Michael Kovalsky (Principal PM, Fabric CAT) as a reference.

Refreshing Selected Tables in a model

%pip install semantic-link --q #skip if semantic-link v >= 0.7import sempy.fabric as fabric from typing import Listdef refresh_selected_tables(workspace_name:str, dataset_name:str, tables:List[str]=None, refresh_type:str="DataOnly"):    """    Sandeep Pawar | Fabric.guru  |  Apr 12, 2024    Function to refresh selected tables from a semantic model based on a refresh policy     - Default refresh type is "DataOnly"     - If no table list is provided, all tables in the model are refreshed    Refer to blog by Michael Kovalsky for details : https://www.elegantbi.com/post/datarefreshintabulareditor    For Refresh types refer to #https://learn.microsoft.com/en-us/dotnet/api/microsoft.analysisservices.tabular.refreshtype?view=analysisservices-dotnet    """    ws = workspace_name or fabric.resolve_workspace_name(fabric.get_workspace_id())    all_tables = fabric.list_tables(dataset=dataset_name, workspace=ws)['Name']    selected_tables = all_tables if tables is None else tables    import Microsoft.AnalysisServices.Tabular as tom    tom_server = fabric.create_tom_server(False, ws)    tom_database = tom_server.Databases.GetByName(dataset_name)    if refresh_type == "DataOnly":        refreshtype = tom.RefreshType.DataOnly    elif refresh_type == "Full":        refreshtype = tom.RefreshType.Full    elif refresh_type == "Calculate":        refreshtype = tom.RefreshType.Calculate    elif refresh_type == "ClearValues":        refreshtype = tom.RefreshType.ClearValues    elif refresh_type == "Defragment":        refreshtype = tom.RefreshType.Defragment    else:        print("Enter valid refresh type, Valid values : DataOnly, Full, Calculate, ClearValues, Defragment")    try:        for table in selected_tables:            print("Refreshing : ",table)            tom_database.Model.Tables[table].RequestRefresh(refreshtype)    except Exception as e:        print("----Refresh Failed----")        print(f"An error occurred: {e}")    tom_database.Model.SaveChanges()      fabric.refresh_tom_cache(workspace=ws)refresh_selected_tables(workspace_name="", dataset_name="")

Refreshing Selected Partitions in a model:

%pip install semantic-link --q #skip if semantic-link v >= 0.7import sempy.fabric as fabric import pandas as pdfrom typing import Dict, Listdef refresh_selected_partitions(workspace_name:str, dataset_name:str, partitions:Dict[str, List]=None, refresh_type:str="DataOnly"):    """    Sandeep Pawar | Fabric.guru  |  Apr 12, 2024    Function to refresh selected partitions from a semantic model based on a refresh policy     - Default refresh type is "DataOnly"     - Partitions should be provided as a dictionary as below       partitions = {"tableA":["partition1","partition2"],"tableB":["partition4","partition6", "partition7"] }     Refer to blog by Michael Kovalsky for details : https://www.elegantbi.com/post/datarefreshintabulareditor    For Refresh types refer to #https://learn.microsoft.com/en-us/dotnet/api/microsoft.analysisservices.tabular.refreshtype?view=analysisservices-dotnet    """    ws = workspace_name or fabric.resolve_workspace_name(fabric.get_workspace_id())    all_partitions = fabric.list_partitions(dataset=dataset_name, workspace=ws)    if partitions:        partition_list = partitions        data_tuples = [(table, partition) for table, partitions in partition_list.items() for partition in partitions]        given_partitions = pd.DataFrame(data_tuples, columns=["Table Name", "Partition Name"])    selected_partitions = all_partitions if partitions is None else given_partitions    import Microsoft.AnalysisServices.Tabular as tom    tom_server = fabric.create_tom_server(False, ws)    tom_database = tom_server.Databases.GetByName(dataset_name)    if refresh_type == "DataOnly":        refreshtype = tom.RefreshType.DataOnly    elif refresh_type == "Full":        refreshtype = tom.RefreshType.Full    elif refresh_type == "Calculate":        refreshtype = tom.RefreshType.Calculate    elif refresh_type == "ClearValues":        refreshtype = tom.RefreshType.ClearValues    elif refresh_type == "Defragment":        refreshtype = tom.RefreshType.Defragment    else:        print("Enter valid refresh type, Valid values : DataOnly, Full, Calculate, ClearValues, Defragment")    try:        for idx, row in selected_partitions.iterrows():            table_name = row['Table Name']            partition_name = row['Partition Name']            print(f"Refreshing Partition: {partition_name} in Table: {table_name}")            tom_database.Model.Tables[table_name].Partitions[partition_name].RequestRefresh(refreshtype)    except Exception as e:        print("----Refresh Failed----")        print(f"An error occurred: {e}")    tom_database.Model.SaveChanges()      fabric.refresh_tom_cache(workspace=ws)

Note that there are five refresh type options : DataOnly, Full,Calculate, ClearValues, Defragment that you can specify.
For selectively refreshing partitions, provide the partitions as a dictionary as : partitions = {"table_nameA":["p1","p2"] , "table_nameA":["p3"]}
For conditional refreshing etc, use fabric.list_partitions(..) to check the last time a partition was refreshed and pass the dictionary to the above function for more granular, conditional refreshes
In refresh history, you will see the Type as "Via XMLA Endpoint"

Michael also has a refresh_semantic_model method in his library fabric_cat_tools . Check it out here.

Method 4 : Using PowerBIRestClient

Above methods work if the dataset you are refreshing is in a Premium/Fabric capacity. But you can use the PowerBIRestClient in Semantic Link to refresh a dataset in a Pro workspace. I have written about PowerBIRestClient before.

import sempy.fabric as fabricclient = fabric.PowerBIRestClient()datasetId = ""client.post(f"/v1.0/myorg/datasets/{datasetId}/refreshes")# if responses code is 202, refresh was successful# to refresh partitions, refer to option 1 on how to pass objects_to_refresh to the post request

The limitation on number refreshes (8 per day) for Pro still apply.

Method 5 : Using Pipelines

As of Apr 15, 2024, you can also use the Fabric pipelines to refresh a semantic model as an activity in a pipeline. Currently, only the entire dataset can be refreshed and not the individual tables and partitions.

Marc Lelijveld, Data Platform MVP, has a fantastic write up on putting it all together to orchestrate the refreshes, I highly recommend reading it.
Tom Oefler also has an excellent blog on how to achieve granular refreshes using Web Activity in Fabric pipelines.

So which method should you use?

If you need to refresh the entire model, refresh semantic model activity is the easiest and cleanest option. You can use the pipeline activity to refresh a Pro dataset as well.
If you need to refresh individual tables and partitions, I like Option 1 using fabric.refresh_dataset because it's an API call so you are not making any structural changes to the model. Plus, since you will be using the notebooks, you can make it more dynamic and conditional. It uses the spark cluster though (until Python kernel becomes available) so I suspect, Web Activity will be a cheaper option (fewer CUs + avoids using spark vCores) if you know which tables/partitions to refresh and the logic is straightforward.

Controlling Direct Lake Fallback Behavior

Sandeep Pawar — Wed, 22 Nov 2023 21:50:01 GMT

In the flurry of announcements at MS Ignite, one significant feature related to Direct Lake got buried. You can now control the fallback behavior of the Direct Lake semantic model. Before we get into how you do that, a quick recap of what is fallback.

What is fallback?

When you create a Direct Lake semantic model, by default it is in Direct Lake mode, i.e. you will directly query the delta table from the lakehouse/warehouse. This is what we want because the query performance will be very much comparable to the import mode. However, under certain circumstances, the DAX query can fallback to DirectQuery if Direct Lake limitations are reached. This was one of the major limitations of Direct Lake because:

While in some cases DirectQuery can be faster on the first read, subsequent performance will be significantly slower
DirectQuery has its own limitations
This fallback behavior is highly dynamic and relies on a number of factors. When developing the model, you must test the queries using DAX Studio or Performance Analyzer to determine whether the query is in Direct Lake or DirectQuery mode. Chris Webb has a series of blogs that cover this in great detail. Even if the query is in DirectLake when you tested it during development, there is no guarantee that it will always be in Direct Lake in service.
If the users experience slow report performance, it will be hard to debug and reproduce given the dynamic nature of the fallback mechanism.
Currently, if you create a semantic model using the delta tables in the web modeling, the table storage mode will show "Direct Lake" despite the fact that the DAX queries could be in DirectQuery (that's because while the storage mode of the partition is Direct Lake, the DAX queries could be DL or DQ). This adds to the confusion.

I should note that fallback doesn't mean, the report will stop working. In case of fallback, the query will switch from Direct Lake to DirectQuery and thus the report performance will be slower than expected.

When does fallback occur?

Below are some of the scenarios when fallback can occur:

The number of files per table, row groups per table, number of rows per table, model size on disk are reached

💡

If the semantic model size exceeds Max memory limit shown below, it will cause the model to be evicted (i.e. paged out) and not fallback. If the model is paged out of memory, it will lead to performance degradation for large models.
Semantic model uses data warehouse views
If the model size on disk exceeds the max size per SKU, the model (not the query) will fall back to DQ
RLS/OLS are defined in the data warehouse
Semantic model is published via XMLA endpoint and has not been reframed.
This is not a fallback criteria rather a known issue/limitation. If you create a Direct Lake model and it is unprocessed, it will fall back to Direct Query. You can read about it here in the official docs. I explained this in the video below:

https://youtu.be/JdNsCHoGeSo?si=dCBhtHBXyc2wyMAi

Always refer to the official documentation for details.

How to control fallback?

If you stay within the above limitations, the semantic model should operate in Direct Lake mode. If your model size is small based on the P/F SKU and if you perform delta table maintenance frequently (read this excellent blog by my colleague Miles Cole), you can avoid fallback.

You can use below Python script to get number rows, rowgroups, number of files, for the default lakehouse mounted in the notebook. Compare these numbers with the limits from the above table based on your SKU. Note that the Delta_Size_MB in the table below is the size of the Delta table and not the semantic model size. I have included it as a reference. To get the size of the model in memory, use Vertipaq Analyzer.

import pandas as pdimport pyarrow.parquet as pqimport numpy as npdef gather_table_details():    """    Sandeep Pawar  |  fabric.guru  |  Nov 25, 2023    Collects details of Delta tables including number of files, rowgroups, rows, size, and last OPTIMIZE and VACUUM timestamps.    This can be used to optimize Direct Lake performance and perform maintenance operations to avoid fallback.    The default Lakehouse mounted in the notebook is used as the database.    Returns:    DataFrame containing the details of each table, or a message indicating no lakehouse is mounted.    """    # Check if a lakehouse is mounted    lakehouse_name = spark.conf.get("trident.lakehouse.name")    if lakehouse_name == "":        return "Add a lakehouse"    def table_details(table_name):        detail_df = spark.sql(f"DESCRIBE DETAIL `{table_name}`").collect()[0]        num_files = detail_df.numFiles        size_in_bytes = detail_df.sizeInBytes        size_in_mb = size_in_bytes / (1024 * 1024)        # Optional, set to False to avoid counting rows as it can be expensive        countrows = True        num_rows = spark.table(table_name).count() if countrows else "Skipped"        delta_table_path = f"Tables/{table_name}"        latest_files = spark.read.format("delta").load(delta_table_path).inputFiles()        file_paths = [f.split("/")[-1] for f in latest_files]        # Handle FileNotFoundError        num_rowgroups = 0        for filename in file_paths:            try:                num_rowgroups += pq.ParquetFile(f"/lakehouse/default/{delta_table_path}/{filename}").num_row_groups            except FileNotFoundError:                continue        history_df = spark.sql(f"DESCRIBE HISTORY `{table_name}`")        optimize_history = history_df.filter(history_df.operation == 'OPTIMIZE').select('timestamp').collect()        last_optimize = optimize_history[0].timestamp if optimize_history else None        vacuum_history = history_df.filter(history_df.operation == 'VACUUM').select('timestamp').collect()        last_vacuum = vacuum_history[0].timestamp if vacuum_history else None        return lakehouse_name, table_name, num_files, num_rowgroups, num_rows, int(round(size_in_mb, 0)), last_optimize, last_vacuum    tables = spark.catalog.listTables()    table_list = [table.name for table in tables]    details = [table_details(t) for t in table_list]    details_df = pd.DataFrame(details, columns=['Lakehouse Name', 'Table Name', 'Num_Files', 'Num_Rowgroups', 'Num_Rows', 'Delta_Size_MB', 'Last OPTIMIZE Timestamp', 'Last VACUUM Timestamp'])    return details_df.sort_values("Delta_Size_MB", ascending=False).reset_index(drop=True)details_df = gather_table_details()details_df

Output:

To get the semantic model size on disk, use workspace settings:

You can explicitly turn off fallback to DirectQuery by changing the TOM properties. There are several ways as below:

Power BI Service:

As of Feb 21, 2024, Web modeling experience in Power BI service has the UI option to change the Direct Lake fallback behavior. Default is Automatic. Read the official blog here.

Tabular Editor:

Upgrade to the latest version of Tabular Editor 2 (v2.21.1) which has the latest AMO/TOM properties. Link to Tabular Editor.
Connect to the Direct Lake semantic model using XMLA endpoint
Select Model > Under Options > Direct Lake Behaviour > Change from Automatic to DirectLakeOnly
Save the model

💡

After updating the model using Tabular Editor, you will not be able to use the web modeling experience. You cannot use this method for an auto-generated default semantic model as it doesn't have XMLA endpoint.

Using C Sharp

Use the below C# script

/*Options:DirectLakeBehavior.AutomaticDirectLakeBehavior.DirectLakeOnlyDirectLakeBehavior.DirectQueryOnly*/Model.DirectLakeBehavior = DirectLakeBehavior.DirectLakeOnly;Model.SaveChanges();

SSMS

Thanks to Daniel Otykier for pointing out that the TMSL script can also be executed in SSMS to update this property. Use SSMS 19.2 to avoid errors.

Add/update "directLakeBehavior":"DirectLakeOnly" Model properties in TMSL.

https://youtu.be/Z2C0MpgAVA0

Semantic Link

You can use the createOrReplace TMSL script and use semantic link >= 0.4.0 to execute the TMSL in the Fabric notebook. If you have many Direct Lake semantic models you want to update, this is a great option.

#install semantic-link 0.4.0 or higher!pip install semantic-link--qimport sempy.fabric as fabricws = "<>" #workspace id or nametmsl_script = """{  "createOrReplace": {    "object": {      "database": "Shortcuttest_DL"    },    .    .        """fabric.execute_tmsl(workspace=ws, script=tmsl_script)

Update 3/15/2024

Semantic Link v0.7

Semantic Link v0.7 can be used to update Tabular Object Model (TOM) properties directly in the Fabric notebook !!!

Use below function to get the list of all the Direct Lake datasets in a workspace and its respective Direct Lake behavior.

%pip install semantic-link --qimport sempy.fabric as fabricfrom sempy.fabric.exceptions import DatasetNotFoundExceptiondef get_directlake_datsets(workspace=None, return_directlake_dataset=True):    """    Sandeep Pawar | fabric.guru    Function to get a list of Direct Lake datasets in a Fabric workspace.    Returns a pandas df    """    #workspace of the notebook irrespective of the lakehouse attached    workspace=workspace or spark.conf.get("trident.artifact.workspace.id")    #clear TOM cache    fabric.refresh_tom_cache(workspace=workspace)    def get_partition_mode(workspace, dataset):        return any(fabric.list_partitions(workspace=workspace, dataset=dataset)["Mode"].isin(["DirectLake"]))    cols = ['Dataset Name', 'Dataset ID', 'Created Timestamp', 'Model Direct Lake Behavior']    df = fabric.list_datasets(workspace=workspace, additional_xmla_properties=["Model.DirectLakeBehavior"])[cols]    df["IsDirectLake"] = False  # Initialize w/ False    for i, row in df.iterrows():        try:            df.loc[i, "IsDirectLake"] = get_partition_mode(workspace=workspace, dataset=row["Dataset Name"])        except DatasetNotFoundException:            df.loc[i, "IsDirectLake"] = "Not Found"    return df.query("`IsDirectLake` == True") if return_directlake_dataset else dfdf = get_directlake_datsets(workspace="workspace")    df

Change Direct Lake Behavior

Use below function to change the behavior of a specific model or all models in a workspace using the dataframe returned by the above function.

#Need semantic-link version 0.7 installedimport sempy.fabric as fabricimport Microsoft.AnalysisServices.Tabular as tom#behavior can be DirectLakeOnly, DirectQueryOnly or Automaticdef change_dl_behavior(workspace, datasetname, dl_behavior = "DirectLakeOnly"):    """    Sandeep Pawar | fabric.guru    This function updates the Direct Lake behavior, and refeshes the model    """    #workspace of the notebook irrespective of the lakehouse attached    workspace=workspace or spark.conf.get("trident.artifact.workspace.id")    tom_server = fabric.create_tom_server(False, workspace)    tom_database = tom_server.Databases.GetByName(datasetname)    behavior = None    if dl_behavior.lower() == "directlakeonly":        behavior = tom.DirectLakeBehavior.DirectLakeOnly    elif dl_behavior.lower() == "directqueryonly":        behavior = tom.DirectLakeBehavior.DirectQueryOnly    elif dl_behavior.lower() == "automatic":        behavior = tom.DirectLakeBehavior.Automatic    else:        print("Direct Lake Behavior is invalid. Valid values are DirectLakeOnly, DirectQueryOnly, Automatic ")        return         try:        tom_database.Model.DirectLakeBehavior = behavior        tom_database.Model.SaveChanges()         fabric.refresh_tom_cache(workspace=workspace)        tom_database.Model.RequestRefresh( tom.RefreshType.Full)    except Exception as e:        print(f"An error occurred: {e}")change_dl_behavior(workspace="workspacename", datasetname="dsname", dl_behavior = "DirectLakeOnly")

To change Direct Lake Behavior of all datasets in a workspace:

workspace = ""df = get_directlake_datsets(workspace=workspace)for ds in df.query("`Model Direct Lake Behavior`!='DirectLakeOnly'")["Dataset Name"]:    try:        change_dl_behavior(workspace=workspace, datasetname=ds, dl_behavior="DirectLakeOnly")    except Exception as e:        print(f"Error with {ds}. Either the dataset is a default dataset or check permission to the dataset")        continue

Note: You will get an error if you try to change the behavior of a default semantic model. Default semantic model is always in Automatic mode and cannot be changed.

How to check if the queries are in Direct Lake mode?

There are several ways to do so. You can use Performance Analyzer to see if the query is in Direct Query or Direct Lake mode. Read this documentation for details.

Let me show you another way to achieve this not covered elsewhere. Thanks to the latest version of Semantic Link, it's now possible to perform query tracing in a Fabric notebook. This is similar to query tracing using SQL Server Profiler or DAX Studio but using sempy in a Fabric notebook. Below is the code:

#Semantic link >= 0.4.0!pip install semantic-link --qimport sempy.fabric as fabricimport timews = "<>" #workspaceds= "<>"  #semantic modeldax=            """                < DAX QUERY >                """def trace_directlake_behavior(workspace, semantic_model, dax_query):    """    Sandeep Pawar | Fabric.guru    Trace DAX queries to determine Direct Lake behavior of a query    Returns a tuple (behavior, dataframe with query trace)    """    BASE_COLUMNS = ["EventClass", "EventSubclass","TextData"]    START_COLUMNS = BASE_COLUMNS     END_COLUMNS = START_COLUMNS     DQ_COLUMNS = ["TextData"]    event_schema = {        "DirectQueryBegin": DQ_COLUMNS,        "DirectQueryEnd": DQ_COLUMNS,        "VertiPaqSEQueryBegin": START_COLUMNS,        "VertiPaqSEQueryEnd": END_COLUMNS,        "VertiPaqSEQueryCacheMatch": BASE_COLUMNS,        "QueryBegin": START_COLUMNS,        "QueryEnd": END_COLUMNS,    }    try:        with fabric.create_trace_connection(workspace=workspace, dataset=semantic_model) as conn:            with conn.create_trace(event_schema, "Trace DL DQ") as trace:                trace.start()                fabric.evaluate_dax(workspace=workspace, dataset=semantic_model, dax_string=dax_query)                time.sleep(5)                  trace_df = trace.stop()        behavior = "Direct Lake query" if  trace_df['EventClass'].str.contains("VertiPaqSE").any() else "DirectQuery"        return behavior, trace_df    except Exception as e:        print("Error: Check workspace, semantic model settings, DAX. Ensure XMLA is enabled and you have access to the semantic model")        raisebehavior, trace_df = trace_directlake_behavior(workspace=ws, semantic_model=ds, dax_query=dax)behavior

Semantic Link >= 0.5.0 (Update Jan 12, 2024)

In the Semantic Link version >= 0.5.0, has a neat way to get additional properties of the model using additional_xmla_properties. This will give you Direct Lake Behavior of all the semantic models in a workspace. This is at the model level and not query.

Update: 2/21/2024

Identifying The Fallback Reason

Eiki Sui mentioned on Twitter that with the December 2023 release of Power BI, you can execute the new INFO DAX functions. Thanks Eiki.

~~Execute~~ INFO.DELTATABLEMETADATASTORAGES() against the DirectLake semantic model in DAX Studio and check the fallback reason. 0 is Automatic, 1 is DirectLake and 2 is DirectQuery. Note that these are the Direct Lake behaviors of the tables and not the query.

Thanks to Michael Kovalsky ( Principal Program Manager at Fabric CAT) for pointing out that the FallbackReason from INFO.DELTATABLEMETADATASTORAGES() or $SYSTEM.TMSCHEMA_DELTA_TABLE_METADATA_STORAGES can be used to identify some of the fallback reasons.

In the example below, I created a Direct Lake semantic model with three delta tables and one view. If I execute the above DMV, I can see that the fallback reason for one of the tables is 2. Since view is not a delta table, it's shown as blank.

Reason Code	Fallback Reason
0	'No reason for fallback'
1	'This table is not framed'
2	'This object is a view in the lakehouse'
3	'The table does not exist in the lakehouse'
4	'Transient error'
5	'Using OLS will result in fallback to DQ'
6	'Using RLS will result in fallback to DQ'

💡

Michael has below python script in his repo to check the fallback reason using Semantic-Link. This is extremely helpful for debugging. I recommend executing this DMV/python script after creating the Direct Lake semantic model as a check before you build rest of the model.

https://github.com/m-kovalsky/Fabric/blob/main/FrameTablesBasedOnFallbackReason.py

Fallback Scenario:

Consider the following scenario where I have two tables in the Direct Lake semantic model. These two tables have a 1-*m relationship between them.

Orders : Delta table
Customer_vw : SQL View

If I check the above DMV, it correctly shows that the Orders table doesn't fall back while the Customer table has fallback reason = 2 i.e. it will be in DirectQuery mode.

I created three measures:

Count of orders
Count of Customers
Count of Orders of customers that are in the automobile segment which will need using the relationship between the two tables

The expectation is Count of orders should be in Direct Lake, Count of customers should be in Direct Query but what about the third measure that uses a delta table and a view?

💡

In this case because the Direct Lake behavior is set to "Automatic" for the model, the third measure will switch to DirectQuery storage mode to combine the queries from the delta table and the view. So it's important to keep in mind that the fallback is evaluated at the query level while the behavior is set at the model level. If I were to set the model behavior to "Direct Lake", the two measures in DirectQuery mode will fail. In general, avoid using views in a Direct Lake semantic model.

Should you change the fallback behavior?

Yes, that would be my recommendation. For a couple of reasons:

You know with certainty that the queries will always be in Direct Lake mode so you can develop and optimize accordingly.
You will likely see performance gains. When the Direct Lake behavior is set to "Automatic", two query plans are generated - one for Direct Lake and another for DirectQuery. If you set the behavior to Direct Lake, only the Direct Lake query plan will be created, hence potentially improving performance, especially for complex queries and large models. e.g. I tested a few queries on a large semantic model (>500M rows), and some of the queries were ~10% faster.
You should limit the number of rows, row groups, size and perform maintenance in addition to changing the fallback behavior.

💡

Fallback is not necessarily negative; consider it akin to insurance. However, in developing a large model, ensuring certainty about its behavior is crucial. This allows for design and optimization, and controlling the fallback aids in achieving this.

What happens if Direct Lake is not supported?

If you set the behavior to Direct Lake only but Direct Lake is not supported (see the list above), you will get an error message. This is good because you can make modeling changes accordingly. For example, if you build a semantic model using views from a warehouse, though in web experience you will see Direct Lake, but the queries are in DirectQuery. I changed it to Direct Lake only and got the below error message:

Thanks to Tamas Polner, Michael Kovalsky, Akshai Mirchandani at Microsoft for answering my questions.

Reference:

Visualizing JSON Structure In Fabric Notebook

Sandeep Pawar — Mon, 20 Nov 2023 20:30:55 GMT

JSON is ubiquitous, particularly when working with APIs and logs. Its unstructured nature makes it highly flexible for handling anything from a simple array to a complex nested structure. However, this can also make it challenging for data analysis. When parsing JSON, it's crucial to understand its structure so you can flatten it and convert it into a tabular format for analysis. Once the structure is identified, you can use pandas or PySpark to explode or normalize it into the desired shape. In this article, I will explain the method I use. While this approach is applicable to any notebook, there is a specific trick to make it work in a Fabric notebook.

I will use the following example. Just looking at it, we can see that it's about employees, their contact details and skills.

{  "employees": [    {      "id": 1,      "name": "John Doe",      "role": "Developer",      "contact": {        "email": "john.doe@example.com",        "phone": "123-456-7890"      },      "skills": ["Python", "JavaScript"]    },    {      "id": 2,      "name": "Jane Smith",      "role": "Designer",      "contact": {        "email": "jane.smith@example.com",        "phone": "098-765-4321"      },      "skills": ["Adobe XD", "Figma"]    }  ]}

If you use .json_normalize() method in pandas to read this json, you will get a dataframe with just one row and column.

To unnest this, we will need to identify the structure and pass extra arguments to .json_normalize() . I like using panel library for that. With panel, you can identify the structure easily and also interact with it to find the nested objects. You just need one trick to make it work in Fabric notebook.

First, install panel and instantiate the extension

  !pip install panel --q  import panel as pn   pn.extension()

Next, use the .pane.JSON() method to load the json
```
  pn.pane.JSON(json_obj, name='JSON', height = 300, width = 300)
```
After executing the code above, you would expect to see the JSON structure, but it doesn't appear. This is the trick. Panel uses Bokeh renderer and hence you need to import Bokeh to output the visual. I wrote about it in one of my blogs.
Fabric runtime already has Bokeh installed so you just need to import it.

  import bokeh   bokeh.io.output_notebook()  pn.pane.JSON(json_obj, name='JSON', height = 300, width = 300)

With this, you should now see the panel visual:

This is fully interactive and you can see all levels of the json and the its nested structure.

From this we can see two things, we have employee records under the key "employees". Each employee record is a dictionary with keys for 'ids', 'name', 'role' and nested dictionaries for 'contact' and a list for 'skills'. To flatten this to a tabular format , we need to flatten the list under 'skills' and extract attributes under the 'contact' dictionary. Lists can be flattened by record_path argument and nested dictionary can be flattened by meta . This will transform the nested and hierarchical json into a table.

df = (pd.json_normalize(json_obj['employees'],                        record_path='skills',                        meta=['id', 'name', 'role', ['contact', 'email'], ['contact', 'phone']],                       record_prefix='skill_')      )df

This was rather a simple example, but you can use a similar approach to plot the structure, interact with it and then flatten it as needed. To recap, the trick to make it work in Fabric notebook was to import bokeh and including bokeh.io.output_notebook() in the code block.

If you are using the Synapse VSCode extension to work with Fabric notebooks, try the JSON crack extension. It plots the json into a fully interactive tree.

Programmatically Creating, Managing Lakehouses in Fabric

Sandeep Pawar — Sun, 19 Nov 2023 02:25:15 GMT

At MS Ignite, Microsoft unveiled a variety of new APIs designed for working with Fabric items, such as workspaces, Spark jobs, lakehouses, warehouses, ML items, and more. You can find detailed information about these APIs here. These APIs will be critical in the automation and CI/CD of Fabric workloads.

With the release of these APIs, a new method has been added to the mssparkutils library to simplify working with lakehouses. In this blog, I will explore the available options and provide examples. Please note that at the time of writing this blog, the information has not been published on the official documentation page, so keep an eye on the documentation for changes.

You can use the lakehouse REST API but I find it easier to use mssparkutils.

Lakehouse Method

To check all the available methods, use .help()

from notebookutils import mssparkutilsmssparkutils.lakehouse.help()

There are five methods create, get, update, delete and list. You can use mssparkutils.lakehouse.help("methodName") to get the description and the arguments.

create()

Use this to create a lakehouse in any of the Fabric workspaces.

create(name: String, description: String = "", workspaceId: String = "")

I created a test lakehouse as below. It created the lakehouse, SQL endpoint and a default semantic model. Note that it takes ~5-10 s for the SQL endpoint/lakehouse to get created so wait for ~10s before adding new items to this newly created lakehouse. Lakehouse names cannot have spaces or special characters like @,#,$,%. Underscores are allowed.

💡

workspaceID is optional. If no workspaceID is provided, the workspace hosting the notebook is used. To get the id of the workspace, use spark.conf.get("trident.workspace.id")

get()

get() returns the details of the lakehouse. You can specify either the lakehouse name or the id. As mentioned above, excluding the workspace id will return the lakehouse of that name from the attached workspace.

update()

update() as the name suggests can be used to change the name or the description of the specified lakehouse. update does not change the lakehouse id, only the name and the description. newName is a required parameter.

list()

Use this to get a list of all the lakehouses in the specified workspace which is great.

Below I used the list method to get a list of all the lakehouses in all the workspaces the user has access to. Note that since these methods are calling APIs, the throttling limits are applicable. In the below code, I have added a 10s wait to overcome that, adjust as needed.

import requestsimport pandas as pdimport timefrom notebookutils.mssparkutils.credentials import getTokenfrom notebookutils.mssparkutils.lakehouse import list as list_lakehousedef get_lakehouse_list():    """    Sandeep Pawar   |    faric.guru    Retrieves a list of all lakehouses in all workspaces accessible to the user.    Only Fabric and Premium workspaces are used.    Refer https://learn.microsoft.com/en-us/rest/api/fabric/articles/throttling for throttling limits    Returns a pandas datafrema    """    base_url = "https://api.powerbi.com/v1.0/myorg"    token = getToken("https://analysis.windows.net/powerbi/api")    headers = {"Authorization": f"Bearer {token}"}    response = requests.get(f"{base_url}/groups", headers=headers)    workspaces_data = response.json().get('value', [])    #Premium/Fabric only    premium_workspaces = pd.DataFrame(workspaces_data).query('isOnDedicatedCapacity == True')[["name", "id"]]    lakehouses = []    for ws_id in premium_workspaces['id'].unique():        lakehouses.append(pd.DataFrame(list_lakehouse(workspaceId=ws_id)))        time.sleep(10)  # Throttling limit for compliance    return pd.concat(lakehouses, ignore_index=True)# Call the functionlakehouse_data = get_lakehouse_list()lakehouse_data

Using the Fabric REST API

Alternatively, you can use the Fabric REST API to get a list of all the lakehouses as below:

import requestsimport pandas as pdfrom requests.exceptions import HTTPErrordef get_lakehouse_list_api():    '''    Sandeep Pawar  |   fabric.guru    This function uses the Fabric REST API to get all the lakehouses in the tenant that the user has access to.    '''    base_url = "https://api.fabric.microsoft.com/v1/admin/items?type=Lakehouse"    token = mssparkutils.credentials.getToken("https://api.fabric.microsoft.com/")    headers = {"Authorization": f"Bearer {token}"}    try:        response = requests.get(base_url, headers=headers)        response.raise_for_status()        data = response.json()        # Check if 'itemEntities' exists, may not exist for all items types        if 'itemEntities' not in data:            raise KeyError("'itemEntities' key not found in the response data")        lakehouses = pd.json_normalize(data['itemEntities'], sep='_')        return lakehouses    except HTTPError as http_err:        print(f"HTTP error occurred: {http_err}")    except Exception as err:        print(f"An error occurred: {err}")get_lakehouse_list_api()

REST API returns a few additional details not provided by mssparkutils.

You may also find my other blog on mounting a lakehouse helpful. As a side note, to get the lakehouse attached to a notebook use:

spark.conf.get("trident.lakehouse.id")spark.conf.get("trident.lakehouse.name")

Using Semantic-Link

Update Jan 12, 2024

With Semantic-Link v 0.5.0, you can use the create_lakehouse method to a create lakehouse in any workspace. By default, it will be created in the workspace attached to the notebook.

#install semantic link >= v0.5.0!pip install --upgrade semantic-link --qimport sempy.fabric as fabricfrom sempy.fabric.exceptions import FabricHTTPExceptionlh_name = "Sales_Lakehouse"ws_name = None #if None use the default workspace else use the specifiedtry:    lh = fabric.create_lakehouse(display_name=lh_name, workspace=ws_name)    print(f"Lakehouse {lh_name} with ID {lh} successfully created")except FabricHTTPException as exc:    print(f"An error occurred: {exc}")

Identifying Semantic Model Storage Mode in Fabric

Sandeep Pawar — Wed, 15 Nov 2023 04:35:38 GMT

With the latest addition of Direct Lake as a storage mode in Power BI, the range of available storage modes continues to expand. If you look at the semantic models in the workspace, except for the Push Semantic Model, it's not possible to identify if a semantic model is in import, DirectQuery, Direct Lake, Dual storage mode. So I wrote the below script to do that.

💡

Since I am using Semantic Link, this code only works for Premium workspaces in a Fabric capacity. This is not a comprehensive list of all semantic models and supported storage modes.

Build Semantic Model Catalog:

  !pip install semantic-link swifter --q  import requests  import pandas as pd  import sempy.fabric as fabric  def build_dataset_catalog():      '''      Sandeep Pawar | Fabric.guru      Build dataset (semnatic model) catalog in Fabric      '''      url = "https://analysis.windows.net/powerbi/api"      token = mssparkutils.credentials.getToken(url)      headers = {"Authorization": "Bearer " + token}      response = requests.get("https://api.powerbi.com/v1.0/myorg/groups", headers=headers)      premium_workspaces = pd.DataFrame(response.json()['value']).query('isOnDedicatedCapacity==True')[["name", "id"]]      # premium_workspaces = pd.DataFrame(response.json()['value'])[["name", "id"]]      dfs = [          fabric.list_datasets(ws).assign(workspace=ws)           for ws in premium_workspaces['name']      ]      catalog = pd.concat(dfs, ignore_index=True)      cols = ['workspace'] + [col for col in catalog.columns if col != 'workspace']      return catalog.reindex(columns=cols)  datasets = build_dataset_catalog()[["workspace","Dataset Name","Dataset ID"]]  datasets

Use Semantic Link To Identify Partition Modes

https://gist.github.com/pawarbi/b48efdc9a8e864d0c4082b851d28d1ad

Note here that the default semantic model is not a storage mode but I am still identifying it because it may be useful. Also, note that, Mode/Type returned here is a Python set of all modes in that semantic model. e.g. a semantic model may have two tables, one in import mode and another in DirectQuery. In that case, Mode/Type will show {'import','directquery'}. {'import'} means all partitions in this semantic model are in import mode.

💡

It is important to note that if a semantic model is in Direct Lake mode, it does not necessarily mean that the queries will be in Direct Lake mode as well. Direct Lake semantic model has a fallback to DirectQuery mechanism in which under certain circumstances, the "query" will be in DirectQuery mode. It's possible to restrict this behavior. Please refer to Direct Lake documentation for more details.

Formatting DAX Expression Returned By SemPy in Fabric

Sandeep Pawar — Mon, 30 Oct 2023 16:43:36 GMT

There is an old Italian saying "If it's not formatted, it is not DAX" 😁

When you get the list of measures from SemPy, it's not formatted and is hard to read and understand. Thankfully, the SQLBI team has made the DAX parser and the formatter available via an API. I wrote a quick function to return the formatted DAX expression of a measure. You can either pass a DAX expression or the FabricDataFrame returned by fabric.list_measures()

Pre Requisites

Microsoft Fabric
Dataset in a Premium (F, P, PPU) workspace in Fabric tenant
semantic-link and beautifulsoup4 installed

Steps:

Install Semantic Link and BeautifulSoup

!pip install beautifulsoup4 semantic-link  --q

Get a list of measures for a dataset

import sempy.fabric as fabricfrom sempy.fabric import FabricDataFrameimport pandas as pdws = "Sales Workspace" #Specify your Premium workspace name or IDds = "Sales & Returns" #Specify dataset name or IDmeasures_df = fabric.list_measures(workspace=ws, dataset=ds)

Function to call the DAX Formatter API

import requestsfrom bs4 import BeautifulSoupdef format_dax(dax_expression=None, measure_df=None, measure_name=None, line='short', region='US'):    """    Author: Sandeep Pawar   |    fabric.guru    You must specify either the DAX for the measure or the measure_df + measure_name    But not both.     Refer to https://www.daxformatter.com/ for DAX Formatter details and options    """    if (dax_expression is None and measure_df is None) or (dax_expression is not None and (measure_df is not None and measure_name is not None)):        return "Error: Provide either a DAX expression or a combination of measure_df and measure_name, but not both."    if dax_expression is not None:        dax_to_format = dax_expression    else:        if measure_df is not None and measure_name is not None:            measure_row = measure_df[measure_df['Measure Name'] == measure_name].iloc[0]            dax_to_format = f"{measure_row['Measure Name']}={measure_row['Measure Expression']}"        else:            return "Error: Provide either a DAX expression or a combination of measure_df and measure_name."    url = "https://www.daxformatter.com"    payload = {        'r': region,        'fx': dax_to_format,        'embed': '1',        's': 'auto',        'l': line    }    try:        response = requests.post(url, data=payload)        response.raise_for_status()        soup = BeautifulSoup(response.text, 'html.parser')        formatted_dax_div = soup.find('div', {'class': 'formatted'})        for br in formatted_dax_div.find_all("br"):            br.replace_with("\n")        formatted_dax_text = formatted_dax_div.get_text(' ').replace('\xa0', ' ')        return print(formatted_dax_text)    except Exception as e:        return f"Error: {e}"

Format a specific measure from the FabricDataFrame:

#Here I want to format a measure called "WIF Units Returned_2"#Note here dax_expression is None. You cannot use both at the same time.format_dax(dax_expression=None, measure_df = measures_df, measure_name = "WIF Units Returned_2")

To format any measure that's not in the above dataframe, pass it to the dax_expression instead. DAX here should always start with the name of the measure.

##Unformatted DAXmeasure_expression = """Sales Amount =        VAR RoundedNetPrices =            ADDCOLUMNS (                SUMMARIZE ( Sales, Sales[Net Price] ),                "@Rounded Net Price", ROUND ( Sales[Net Price], 1 ),                "@Sum Of Quantity", CALCULATE ( SUM ( Sales[Quantity] ) )            )        VAR Result =            SUMX ( RoundedNetPrices, [@Rounded Net Price] * [@Sum Of Quantity] )        RETURN    Result"""# pass unformatted DAXformat_dax(dax_expression=measure_expression, measure_df = None, measure_name = None)

💡

The default is short line but you can change it by line="long"

format_dax(dax_expression=measure_expression, measure_df = None, measure_name = None, line='long')

This API is generously made available by the SQLBI team so be respectful of the limits, terms and conditions. Thank you SQLBI.

Measure Maze: Visualizing Measure Dependencies Using Semantic Link & Network Analysis

Sandeep Pawar — Thu, 26 Oct 2023 21:35:47 GMT

In my previous blog post, I introduced Semantic-Link, discussing its use cases and explained how it enables us to create solutions that were either not possible or not easily achievable before. In this blog post, I would like to present another powerful use case that, although possible in the past, could not be created and used seamlessly in Power BI. Allow me to introduce the MeasureMaze Python library, which helps uncover insights from a complex maze of dependencies in a Power BI semantic model using Semantic-Link and the power of network analysis.

When you are developing complex analytics solutions using Power BI, you will typically have hundreds of measures that are interdependent with each other. If you need to find dependencies among these measures, you have to use external tools (Tabular Editor and DAX Studio) which work great but these tools don't allow you to find insights or show critical measures without additional analysis. MeasureMaze aims to plug that gap using Semantic-Link and network analysis in a Fabric notebook. I have used this approach before in my work and wanted to build on it after the release of Semantic-Link.

Pre-Requisites:

Microsoft Fabric
Premium workspace and dataset in a Fabric tenant

Installation

To get started, install the MeasureMaze library from my github. You can pip install it in a Fabric notebook.

#Install Measure Maze in a Fabric Notebook!pip install https://github.com/pawarbi/MeasureMaze/raw/main/measuremaze-0.0.1-py3-none-any.whl --q

If you want to follow along with me, publish this Profit Analysis pbix to a Premium (F, P or PPU) workspace in a Fabric tenant.

After installing the library, import it and let's get started.

#Import Measure Mazefrom measuremaze.map import get_dependencies, PlotDependencies#Specify Premium workspace and dataset id or namews = "90d7b4f7-9c0f-425c-87f1-xxxxxxxx"ds = "a52cde7b-9497-4cfa-a133-xxxxxxxx"#Returns a FabricDataFrame with dependency mapping for the above datasetdf = get_dependencies(dataset = ds, workspace = ws)df.head()

get_dependencies() uses SemPy under the hood to execute two DMVs to get all the dependencies and return a FabricDataFrame containing 5 columns:

Object Name : Measure or Calculated Column name
Expression: DAX Expression used to define the object
Dependent on : Measure, column or table used to define the object. If the object is a column (regular or calculated), it is always shown in Table[Column Name] format to quickly identify if it's a column or a measure. e.g. in the below table, Net Sales is a measure whereas Sales[Status] is a column.
Type : Whether the Dependent on object is a column, measure or table
Folder: Home table and folder containing the object. Always starts with Table name followed by the folder name. i.e. Table\Folder\Sub-folder\..

You can use this dataframe like any other pandas dataframe and filter/transform as needed. For example, we know that using IF , especially nested IF can cause DAX performance issues which can be optimized by using variables and SWITCH . We can filter the above dataframe to identify all objects that use IF. Using regex patterns can make this very powerful.

This dataframe shows the dependencies in a tabular format which is great for querying but not for understanding the dependencies quickly and visually. To do that, use the PlotDependencies class from MeasureMaze which contains several helpful methods.

PlotDependencies

Instantiate the class by passing the above dependencies dataframe

#Plot all objectsall_objects = PlotDependencies(df)all_objects.show()

Above I passed the entire dependencies dataframe and used the .show() method to get a quick summary of the dependencies in a tabular format. This table contains four columns:

Object : This is measure or calculated column
Upstream Dependencies : Number of objects used to define the measure/calculated column
Downstream Dependencies : Number of objects that use the above measure/calculated column
Centrality : Measure of how central or connected a node is (more on this later). The higher the centrality score, the higher its importance in the network of dependencies

From the table, we can see that Net Sales measure is defined by 2 objects while it is used in 16 different objects. By default, the table is sorted by centrality score to provide a list of objects that are influential.

Using Folders

If you are like me, I always add measures to a folder to keep everything organized and logical, especially on large projects where it's typical to have hundreds of measures. In that case, you may not want to analyze all the measures and only look at a subset of measures in a folder or list of folders. PlotDependencies has an optional argument folder to help with that.

Above I am looking at the measures that are in the TopN folder in Design DAX table. Note here that the folder names must be specified in TableName\folder\sub-folder format. But what if you want to look at all the measures in the Design DAX table? You can specify the table too.

You can pass a list of folders as well. Below I am analyzing measures that are in two folders from two different tables. Using folders can help narrow down the scope.

💡

Quick Tip: Folders use startswith condition in the code so you just need to provide the initial letters of the folder name.

Network Analysis

So far we have only done tabular analysis. However, my main objective with this project is to use network analysis to visualize and understand large dependency network. I have built several methods that can help depending on the use cases based on my experience working on large enterprise projects.

plot

To plot all the dependencies for all the objects that were in the dataframe used in PlotDependencies are plotted using the NetworkX and Graphviz libraries. Both of these libraries are installed during MeasureMaze installation.

This plot visually shows how all objects are related to each other. By default, all objects from the dataframe are used but you can specify an object or list of objects to reduce the scope:

You can pass a list as well:

Nodes (i.e. objects) have colors based on the type of object.

Red: Calculated columns and calculated tables
Green : Regular column
Purple: Measure
Blue : Object under consideration.

💡

My intention with color-coding was to quickly identify the object type and use colors that have high contrast to make them accessible\color blind friendly. Admittedly, this is not working as I expected and I will fix it in future updates. If you have better ideas, please let me know.

In the above example, %Return Rate is a calculated table (because it's red in color and doesn't have Table[Column Name] format. %Return Rate[% Return Rate] is a calculated column. Sales[Unit] is a regular column. By default, when .plot() method is used, it will always show immediate upstream and downstream dependencies to show context.

plot() method is not restricted to measures and calculated columns only. You can pass a regular column to see which objects use that column:

In this example, Sales[Unit] column is used by four different measures. Sales[Unit] is blue because it's the object being plotted (again the colors are inconsistent and will be fixed later).

Plotting the dependencies and visualizing them can help with documentation as well as analysis. It's much easier to understand all the dependencies if you plot them instead of seeing them in a tabular format.

Configuring the plot

If the semantic model has hundreds of objects, the plots can get very crowded and difficult to see. There are three ways you can configure the plots:

layout: By default, the objects are plotted in a vertical layout, i.e. the upstream/downstream dependencies are laid out vertically (above and below the node). You can change that to layout=horizontal which will work better in some cases.
- graph_size : The default size is '20,20' which should work in most cases, but you can adjust that by changing the size
- dpi: To increase resolution, set the dpi to a high value.
  
  All three can be set at the same time to get the desired size and layout. It can still be challenging in a notebook if the number of objects is very large. My recommendation is to use a list of folders or objects to reduce the scope. I am planning on adding a way to export the plots as PNG or HTML for viewing them easily outside of notebooks.

Centrality

Before getting into data analytics, I worked as an engineer in an R&D team to design and analyze heat transfer products. It was my job to analyze test and production data and find insights to design better, stronger, and optimal products. I used network analysis frequently to analyze vast amounts of data that were interconnected with each other. I wanted to see if the same principles can be used on a network of measure dependencies, especially on large projects.

Centrality measures the importance of nodes in a network. A node that is highly connected to other nodes is deemed more important than a node with few connections. In the context of Power BI, it can help identify key objects that use many other objects or are used by other objects for their definition. These central or key objects can potentially:

influence many other measures and columns. Changing these could have a broad impact on the report results and performance
be bottlenecks if defined sub-optimally
be helpful for lineage tracing and auditing. Report authors could prioritize these objects for accuracy and auditing in their DataOps pipeline
be helpful for documentation
help to identify redundancy and complexity

In the above network, node A is more influential than node F or node G because it has more connections.

I recently completed a project with over 400 measures with varying complexity and wished I had a way to highlight to the client how the measures were organized and designed.

💡

Centrality purely measures the importance of a node based on how it is connected to other objects and not on its value to the business, report or by DAX definition.

Imagine, you took over the above project and quickly want to find out which are the key measures. You can do that using the centrality method in Measure Maze.

#Return node(s) with highest centrality scoreall_objects.centrality()

Above, I used the centrality() method which returned Net Sales as the key measure because it has the highest number of connections upstream and downstream. It is used by 16 other measures in its definition. Without centrality, it would have been tough to identify this quickly.

Note that in this case a measure was returned but it could have been a column, a calculated column or a calculated table too if it's used by other objects.

Centrality Level

In all likelihood, you won't just have one key measure. You can return more than one by specifying the level of centrality desired. By default, it's the node with the highest centrality score. If level=2 is used, the top two nodes with the highest centrality are returned - Net Sales and Returns

#Return top 2 central nodesall_objects.centrality(level = 2, graph_size='30,20')

When the nodes are returned, it always shows the upstream and downstream dependencies to provide context. Though the default is level=1 , my recommendation would be to look at the top 5 nodes. In the most examples I have tested, that's sufficient to get a good understanding of how everything is defined and organized. We don't want to see too many isolated nodes which may indicate measures are not being re-used.

The table returned by .show() method also shows the centrality score for each object. I would like to point out that the default centrality method does not distinguish between the upstream and downstream dependencies. Below, ProductR Top N and Units Returned both have a total of 6 connections and hence the same centrality score.

A true key measure cannot be identified just by the number of connections but it is certainly one of the ways and can point in a direction.

Centrality Method

By default, 'degree' centrality method is used to calculate the centrality score. There are many more methods that may not be completely applicable to the type of network we are analyzing.

Degree : Nodes most directly used
Closeness: Nodes that spread information quickly through a network
Betweenness: Nodes that act as bridge between different nodes
Eigenvector: Nodes that are connected to other powerful nodes

All of these four methods are available in the .centrality method. e.g. below I used the closeness method:

Eigenvector, though powerful, needs a large network and is currently unstable in my implementation. I think in Power BI's context, degree makes more sense but I am still exploring to see if and how other methods will help. In future releases, I may drop them.

I highly recommend reading this book if you want to learn more about centrality and network analysis in general.

Depth

Thinking about different use cases, I realized that in some cases, I may want to know if there are measures that have a long dependency depth. e.g. a measure could be defined by a measure and a column. But that upstream measure will have its own dependency and so forth. This tall dependency chain could lead to a bottleneck that may have a cascading effect so I came up with another measure called - depth. Depth looks at the total number of unique upstream dependencies instead of connectedness.

#Returns objects with a long dependency chainall_objects.depth()

In the above example, Profit Indicator measure has 6 levels of upstream dependencies. If any of these objects are changed or defined sub-optimally, it will have a cascading effect on accuracy and performance.

Depth Level

You can specify a level, similar to centrality, to get more than one object. Below level=3 returned 3 measures (Profit Indicator, WIF Profit Difference, WIF Units Returned) with a long chain of dependencies. You can control the size via graph_size, dpi and layout.

Disregard the number of arrows. It's a big that I need to fix.

Find Link

Lastly, during development, you may want to know if there are any dependencies between two objects. This can be found by using find_link() method.

You can specify any two objects and if they share dependencies, they are plotted.

Dependency Analysis in Semantic Link

Semantic-Link also has methods to find functional dependencies between columns in a FabricDataFrame. The difference between the two is, plot_dependencies identifies relationships between columns in a dataframe.

from sempy.dependencies import plot_dependency_metadatadeps = df.find_dependencies()plot_dependency_metadata(deps)

find_dependencies() correctly shows that Expression and Folder are related to the Object Name column while Type is related to the "Dependent

on" column.

This type of analysis is very helpful if you receive a big flat table and want to identify attributes that belong together to create a dimensional model.

Relationships

Semantic Link can also be used to plot relationships between tables.

from sempy import fabricfrom sempy.relationships import plot_relationship_metadatarelationships = fabric.list_relationships(workspace=ws, dataset=ds)plot_relationship_metadata(relationships)

You can also perform dependency violation analysis in SemPy which can be used to identify referential integrity violations and data quality issues. You can read more about it here.

Notes:

While I showed how to use this in a Fabric tenant and after the model has been published to service, MeasureMaze can be used with Pro datasets and locally with Desktop as well. I will write a separate blog post on that.
I am aware of issues with colors and multiple arrows. I will fix that in future updates.
This is the most amount of non-data related Python code (~500 lines) I have written so there are definitely ways to make the library faster and performant in its design. I am not a SE but if you are, happy to share the code with you if you would like to contribute.
If you find any bugs or have suggestions, please feel free to drop them here.
Currently, all the plots are static but can be made dynamic using Pyvis and plotly. I am looking into it and may update in the future.

Fabric Semantic Link and Use Cases

Sandeep Pawar — Thu, 05 Oct 2023 16:22:29 GMT

Fabric Semantic Link has finally been announced and is available for everyone to use. If you know me and have followed my blogs/presentations, you know that I am very passionate about data science, Power BI, notebooks and Python. If you ever meet with product PMs from Microsoft, invariably they will ask you "If you had a magic wand, what feature would you want?" In all of my interactions with them in the last two years, my answer was - something that would allow me to easily access the Power BI semantic model using Python/R/spark in a notebook so I could enrich/augment Power BI reports with data science insights & solutions. Well, they listened, took notes and made that wish into a reality. Semantic Link does all of the above things plus more. Before I explain what Semantic Link is, allow me to set the stage first. I will explain the need, use cases and some of the Semantic Link features. Hopefully, at the end of the blog, you will be as excited as I am.

Overview

If you use the Microsoft data stack, the data scientists primarily use Azure Databricks or Azure Machine Learning to develop and deploy their data science solutions. A typical data analyst or Power BI developer rarely has access to those platforms and has no visibility on what data was used, how the models were created and what assumptions went into creating the ML models. They both operate in their silos and unfortunately, I have experienced the issues it causes firsthand.

Before Fabric

With Fabric

Microsoft solved that challenge by creating one single SaaS product, Microsoft Fabric, where the data scientists, data engineers and Power BI developers can all collaborate and access the same data & items thanks to OneLake. However, it only solved part of the challenge.

Consider the following scenarios:

As a data scientist, you are working on developing a hierarchical sales forecasting model for all product SKUs. You utilize the table from the gold layer, the same as the BI developer, to calculate net sales for all products, create features, and develop an accurate model that the BI developer can incorporate into their report. However, if you have experience with BI projects, you know that calculating the total sales amount is often not so straightforward. In your DAX measure, you will likely include business logic to exclude internal customers, specific departments, and accounts that are not revenue accounts, as well as potentially excluding adjustments and returns. Additionally, you may have visuals where extra filters are applied based on user requirements. If the logic used in the BI report/visual differs from what the data scientist used to train the model, the advanced model will still produce incorrect predictions, regardless of its sophistication. This issue becomes even more problematic if you have a large team with multiple departments and each member defines these measures/features differently.
Let's assume that instead, they collaborate during development and both use the exact same business logic to calculate the sales amount. Now, the data generation process and assumptions align, and everything is good. They move on to their next projects, and six months later, someone takes over this project. Business conditions and requirements have changed, and the new BI developer updates the DAX measures, but the data scientist isn't informed of the change. Now, there is a disconnect between the two, which means the predictions are going to be incorrect again. The BI developer may inform the data scientist colleague about the change in metric definition, but there is no automated process to track if the DAX measure/semantic model has changed and alert data scientists or others of the change.
Let's continue with the example of building a sales forecasting machine learning model. As a Data Scientist, you will create several features to capture the signal and trend in the data. If you are not familiar with features, consider them as numerical representations of the data. For instance, to create a forecasting model for each SKU, you will typically calculate sales from previous days, weeks, months, etc. Data Scientists refer to these as "lags," while BI developers call them time intelligence calculations - WTD, MTD, QTD, Previous Day/Week/Month, etc., which are part of most BI reports. If the BI developer is going to create these measures, shouldn't the data scientist be able to reuse the same measures for consistency and efficiency, at least for rapid prototyping?
I have discussed and presented on several occasions (here and here) how I use notebooks to explore data in my Power BI report development workflow. I may not necessarily build a data science solution for a BI project but the notebooks help with ad-hoc exploration and can help identify insights that otherwise are almost impossible with just Power BI. The primary challenge in this workflow, however, is that I must redefine all the measures and calculations in the notebook using Python. Plus, any insights I find cannot be incorporated in the report.

These challenges arise because there has not been an easy way to use the semantic model defined in the Power BI dataset outside of Power BI. You can use the Power BI REST API or XMLA endpoint (I wrote about it here), but it is too tedious and has several limitations. All the rich business logic captured in the semantic layer through the dimensional model and DAX measures can not be utilized by data scientists and BI developers to build additional solutions - until now ! With Semantic Link, data scientists can now explore the semantic model using Python, PySpark, and SparkSQL in the Fabric Data Science experience. This enables seamless collaboration between data scientists and BI developers, offering full visibility into the semantic layer.

With Fabric Semantic Link

Semantic Link includes sempy Python library which can be used in Fabric notebook to access any Power BI dataset (i.e. semantic model), including all the relationships, data, measures, calculated columns, hierarchies, DMVs and execute DAX against the model in the notebooks using Python or spark. Let's take a simple example of how a data scientist will use this to understand more:

To continue with the previous example of forecasting, imagine I need to create a univariate forecast model and the predictions need to be used in a Power BI report. Instead of defining the revenue logic again, I can query the Power BI dataset directly which has the sales measure, build the training dataset to create the model, and save the predictions to the lakehouse.

List all the tables from Sales & Returns dataset in Sales Workspace:

View relationships between the tables. From the relationship, I can see that we have a sales fact table and product & calendar dim tables. Two tables have many-to-many relationships.
- Next, I need to identify the measures in this dataset. I see "Net Sales" measure with its DAX
- Now, I need to build the training set based on weekly sales so I will get the data I need from the dataset using evaluate_measure:
- 💡
  
  Here I did not have to create a join between Calendar, Store, Product and Sales tables, unlike a pandas dataframe. I created a "FabricDataFrame" that behaves as a pandas dataframe but is aware of the semantic relationship among these tables and hence provides the aggregate based on the context- just like Power BI does. FabricDataFrame supports all pandas operations plus the semantic information, lineage and semantic functions.
- Note in this case, I used the Python API because that's what I am familiar with as a data scientist. But if I knew DAX or the Power BI developer shared the DAX with me, I could have used that as well:
  
  From here I can continue building the training dataset and build the model using pandas, spark, synapseml, sklearn with mlflow etc. just like any other data science project.
I can now evaluate the model, save it to the Fabric model registry and do scoring when required for it to be saved as a delta table for further use in a Power BI dataset.

This is a simple example but I was able to discover the semantic model and fully re-use it in the notebook using Semantic Link. I will share more detailed examples in the coming weeks so stay tuned.

What About Power BI Developers?

Buckle up ! Semantic Link has so many use cases for Power BI developers and data engineers that I think it's going to change our workflows and create many new solutions that were not possible before. Here are a few I thought of with examples:

Dataset Catalog: You can build a catalog of all available datasets as long as the datasets are in a Premium workspace in the same tenant. You can use sempy only in the Fabric notebook but the datasets can be in a Fabric, Premium or PPU workspace in the same tenant.

import requestsimport pandas as pdimport sempy.fabric as fabricdef build_dataset_catalog():    '''    Sandeep Pawar | Fabric.guru    Build dataset (semnatic model) catalog in Fabric    '''    url = "https://analysis.windows.net/powerbi/api"    token = mssparkutils.credentials.getToken(url)    headers = {"Authorization": "Bearer " + token}    response = requests.get("https://api.powerbi.com/v1.0/myorg/groups", headers=headers)    premium_workspaces = pd.DataFrame(response.json()['value']).query('isOnDedicatedCapacity==True')[["name", "id"]]    dfs = [        fabric.list_datasets(ws).assign(workspace=ws)         for ws in premium_workspaces['name']    ]    catalog = pd.concat(dfs, ignore_index=True)    cols = ['workspace'] + [col for col in catalog.columns if col != 'workspace']    return catalog.reindex(columns=cols)datasets = build_dataset_catalog()datasets

TMSL Catalog: Get all the attributes of the dataset including tables, partitions, and column/measure level metadata found in TMSL.
https://gist.github.com/pawarbi/40457eef2a9304fe34784e112a31386a
- Measures Catalog: Use fabric.list_measures(workspace, dataset) to get a list of all the measures from the specified dataset and the workspace. This will be great for documentation.
- Measure Descriptions: By default, sempy doesn't provide measure descriptions. But you can use the DMV to get all the measure properties as below:
- https://gist.github.com/pawarbi/354fc55d10f365fb3b4c154c9d686b17
- Yes, you can execute DMVs as well! Now imagine all the possibilities. You can build a measure catalog for the entire tenant, make it available as a dataset for all users to see, and save measure definitions regularly.
- 💡
  
  I mentioned at the beginning of the blog, that currently there is no way to alert if DAX measures change. Now you can save a snapshot of the measures catalog, compare and use Fabric pipeline to alert if the measure definition changes. Data Scientists can also trigger re-training pipeline if any changes are detected or evaluate if re-training is required. This closes the entire feedback loop and truly provides OneSemantic layer :P
- Integrate Copilot and LLMs:
  I occasionally use ChatGPT-4 to break down complex DAX measures. By using the OpenAI API, you can easily append a column to the above measures table, complete with a natural language explanation of the DAX. This will help both business users and data scientists who may not be well-versed in DAX. Since you have access to the model's details (such as relationships), it's feasible to pass this information to the LLM, allowing for more robust optimization and summarization of the DAX. You could create a chat agent to ask natural language questions about the model (which columns are used by 'sales' measure?Give me a list of all the key columns that are not hidden and have default summarization...).
- BPA: With the ability to query DMVs using Python, you can pretty much create the Best Practice Analyzer using Python which can be run in Fabric notebooks and alert/trigger refresh based on the results.

As an example, below I am executing three DMVs to get:

If the table is hidden
Excluded from refreshes
If it's an auto-date table
Has RI Violations
Total number of columns
Total number of calculated columns

This can easily be extended further with other TOM properties to build BPA

import pandas as pdimport sempy.fabric as fabricdef pbi_table_report( workspace, dataset):    '''    Sandeep Pawar | Fabric.guru    Use sempy to execute DMVs and analyze tables for best practices    Ver: 0.2 | 2/16/2024    Warning : I haven not validated the DMVs yet, only for demo currently. I will update later.     '''    import warnings    warnings.filterwarnings("ignore", category=UserWarning)    table_list = (fabric        .evaluate_dax(            workspace=workspace, dataset=dataset,            dax_string="""select [ID], [Name], [IsHidden] from $SYSTEM.TMSCHEMA_TABLES"""        )    ).assign(IsAutoDate=lambda x: x['Name'].str.startswith(('DateTableTemplate', 'LocalDateTable_'))).set_index('ID')    # Get RI Violation Count    ir = (fabric        .evaluate_dax(            workspace=workspace, dataset=dataset,            dax_string="""select [Name],[RIViolationCount] from $SYSTEM.TMSCHEMA_TABLE_STORAGES"""        )    ).assign(ID=lambda x: x['Name'].str.extract(r'\((\d+)\)$')).set_index('ID')    # Get calculated columns    calc_cols = (fabric        .evaluate_dax(            workspace=workspace, dataset=dataset,            dax_string="""select [TableID],[Expression] from $SYSTEM.TMSCHEMA_COLUMNS"""        )    ).groupby('TableID').agg(        Num_Columns=pd.NamedAgg(column='Expression', aggfunc='size'),        Num_Calc_Columns=pd.NamedAgg(column='Expression', aggfunc='count')    )    calc_cols.index.name = 'ID'    # Join all    final = table_list.join(ir[['RIViolationCount']]).join(calc_cols).reset_index(drop=True)    return finalpbi_table_report( workspace = ws, dataset = ds)

Metric Report: You can query any measure from any Premium dataset you have access to. So you can create delta tables that have KPIs from different datasets and show them on one single report ! Imagine if you could create an executive report that shows revenue trend from revenue dataset, marketing KPIs from another dataset etc., something not currently possible without using Dashboards (or composite models).
Customized CSV Extracts: No matter how good your Power BI report is, business users always want CSV extracts. If you have an existing table in the Power BI report, you can use the underlying DAX measure in the evaluate_measure or evaluate_dax method to retrieve that table and save the CSV file. The best part is unlike the Power BI report, you can change the column headers, add custom indentation to hierarchies, add color formatting however you want and save as pdf, Excel on a schedule to OneLake!
Dataset Quality: Use WAP (Write-Audit-Publish) pattern for business-critical datasets where you first publish a "Control Dataset", audit it using DMVs for RI violations, data types, nulls and other data quality issues and then publish/refresh the actual dataset after the audit has passed. sempy is going to change how we do DataOps for Power BI.
Data Quality/Validation: Similarly, you can use great-expectations or pydeeque to validate datasets against the source or gold tables by following the same "Control Dataset" approach. DAX validation can also be performed in the same way. Read more about it here.
%[https://youtu.be/XZAff4SG_9c?si=OkCgi0qKW2ZE1FHJ]
Pre-Warming Direct Lake Dataset: I alluded to this when I wrote a blog post on how sempy can be used to pre-warm a Direct Lake dataset. Just as I showed above, you can query a set of columns using sempy right after refreshing the dataset in the pipeline so the users can get warm cache import-like performance.
Calculated Columns in Direct Lake: Using the "Control Dataset" approach, you can use the DAX measures to add calculated columns to a direct lake (or import) dataset right in the source, something currently not possible.
Get Dataset Size: As far as I know, currently, there is no programmatic way to get the size of a dataset in memory. You either have to use the metrics app or VertiPaq Analyzer. With sempy you can query the DMVs and get an approximate size of the datasets in all the workspace. I will publish a blog on this.
Metric Store: I am not a metric store expert but I can see a central BI team creating wrappers, functions, APIs to easily access the business KPIs, similar to Malloy. Business users can call these functions to quickly get the KPIs in a notebook and allow ad-hoc analysis without any knowledge of DAX or Power BI. If you have experience with Malloy, I would love to learn more.
Data Wrangler: If you are not familiar with Python, you can use Data Wrangler to work with a FabricDataFrame similar to pandas dataframes to perform pandas operations & methods. Data Wrangler doesn't perform semantic-link operations.
Data Apps: This is not supported and is not a feature. But, I can imagine in the future the data scientists can pull data from the Power BI dataset, enrich it with additional insights and publish a data app using streamlit, panel, Shiny etc. to Power BI Apps for business users to consume. This will open up Fabric to many users who are not familiar with Power BI but want to develop & publish data apps using open-source tools they know.
Here is a list of my blogs related to Semantic Link: https://fabric.guru/tag/semantic-link

These are just a few use cases. I am excited to see how everyone uses it in their workflow. There is more to Semantic Link than just sempy. My goal was to discuss and showcase some of the use cases and what's possible.

Congratulations to the product and engineering team on this awesome feature and I am looking forward to using it in my projects.

Notes

You need Fabric capacity to use Semantic Link and the dataset needs to be in a Premium/Fabric workspace
Remember that the goal is not to extract all data from a Power BI dataset and build another dataset. Your published report should be built using the source data. Semantic Link should be used to enrich/augment to create a "diamond layer" instead.
Unlike Power BI REST API, Semantic Link does not have any row limitation (except Spark SQL connector but that may change)
Semantic Link respects OLS/RLS
Semantic Link is still in public preview and some of the methods may be experimental.

References:

🚀Lightning Fast Copy In Fabric Notebook

Sandeep Pawar — Sat, 23 Sep 2023 17:45:17 GMT

If you are familiar with mssparkutils , you know that it is packed with utilities to perform common notebook tasks such as getting a list of files and folders, mount points, copying files, running notebooks etc. in Fabric. The Fabric notebook team has been adding new tools to mssparkutils to accelerate development. One such recently added method is fastcp which as the name suggests is similar to the existing method cp to copy files but it's orders of magnitude faster.

fastcp is a Python wrapper to azcopy which means you can also use the configurations provided by azcopy with the flexibility of Python.

For demonstration, I am copying 4.5GB folder from one lakehouse to another. I provide the abfss paths for the source and destination folders in the two lakehouses and use cp and fastcp as follows:

source = "abfss://xxxxxx/Files/Generico"cp_dest = "abfss://xxxxxx/Files/cp/Generico"fastcp_dest = "abfss://xxxxxx/Files/fastcp/Generico"#copy using cpmssparkutils.fs.cp(source, cp_dest, recurse=True)# copy using fastcpmssparkutils.fs.fastcp(source, fastcp_dest, recurse=True)

cp took 43 seconds to copy the data, fastcp finished the job in 5 seconds !!! 9x speedup.

To learn more about fastcp , you can get help using mssparkutils.fs.help("fastcp")

You can specify additional copy options using the extraConfig parameter. All options supported by azcopy can be specified in a dictionary. For example, I used mssparkutils.fs.fastcp(source, fastcp_dest, recurse=True, extraConfigs={"flags": "--dry-run=true"}) to simulate a dry-run to make sure the copy runs successfully without actually copying the data. You can pass additional parameters as extraConfigs={"flags": "--dry-run=true --overwrite"}

💡

Note that, unlike cp, fastcp currently only supports abfss paths

Versatility of cp

Although copying files from ADLSg2 to the lakehouse is faster with fastcp , cp is a great method for copying files from a variety of sources and filesystems that fastcp doesn't support. For example,

Blob store

blob_source = 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/bing_covid-19_data/latest/bing_covid-19_data.parquet'lh_dest = "abfss://@onelake.dfs.fabric.microsoft.com/.Lakehouse/Files"mssparkutils.fs.cp(blob_source, lh_dest, recurse=True)

Github

github_source = 'https://media.githubusercontent.com/media/datablist/sample-csv-files/main/files/customers/customers-100.csv'lh_dest = "abfss://@onelake.dfs.fabric.microsoft.com/.Lakehouse/Files"mssparkutils.fs.cp(github_source, lh_dest, recurse=True)

You can pretty much replace wget with cp todownload and copy files to Fabric lakehouse.

Keep an eye on the official documentation to learn more.

I want to thank Jene Zhang, Yi Lin, Fang Zhang and Tiago Rente from Microsoft for the information.

A Quick Comparison Of Fabric Spark Configuration Settings

Sandeep Pawar — Wed, 20 Sep 2023 20:02:27 GMT

I compared the default Spark configurations in the Fabric Spark runtime with those of the standard Spark. I excluded configurations that were identical between the two, as well as those that were irrelevant. I thought sharing this information might be useful to others. Additionally, I have provided brief explanations on a couple of key configurations and their impact on performance. Keep in mind that each Spark job is unique, so it's essential to evaluate the requirements based on the specific job.

💡

Below configs are for Fabric F64 capacity with Starter Pool. I will update the post later with other capacities and pools as well.

Code

This code extracts the default Spark configuration from the Spark documentation and compares it with the Spark config in Fabric. I have tried to make it dynamic so based on the Spark version in Fabric, it will fetch the configs from the corresponding Spark documentation. Note that the below comparison is for Fabric spark runtime 1.1 (Spark 3.3, Delta 2.2). I found 33 configs that are different.

from pyspark import SparkConfimport pandas as pddef compare_sparkconf():    '''    Author : Sandeep Pawar   |   Fabric.guru   | Sept 9, 2023    Compares spark configurations from current Fabric runtime with the    default configurations from spark documentation.     '''    ver = spark.version[:5]    try:        url = f'https://spark.apache.org/docs/{ver}/configuration.html'        dfs = pd.read_html(url)    except Exception as e:        return f"Check the URL. Error: {e}"    table_names = ['Application Properties', 'Runtime Environment', 'Shuffle Behavior', 'Spark UI', 'Spark Streaming', 'Spark MLlib', 'GraphX', 'Cluster Manager', 'Deploy', 'Kubernetes', 'Other']    for idx, df in enumerate(dfs):        df['Table_Name'] = table_names[idx] if idx < len(table_names) else 'Unknown'    final_df = pd.concat(dfs, ignore_index=True)    conf = SparkConf()    spark_configs_df = pd.DataFrame(conf.getAll(), columns=['Property Name', 'Fabric'])    joined_df = pd.merge(final_df, spark_configs_df, on='Property Name', how='inner')    ## Excluding none, paths and credentials     excl = 'Default != "(none)"  and not Default.fillna("").str.contains("/") and not Fabric.fillna("").str.contains("/") and not Default.fillna("").str.contains("password") and not Default == "None"'    filtered_df = joined_df.query( excl, engine='python')    return filtered_df[filtered_df.Default != filtered_df.Fabric][['Property Name', 'Default', 'Fabric']].reset_index(drop=True)compare_sparkconf()

Property Name	Default	Fabric
spark.driver.cores	1	8
spark.driver.maxResultSize	1g	4096m
spark.driver.memory	1g	56g
spark.driver.memoryOverhead	driverMemory * spark.driver.memoryOverheadFactor, with minimum of 384	384
spark.executor.memory	1g	56g
spark.executor.memoryOverhead	executorMemory * spark.executor.memoryOverheadFactor, with minimum of 384	384
spark.shuffle.file.buffer	32k	1m
spark.shuffle.io.backLog	-1	8192
spark.shuffle.service.enabled	false	true
spark.eventLog.enabled	false	true
spark.eventLog.buffer.kb	100k	4k
spark.ui.port	4040	0
spark.io.compression.lz4.blockSize	32k	128kb
spark.kryoserializer.buffer.max	64m	128m
spark.rdd.compress	false	true
spark.serializer	org.apache.spark.serializer. JavaSerializer	org.apache.spark.serializer.KryoSerializer
spark.executor.cores	1 in YARN mode, all the available cores on the worker in standalone and Mesos coarse-grained modes.	8
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version	1	2
spark.locality.wait	3s	1
spark.scheduler.minRegisteredResourcesRatio	0.8 for KUBERNETES mode; 0.8 for YARN mode; 0.0 for standalone mode and Mesos coarse-grained mode	0.0
spark.scheduler.mode	FIFO	FAIR
spark.dynamicAllocation.enabled	false	true
spark.dynamicAllocation.initialExecutors	spark.dynamicAllocation.minExecutors	1
spark.dynamicAllocation.maxExecutors	infinity	9
spark.dynamicAllocation.minExecutors	0	1
spark.sql.autoBroadcastJoinThreshold	10MB	26214400
spark.sql.cbo.enabled	false	true
spark.sql.cbo.joinReorder.enabled	false	true
spark.sql.execution.arrow.pyspark.enabled	(value of spark.sql.execution.arrow.enabled)	true
spark.sql.execution.arrow.pyspark.fallback.enabled	(value of spark.sql.execution.arrow.fallback.enabled)	true
spark.sql.files.maxPartitionBytes	128MB	134217728
spark.sql.optimizer.runtime.bloomFilter.enabled	false	true
spark.sql.statistics.fallBackToHdfs	false	true
spark.sql.hive.metastore.version	2.3.9	2.3.2

Here are a few that are worth noting:

Cost Based Optimizer:
spark.sql.cbo.enabled is enabled by default. CBO optimizes the query performance by creating more efficient query plans compared to the rule-based optimizer, especially for queries involving multiple joins. However, CBO only works if you gather the statistics using ANALYZE TABLE (which doesn't work with Delta v2 tables). Once statistics are available, optimization is automatic. The plan with the lowest estimated cost (I/O, CPU etc) is selected for execution by Catalyst optimizer. Be sure to update the statistics if your data changes frequently to utilize the optimization. I will write a blog on this.
Broadcast Join Threshold:
spark.sql.autoBroadcastJoinThreshold size is set to 25MB instead of 10MB. When you join two tables with significant skew, you can broadcast the smaller table to avoid data shuffling. If you have AQE on, Spark will automatically switch to broadcast join instead of sort-merge. Although the threshold has been doubled in Fabric, considering the executor memory is 56g, with 8 cores, roughly 25% of total will be used for execution so per core we will have 1.75GB memory. I think you can certainly bump it up to 150MB depending on the data and other tasks. (This is assuming the 56g is available.) -1 turns off broadcast join. If Autotune is used, it will automatically tune this setting. Note the Autotune documentation incorrectly mentions the default as 10MB.
Serializer:
Note that Kryo serializer is used instead of the default Java. Synapse also uses Kryo which is faster and more compact but for some reason the product team never updated the documentation and still mentions Java as the default. I spoke about spark serialization here if you are interested to learn more.
Partition Size:
Perhaps one of the most important settings in spark tuning. But it's not different, it is still 128MB, just shown in bytes instead of MB. This is the max size in bytes in a single partition. Autotune will tune this as well. Typically you want to keep it <=200MB but depends on the data size, cluster size to maximize parallelization & resource utilization.
Bloom Filter
The bloom filter in the config is not the same as the Bloom Filter Indexes in Databricks. As far as I know, Fabric does not offer bloom filter indexing on delta tables.

Other than these nothing else stands out. Other configs are standard and based on the spark cluster you set up in the workspace settings. Fabric also has certain improvements/optimizations introduced in the Spark engine but that would be another blog. I just wanted to quickly identify if any of the standard configs were different in Fabric. Hope this helps.

BZ2, Linux Command Line and Fabric

Sandeep Pawar — Sun, 27 Aug 2023 15:42:20 GMT

Apologies for the title that doesn't make any sense. Like many of you, I hadn't heard of bz2 until a few weeks ago. I've been doing experiments and performance tests on Fabric, which involved working with 300GB of data compressed using bz2 compression. I've used gzip and lzma before, but bz2 was new to me. I downloaded the data to the Fabric lakehouse, but couldn't open it in Spark or with the bzip Python library. Data Factory pipelines can read bz2 files as well, but that didn't work either. The data authors likely compressed it on Linux with a very high compression ratio. I then realized that since Fabric spins up a Linux VM for Spark, I could use the Linux command line utilities (if available) in the notebook. Sure enough, it worked.

!bzcat  | head -n 5

You have to use the File API when providing the path, thanks to blobfuse2. I was able to decompress the file and save it as gzip too using the command line and load it in spark.

I'm fairly certain that 99.999% of you will never need to use this (myself included), but I thought I'd share it just in case it might help someone.

This also means if you have csv, json, text files in your Files section, and if you just want to inspect or search those files, instead of loading it in pandas or spark dataframe, you can use the bash commands like gprep ,cat ,awk , head ,column to quickly load the files.

Below I am using head and awk to view the first 5 columns and rows of the csv file

!head -n 5 /lakehouse/default/Files/sales.csv | awk -F, '{ for (i=1; i<=5; i++) printf "%s,", $i; print "" }' | column -s, -t | less

use json_pp to print, inspect json files and json tree structure.

This is a good reference for some of the commands you can use.

Creating Shortcut To KQL Table In Fabric Lakehouse

Sandeep Pawar — Sun, 27 Aug 2023 00:51:29 GMT

Shortcuts in Microsoft Fabric are one of the best features that allow you to point to any location on OneLake from any engine without moving the original data. It just magically makes it appear virtually via embedded references within OneLake. The official Microsoft documentation walks you through how to create shortcuts in the lakehouse, and KQL database and can be accessed using any Fabric engine. You can create shortcuts for tables as well as structured/unstructured data in the Files section.

Similar to the lakehouse, you can add a table or the files as a shortcut in the KQL database. In fact, in my last blog I showed a use case for doing that. However, what I have not seen in the official documentation yet is how to add a Kusto table as a shortcut in the lakehouse. If you pick KQL database as the source of the shortcut, you will see the table list is empty.

The Solution

The solution is rather easy.

By default, when you create a KQL database, its OneLake folder path is inactive. If you activate the folder path, any new tables that are added to the KQL tables can be added as a shortcut in the lakehouse.

Activate folder:

Now if you browse to the KQL database in the Lakehouse shortcut option, you will be able to pick the table and add it as a shortcut in the Files section.

💡

Note here that only the new tables added to the KQL database after activating the folder option will be available. I am not aware of an option to add existing tables. Also, you cannot add the KQL tables in the managed table section since these are not delta tables. It typically takes 5-10min for the folder path to get activated.

After adding to the Lakehouse as a shortcut, you can use it as any other data in the notebooks ! Query it using any language of your choice - KQL, Python, R, Scala. You should also be able to access this data using the OneLake API.

You can check out my KQL related blogs here.

Reference:

fabric real-time-analytics | Microsoft Learn

Building Insightful Analytical Solutions Using KQLMagic

Sandeep Pawar — Fri, 25 Aug 2023 22:33:43 GMT

A few months ago, I wrote a blog post about querying a KQL database in Fabric notebooks using the Python Kusto SDK. Although it works perfectly, my friend Alex Powers reminded me that kqlmagic can also be used in notebooks for a more ad hoc experience. I had previously used it in Azure Data Studio notebooks but had forgotten about it. In this blog post, I want to share kqlmagic and another Spark-based method for querying the Kusto database in notebooks. More importantly, I will provide a sample use case demonstrating how to build a rich analytical solution using these methods. Let's first examine the how, and then the why.

How

Using `kqlmagic`

kqlmagic allows you to combine Python and KQL for querying and visualizing data in notebooks. While you can use it in any IPython notebook, Fabric makes authentication to KQL cluster very easy. You can run KQL queries against your KQL database in the Python kernel by using the magic commands - %kql and %%kql . The result is a pandas dataframe which can then further be used in your analysis. The setup is very easy. We need to install the kqlmagic library, generate the authentication token and that's it.

Install kqlmagic and load the extension

 !pip install Kqlmagic --no-cache-dir --upgrade  --quiet reload_ext Kqlmagic

Generate authentication token :
Copy the Kustodb name and the cluster uri

   from notebookutils import mssparkutils   ##replace with your uri and dbname   kusto_uri = 'https://.kusto.fabric.microsoft.com'   kusto_dbname = ''   try:       token = mssparkutils.credentials.getToken(kusto_uri)       mssparkutils.credentials.isValidToken(token)       print("valid token")   except:       "Not a valid kusto token"

Connect to the Kusto db

 %kql kusto://code;cluster=kusto_uri;database=kusto_dbname -try_token={"tokenType":"bearer","accessToken":token}

That's it ! We are now ready to query the above Kusto database in the Fabric notebook.

To run a query inline, use %kql and for multiline queries use %%kql
```
 %kql StormEvents | summarize count() by State | sort by count_ | limit 10
```
```
 %%kql StormEvents | summarize count() by State | sort by count_ | limit 10 | render columnchart title='Top 10 States by Storm Event count'
```
This will generate a pandas dataframe that you can use for further analysis.

💡

If you want to save the result of the inline magic to a variable, you can use %kql result << -query kqlquery where kqlquery is the query you want to run and the result will be saved in variable 'result' which you can use it at any time.

KQL makes time series analysis and plotting very easy. We can use kqlmagic to generate fully interactive plotly charts as well if you render the query

kqlmagic is very handy if you want to query the kql database in ad-hoc fashion. But not great if the resulting dataset is very large because because you will get out of memory errors. To overcome that you can use spark connector.

Using spark

We don't need to do anything special. Just use the below template in the Fabric notebook.

kustoDf  = (spark.read             .format("com.microsoft.kusto.spark.synapse.datasource")             .option("kustoCluster",kusto_uri)             .option("kustoDatabase",kusto_dbname)             .option("kustoQuery", q)             .option("accessToken",token)            ).load()

This will create a pyspark dataframe which you can use like any other dataframe. Unlike kqlmagic this won't render the queries though.

Why

This is all good but why should you bother? Good question. Because this allows you to create more insightful and complex solutions that have never been possible before.

Consider a scenario where you have a clean sales data in the Files section of the lakehouse. You are analyzing it and will build a dimension model out of it, save it as delta table and eventually build a Power BI dataset. Typically any insights you want to surface in your reports, will happen in Power BI because :

the traditional warehousing architecture does not have features to extract complex insights
analysts don't have the tools or access to those tools
using different tools often involves moving data around

This is where Fabric truly shines. We have all the tools necessary without copying the data mutliple times/locations and having to piece different tools/platforms together. Let me show you how:

I have the sales data in my lakehouse
I can create a shortcut to the sales data in the kusto database and query it like any other Kusto table
This will allow me to use advanced analytical techniques like clustering, windowing, forecasting, anomaly detection etc on the same data without moving that data to another platform. Sure, we can do the same thing with spark and ML libraries but KQL makes time series analysis far easier and scalable.
I can now run the KQL against the same data in Fabric notebook and integrate that result in the final delta table. When the notebook is refreshed in the pipeline, it will run the KQL query along with the rest of the Python query in the notebook.

For example, in my sales data of 150 million+ rows, I want to identify products that are on an upward trend and surface that insight in the Power BI dataset directly for report authors and users to consume. Below is the KQL query that groups by the sales data for the last 6 months by week and item, averages sales based on 7 day window to smooth out fluctuations, fits a linear model to calculate the slope which I can use to identify top 80th percentile products that are on the upward trend. I can then add a flag in my dimension table that the Power BI developer can use easily. I can even pass Python variables and make the query dynamic based on requirements.

N = 80  q = f"""let topNcutoff = {N};let t = external_table('Sales')| extend sales = Quantity * UnitPrice| extend Items = substring(Item, 0, indexof(Item, ','))| extend WeekStart = startofweek(OrderDate)| summarize TotalSalesRaw=sum(sales) by WeekStart, Items| extend TotalSales = round(TotalSalesRaw, 2)| project WeekStart, Items, TotalSales| order by Items, WeekStart asc;let max_t = toscalar(t | summarize max(WeekStart));let min_t = (max_t-180d); let slopeData = t| make-series WeeklySales=round(avg(TotalSales),2) default=0 on WeekStart from min_t  to max_t step 7d by Items| extend (slope, intercept) = series_fit_line(WeeklySales);let p = toscalar(slopeData | summarize percentile(slope, topNcutoff ));slopeData| where slope >= p| mv-expand WeeklySales to typeof(double), WeekStart to typeof(datetime)| project Items, WeekStart, WeeklySales| order by Items, WeekStart asc // | render timechart title = 'Top {N} percentile products on upward trend'| render timechart title = strcat("Top {N} percentile products on upward trend - [", format_datetime(min_t,"yyyy-MM-dd"), " to ", format_datetime(max_t,"yyyy-MM-dd"), " ]")"""

If I run this using kqlmagic, I get a nice interactive plot that correctly identifies 9 products that show the highest upward trend which can be saved in the dimension table.

💡

Note that in the KQL above, I am using external_table('Sales') because to access a shortcut table you have to use the external_table(..) function.

Before you say you could have done this in DAX using moving average, LINEST function etc., let me tell you this is just one example of what's possible. For example, I could decompose this time series to extract trend, and seasonality and choose to use different methods based on the time series structure, something DAX can't do. KQL has many advanced time series analysis capabilities that you can use to enrich your Power BI datasets and reports.

💡

This is the part that I am most excited about in Microsoft Fabric. These tools and features like KQL and ML have existed for a long time but were not available to most analysts. Fabric provides access to all the features without worrying about the setup time and additional costs. Users can create more insightful solutions that have not been possible with Power BI alone. It's time to move past aggregate and percentage based KPIs and generate insights based on latent patterns in the data.

The obligatory - Fabric is still in public preview and features may change at any point.

Resources:

C# Script To Get A List Of Measures Matching Column Names

Sandeep Pawar — Sat, 19 Aug 2023 19:53:36 GMT

This is not a Fabric related blog; well technically it is because Power BI is part of Fabric :P. It's just easier for me to write the blog here compared to my previous blog at www.pawarbi.com so dropping it here.

I was working on a project that had several measures with the same name as column names in tables which caused some issues. Avoid doing that, especially working with a large model. It's easy to refer to one instead of the other. You cannot create a measure in a table that matches the name of one of the columns in that table but if you use a measures table, there is no such restriction. Use the below script in Tabular Editor to return a list of measures that matches column names. Not sure if it's already part of the BPA, but it should be.

// Sandeep Pawar | fabric.guruvar _cols = new List<string>();var _measures = new List<string>();// Loop through each table in the model to get column namesforeach (var table in Model.Tables){    _cols.AddRange(table.DataColumns.OrderBy(c => c.SourceColumn).Select(c => "[" + c.SourceColumn + "]"));}// Loop through each table in the model to get measure namesforeach (var table in Model.Tables){    _measures.AddRange(table.Measures.OrderBy(m => m.Name).Select(m => "[" + m.Name + "]"));}// Combine the lists and find duplicatesvar duplicates = _cols.Intersect(_measures).ToList();// outputvar outputMessage = "Number of duplicates found: " + duplicates.Count + "\r\n";outputMessage += "==========================\r\n";outputMessage += string.Join(",\r\n", duplicates);// Output the messageoutputMessage.Output();

Recursively Loading Files From Multiple Workspaces And Lakehouses in Fabric

Sandeep Pawar — Sun, 13 Aug 2023 14:53:48 GMT

Imagine a scenario in which you are collaborating with two colleagues who have stored sales data in two separate lakehouses, each within a different workspace and under distinct names but same schema. The data is saved in folders containing numerous parquet files. One colleague saved the parquet files with names starting with "data-", while the other used Spark, which typically saves files with names beginning with "part-". Your task is to read this data in a notebook for further analysis.

One obvious solution is to read the data individually and union the multiple spark dataframes. An easier and more flexible option, without mounting those lakehouses, in the notebook is to use the abfs paths and a couple of spark options to recursively load folders with wildcards.

paths = [    "abfss://@onelake.dfs.fabric.microsoft.com//Files//",    "abfss://@onelake.dfs.fabric.microsoft.com//Files//"]df = (spark.read      .option("pathGlobFilter", "{data*.parquet,part*.parquet}") # only files starting with data or part      .option("recursiveFileLookup", "true") #recursively load from nested sub folders      .parquet(*paths))

In the above code, paths are the abfs paths of the folders where the files are saved. In the pathGlobFilter option, I provided a dictionary to only load files that start with data or part. recursiveFileLookup option looks for files within subfolders under the specified folder. Setting recursiveFileLookup to false will only load files from the top-level folder.

You can also create shortcuts to these folders in a new lakehouse, but it still won't solve the problem of loading files from nested subfolders. By using the approach mentioned above, you can provide relative paths instead of abfs paths in the paths list.

ABFS paths for folders can be found by going to the respective lakehouse > Properties

💡

This is a useful pattern during development or for ad-hoc analysis but for production, you will want more robust error handling, schema evolution/consistency, performance etc. so use it according to your use case.

Calculating Folder Size In The Lakehouse

Sandeep Pawar — Wed, 09 Aug 2023 17:09:11 GMT

This is more of a personal note to self than a blog post. When running various test scenarios, I calculate the size of the lakehouse by calculating the size of its folders (Files and Tables). The code I use is below. I'm sharing it here to save myself the trouble of searching for it and to help others who may find it useful. It's nothing special; I simply use Pool to parallelize calculations for a large number of folders.

import osimport pandas as pdfrom multiprocessing import Pool, cpu_countdef get_size_of_folder(folder_path):    """    fabric.guru  |  08-09-2023    Calculate the total size of all files in the given folder.    Args:    - folder_path (str): Path to the folder.    Returns:    - tuple: (folder_path, size in MB)    """    total_size = sum(        os.path.getsize(os.path.join(dirpath, f))         for dirpath, _, filenames in os.walk(folder_path)         for f in filenames    )    size_in_mb = total_size / (1024 * 1024)    return folder_path, size_in_mbdef get_folder_sizes(base_path):    """    fabric.guru  |  08-09-2023    Get the sizes of all folders in the given base path.    Args:    - base_path (str): Base directory path.    Returns:    - DataFrame: DataFrame with columns 'Folder' and 'Size (MB)'.    """    folders = [        os.path.join(base_path, folder)         for folder in os.listdir(base_path)         if os.path.isdir(os.path.join(base_path, folder))    ]    with Pool(cpu_count()) as p:        sizes = p.map(get_size_of_folder, folders)    df = pd.DataFrame(sizes, columns=['Folder', 'Size (MB)'])    return df## Return a pandas dataframe containing two columns Folder & Size (MB)## This will scan the folders from the lakehouse mounted in the notebook## Use File API path and not the ABFSS or https pathdf = get_folder_sizes("/lakehouse/default/Files")

Boosting Copy Activity Throughput in Fabric

Sandeep Pawar — Mon, 07 Aug 2023 17:21:42 GMT

In all of my Fabric related blog posts, I make it abundantly clear that Fabric is (as of the writing of the post), in public preview and the performance/features may change at any time. I want to discuss one such significant performance improvement in Fabric data factory I discovered that improved the throughput by 22% in my case. The Fabric DI team has enhanced the copy performance by increasing the CPU utilization during the copy process.

Copy Activity

In Fabric, you can use the copy activity to copy the data from a variety of sources (on-premises data sources are not supported yet) to Fabric OneLake. For most small to medium workloads, the out-of-the-box performance is decent. At higher workloads, if you want to boost the performance, there are a few knobs to turn in the "Settings" (see below). You can manually set the "Intelligent throughput optimization", "Degree of copy parallelism" and "Enable staging". Each of them changes different aspects to improve the throughput in various scenarios. You can learn more about them here. My focus in this article is tuning the "Degree of copy parallelism" (DoCP) to boost throughput and potentially reduce capacity units consumed.

When a pipeline is executed, the necessary computing resources are automatically provisioned. The degree of copy parallelism determines how efficiently the compute resources (i.e., CPU) are utilized by parallelizing the execution. It represents the maximum number of threads within the copy activity that simultaneously read from your source or write to your sink data stores. Copy tasks are distributed across the computing resources to enhance overall throughput. The higher the parallelism, the greater the throughput. In theory, as parallelism increases, the compute resources should execute more tasks within the same fixed resource allocation. I will also discuss when that's not the case.

Ref: Image from here

Test data and Setup

I had 291 GB of parquet dataset with 9 billion rows with total 1500 parquet files in OneLake that I wanted to copy to another lakehouse as VORDER'd parquet. In my tests, I first started with the default settings and tweaked the DoCP without changing any other settings.

DoCP: Default settings

With default settings, it took 27 min to copy the data. Notice in the above image that Degree of copy parallelism is blank. Next, I manually changed the settings to test how it affects the throughput.

DoCP : 32

In the dropdown, the max I could set was 32 and ran again without changing any other settings:

The activity took 66 minutes, compared to 27 minutes after manually changing the setting, which was not what I expected. However, upon closely examining the runtime details of the default settings, I discovered that the default DoCP value is actually 128, even though it appears blank in the UI. This is where the Fabric DI team seems to have made significant performance improvements to optimize the use of the provisioned compute with multithreading. The same activity with default settings took much longer a few weeks ago. With a DoCP = 128, 128 files are being read in parallel!

💡

Similar to ADF, it has been possible to change the DoCP in Fabric from day 1 but there have been improvements in the CPU utilization to increase the copy performance for most of the copy pairs. Binary copy already fully utilizes the CPU.

Next, I kept increasing the DoCP. Below are the results:

Parallelism	Duration (min)
32	66
128 (Default)	27
256	25
512	23
1024	21
1500	21

As you can see, increasing DoCP up to 1024 shaved off 6 minutes from the run time that's a 22% improvement in performance! Below are some observations and notes:

Increasing DoCP from 32 to 128 was a significant improvement but after that, the improvements were marginal for my copy pair and data. This may not always be the case and will depend on your scenario.
DoCP is the maximum value, i.e., if you set it at 256 but if there are not enough files and/or the source & target cannot handle the parallel execution, copy activity will automatically scale it down to the optimal number but won't exceed the value you set.
In my case, I had 1500 files so I set it at 1500 but copy activity automatically scaled it down to 1476 because that's all it could handle. Notice that increasing DoCP from 1024 to 1500 did not improve the performance.
In some of my other tests (other copy pair and 1TB data), at higher DoCP, the execution was automatically throttled based on what the source & target could handle and the network bandwidth.
Copy activity will automatically find you the best DoCP based on the copy pair, data and intelligent throughput optimization setting. Based on my tests, my recommendation would be to start with the default, observe the value during execution and then tune the number based on the number of files you have. As always, only tests will determine the optimal value.
Since DoCP increases parallelism for the given compute, it should reduce the capacity units consumed. I need to do more tests to confirm this.
To get the best performance that wont be impacted by other parameters such as cross-region, network, etc, you can :
1. Move the data to a storage that is in the same region as your Fabric tenant to eliminate the impact of cross-region.
2. Do not run the copy job against single storage at the same time, to eliminate the impact of storage throttling due to frequent requests or/and bottlenecks from network bandwidth.
This was just one of the settings. You can also tune throughput optimization and look into staging for certain scenarios.
In other tests I have done with 1TB of data, I saw ~10-30% improvement in throughput which can be attributed to the recent improvement in CPU utilization. If you have run any large copy activities in the last few weeks, I would encourage you to run tests again to compare the performance. YMMV but I think you will see a performance improvement.
A keen eye will notice that V-ORDER optimized Parquet with Snappy compression reduced the data size from 291 GB to 200 GB! Interestingly, at DoCP=32, the compression was slightly worse (by 3 GB), and the reason for this is unclear to me. (Btw, I have a blog in the backlog on V-ORDER and compression, stay tuned!)
Note here that this is parallelizing the copy activity for the given copy pair within that activity.
Copy Activity Assistant does not let you change the settings in the UI. To change the settings, uncheck the below option on the last page of the Assistant and then select the activity > Settings

Overall, I believe the Fabric DI team has done an excellent job in enhancing performance, and hopefully, this will continue to improve in the near future. The documentation on this topic is currently scarce, so I hope they will improve that as well and also work on refining the user interface.

As always, the obligatory - Fabric is currently in public preview and the performance & features may change in the future - hopefully for the better! Performance is always dependent on several factors so run tests for your specific use case and let me know what you find.

I want to thank Jianlei Shen, Tina Hu at Microsoft and my colleague Albert Paulraj for their help.

References:

Downloading ML Models From Fabric Model Registry

Sandeep Pawar — Tue, 01 Aug 2023 20:11:12 GMT

In previous blog posts, I discussed loading models from the Fabric model registry and migrating models to Fabric. To conclude this series, this blog post will demonstrate how to download models from Fabric both programmatically and through the user interface. Downloading models enables you to export them from Fabric for inferencing on other platforms or for local inferencing during development while model endpoints are not yet available in Fabric.

Using UI

In the workspace, select the model you want to download and then select the version you want to download. In the top ribbon "Download model version". Note that this will download the individual model as a zip file.

This is a quick option if you need to download one or two models. To export many models from a workspace, we can use the mlflow API and download programmatically.

Programmatically

Get Registered Models

Use the below code to get a list of all the models that are registered in the current workspace:

import mlflowimport pandas as pdfrom mlflow import MlflowClientclient = MlflowClient()#Get a list of models in the workspacemodels_list = client.search_registered_models()models_dict_list = [dict(model) for model in models_list]# Return a dataframe showing the modelsmodels_df = pd.DataFrame(models_dict_list)models_df

You can use the above pandas dataframe to recursively download the models and save them to the Files section of the lakehouse. Note that the code below will download the entire packaged MLModel. Refer to mlflow API if you want to download just the serialized model. I have also included some error handling.

from pathlib import Pathimport mlflow## Author : Sandeep Pawar  | fabric.guru  | ver:1.0  | Aug 1, 2023## provide the path in the lakehouse where models should be savedbase_path_download = Path("/lakehouse/default/Files/")for model in models_df['name']:    print(model)    # Create the directory, overwrite if it exists    model_path = base_path_download / model    model_path.mkdir(parents=True, exist_ok=True)    try:        # download the latest model        mlflow.artifacts.download_artifacts(artifact_uri=f"models:/{model}/latest", dst_path=str(model_path))        print("Downloaded & saved ", model)    except Exception as e:        print(f">>> Could not download model {model}: {str(e)}")

Output:

The models will be saved to individual folders in the Files section of the lakehouse. You can then download it to your local machine.

💡

Above code will download the latest models from the registry. If you need to download specific versions, change the code accordingly.

Pre-Warming The Direct Lake Dataset For Warm Cache Import-Like Performance

Sandeep Pawar — Mon, 10 Jul 2023 21:53:46 GMT

When using the Direct Lake dataset, query performance primarily depends on whether the columns are cached in memory. If the dataset has not been framed, there will be nothing in the cache, meaning you are encountering a cold cache. In this state, the AS engine sends the query to Delta Lake, transcodes the columns, and loads them into memory. Import mode with Large Dataset Format dataset has an on-demand caching behavior which also has cold and warm cache [1]. The query performance of Direct Lake dataset in a cold cache state can be same or worse than in import mode, depending on factors such as the delta table, dataset size, cardinality, and the specific query. Once the columns are in memory, subsequent query performance becomes significantly faster, similar to warm cache import mode. For instance, I recently published an article demonstrating that for a 600 million row TPCH_SF100 dataset, a cold cache query took around 40 seconds in Direct Lake mode, while the same query in warm-cache mode was completed in under a second.

The first user who queries the report connected to a Direct Lake dataset in a cold cache state may potentially experience a long wait time to obtain the results. This can possibly be improved by pre-warming the Direct Lake dataset, meaning we load the required columns into memory before the first user interacts with the report. I will use a single table (lineitem table from the TPCH_SF100 dataset) as my example. I will provide Direct Lake vs import cold cache performance results in a future blog.

It's important to note that Fabric is still in preview. The options and performance will most definitely change in the future. Neither Fabric nor Direct Lake are ready for production while in public preview so below is more of an academic investigation/discussion of what's possible. I will update the blog as and when new features/updates are available.

Cold Cache:

In cold state, the Direct Lake dataset only stores the metadata and not the data. I framed the dataset to drain any cache and refreshed the visual in Power BI Desktop. It took 47 s to get the results in Direct Lake mode.

Pre-Warming The Cache:

There are many ways to achieve this but I will be using the REST API to query the four columns I need. I queried the required columns using the ExecuteQueries REST API using Python in the Fabric notebook. To generate REST API tokens, I used the code Gerhard Brueckl shared on Twitter.

import requestsfrom notebookutils import mssparkutilsdef execute_powerbi_query(dataset_id, query):    """    Sandeep Pawar   |   Fabric.guru  |   July 9, 2023   | v1    Execute a DAX query against a  Power BI dataset.    Args:        dataset_id (str): The ID of the dataset to query.        query (str): The DAX query to execute.    Returns:        dict: The JSON response from the Power BI API.    """    # Obtain the Power BI API token    # source : https://twitter.com/GBrueckl/status/1673305659624833024?s=20    url = "https://analysis.windows.net/powerbi/api"    token = mssparkutils.credentials.getToken(url)    if not token:        raise ValueError("Could not obtain Power BI API token")    # JSON body for the API request    json_body = {        'queries': [            {                'Query': query,                'QueryType': 'Data'            },        ]    }    # Headers for the API request    headers = {"Authorization": "Bearer " + token}    # API request    try:        response = requests.post(f"https://api.powerbi.com/v1.0/myorg/datasets/{dataset_id}/executeQueries",                                  headers=headers,                                  json=json_body)        # Raise an exception if the request was unsuccessful        response.raise_for_status()    except requests.exceptions.RequestException as e:        print(f"Request failed: {e}")        return None    # Parse the as JSON    content = response.json()    return content

I retrieved the count of string columns and the sum of numeric columns, loading these four columns into memory.

Upon executing the above query in the notebook, the column segments were cached in memory. As a result, querying the same visual in the Power BI report ran faster, taking only 3.4 seconds instead of the initial 47 seconds. This improvement was possible because the columns were already loaded in memory. Any user who interacts with the report after pre-warming the cache will experience lower query latency.

Which Columns To Cache?

I briefly mentioned this in my Direct Lake FAQ blog but Chris Webb has written about it in excellent detail. You can run the DMV to identify which columns are queried frequently using the temperature. You can also use this report Gilbert Quevauvilliers created to identify "hot" columns. You would cache the few most frequently used columns. Note that if you have two tables related to each other, you will have to cache the key columns as well.

Image Source :Power BI Which tables and columns are being used the most in Power BI Premium/Premium Per User - FourMoo | Power BI | Data Analytics

You can also use fabric.list_columns(workspace, dataset, extended=True) to get columns cached in memory.

How To Use This?

For the custom Direct Lake dataset, if auto-sync is on, it will automatically fetch the latest data if the data or the schema changes. It is my understanding, I need to do more tests to confirm, the dataset will be framed and will be in a cold state after the auto-sync happens. If the delta lake is updated very frequently, it won't be practical to use this method. However, if you turn off auto-sync and refresh the dataset manually, you can run the above code in a notebook in the pipeline right after the delta lake has been loaded and the dataset has been refreshed. All we are doing is preemptively querying the dataset so that the user experiences import-like performance. Default dataset cannot be refreshed so this is only applicable to custom datasets.

💡

Please note that the Direct Lake dataset has a fallback to Direct Query behavior, and it is difficult to predict, particularly for large datasets, whether the dataset will be in DQ or DL mode. I assume that it will be in DL mode, but if it is not, none of this will work. As I have mentioned several times, Fabric is in preview, so this will change in the future.

Other Methods:

Dataflow

Instead of using the notebook, you can also query the dataset with the Analysis Services connector in a dataflow. You can opt for the standard Gen 1 dataflow instead of Gen 2, as we are not saving the data to OneLake. Mim has written a blog on how to accomplish this. You can add the dataflow to your pipeline following the dataset refresh step.

XMLA End point

The ExecuteQueries API has some limitations regarding the number of rows, data size, etc. In our case, we only need to query the columns and not run the actual DAX, so it should be fine. However, if you need to run a large query, you can use the XMLA endpoint, as I have described in my blog.

SemPy

SemPy, which is currently in private preview, will also be able to provide this functionality in Fabric notebooks seamlessly without using any API. I can't provide specifics while it's in private preview so I will update the blog post when it becomes available.

Michael Kovalsky (PM, Fabric CAT) shared a script in his repo to identify the columns cached in memory before the Direct Lake semantic model is refreshed and then caches them cack again after the refresh:

https://github.com/m-kovalsky/Fabric/blob/main/WarmDirectLakeCache_IsResident.py

Summary

Hopefully, none of this will be necessary after the GA release - cold cache performance should improve significantly, Fabric will intelligently cache based on usage or do efficient memory swap/pinning etc. However, if it doesn't, you can pre-warm the Direct Lake cache according to usage, offering warm cache import-like performance to users. I presented a simple example based on a single table, but I understand that it will be more challenging for complex models, datasets with numerous live connected reports, and reports with multiple visuals; it's certainly not trivial. If you have better ideas or suggestions, please feel free to share them.

Thank you to Tamas Polner, Christian Wade at Microsoft for reviewing the blog.

References

Change Log:

I updated the blog post because I was reminded that import also has cold and warm cache. For a fair comparison, the performance of a pre-warmed Direct Lake dataset should be compared with that of a warm cache import dataset. In a perfect scenario, cold/warm cache DL performance should match the cold/warm cache import dataset.

Migrating Existing ML Models To Fabric Data Science

Sandeep Pawar — Tue, 04 Jul 2023 15:04:52 GMT

In the last blog, I showed how you can load ML models for scoring in Fabric. The assumption was you already have an ML model registered in the Fabric model registry. What if you have existing serialized models from another platform (e.g. AzureML, Databricks) that you want to migrate to Fabric? Currently, there are no options available yet to do this using the GUI. This will change in the future, but for now, below are two scenarios based on whether you have just the pickle file or the MLmodel format containing the pickle file along with other model artifacts (conda.yaml, model metadata etc.)

The focus of this blog is to migrate the models and the related artifacts, not the entire tracking store with artifacts. If that's what you are looking for, look into mlflow-export-import Python library.

Serialized Model

Below are the steps and the code if you have just the pickle (or any other supported serialized format) file.

Steps:

Save your serialized models to a folder.
Upload the models to a folder in the Files section of the lakehouse. Mount the lakehouse in the notebook.
In the Fabric notebook, use the below function to create MLflow run and register a Scikit-Learn model. You will need to provide three parameters:
1. model_file_path : This is the .pkl location using the Files API /lakehouse/default/Files//.pkl
2. model_name : name of your model
3. artifact_location : This is optional. This is the location under runs where model will be saved. If you don't specify anything, it will be saved under imported_model
Note that the function below assumed a sklearn model. You can register other model flavors the same way. Refer to MLFlow documentation.

           import cloudpickle           import mlflow           from mlflow.tracking import MlflowClient           from pathlib import Path           def register_model(model_file_path: str, model_name: str, artifact_path: str = "imported_model") -> None:               """               Author : Sandeep Pawar   |   fabric.guru   | July 3, 2023               Register a model in the Fabric MLflow model registry.               This function will log the model as an artifact within an MLflow run                and then register the model in the Fabric MLflow model registr with given name.               Parameters               ----------               model_file_path : str                   The path to the file containing the model to be registered.               model_name : str                   The name to be given to the registered model.               artifact_path : str, default "imported_model"                   The path where the model will be stored as an artifact within the run.               Returns               -------               None               Raises               ------               FileNotFoundError                   If the provided model_file_path does not exist.               Examples               --------               >>> register_model("/lakehouse/default/Files//.pkl", "churn_model","marketig_imported_model")               """               model_path = Path(model_file_path)               if not model_path.exists():                   raise FileNotFoundError(f"No file found at provided model path: {model_file_path}")               # Load the model               with model_path.open("rb") as f:                   model = cloudpickle.load(f)               # Start a new MLflow run               with mlflow.start_run() as run:                   # Log the model as an artifact within the run                   mlflow.sklearn.log_model(model, artifact_path=artifact_path)                   # Register the model in the MLflow model registry                   mlflow_model_uri = f"runs:/{run.info.run_id}/{artifact_path}"                   mlflow.register_model(model_uri=mlflow_model_uri, name=model_name)

Example:

If you need to include a signature, register the model and follow this example.
If you have many models, you can create a dictionary with the model name and file path and iterate over it using the above function.
Note that if the model is stored at a remote location but is accessible in Fabric notebook (blob storage, S3 etc.) via API, you can skip step 2 and provide the remote path directly.
Check the workspace or use the function I shared in the last blog to get a list of all the registered models.
Always backup and test. Load the newly registered model and create predictions to ensure it's working as expected.

MLmodel Format

MLflow adopts the MLmodel format as a way to create a contract between the artifacts and what they represent. The MLmodel format stores assets in a folder. Among them, there is a particular file named MLmodel. This file is the single source of truth about how a model can be loaded and used [1]. Below is a screenshot from AzureML.

Steps:

The steps are similar to those above, but instead of uploading a single model, you will upload the existing MLFlow model folder containing:
- MLModel : model metadata file
- model.pkl : serialized model
- conda.yaml : conda environment

Use the below function. Instead of the location of the pickle file, provide the folder path containing the above three artifacts. Note that the folder path is in File API format i.e. "/lakehouse/default/Files//"
After registering to the model registry, it takes a few minutes for it to show up in the workspace.

import cloudpickleimport mlflowfrom mlflow.tracking import MlflowClientfrom pathlib import Pathdef register_model(model_folder_path: str, model_name: str, artifact_path: str = "imported_model") -> None:    """    Author : Sandeep Pawar   |   fabric.guru   | July 4, 2023    Register an mlflow model folder to Fabric model registry.     This function will log the model as an artifact within an MLflow run     and then register the model in the Fabric MLflow model registr with given name.     The specified folder must have model as model.pkl, environment as conda.yaml    and the MLmodel file which contains the model metadata.    Parameters    ----------    model_folder_path : str        The lakehouse File path to the folder containing the mlflow model artifacts.    model_name : str        The name to be given to the registered model.    artifact_path : str, default "imported_model"        The path where the model will be stored as an artifact within the run.    Returns    -------    None    Raises    ------    FileNotFoundError        If the provided model_folder_path does not exist.    Examples    --------    >>> register_model("/lakehouse/default/Files//", model_name="churn_model")    """    model_fpath = Path(model_folder_path)    if not model_fpath.exists():        raise FileNotFoundError(f"Folder not found at provided path: {model_folder_path}. Check if Lakehouse has been mounted.")    model_path   = Path(model_folder_path,"model.pkl")    conda_path   = str(Path(model_folder_path,"conda.yaml"))     mlmodel_path = str(Path(model_folder_path,"MLmodel"))     # Load the model    with model_path.open("rb") as f:         model = cloudpickle.load(f)    with mlflow.start_run() as run:        (mlflow            .sklearn            .log_model(sk_model     = model                    ,  artifact_path = artifact_path                    ,  registered_model_name = model_name                    ,  conda_env = conda_path                    ,  metadata = mlmodel_path                         )        )

Example:

Note that in the code above, I did not log any additional model artifacts (charts, files, signatures etc.). If you have any additional files that you need to log, modify the above code. Also note that after registering the model for the first time, it will be registered with version 1. If you want to change the model version to reflect the correct version, update the model version using this example.

As I mentioned above, uploading a model directly using the UI may be possible in the future as Fabric progresses towards GA but for mass importing and registering the models, above code would still be helpful.

Third scenario, which I won't cover, is custom models. Since Fabric uses MLflow model registry, you need to register the model with MLmodel format. Follow this article for the steps to do so.

In the next blog, I will show how to download models registered in the Fabric model registry.

References

From artifacts to models in MLflow - Azure Machine Learning | Microsoft Learn

Loading ML Models in Fabric Data Science

Sandeep Pawar — Wed, 28 Jun 2023 12:00:39 GMT

This will be a short blog post. One of my colleagues was going through the Fabric Data Science documentation and did not find anything on how to get the model URI. If you are coming from Databricks, you can get the URI directly from the MLFlow model registry. The model hub in Fabric doesn't (yet) show that in the UI nor does it mention it anywhere in the documentation.

If you follow the documentation, you will see that the model is loaded using MLflowTransformer API from SynapseML as below:

from synapse.ml.predict import MLflowTransformer spark.conf.set("spark.synapse.ml.predict.enabled", "true") model = (MLFlowTransformer( inputCols=["x"],             outputCol="prediction",             modelName="sample-sklearn",             modelVersion=1, ))

SynapseML performs model scoring in parallel using the spark cluster. But what if you have existing code or don't want to use SynapseML? The solution is easy, you use the MLFlow API.

I will first get a list of all the registered models.

## This is optional, I am just getting a list of all the modelsimport mlflowimport pandas as pdfrom mlflow import MlflowClient## Instantiate the MLFlow clientclient = MlflowClient()#Get a list of models in the workspacemodels_list = client.search_registered_models()models_dict_list = [dict(model) for model in models_list]# Return a dataframe showing the modelsmodels_df = pd.DataFrame(models_dict_list)models_df

The above code will return a list of models registered in the worskpace.

If you already know the model and the version you want to use, you can directly use that for scoring as:

For sklearn:

For a specific version of the model:

model = mlflow.sklearn.load_model(model_uri="models://")

For the latest version of the model:

model = mlflow.sklearn.load_model(model_uri="models://latest")

Custom Flavor Model:

If you are not using any specific model flavor such as sklearn, pytorch etc., you can use:

model = mlflow.pyfunc.load_model(model_uri="models://")

Spark UDF:

You can also load the model as a spark UDF to parallelize scoring on a spark dataframe. You have to specify the conda environment which will recreate the environment at the time of inferencing.

mlflow.pyfunc.spark_udf(spark,model_uri="models://latest",env_manager="conda")

Using run id:

You can also use the experiment run id to obtain the model URI and load a model from a specific experiment:

model_uri = f"runs:/{run_id}/{model_name}"

Example:

I know, it was too easy but hopefully, I saved you some time until the documentation improves.

Note here that the model URI doesn't have any reference to a workspace. That's because, currently, cross-workspace model URIs aren't available. You can only use the model in the workspace it resides in. This will change in the future. For now, as a workaround, you could save the serialized model to the lakehouse and load it. The drawback of this approach is that it bypasses the model registry, which deviates from best practices.

Stay tuned, I have a whole series of blogs planned for Data Science and ML on Fabric!

Reference

MLflow Models MLflow 2.4.1 documentation

Data science tutorial - train and register machine learning models - Microsoft Fabric | Microsoft Learn

Fabric : Not All Delta Tables Are Created Equally

Sandeep Pawar — Tue, 27 Jun 2023 04:43:06 GMT

Tables in a Microsoft Fabric lakehouse are based on the Delta Lake table format. Delta Lake is an open-source storage layer for Spark that enables relational database capabilities for batch and streaming data. By using Delta Lake, you can implement a lakehouse architecture to support SQL-based data manipulation semantics. All compute engines in Fabric create Delta tables which can be used for analytics. Depending on the skill and experience level of the person using Fabric, they can choose the compute engine they are most familiar with and create the Delta tables. The size, number of files, number of rowgroups, and number of rows per rowgroup play an important role in analytical workload performance. Ideally, all compute engines should create optimized Delta tables that are same/similar. But do they and how does it impact performance?

Setup

I used the Fabric F64 trial capacity with the standard medium node. I used the lineitem table from the TPCH_SF100. This table has 600 million rows and 15 columns. I saved it in the Files section as parquet files, ~19.4 GB in size. To test whether each compute creates same/similar Delta tables, I used the same source data (i.e parquet files) and created the Delta tables as below:

DFg2 : Dataflow Gen 2 with Lakehouse as the destination
PL: Using copy activity in Data Factory
DWH : Datawarehouse
spark: In Fabric, by default optimize Write spark configuration aims to optimize the number of files and file size. I created Delta using three configurations:
- spark_optimizeon : pysaprk with spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")
- spark_optimizeoff : pysaprk with spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "false")
- spark_optimize : OPTIMIZE to reduce the number of files with fewer large files.
Load To Tables : Not available yet for folders so couldn't test

It is important to mention here that Fabric is still in Public Preview and under development. I have been running tests and measuring performance in Fabric since its launch and wanted to understand if it matters which engine was used for creating Delta lake. These results will change based on your data, configurations and updates made to these engines. Run your own tests and experiments to draw conclusions.

Here is the schema of this table:

Delta Write Time:

Below is how long it took for each method. Dataflow Gen2 took 9 hours to ingest and save to the lakehouse! No surprise, we know Dataflow Gen2 is still being developed actively and hopefully it will be as fast as DWH and spark by GA. This will be especially critical for Power BI developers.

Delta Structure

To analyze the Delta tables, I looked at :

size of the Delta table
number of parquet files
average file size
if the Delta was V-Order optimized
number of rowgroups and number of rows per rowgroup (For this I looked at the largest parquet file in the table)

Below are the results:

Observations:

It is evident from the results above, not all Delta tables are created equally
Pipeline created the smallest Delta table (13.4GB) and DWH/DFg2 the largest (16.8GB). All spark configurations created about the same Delta size.
The number of parquet files varies significantly. The number of files in a Delta table directly affects the performance of compute operations. Having a large number of small files can lead to increased overhead during read operations due to the need to open multiple file connections while having too few large files can limit parallelism and result in data skew during processing. Ideally, all engines should create same/similar number of files. But, Dataflow Gen2 created a single monster parquet file that's 16.6 GB in size while DWH created many small files with an average file size of 52 MB.
What's interesting is that I expected optimizeWrite to automatically optimize the number of files and file size but it did not. This spark configuration did not affect the Delta size/structure. It's possible that spark engine already optimized the Delta and there is not much room for further optimization. Hence, I ran another test with OPTIMIZE command.
Running OPTIMIZE reduced the number of files from 150 to 15 with each file being ~1GB in size.
The number of row groups in a Delta table is essential as it affects query performance: too many can increase management overhead, while too few or overly large ones can limit query optimization and cause unnecessary data reading. Balancing the size and number of row groups based on your data and queries is crucial for optimal performance. There is no magic number for the number of rowgroups or the size of the rowgroups. It depends on the engine and the operation. In this case, both vary significantly based on the engine used. We can expect the performance to vary as well. For my purposes, I will focus on Direct Lake and spark query performance.

Query Performance

I ran a simple aggregation query based on two low cardinality columns and aggregated two numeric columns in spark and Direct Lake.

durations = []for t in list(final.name):    start_time = time.time()    df = (spark            .read            .format("delta")            .load("Tables/"+t)            .groupby(["l_returnflag","l_linestatus"])            .sum("l_quantity","l_extendedprice")            .collect()            )    duration = time.time()-start_time    durations.append(math.ceil(duration))    print(t,duration )    final["spark_duration"] = durationsfinal

The lowest durations are highlighted in yellow in the table below. For Direct Lake, all DAX query runtimes are cold cache, meaning after each run I reframed the dataset manually to ensure it's hitting the cold cache.

Observations:

Dataflow Gen2 was the slowest for spark and Direct Lake. This was expected as DFg2 created one giant parquet file.
Except for DFg2, spark performance wasn't affected much for this query
DWH currently doesn't support Direct Lake and hence it was in Direct Query mode (47s). But this gives a good way to benchmark Direct Lake performance.
All spark created tables ran faster than Direct Query. Delta table created with OPTIMIZE ran the fastest, even faster than the other two spark tables. I am not sure if my tests are wrong, data is not conducive for optimizeWrite or optimizeWrite isn't working as expected.
Delta size and structure matter for Direct Lake performance. I don't know what and how to optimize but that's what oprimizeWrite was supposed to provide. More tests on diverse datasets will be needed to understand the implications of Delta table structure on Direct Lake performance.
Definitely pay extra attention to Delta tables created with DFg2 while it is still in development.
I should mention that I also compared warm cache performance and as expected there was no difference in performance. After caching the data in memory, the same queries returned the data in ~300ms, giving import-like performance.
Note that all the Delta tables were V-Order optimized (except for DWH which doesn't show V-order in the logs).

Conclusion & Thoughts

I hate to generalize based on a limited set of tests and especially while Fabric is still in development but for my dataset, the queries I ran and the configurations I considered, spark created Delta tables were most optimal for Direct Lake performance. Optimize the Delta tables with OPTIMIZE and do not rely on optimizeWrite alone at this point. Avoid Dataflow Gen2 for performance comparisons, while it is still in development. Certainly test, experiment with DFg2 but be aware that Delta tables may not be optimal. While spark query performance is not affected much by the underlying Delta table, Direct Lake performance is certainly influenced by it. Not all Delta tables are created equally in Fabric.

I will run these tests again every few weeks to check how Fabric changes and improves. Let me know your thoughts, and any additional tests/configurations I should consider.

Installing Custom Python Packages In Fabric

Sandeep Pawar — Fri, 23 Jun 2023 19:28:07 GMT

Updated : Aug 31, 2023

In Microsoft Fabric, you can install Python libraries from PyPI and conda feeds. You can install at the workspace level or the session level in notebooks. Before I show how to install custom Python packages, a quick recap of installing standard libraries.

Workspace Library Management

Any libraries installed at the workspace level will be available for all notebook sessions in that worskspace.

💡

This is not mentioned in the documentation, but you must have an Admin role in the workspace to install libraries at the workspace level.

You can manually install libraries one at a time using PyPI or conda channels. You can also upload YAML file with the required libraries to install a number of libraries. If you have an existing workspace with a number of Python libraries installed, you can export the YAML file and add it to the new workspace to replicate the environment.

Currently, multiple environments are not supported so you will have only one virtual environment per workspace. This will change in the future. Also, note that installing custom libraries will slow down the spark cluster start-up time. Only install the libraries you need.

To get a list of default libraries installed on the cluster, use the notebook and run !pip list . Unlike AzureML, the list of default libraries is not published in the documentation yet.

Below is a sample YAML file:

In-line Installation

You can also install libraries at the session scope in a notebook by running !pip install or !conda install . You can also use %pip install which will install in the virtual environment (not available yet) whereas !pip install installs the package in the base environment. If you have several libraries you want to install using PyPI, you can create a requirements.txt file, save it to the lakehouse and install. Read this article for instructions to create and use requirements.txt.

Custom Python Packages

What if you have custom Python functions and packages? You can build a wheel package for your Python files and install it at the workspace or session level. Let's say you want to use my check_vorder function in all of your notebooks in a workspace. I am going to call my package needles (needles to build Fabric, get it 😜 ). This package has the above function. Below are the steps to build the wheel file.

Create a folder with the below structure:

setup.py has the information regarding the package and the dependencies:

      ## Sandeep Pawar | fabric.guru ##      ## setup.py      from setuptools import setup, find_packages      setup(          name='needles', #needs to build fabric          version='0.0.1',          url='https://www.fabric.guru',          author='Sandeep Pawar',            author_email='pawarbi@outlook.com',          description='Utilities for Microsoft Fabric',          packages=find_packages(),            install_requires=['pyarrow'],      )

init.py can be blank. I like to add a function to check if it has been imported successfully:

      ## __.init__.py      def function_init():       print('Successfully Imported Init.py')

needles folder has: init and the python file with my function

check_vorder has the below function:

      ## check_vorder function      def get(table_name_path):          '''          Author: Sandeep Pawar | fabric.guru |  Jun 6, 2023          Provide table_name_path as '//lakehouse/default/Tables/'          If the Delta table is V-ordered, returns true; otherwise, false.          You must first mount the lakehouse to use the local filesystem API.          '''          import os           if not os.path.exists(table_name_path):              print(f'{os.path.basename(table_name_path)} does not exist')              result = None  # Initialize the variable with a default value          else:              import pyarrow.dataset as ds              schema = ds.dataset(table_name_path).schema.metadata              is_vorder = any(b'vorder' in key for key in schema.keys())              if is_vorder:                  result = str(schema[b'com.microsoft.parquet.vorder.enabled'])              else:                  result = "Table is not V-ordered"          return result

Now you just need to build the wheel by going to the directory with the setup.py file and running the command python setup.py bdist_wheel . Be sure you have wheel installed (!pip install wheel)

This will package your Python files in a wheel file. You will find it in the dist folder in the same directory.

Installing wheel in the workspace

To install it in the workspace, go to Library Management in the workspace settings. Under Custom libraries, select upload and add the newly created whl file. This can take a while.

There is no way to download a whl file uploaded here. Also, there is no central repository where the tenant admin can upload custom packages to make them available to all/selected workspaces. That would be a great feature.

Once successfully uploaded, the package would be available for use in all future sessions:

Installing wheel in session scope

You can also save the wheel file to a lakehouse just like any other file and install it in-line. Unlike workspace installation, this can be used across workspaces.

Once support for virtual environments becomes available, it will make managing environments and dependencies for multiple projects much easier. Until then, if you need to isolate environments, either create multiple workspaces (not practical ) or install in-line (increases run time).

Update:

While you can use the Files section in the lakehouse to upload the whl and py files, there is now a dedicated location per notebook where you can upload "notebook resources".

It will always stay attached to the notebook and can be referred to in the pipelines too. There is one catch, I am not sure if it's a bug or now. If you use the relative path "/builtin/...", installation will not work. Instead, use

path = mssparkutils.nbResPath + "/builtin/<>package_name>.whl"

!pip install path

💡

Each file size is limited to 50 MB or less. Only certain file types can be used including PY, WHL, JAR, TXT, JSON, YML, XML, CSV, HTML, PNG, JPG, and XLSX. tar.gz is not supported yet.

You can download the above whl packages from here. Note that the process for py, jar and tar.gz (R) files is the same.

Read the official documentation here. For R, read this.

V-Order and Sorting

Sandeep Pawar — Tue, 20 Jun 2023 22:03:15 GMT

I have written about V-order before. A quick summary from official MS Docs:

V-Order is a write time optimization for the parquet file format that enables fast reads under Microsoft Fabric compute engines like Power BI, SQL, Spark, and others. It improves data access times and provides cost efficiency and performance by applying sorting, row group distribution, dictionary encoding, and compression on parquet files.

To check if a table is V-order you can read my blog.

By default, all engines in Microsoft Fabric create V-order optimized tables. If they currently do not, they will be by GA. However, currently, there is one specific instance when tables will not be V-order'd - if you sort the dataframe in spark.

I have an existing table called orders . I can confirm if it's V-order'd by using my script from the above blog:

Sort & Save Table

If I sort the spark dataframe and save the table, the table will not be V-order'd. User will not get any error or warning. In the example below, I sorted the dataframe by year column before saving and this resulted in a table without V-order.

Solution

To force V-order for a sorted dataframe, you have to use option("parquet.vorder.enabled", "force_true") while saving the table, as shown below.

You can also set spark configuration setting spark.conf.set('spark.sql.parquet.vorder.ignorePlan', 'true') to apply V-order in the Spark session.

As the product is in public preview, note that this behavior may change in the future. Always check if the table is V-order enabled. As far as I know, this is only applicable to spark. Refer to the official documentation here in the future for details.

Thank you to Tamas Polner at Microsoft for this information.

Fabric Project 1 : Scraping The Ideas Site

Sandeep Pawar — Mon, 19 Jun 2023 21:53:33 GMT

Motivation

My friend Alex Powers recently shared the PowerQuery template file to scrape the Microsoft Fabric ideas site. Until Microsoft Fabric was announced, this would have been the only way to scrape the site for analysis in Power BI. But with Fabric, we can not only scrape the data but also create and operationalize analytics solutions that were not possible using one single platform. I strongly believe Fabric will facilitate the development of an array of solutions using machine learning, LLMs that will further augment what we have been developing using Power BI alone.

This is a weekend side project for me, so I will attempt to share as much as I can whenever possible. Like real-world projects, I will iterate over many ideas and solutions so others can see how it evolves and also follow along/contribute.

My Goals:

Learn and share what I learn
Explore Microsoft Fabric
Highlight how Microsoft Fabric allows creating solutions not possible before, especially for Power BI developers
Scrape responsibly and develop performant/scalable solutions
The final solution will contain notebooks, orchestration pipelines, dimensional modeling, machine learning models with experiments and a Power BI report
It is semi-ambitious and will take time but I know it will be worth it

https://twitter.com/notaboutthecell/status/1667250335230971905?s=20

If you are new to Python, I will try to explain as much code as I can. This will be a work in progress and I will label it accordingly. If I feel something is ready, I will label it as complete.

Scraping The Site:

To scrape the ideas site, I will be using BeautifulSoup.

I have updated the script below to scrape pages concurrently. This reduced the scraping time in half compared to the previous version.

Below Python script is a multi-threaded web scraper that extracts data. The script retrieves idea submissions, associated metadata (like votes, state, description), and user information from the specified number of pages, processes this information and saves it into a pandas DataFrame. I like to think in steps and use pseudocode so if you are new to this, use this pseudo-code to understand what's going on:

Pseudo-code

https://gist.github.com/pawarbi/5455568d9e10a62d580b9af6e0efe849

Code

!pip install beautifulsoup4import timeimport requestsfrom bs4 import BeautifulSoupimport pandas as pdimport concurrent.futuresclass Scraper:    """    Author: Sandeep Pawar   | Fabric.Guru   | Jun 19, 2023    This class is responsible for scraping ideas from the Fabric ideas site    Warning: Be careful with the value of max_pages. If you set it to 'max', it will    scrape data from pages until no data is available. This could result in a large    number of requests to the server, which may not be allowed by the site's policy.    Always ensure that your scraping activities respect the site's terms of service.    Use it only for learning purposes and avoid any PII data.    """    def __init__(self, url, experience_name, max_pages=None):        self.base_url = url + "?page="        self.dfs = []  # List to store dataframes        if max_pages is None:            self.max_pages = 3        elif max_pages == -1:            self.max_pages = 'max'        else:            self.max_pages = max_pages        self.experience_name = experience_name    def extract_data(self, idea):        # Helper function to reduce redundancy        def get_text(selector):            element = idea.select_one(selector)            return element.text.strip() if element else None        def get_href(selector):            element = idea.select_one(selector)            return element.get('href') if element else None        href = get_href(".mspb-mb-18 > .ms-card-heading-1 > A:nth-child(1):nth-last-child(1)")        user_id = get_href("a.view-profile")        user_id = user_id.split('=')[-1] if user_id else None        idea_id = href.split('/')[-1] if href else None        return {            "product_name": get_text('span.mspb-mb-4.ms-card-heading-2'),            "idea_id": idea_id.split("=")[1] if idea_id else None,            "idea": get_text(".mspb-mb-18 A"),            "votes": get_text(".ms-fontWeight-semibold *"),            "date": get_text(".mspb-my-0 SPAN *"),            "state": get_text(".ms-tag\-\-lightgreen"),            "user_id": user_id if user_id else "NA",            "like_count": get_text(".ms-like-count"),            "description": get_text(".ms-fontColor-gray150.mspb-mb-12"),        }    def get_page_data(self, session, page):#         time.sleep(1)  # Add a delay here        current_url = self.base_url + str(page)        response = session.get(current_url)        if response.status_code != 200:            return None        soup = BeautifulSoup(response.content, "html.parser")        ideas = soup.select(".mspb-py-sm-24")        if not ideas:  # If there are no matches            return None        return [self.extract_data(idea) for idea in ideas]    def scrape_data(self):        # Create a session        with requests.Session() as session:            pages_to_scrape = list(range(1, self.max_pages+1)) if self.max_pages != 'max' else range(1, 100)  # Consider a large number for 'max'            with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:  # Limit the number of workers                results = executor.map(lambda page: self.get_page_data(session, page), pages_to_scrape)            for result in results:                # If the page doesn't contain data, skip                if not result:                    continue                df_page = pd.DataFrame(result)                self.dfs.append(df_page)        # Concatenate all the dataframes into one        df = pd.concat(self.dfs, ignore_index=True)        df["Experience"] = self.experience_name        df["scraped_date"] = pd.Timestamp.now().normalize().strftime("%Y-%m-%d")        return dfurl = "https://ideas.fabric.microsoft.com/"scraper = Scraper(url, "Power BI", max_pages=3)df_ideas = scraper.scrape_data()

Few things to note about the code:

Concurrent Scraping: The code utilizes Python's concurrent.futures library to scrape multiple pages simultaneously. This is much faster than scraping pages one after another.
Error Handling: The code includes checks to handle situations where a requested page returns a status code other than 200 (which indicates success), or when no 'idea' elements are found on a page. These checks help the program avoid crashes and missing data.
Customizable Scrape Limits: The scraper can be set to scrape a specific number of pages, or to keep scraping until it doesn't find any more data. This is handled in the __init__ and scrape_data methods of the Scraper class.
Session Usage: The requests.Session object is used to persist certain parameters across requests. It also provides connection pooling, which means that after a connection has been established, subsequent requests to the same host will reuse the underlying TCP connection, leading to performance improvements.

The output of the Scraper class will be a pandas DataFrame containing selected attributes of each idea. This will serve as the primary fact table. To avoid collecting personally identifiable information (PII), I am intentionally not scraping the usernames. Remember to scrape responsibly and use this as a learning opportunity.

Ideas Dimension Table

In the above class, we only scraped the ideas. Now we get the full description of each idea. The logic is still the same as above, just that we go to each ideas page and get the details about the idea:

Status: completed

import concurrent.futuresfrom tqdm import tqdmdef get_description(idea_url):    try:        with requests.Session() as session:            response = session.get(idea_url)            if response.status_code == 200:                soup = BeautifulSoup(response.content, 'html.parser')                text_element = soup.select_one('.mspb-mb-18')                if text_element:                    desc = text_element.get_text(strip=True)                    return desc            else:                print(f"Page not found. URL: {idea_url}")    except (requests.RequestException, ValueError, AttributeError) as e:        print(f"Error processing URL: {idea_url}. {e}")    return Nonedf_descriptions = df_ideas[['idea_id']].copy()df_descriptions.loc[:,'urls'] = "https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=" + df_descriptions['idea_id'].astype(str)ideas = df_descriptions['urls']descriptions = []def process_idea(idea):    description = get_description(idea)    return descriptionwith concurrent.futures.ThreadPoolExecutor() as executor:    results = list(tqdm(executor.map(process_idea, ideas), total=len(ideas), desc="Processing URLs"))df_descriptions['description'] = resultsdf_descriptions["scraped_date"] = pd.Timestamp.now().normalize().strftime("%Y-%m-%d")

The output of this is a pandas dataframe containing the full description of each idea.

User Dimension Table

As mentioned above I am not scraping the user name intentionally. But for dimensional modeling, it will be nice to have user names so I will use Faker library to create fake names in different languages with user attributes.

!pip install Fakerfrom faker import Fakerimport randomimport pandas as pdimport numpy as npfrom datetime import datetime, timedeltadef generate_fake_data(df):    # Copy the dataframe    user_df = df[['user_id', 'scraped_date']].copy().drop_duplicates()    # Initialize Faker with multiple locales    fake = Faker(['it_IT', 'en_US', 'ja_JP', 'en_IN'])    # Define the start date (6 years ago from today)    start_date = datetime.today() - timedelta(days=6*365)    # Create a vectorized function for generating random dates    vectorized_random_date = np.vectorize(lambda x: start_date + (datetime.today() - start_date) * random.random())    # Create new columns with fake data    user_df['name'] = np.array([fake.name() for _ in range(user_df.shape[0])])    user_df['email'] = np.array([fake.email() for _ in range(user_df.shape[0])])    user_df['company'] = np.array([fake.company() for _ in range(user_df.shape[0])])    user_df['title'] = np.array([fake.job() for _ in range(user_df.shape[0])])    user_df['user_since'] = vectorized_random_date(np.ones(user_df.shape[0]))    user_df['user_since'] = pd.to_datetime(user_df['user_since']).dt.date    return user_dfdf_users = generate_fake_data(df_ideas)

The output is a pandas dataframe with user details.

Date Dimension Table

I already have a blog post on how to create a date table using Python. I will use that. I don't need all the columns and will drop unnecessary columns later.

import pandas as pdimport numpy as npfrom pandas.tseries.offsets import DateOffset, BMonthEndfrom datetime import datetime#This is for Microsoft Fabric only to optimize the Delta tablespark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")def date_dimension(start_date: str, end_date: str) -> pd.DataFrame:    '''    Author : Sandeep Pawar | fabric.guru    Version :2 [06-15-2023]    This functions creates a date dimension table based on the start date and     end date.         start date and end date should be in 'yyyy-mm-dd' format such as '2023-12-31'    '''    try:        start_date = datetime.strptime(start_date, '%Y-%m-%d')        end_date = datetime.strptime(end_date, '%Y-%m-%d')    except ValueError:        raise ValueError("Invalid date format. Please provide dates in 'yyyy-mm-dd' format.")    if end_date <= start_date:        raise ValueError("End date should be after start date.")    df = pd.DataFrame({"Date": pd.date_range(start_date, end_date)})    df["DateKey"] = df.Date.dt.strftime('%Y%m%d').astype(int)    today = pd.Timestamp.now().normalize()    current_week = today.week    current_month = today.month    current_quarter = today.quarter    current_year = today.year    df["ISODateName"] = df.Date.dt.strftime('%Y-%m-%d')    df["AmericanDateName"] = df.Date.dt.strftime('%m/%d/%Y')    df["DayOfWeekName"] = df.Date.dt.day_name()    df["DayOfWeekShort"] = df.Date.dt.day_name().str[:3]    df["MonthName"] = df.Date.dt.month_name()    df["MonthShort"] = df.Date.dt.month_name().str[:3]    df["YearWeekName"] = df.Date.dt.strftime('%YW%V')    df["YearMonthName"] = df.Date.dt.strftime('%Y-%m')    df["MonthYearName"] = df.Date.dt.strftime('%b %Y')    df["YearQuarterName"] = df.Date.dt.year.astype(str) + 'Q' + df.Date.dt.quarter.astype(str)    df["Year"] = df.Date.dt.year    df["YearWeek"] = df.Date.dt.year*100 + df.Date.dt.isocalendar().week    df["ISOYearWeekCode"] = df.Date.dt.year*100 + df.Date.dt.isocalendar().week    df["YearMonth"] = df.Date.dt.year*100 + df.Date.dt.month    df["YearQuarter"] = df.Date.dt.year*100 + df.Date.dt.quarter    df["DayOfWeekStartingMonday"] = df.Date.dt.dayofweek + 1    df["DayOfWeek"] = np.where(df.Date.dt.dayofweek == 6, 1, df.Date.dt.dayofweek + 2)    df["DayOfMonth"] = df.Date.dt.day    df["DayOfQuarter"] = (df.Date.dt.dayofyear - 1) % 91 + 1    df["DayOfYear"] = df.Date.dt.dayofyear    df["WeekOfQuarter"] = (df.Date.dt.day - 1) // 7 + 1    df["WeekOfYear"] = df.Date.dt.isocalendar().week.astype('int64')    df["ISOWeekOfYear"] = df.Date.dt.isocalendar().week.astype('int64')    df["Month"] = df.Date.dt.month    df["MonthOfQuarter"] = (df.Date.dt.month - 1) % 3 + 1    df["Quarter"] = df.Date.dt.quarter    df["DaysInMonth"] = df.Date.dt.days_in_month    df["DaysInQuarter"] = (df.Date + pd.offsets.QuarterEnd(1)).dt.day    df['DaysInYear'] = df['Date'].dt.is_leap_year+365    df['FirstDayOfMonthFlag'] = (df['Date'].dt.is_month_start).astype(int)    df['LastDayOfMonthFlag'] = (df['Date'].dt.is_month_end).astype(int)    df['IsTodayFlag']=(df['Date'] == pd.Timestamp.today().date()).astype(int)    df['IsToday'] = np.where(df['Date'].dt.date == today, 1, 0)    df['IsCurrentWeek'] = np.where(df['Date'].dt.isocalendar().week == current_week, 1, 0)    df['IsCurrentMonth'] = np.where(df['Date'].dt.month == current_month, 1, 0)    df['IsCurrentYear'] = np.where(df['Date'].dt.year == current_year, 1, 0)    df['IsCurrentQuarter'] = np.where(df['Date'].dt.quarter == current_quarter, 1, 0)    df['NextDay'] = df['Date'] + DateOffset(days=1)    df['PreviousDay'] = df['Date'] - DateOffset(days=1)    df['PreviousYearDay'] = df['Date'] - DateOffset(years=1)    df['PreviousMonthDay'] = df['Date'] - DateOffset(months=1)    df['NextMonthDay'] = df['Date'] + DateOffset(months=1)    df['NextYearDay'] = df['Date'] + DateOffset(years=1)    return df# For testingstart_date = "2023-01-01" end_date = "2024-05-25" date_table = date_dimension(start_date, end_date)date_table

Output is a pandas dataframe with dates and date attributes

Need to create a couple of more dimensions, scrape for each Fabric workload, clean/transform.

Using The Lakehouse Connector In Power BI Desktop

Sandeep Pawar — Fri, 16 Jun 2023 13:55:32 GMT

Lakehouse Connector

You can use the Lakehouse connector in Power BI Desktop to connect to the Fabric lakehouse. Depending on your scenario, you can connect to the default dataset in Direct Lake mode or to the SQL endpoint.

Direct Lake Mode

For each lakehouse created in the OneLake, a default dataset and a SQL endpoint are automatically created (this may change in the future).

To connect to the default dataset which is always in Direct Lake mode using "Live" connection, use the Connect option. In the Live mode, you will connect to the semantic model with relationships, measures etc. You cannot edit the model or the data. You can add report-scope measures. If you want to add more data, you can convert it into a Direct Query Over AS (i.e composite model) and add another data source.

If you would like to learn more about Direct Lake datasets, please read my comprehensive blog on Direct Lake datasets.

SQL Endpoint

Instead of the default dataset, if you want to either import the tables or use it in Direct Query mode, select "Connect to SQL endpoint". This is similar to connecting to a SQL server.

Quick note: To connect to the Custom Direct Lake dataset, use the "Power BI datasets" option from the data hub.

Your Fabric capacity must be running for you to connect to the tables. If you are using the PAYG capacity and if it's paused, this will not work.

How About Importing Files

Unlike the Lakehouse connector in Dataflow Gen2, in Desktop, you can only connect to tables. To import files, read my other blog

Querying KQL Database in Fabric Notebook

Sandeep Pawar — Mon, 12 Jun 2023 15:24:27 GMT

In Azure Data Studio, you can install the KQL extension to connect to and query Azure Data Explorer clusters. This allows you to access the KQL database and query/plot right in the notebook. Microsoft Fabric notebook does not have such an extension nor does it support a KQL kernel. But we can still use the Azure Kusto Python SDK to do the same thing. With the SDK you can write the KQL queries and create pandas dataframes for further analysis.

Steps:

In the Fabric notebook, install the Azure Kusto SDK
!pip install azure-kusto-data
!pip install azure-kusto-ingest ## this is optional
Next, we need three things - tenant ID, Kusto cluster URL and the Kusto DB name.
- Tenant ID: You can get tenant ID from the spark configuration
```
  AAD_TENANT_ID  = spark.conf.get("trident.tenant.id")
```
- Kusto Cluter URL and DB name : You can get this from the Kusto DB in Fabric directly.

Copy the Kusto DB name and the URI and assign it to global variables similar to tenant ID.

KUSTO_CLUSTER = "https://.z0.kusto.data.microsoft.com"KUSTO_DATABASE = ""

Connect to the Kusto DB and query the database by writing the KQL directly in the notebook. You will be prompted to authenticate yourself using device authentication, i.e. click on the link provided and paste the code.

from azure.kusto.data import KustoClient, KustoConnectionStringBuilderfrom azure.kusto.data.exceptions import KustoServiceErrorfrom azure.kusto.data.helpers import dataframe_from_result_tableimport pandas as pd## Device authenticationKCSB = KustoConnectionStringBuilder.with_aad_device_authentication( KUSTO_CLUSTER)KCSB.authority_id = AAD_TENANT_IDKUSTO_CLIENT = KustoClient(KCSB)## Write KQL query belowKUSTO_QUERY = """"""## Execute KQLRESPONSE = KUSTO_CLIENT.execute(KUSTO_DATABASE, KUSTO_QUERY)df = dataframe_from_result_table(RESPONSE.primary_results[0])

Note here that, unlike Azure Data Studio KQL extension, there is no KQL intellisense and you will primarily be analyzing the data using pandas.

You can also use username/password and AAD authentication.

## Using user name and password instead of device authenticationKCSB = KustoConnectionStringBuilder.with_aad_user_password_authentication(cluster, username, password, authority_id)

Using AAD app auth:

# AAD applicationKCSB = KustoConnectionStringBuilder.with_aad_application_key_authentication(cluster, client_id, client_secret, authority_id)

You can read more about other authentication types supported and the SDK here.

Additional Resources:

Thoughts On Spaces In Workspace And Column Names in Microsoft Fabric

Sandeep Pawar — Mon, 05 Jun 2023 04:47:59 GMT

If you are coming to Microsoft Fabric from Power BI, you will know that it's very common to have spaces in Workspace, Table and Column names to make them user-friendly. Power BI has no restriction on including spaces in names. For example, if you look at the official MSFT documentation page about Workspaces (see below), the workspace is named New Opportunity Analysis. This makes the name, easy to read and business-friendly. In Microsoft Fabric, there is no restriction on white spaces in workspace names either. However, in Fabric, removing the spaces will help in certain situations.

Similarly, in Power BI datasets, it's a best practice to make table and column names business-friendly for other developers/users to easily query the tables, and build reports in Power BI or Excel. In Fabric, you will first save the table in Delta format and then you will use it in Import, Direct Query or Direct Lake mode. In an ideal scenario, the table and column names in delta tables should be the same as the Power BI dataset to minimize rework. However, doing so has consequences and should be avoided as discussed below.

Workspaces

If you keep the spaces in workspace names, you will have to query the tables and files in spark notebook using the workspace id. As I mentioned in the previous blog, you do not need to mount a Lakeshouse to read or write using spark.

## query a table in a workspace with spaces in names(spark.read.format("delta").load("abfss://@onelake.dfs.fabric.microsoft.com//Tables/customer"))

If the space is removed from the workspace name, i.e instead ofNew Opportunity Analysis we rename it to New_Opportunity_Analysis , we can use the lakehouse name and workspace name directly in the abfss path, instead of the GUIDs. This will make the code readable, and easier to track lineage when the notebook is shared or exported.

## with workspace and lakehouse names in the abfss path(spark.read.format("delta").load("abfss://New_Opportunity_Analysis@onelake.dfs.fabric.microsoft.com/NewLakehouse.Lakehouse/Tables/customer"))

Note that the abfss path is abfss://@onelake.dfs.fabric.microsoft.com/.Lakehouse/Tables/customer . You have to add .Lakehouse after the Lakehouse name. For Datawarehouse items, add .Datawarehouse . This also works with URL paths and not just ABFSS paths.

The disadvantage here is that if the workspace name is changed in the future, it will break the code. Using GUID will always work. So it is up to the developer to assess what works best in their scenario. In most production cases, workspace names rarely change so I would default to using names instead of the GUIDs.

It's not possible to mix and match, i.e. use workspace GUID and Lakehouse name.

Thank you to Qixiao Wang at Microsoft for the information.

Just as a side note, if you run SELECT @@SERVERNAME on the T-SQL endpoint you will see that the server is actually the workspace GUID and Lakehouse is the database. So treat the workspace name as the server name and exclude the spaces.

Column Names

Delta tables by default do not allow special characters, including spaces, in column names. Let's take an example below, where the column names have spaces and are not business-friendly.

In a Power BI dataset, we would rename these columns by removing "_" and making them readable, e.g. instead of c_name we would use Customer Name etc. Let's do that and try saving it as a Delta table. To rename columns with Pyspark, we can use .withColumnRenamed() .

If I try to save this table with spaces in column names, I get the following error message, prompting me to set the column mapping config in spark settings.

Setting the Column Mapping to "name" worked successfully.option('delta.columnMapping.mode' , 'name')

(df2.write.format("delta").option('delta.columnMapping.mode' , 'name').saveAsTable("Customer_Table"))

The new Delta table has column names that are business-friendly similar to Power BI datasets and have spaces in them.

However, there is a problem. If you switch over to the SQL Endpoint view, you will see an error message:

This is because, the SQL Endpoint does not support the column mapping property we defined earlier in spark, i.e. while Delta table supports the spaces, SQL Endpoint does not. This also means that when you try to create a dataset, this newly created table with column mapping won't even show up in the dialog box.

Hopefully, this limitation in Fabric will be removed in the future.

Upgrading protocols is an irreversible process, so be sure to test this on a test dataset before making any changes.

The solution, in my opinion, is instead to make column names lowercase and replace spaces with underscores instead. This will make it readable and also will be easier to change to title case with spaces in Tabular Editor later when XMLA-Write becomes available. It's common practice to keep names and variables lowercase in Python and Pyspark. Below C# can be used in the future to change column_name to Column Name for all tables when XMLA-W becomes available.

// Script to change column_name to Column Name// Author : Sandeep Pawar   |   fabric.guru// Loop through all the tables in the modelforeach (var table in Model.Tables){    // Loop through all columns in the table    foreach (var column in table.Columns)    {        // Replace spaces with underscore in all column names        string newName = column.Name.Replace("_", " ");        // Convert to title case        System.Globalization.CultureInfo cultureInfo = System.Threading.Thread.CurrentThread.CurrentCulture;        System.Globalization.TextInfo textInfo = cultureInfo.TextInfo;        newName = textInfo.ToTitleCase(newName);        // Update the column name        column.Name = newName;    }}

To Recap: Clean the column names in the notebook or Dataflow Gen2 to lowercase with an underscore and then change it to title case without an underscore using Tabular Editor when XMLA Write becomes available. Note that you can rename manually as well in the web modeling experience. Avoid column mapping as it won't work with SQL Endpoint.

Update 6/19/2023: Charles Webb who is the Microsoft PM for DWH confirmed that column mapping is on the roadmap for Fabric.

https://twitter.com/CharlesWebb22/status/1668658487881248770?s=20

Case Sensitivity of Column Names

Unlike Power BI, column names in spark dataframes are not case-sensitive. As you can see below, I have a dataframe with the same column names in different cases. This will not work in Power Query.

Spark environment does not provide a configuration setting to prevent the creation of dataframe columns with the same name but in different cases, even when case sensitivity is turned on using spark.conf.set("spark.sql.caseSensitive","true")

The case sensitivity setting only affects the way Spark SQL queries treat column names, it doesn't prevent you from creating a dataframe with columns whose names only differ by case.

Therefore, it's necessary to implement a manual check before or after creating the dataframe to ensure no two columns have the same name, disregarding case. Below function can be used to achieve that:

# function to check if schema contains duplicate column names (ignoring case)def has_duplicates(schema):    column_names = [field.name.lower() for field in schema]    return len(column_names) != len(set(column_names))

Ideally, you will create Views to alias the column and table names but Direct Lake datasets cannot be created based on Views - yet.

What about Lakehouse and Table Names?

Both Lakehouse and Table names cannot contain spaces. Delta specification does not allow spaces in table names. Similar to the above approach, you can make the table names lowercase with an underscore for spaces and then update them in Tabular Editor later.

I should add that I am not 100% sure yet how to run the above C# scripts in pipelines to make it fully automated. If you have any thoughts, please let me know in the comments below.

Importing Files From Fabric Lakehouse Into Power BI Desktop

Sandeep Pawar — Sun, 04 Jun 2023 07:51:52 GMT

If you have files or folders with files in the Fabric Lakehouse, you can import it into Power BI Desktop. Lakehouse is built on top of OneLake which is nothing but Azure Data Lake Storage Gen2. Hence, you can use the ADLSg2 Connector in Power BI Desktop to import files.

There are several ways, I will list them below:

Using ADLSg2 Connector

This uses the AzureStorage.DataLake() function in PowerQuery

Go to the Lakehouse > Browse to the Folder or File > Properties

Copy the URL Path (not abfss)
- In Power BI Desktop, use the Azure Data Lake Storage Gen 2 connector and paste the URL

This also works for files available via external shortcuts (e.g. files in S3 bucket) which is great because Power BI does not have a native S3 connector. Now you can create a shortcut to S3 files and folders and import them in Power BI directly. Be aware of the egress charges. Thanks to Josh Caplan for pointing this out.

As far as I know, there is no direct way to connect the files and folders in the Lakehouse. If you use the Lakehouse connector, it will connect to the SQL endpoint . Interestingly, the Lakehouse connector in Dataflow Gen2 does allow you to browse to specific file/folder so hopefully it will be available in Desktop soon.

To Access All The Files/Tables:

To access all the files and tables in a lakehouse, use just the lakehouse URL https://onelake.dfs.fabric.microsoft.com// in the ADLSg2 connector to access all the files and tables. Use the Folder Path to identify the table/file and expand the header to get the data. Thanks to my colleague Bryan Campbell for this tip.

Using Lakehouse.Contents

We can use Lakehouse.Contents() to import files and tables directly form the lakehouse. The nice thing about this approach is that you don't need to provide the path, you just authenticate yourself and you will be connected to the Lakehouse and from there you can browse the files, folders and tables.

For Files:

Use below M code. You can also just start with Lakehouse.Contents() and go from there

https://gist.github.com/pawarbi/80aeb25c3772683c4ab2d06486be2238

Extract the Binary for the file you want to import

For Tables

https://gist.github.com/pawarbi/398b11540a62b1c8fae43423e022383c

If you know a direct way instead of using the URL path, please let me know.

How To Mount A Lakehouse and Identify The Mounted Lakehouse in Fabric Notebook

Sandeep Pawar — Wed, 31 May 2023 20:08:49 GMT

Let's say you receive a notebook from someone that reads parquet file from some location in the OneLake. The code used to read the parquet is :

How do you know where this parquet file folder was in the OneLake? Which workspace and which Lakehouse ? The file path shown above is called a relative file path. When you have a Lakehouse that is 'mounted' , i.e. attached to a notebook, you can use such file paths instead of using the full abfss://... file paths of the container/folder location in the lake. The Lakehouse mounted to the notebook can be seen on the left in the Lakehouse explorer section.

But, if you received the notebook from your colleague, the Lakehouse won't be visible and hence you won't be able to figure out where the data was read from or saved to.

We can identify the mounted lakehouses and the respective mount points by using the msspsarkutils library.

from notebookutils import mssparkutilsmssparkutils.fs.mounts()

This will return the list of Lakehouse mounted to the notebook. Mounting simply allows you to use local file system paths.

To get the abfss path of the default Lakehouse, use:

#To get the default lakehouse attached to the notebookfor mp in mssparkutils.fs.mounts():    if mp.mountPoint == '/default':        print(f"Default Lakehouse is: {mp.source}")

The abfss path that is returned is in this format abfss://@onelake.dfs.fabric.microsoft.com/ . This will allow you to identify the mounted Lakehouse and the workspace.

I highly recommend printing the lakehouse, workspace ids at the start of the notebook for traceability.

Note that it is possible to mount additional Lakehouses if the data resides in more than one Lakehouse.

Mounting A Lakehouse

To mount another Lakehouse in the notebook, use the below code. It should be noted that Lakehouses mounted at the runtime like this will not be visible in the lineage view, at least for now.

mssparkutils.fs.mount("abfss://@onelake.dfs.fabric.microsoft.com/", "") #mountPoint such as '/lakehouse/default'

After mounting, I am able to see the newly mounted Lakehouse and the scope as "job" instead of the "default_lh"

Note that you do not need to mount a Lakehouse for reading and writing with spark. You can use the full abfss path. However, mounting is required for pandas as pandas requires local file path. Thanks to Qiaxiao who is the developer for this feature in Fabric.

https://twitter.com/qx_will/status/1664163546685648896?s=20

You can use below code snippet to dynamically mount any lakehouse or warehouse in a notebook and query the files and tables :

import osimport pandas as pdworkspaceID = "<>"lakehouseID = "<>"mount_name = "/temp_mnt"base_path = f"abfss://{workspaceID}@onelake.dfs.fabric.microsoft.com/{lakehouseID}/"mssparkutils.fs.mount(base_path, mount_name)mount_points = mssparkutils.fs.mounts()local_path = next((mp["localPath"] for mp in mount_points if mp["mountPoint"] == mount_name), None)print(local_path)print(os.path.exists(local_path)) #check if location existsprint(os.listdir(local_path + "/Files")) # for filesprint(os.listdir(local_path + "/Tables")) # for tablesdf = pd.read_csv(local_path + "/Files/"+ "")

To learn more about mounting and other patterns, you can refer to this documentation.

Rendering HoloViews, Bokeh Plots in Fabric Notebooks

Sandeep Pawar — Tue, 30 May 2023 18:02:16 GMT

HoloViews and Bokeh are two popular interactive data visualization libraries in Python. HoloViews is an open-source Python library for data analysis and visualization. It is designed to make data analysis and visualization seamless and simple. With HoloViews, you can usually express what you want to do in very few lines of code, letting you focus on what you are trying to explore and convey, not on the process of plotting. Bokeh is another popular interactive data visualization library in Python. It provides elegant, concise construction of versatile graphics with high-performance interactivity over very large or streaming datasets.

In Fabric, Holoviews is not available by default and must be installed in the workspace Library Management or in the notebook by !pip install holoviews .

Once you install it, it is expected to work as expected, however, it does not. You will find that the code is executed successfully but the chart is not rendered. This blog shows you a simple solution to fix this. This is a known bug and will be fixed in future updates. Thank you to Chang Xu at Microsoft for the information.

When you import the HoloViews library, you will first notice that extra output is returned which it should not. Despite this bug, the library is successfully imported and this output can be ignored.

Next, I will create a sample dataframe from HoloView's documentation page for rendering:

macro_df = pd.read_csv('http://assets.holoviews.org/macro.csv', delimiter='\t')key_dimensions   = [('year', 'Year'), ('country', 'Country')]value_dimensions = [('unem', 'Unemployment'), ('capmob', 'Capital Mobility'),                    ('gdp', 'GDP Growth'), ('trade', 'Trade')]macro = hv.Table(macro_df, key_dimensions, value_dimensions)

If now try to create a scatterplot, you will see that the code is executed but the chart is not rendered.

gdp_unem_scatter = macro.to.scatter('Year', ['GDP Growth', 'Unemployment'])overlay = gdp_unem_scatter.overlay('Country')overlay.opts(    opts.Scatter(color=hv.Cycle('Category20'), line_color='k', size=dim('Unemployment')*1.5,                 show_grid=True, width=700, height=400),    opts.NdOverlay(legend_position='left', show_frame=False))

The Solution

The solution to this is to import bokeh with import bokeh.io and force the output by using bokeh.io.output_notebook():

import bokeh.iobokeh.io.output_notebook()

This should plot the visual correctly. Use the same method if you are using hvPlot. Note that it's not enough to run this once. Anytime you have to create a plot you will have to add bokeh.io.output_notebook() to create the chart output successfully. As I mentioned, this is a bug and will likely be fixed soon.

Why Does Fabric Lakehouse Show Unidentified Folder In Tables?

Sandeep Pawar — Sun, 28 May 2023 12:55:42 GMT

Are you puzzled why in the Table section of the Lakehouse has an unidentified folder like below? And what's the solution?

Easy - the Tables are managed section, i.e. the metastore and can't have files in it , only Delta tables are allowed. If you click on the unidentified, you will see the files written to the folder. If you are using spark, you may have used Tables/... instead of Files/... in your file path. The only solution is to delete the files/folder in the Lakehouse and write it to the Files.

I wish Microsoft provided some helpful error messages here about what's happening and what to do about it.

Note that in some cases (e.g. Dataflow Gen2 and Shortcuts), you may see the Unidentified Folder temporarily. It should disappear after a short time. If it doesn't, check the folder and/or your transformations.

Checking If Delta Table in Fabric is V-order Optimized

Sandeep Pawar — Sat, 27 May 2023 00:09:01 GMT

V-Order

V-order is a write time optimization to the parquet file format. When the delta table is created using any of the Fabric engines (Dataflow Gen2, Spark notebooks, Pipelines, DWH), Delta tables are automatically are V-Order'd. This not only helps with size of the table but can significantly improve Direct Lake dataset read performance. While it's not required, Delta tables with V-order are highly recommended for fastest Direct Lake and Delta read performance. For more ob V-order, read this official article.

However, there is no direct way to identify if a table already has V-order or not. There three ways to check but let me show two easy ways. I will cover the third in a future blog post. When a Delt atable is created, the transaction logs have a metadata property related to V-order.

I created two Delta tables in a Fabric lakehouse, one with and the other without the V-order by changing the spark configuration.

Manually

You can manually inspect the transaction logs in the Lakehouse by going to Lakehouse > Table name > Right Click > View Files > _delta_log, inspect the latest .json file. In the transaction logs you will either see VORDER set to false or missing if the table is not V-Order'd. If the Delta table is V-order, the property is set to true .

Programmatically

If you have several Delta tables, many transaction logs or if you need to check it as a part of your validation process for DataOps, you can use pyarrow to check the schema metadata. This only checks the metadata so the table is not required to be read. Below is the Python code I used:

def check_vorder(table_name_path):    '''    Author: Sandeep Pawar | fabric.guru |  Jun 6, 2023    Provide table_name_path as '//lakehouse/default/Tables/'    If the Delta table is V-ordered, returns true; otherwise, false.    You must first mount the lakehouse to use the local filesystem API.    '''    import os     if not os.path.exists(table_name_path):        print(f'{os.path.basename(table_name_path)} does not exist')        result = None  # Initialize the variable with a default value    else:        import pyarrow.dataset as ds        schema = ds.dataset(table_name_path).schema.metadata        is_vorder = any(b'vorder' in key for key in schema.keys())        if is_vorder:            result = str(schema[b'com.microsoft.parquet.vorder.enabled'])        else:            result = "Table is not V-ordered"    return result

There is another robust method using Spark that can be used to detect parquet files in the Delta table that are non V-Order'd but I will cover that in a future blog post.

Tips, Tricks and Shortcuts in Fabric Notebooks

Sandeep Pawar — Fri, 26 May 2023 21:22:50 GMT

Notebooks in Fabric offer a rich development experience with many hidden tricks and features. Below are some of them that I have discovered which can be helpful during development.

Toggle the cell output layout

You can change the layout to side-by-side or standard. Side by side will split the layout vertically.

Freeze A Code Cell

If you want cells to be excluded from execution, you call click on the snowflake in cell options. This is very helpful during development when you don't want certain cells to be executed again when the entire notebook is re-run.

Lockdown A Markdown Cell

Similar to freeze cells but for creating read-only markdown cells, just select the 🔒lock icon in the cell options.

Embed Images From Your PC

Just drag and drop the image from your PC to the markdown cell, it will automatically be embedded and uploaded to the lakehouse.

Global and Local Text Replacement

Global Find and Replace

To replace a text from the entire notebook, Edit > Find and Replace. You can also use the regex patterns for complex find operations

Local (cell level) find and replace
Just highlight the word, right click > Change All Occurrences > start typing

If you need to do regex-based replacements, right click cell > Command Palette > Type Replace > Use Regex pattern

Change Text Case

Highlight the text > right click > lower

Move Cells Up or Down By Dragging

Grab the blue bar on the left and drag and drop

Multiline Cursor

Alt + click wherever you want to insert the cursor

Box Selection

Shift + Alt to make box selections and edits

Add Comments to revisit or bookmark a code block

Currently you can't pin a cell to bookmark it, but as a hack you can add a comment to a highlighted code and then just click on the comment itself to quickly jump to that cell/comment. Really handy during development to leave yourself some notes.

Merge or split cells

Select the cell options (three ellipses), and choose merge or split cells

If you know any more, please let me know!

Comprehensive Date Dimension Table For Power BI Datasets in Fabric

Sandeep Pawar — Thu, 25 May 2023 14:26:01 GMT

Date tables in Power BI for time intelligence calculations are a must for the performance and simplicity of DAX. Traditionally, date tables have been created using M or DAX. But when working in Microsoft Fabric, especially for Direct Lake datasets, DAX is not an option as calculated tables are not supported. You could still use M in Dataflow Gen2 but you cannot easily pass parameters to the dataflow for dynamic date tables. For example, if you need to control the start and end date of the table dynamically based on fact tables, there is no straightforward way. The solution is to use notebooks. You can easily pass parameters to the notebook in the pipeline and create dynamic date tables. Below is the Python function to do that. I have also explicitly set OptimizeWriteto true to ensure optimal Delta table is created for performance.

Create pandas dataframe

import pandas as pdimport numpy as npfrom pandas.tseries.offsets import DateOffset, BMonthEndfrom datetime import datetime#This is for Microsoft Fabric only to optimize the Delta tablespark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true")def date_dimension(start_date: str, end_date: str) -> pd.DataFrame:    '''    Author : Sandeep Pawar | fabric.guru    Version :2 [06-15-2023]    This functions creates a date dimension table based on the start date and     end date.         start date and end date should be in 'yyyy-mm-dd' format such as '2023-12-31'    '''    try:        start_date = datetime.strptime(start_date, '%Y-%m-%d')        end_date = datetime.strptime(end_date, '%Y-%m-%d')    except ValueError:        raise ValueError("Invalid date format. Please provide dates in 'yyyy-mm-dd' format.")    if end_date <= start_date:        raise ValueError("End date should be after start date.")    df = pd.DataFrame({"Date": pd.date_range(start_date, end_date)})    df["DateKey"] = df.Date.dt.strftime('%Y%m%d').astype(int)    today = pd.Timestamp.now().normalize()    current_week = today.week    current_month = today.month    current_quarter = today.quarter    current_year = today.year    df["ISODateName"] = df.Date.dt.strftime('%Y-%m-%d')    df["AmericanDateName"] = df.Date.dt.strftime('%m/%d/%Y')    df["DayOfWeekName"] = df.Date.dt.day_name()    df["DayOfWeekShort"] = df.Date.dt.day_name().str[:3]    df["MonthName"] = df.Date.dt.month_name()    df["MonthShort"] = df.Date.dt.month_name().str[:3]    df["YearWeekName"] = df.Date.dt.strftime('%YW%V')    df["YearMonthName"] = df.Date.dt.strftime('%Y-%m')    df["MonthYearName"] = df.Date.dt.strftime('%b %Y')    df["YearQuarterName"] = df.Date.dt.year.astype(str) + 'Q' + df.Date.dt.quarter.astype(str)    df["Year"] = df.Date.dt.year    df["YearWeek"] = df.Date.dt.year*100 + df.Date.dt.isocalendar().week    df["ISOYearWeekCode"] = df.Date.dt.year*100 + df.Date.dt.isocalendar().week    df["YearMonth"] = df.Date.dt.year*100 + df.Date.dt.month    df["YearQuarter"] = df.Date.dt.year*100 + df.Date.dt.quarter    df["DayOfWeekStartingMonday"] = df.Date.dt.dayofweek + 1    df["DayOfWeek"] = np.where(df.Date.dt.dayofweek == 6, 1, df.Date.dt.dayofweek + 2)    df["DayOfMonth"] = df.Date.dt.day    df["DayOfQuarter"] = (df.Date.dt.dayofyear - 1) % 91 + 1    df["DayOfYear"] = df.Date.dt.dayofyear    df["WeekOfQuarter"] = (df.Date.dt.day - 1) // 7 + 1    df["WeekOfYear"] = df.Date.dt.isocalendar().week.astype('int64')    df["ISOWeekOfYear"] = df.Date.dt.isocalendar().week.astype('int64')    df["Month"] = df.Date.dt.month    df["MonthOfQuarter"] = (df.Date.dt.month - 1) % 3 + 1    df["Quarter"] = df.Date.dt.quarter    df["DaysInMonth"] = df.Date.dt.days_in_month    df["DaysInQuarter"] = (df.Date + pd.offsets.QuarterEnd(1)).dt.day    df['DaysInYear'] = df['Date'].dt.is_leap_year+365    df['FirstDayOfMonthFlag'] = (df['Date'].dt.is_month_start).astype(int)    df['LastDayOfMonthFlag'] = (df['Date'].dt.is_month_end).astype(int)    df['IsTodayFlag']=(df['Date'] == pd.Timestamp.today().date()).astype(int)    df['IsToday'] = np.where(df['Date'].dt.date == today, 1, 0)    df['IsCurrentWeek'] = np.where(df['Date'].dt.isocalendar().week == current_week, 1, 0)    df['IsCurrentMonth'] = np.where(df['Date'].dt.month == current_month, 1, 0)    df['IsCurrentYear'] = np.where(df['Date'].dt.year == current_year, 1, 0)    df['IsCurrentQuarter'] = np.where(df['Date'].dt.quarter == current_quarter, 1, 0)    df['NextDay'] = df['Date'] + DateOffset(days=1)    df['PreviousDay'] = df['Date'] - DateOffset(days=1)    df['PreviousYearDay'] = df['Date'] - DateOffset(years=1)    df['PreviousMonthDay'] = df['Date'] - DateOffset(months=1)    df['NextMonthDay'] = df['Date'] + DateOffset(months=1)    df['NextYearDay'] = df['Date'] + DateOffset(years=1)    return df# For testingstart_date = "2023-05-25" end_date = "2024-05-25" df = date_dimension(start_date, end_date)df

Save dataframe as Delta table

(spark.createDataFrame(df).write.mode("overwrite").format("delta").saveAsTable("Date"))

I did not add any fiscal date columns but they could be added easily

[Thanks to Francisco Mussari for identifying a couple of bugs that have been fixed in the above code.]

Getting A List of Folders and Delta Tables in the Fabric Lakehouse

Sandeep Pawar — Wed, 24 May 2023 22:05:37 GMT

When you are working with the files and folders in Fabric, you will often need the list of file folders and tables for parameterization, data validation, DataOps, pipeline orchestration etc. Below is a quick helper function I have been using to get a list of all the folders and the tables as pandas dataframe along with the abfss file paths. Very handy when working on large projects. You can extend further if you need to get to the sub-folders in the Files section.

import pandas as pdfrom notebookutils import mssparkutilsdef get_file_table_list()->pd.DataFrame:    '''    Function to get a list of Folders and Tables to the mounted Lakehouse.     This function will return a pandas dataframe containing names and abfss   paths of each folder in the Files section    and the Tables in the managed section of the mounted or the default Lakehouse the notebook is attached to.    '''    LH_ID = spark.conf.get("trident.lakehouse.id")    WS_ID = spark.conf.get("trident.workspace.id")    base_path = f'abfss://{WS_ID}@onelake.dfs.fabric.microsoft.com/{LH_ID}'    data_types = ['Files', 'Tables']    df = pd.concat([        pd.DataFrame({            'name': [item.name for item in mssparkutils.fs.ls(f'{base_path}/{data_type}/')],            'type': data_type[:-1].lower() ,             'path': [item.path for item in mssparkutils.fs.ls(f'{base_path}/{data_type}/')],        }) for data_type in data_types], ignore_index=True)    return dfget_file_table_list()

Here is the returned dataframe:

Please note that the spark configuration instead of using fabric uses trident which was the code name. This might change in the future, so refer to the official documentation.

There is one more method using spark sql show tables :

It doesn't return the path though.

This will return files and tables in the mounted lakehouse. You can identify the mounted lakehouse by reading this blog. If the notebook is not attached to a lakehouse, the above function will return an error.

Power BI Direct Lake Mode : Frequently Asked Questions

Sandeep Pawar — Wed, 24 May 2023 02:16:26 GMT

Updated on Sept 09 2023

Microsoft announced an innovative dataset storage mode for Power BI called Direct Lake at Microsoft Build. It fundamentally changes how we create and consume BI solutions. It promises to provide the perfect balance between import and Direct Query modes. But does it? Let's dig into it and understand the hows and whats of it. I also share my experience at the bottom of the post. Please share your comments and questions below.

Please note that this blog post is based on my understanding of how Direct Lake works. If you find any inconsistencies or inaccuracies, please let me know. Refer to the official Microsoft documentation here. Watch this official demo by Christian Wade.

Give me a Direct Lake tl;dr
Until the Microsoft Fabric announcement, Power BI dataset had two primary modes of storage - DirectQuery and Import. DirectQuery gives you the freshest data from the supported data sources (e.g. SQL, OData) by directly querying the underlying source without a need for dataset refresh but at the cost of poor query performance. Import, on the other hand, can be used with any data source by importing the data into the dataset, offering the best query performance but you get stale data as the dataset needs to be refreshed frequently. The new Direct Lake mode, exclusively available in Fabric, aims to provide the best of both worlds by giving import-like query performance and the latest data even for large datasets.
How does Direct Lake work?
To understand it better, let's first recap what happens under the hood and how Direct Query and import modes work. When a user interacts with a visual, DAX queries are generated irrespective of the storage mode of the table. Let's take a look at a high level of what happens in each mode:
- Direct Query: If the table is in Direct Query mode, DAX is translated to the native SQL queries which are sent to the relational database. The database runs those queries and sends the resulting data back to Power BI and visuals are rendered. This DAX to SQL translation negatively affects Direct Query performance.
- Import*:* In import mode, the data is imported into the model and saved in the proprietary *.idf files which are columnar storage files. When the user queries a visual, only the data requested by those queries (called vertiscan queries) are loaded on demand and returned to the visual. It's because of these proprietary columnar files that Power BI can significantly compress the data, thanks to the Vertipaq engine. The user has to copy the data from the source on a schedule, which can consume Power BI capacity resources. Any changes in the data must be captured by scheduled refreshes to copy the data again to the import dataset.
- Direct Lake: Direct Lake essentially uses the same concept as the import mode. Because the parquet file is a columnar format, similar to the .idf files, the vertiscan queries are sent directly to the Delta tables and the requiredcolumns are loaded in memory. Delta uses different compression than vertipaq so as the data is fetched by Power BI, it is "transcoded" on the fly into a format that Analysis Services engine can understand. In Direct Lake, instead of the native idf files, the Delta parquet files are used which removes the need for duplicating the data or any translation to native queries. Another innovation Microsoft has introduced to make queries faster is the ability to order the data in the parquet files using V-order algorithm which sorts the data similarly to vertipaq engine for higher compression and querying speed. This makes the query execution almost the same as the import mode. Data is cached in memory for subsequent queries. Any changes in the Delta tables are automatically detected and the Direct Lake dataset is refreshed providing the latest changes from the Delta tables. You will create the relationships, measures etc using web modeling or XMLA write (external tools) to turn the tables into a semantic model for further report development.
How is it different from the Live connection mode?
Live mode is just a connection to a dataset and not a storage mode in itself. When the user generates interactive DAX queries, it's sent to the dataset and depending on the storage mode of the underlying table (DQ or import), uses SQL or vertiscan queries to get the data back. In fact, in most cases, developers will be using the Live connection to a Direct Lake dataset to create the reports.
So the users still have to copy the data from the source to the OneLake on a schedule. How is that different from using the scheduled refresh in the import mode?
The user may or may not have to copy the data from the source depending on what the data source is and where it is stored. If the source data is in non-Delta format and/or not in another data lake, the users can use Spark engines (notebooks, pipelines, Dataflow Gen2, Kusto) to copy and transform the data at scale and save it as Delta tables. Power BI is not an enterprise ETL/ELT tool, especially at a large scale. Spark can load and transform the data at a much higher speed, scale and granularity than Power BI. You have access to the same existing data connectors available in Power BI.
If the data is already in another data lake (ADLSg2, S3 etc.), either in Delta or non-Delta ( csv, json, Excel etc.) format, the user can create shortcuts to those files without having to copy the data to OneLake.
Another huge difference is that Power BI is just another compute in Fabric eco-system using the data from OneLake. Other computes, such as Spark notebooks, data warehouse, Kusto etc. can access the same data as Power BI, offering the flexibility to choose any compute to create the analytics solutions. Thus, the number of refreshes and data movement will be lower compared to traditional approaches.
Refresh (aka framing) of the Direct Lake datasets is a metadata refresh option only, i.e. it doesn't load the entire data in memory again and irrespective of the dataset size usually lasts ~5 seconds, thereby significantly reducing the resource footprint. Import mode refresh loads the data in memory, Direct Lake refresh just syncs the metadata.
Does that mean Direct Query and import are going away?
Absolutely not. In fact, depending on the use case, the Delta tables can be used in Direct Query or import using the Lakehouse SQL endpoint. All the modes serve different purposes and depending on the requirements, data (volume & velocity) and the persona you can choose the right mode.
Is Delta table a hard requirement? Can I create Direct Lake dataset using csv, Excel files etc.?
Yes, Delta tables are a must. You can turn the CSV, Excel into Delta format to create Direct Lake dataset. Note that you can also create parquet/csv tables using spark, which will show up in the Tables section of the lakehouse, but it won't generate the SQL endpoint and hence can't be added to the dataset.
How do I create the Delta tables?
There are four primary ways at the moment depending on your experience and skill set.
- Dataflow Gen2: If you are coming from Power BI background, you can use Dataflow Gen2 to import and transform the data using the familiar Power Query Online experience and save the resulting data to the Lakehouse as Delta tables. As you refresh the dataflow, it will automatically replace/append the data at scale.
- Notebooks and Pipelines: If you are a Data Engineer or a Data Scientist with experience in SQL, ADF, Spark notebooks, python, R etc., you can use Notebooks or pipelines to copy and transform the data. Notebooks and pipelines can be scheduled, similar to dataflows, to create/append Delta tables.
- Datawarehouse and KQL database generate Delta tables that can be used for Direct Lake mode. Views are also not supported for Direct Lake.
- Load To Tables : You can select files in the lakehouse (csv, parquet) and select Load To Tables to convert it to Delta format.
Can I use Shared or Premium capacity to create Direct Lake datasets?
Direct Lake is exclusive to Premium P and Fabric SKUs and as such these datasets must be created in Fabric and reside in Fabric capacity.
Is there a limit on the dataset size?

Yes, the limits are based on the F/P SKU and are posted here. There are two limits - model size on disk and size in memory.
The size limits look similar to import but there is a difference. In import mode, you load the entire dataset, so even if you don't use any columns or if the user never queries certain columns, those are loaded in the memory. If the original data is 50 GB in memory, you will always keep 50 GB in memory (roughly speaking). In Direct Lake, even if the Delta table is 50GB, if the user only queries certain columns, only those columns are loaded so the Direct Lake dataset size can be significantly smaller. If the users query the entire 50GB of data, you will likely run into memory pressures and will cause the model to be evicted from memory. Another distinction is that since refreshes are not needed, more data can be loaded. On-demand loading and paging means virtually larger datasets can be supported. In import mode, you need additional memory for refreshing the data which Direct Lake doesn't need. This also saves capacity resources. Similar to large dataset format, if certain columns are not used, they will be paged out dynamically thus reducing the size of the dataset.
What happens if the dataset size is exceeded?
Depends on which size limit is exceeded. If the size on disk is exceeded, it will fallback to DirectQuery. If the size in memory is exceeded, it will be paged out from memory.

💡

You may be wondering, what's the difference between the two ? - When a Power BI semantic model is loaded in memory, it decompresses the data after loading it in memory. Hence, the size in memory is almost always more than size on disk. Read my blog on Direct Lake fallback mechanism for more details on how to calculate the size.
In that event, will Direct Query limitations be applicable?
Yes, you will see Direct Query experience (potentially slower performance).
Can you create a composite model with Direct Lake?
You cannot mix and match storage modes at the moment in web modeling. You can certainly use Direct Lake dataset in local mode to create DQ over Analysis Services and then bring in additional data for the composite model in Desktop.
Can you create a Direct Lake dataset in Power BI Desktop ?
No. The dataset has to be created in service or using external tools.
Can I use XMLA endpoint to create/edit the model using Tabular Editor?
Yes. XMLA-W became available on Aug 9. You can read more about it here. Note that if you create or edit a Direct Lake dataset using XMLA, you cannot use the web modeling feature. Read this detailed blog post on how to use Tabular Editor to create Direct Lake dataset.

Can you connect to a Direct Lake dataset in Desktop in a Live mode ?
Yes. Once you create the semantic model using the web editing in service, you can connect to the dataset in Live mode in the Desktop to create reports.
Are there any limitations on the DAX functions?
In Direct Lake mode, there are no limitations. You use the same functions as the import mode. However, if the query falls back to Direct Query, then any Direct Query related limitations will be applicable.
Can I create Row Level Security?
You can define RLS using fixed identity using Tabular Editor 3. The web experience in service has not been rolled out yet. Note that RLS defined in the DWH will cause the queries to fall back to Direct Query.
That's a significant limitation.
Yes, OneSecurity will provide unified security across all Fabric item types including semantic models. Currently it is not available.
Will it work on existing Delta tables ?
It depends. You can either upload existing delta tables to the Lakehouse or if the Delta tables are in ADLSg2 or S3, you can create shortcuts to those tables. There is no copying of the files if shortcuts are created. Currently, Fabric does not support Delta protocols with Reader v2 and Writer v5. Delta with higher protocols will not generate the SQL endpoint.
Do the Delta tables need to be in OneLake?
If the shortcut can't be created, then yes.
Can I create Direct Lake dataset using Apache Iceberg/hudi (or any other table formats)?
Not currently. There is a OneTable open source format that allows interop between various table formats without copying the data. Microsoft is one of the contributors this project so hopefully in the future, this will be supported out-of-the box to use other table formats easily in Fabric.
What is thevorderyou mentioned above?
Think of it as a shuffling algorithm that sorts the data in the parquet files before saving it as parquet or Delta parquet. This is similar to what vertipaq does to achieve additional compression and fast query performance. Data is dictionary encoded, bin packed similar to vertipaq. Read more about V-Order here.
How do I createvorder?
All Fabric compute engines (spark notebooks, pipelines, Dataflow Gen2) automatically create Delta tables that are vorder'd. You can also specify the Spark configuration if you want to spark.conf.set("spark.sql.parquet.vorder.enabled", "true") or while saving the Spark dataframe as a Delta table .option("parquet.vorder.enabled","true") . I think vorder is idempotent but I will have to confirm that. KQL tables do not yet apply V-order optimization.
Doesvorderimprove Direct Lake performance?
Yes. In my tests, I saw performance gains in loading the cold-cache data, in some cases warm-cache as well. Delta files were also compressed more and write speed was also 1.5x faster in some instances. I will write a detailed blog later to share the details. Please note that it also depends on the column type and cardinality of the column, so performance may vary.
Is vorder same as zorder in Databricks?
No, zorder is used to produce evenly balanced data which reduces the number of scans to read Delta tables. vorder sorts the data for optimum compression. Both are complementary to each other and will improve performance. It's recommended to Zorder before V-order.
Isvordera must-have requirement for Direct Lake?
No, Direct Lake will work with non-vorder Delta tables too, it just may not perform as optimally. Also, note that only Fabric engines can create vorder Delta tables, it's not an open-source algorithm.
How do I know if a Delta table used for Direct Lake dataset hasvorder?
There are two ways to do that - either inspect the Delta transaction log file or use pyarrow. Please read my blog for details.
Any other restrictions on V-Order?
Yes, sorting a table in spark will invalidate V-order. You will have to force the V-order by a spark config. Read my blog.
Does Direct Lake dataset built on top of shortcut'd Delta tables have a performance impact?
- It may or may not have an impact depending on many factors. In my tests, querying a large (50M rows) Delta table on Amazon S3 bucket via Direct Lake showed a minor performance drop. But it may vary based on geography, network, configuration and many different factors. Be aware of potential egress charges when using shortcuts.
- If you create a shortcut to a Delta table that's outside OneLake, note that you can still run OPTIMIZE on it or use table maintenance API to vorder those tables. I would recommend testing this before applying on production data.
Are partitioned Delta tables supported?
Yes.
If the Delta table is partitioned, will Direct Lake do partition pruning?
No, if you query a column with a filter applied, it will still load the entire column and not just specific partitions. This may change in the future. Only the selected columns are loaded.
Will parquet rowgroup size affect Direct Lake dataset performance?
Yes. Fewer rowgroups with tightly packed columns means vorder will have a chance to apply sorting on large block of rows. This should improve performance and avoid small file problem. But as always, it will depend on many other factors. There is also a limitation on the number of rowgroups but it hasn't been published. I will share my results in future blogs. Read my blog for details.
Does it matter which compute engine I use to write the Delta tables?
Absolutely. At least as of publishing my blog on June 27, 2023, all Fabric engines created different Delta tables. This will affect Direct Lake performance. Read my blog for details. In general, spark will provide the most flexibility and knobs to optimize the tables. I need to do more tests to see how recent the recent improvements to delta affect the performance.
Does the compression codec used to compress the parquet (snappy, gzip, zstd) affect Direct Lake performance?
My tests with snappy and zstd did not show any difference in cold cache performance. Other codecs such as gzip and brotli typically have poor read performance so best to avoid them unless necessary. A detailed blog on this topic will be published soon.
What is Default dataset vs Custom dataset?
This is one of the confusing parts of Direct Lake dataset. Let me explain:
- When you create a Lakehouse with Delta tables in it, a default dataset is created with all the tables in the Lakehouse. This is very similar to the default dataset created in Datamart. You can choose tables to exclude from the default dataset. It uses the TDS endpoint of the Lakehouse and is in Direct Lake/DirectQuery dual mode. It always stays in-sync with the Delta tables for any changes. If the 'Automatically update dataset` option is checked, any schema changes are also captured. However, the big limitation is that you cannot use XMLA-write. There is no refresh option. The Default dataset is also performance optimized. I have seen in some cases, default dataset perform faster in cold cache compared to custom dataset. I am not sure why. Default dataset is the best option when the data model is not complex, measures are simple and you want to use all the tables from that Lakehouse. Default dataset is identified as Dataset (default) in the data hub. I wish there was a way to turn off default dataset to reduce number of artifacts. There is a known limitation of default dataset not getting generated if there is only one table. This is temporary. The Default dataset story is evolving, I will update the post when I have more details.
- Custom dataset, on the other hand, allows you to pick and choose the tables to be included. It also stays in sync with the Delta table if auto-sync is turned on. You can disable auto-sync and refresh using UI, REST API or manually. In the future, XMLA-W will be available for custom datasets. In most cases, it's the custom dataset that should be used for flexibility and pro-development. You can only create custom dataset in the Lakehouse.
Does Direct Lake dataset need to be refreshed?
The Default dataset doesn't need to be refreshed. Custom datasets, if auto-sync in on, will look for changes in the Delta table at a fixed time interval. (I need to confirm the interval used for polling before sharing). If Delta table has changed, it will refresh (i.e framing) the dataset. If you can also turn off auto-sync and refresh with a scheduled interval similar to import dataset.
Can I use existing REST APIs to trigger a Direct Lake dataset refresh ?
Yes
Can I do a selective refresh of a table?
I think so because I was able to process just one table in SSMS. But I need to confirm.
When and how does caching occur?
There are two caching mechanisms. Cold cache and warm-cache. When the framing occurs, Direct Lake dataset does metadata refresh and unloads all the memory, there is no data in the dataset. As soon as the first user interacts with the report, Direct Lake dataset loads data on-demand for the columns involved in the query. This is called cold-cache. If the user runs the same query again, the cached data is used and Delta table is not queried again. This is warm-cache. As the user applies more filters and interacts with different pages, more data is pulled into the cache and made available for subsequent queries. Currently, the cold-cache performance is not great and is very close to DirectQuery performance. If you want true import-like performance, you will need data in warm-cache. As soon as the framing occurs (i.e. dataset is refreshed), it will hit the cold cache again. Read more about it here.
Is it possible to always cache certain columns or tables after refresh?
Yes but not via UI. Read this blog on how to pre-warm the cache so the users will always get fast performance.
Is it possible to see which columns have been cached?
Yes, you can run this DMV in SSMS or DAX Studio SELECT * FROM $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS and look for columns that have been paged in memory. Another easy option is to run Vertipaq Analyzer in DAX Studio.
I have created the Direct Lake dataset but see this error message "ParquetException Unexpected end of stream" in the report visual.

I have seen this error message before and was able to fix it by running OPTIMIZE on the delta table.
Does Direct Lake dataset use M ?
No. You can use Dataflow Gen2 to ingest and transform data and sink it to the Lakehouse. But once the table is created, you cannot edit it using M in the Lakehouse. If you used Spark or pipelines to create the Delta tables, you cannot use Dataflow Gen 2 or M to edit the table.
If it doesn't use M, what about custom functions, and parameters?
You can use parameters inside the Dataflow Gen2 but not in the Delta table.
So no parameters at all?
Nope. If you have existing datasets that use parameters for connection strings or some function or transformation logic, that will need to be refactored. For example, typically for various deployment stages, we reduce the number of rows or use different connection strings etc and change it in service in dataset settings. Direct Lake datasets do not have that option.
What about partitions?
Direct Lake datasets have only one partition and you cannot change it or create additional partitions.
Then what about incremental refresh?
Direct Lake datasets do not have an incremental refresh. You can create incremental loading using Spark, pipelines or Dataflow Gen 2 to append/merge additional data. Blog to come.
Can I create calculated tables?
No. You cannot create calculated tables or columns.
Are all TOM properties currently available in import mode supported in Direct Lake mode?
Most are but not all. I don't have a list of what's available.
Does it support Calculation Groups?
Yes, support for calculation groups was added on Sept 6 2023. Check out this blog post. Note that you still need external tools such as Tabular Editor to create CGs.
Where do you write measures?
In the web modeling experience or using the external tools with XMLA endpoint.
Is it possible to always use the DirectQuery mode as default instead of the Direct Lake mode?
No, you cannot force it to be in one mode vs the other. It's always in Direct Lake by default, for now.
Can I evict a column from memory?
As far as I know, no. It will be evicted automatically either because of memory pressure or if it has not been accessed in a while.
Are the query plans created in Direct Lake the same as import mode?
Technically yes they should be but it may be different in some cases at least in Public Preview.
Can I apply MIP labels to Direct Lake datasets?
Yes.
Can I refresh a Direct Lake dataset in a pipeline?
Yes using the REST API.
Can I publish a Direct Lake dataset to a different workspace than the workspace of the Lakehouse?
No, currently the dataset must be in the same workspace as the Lakehouse
Can I use tables from different workspaces and Lakehouses?
Yes, you can shortcut tables from other Lakehouses
Can I shortcut to another Lakehouse authored by another user?
Yes if you have access to the Lakehouse.
Can I use Deployment pipelines with Direct Lake datasets?
Yes, and more. You can fully integrate with ADO for CI/CD.
What's the best way to migrate an existing DirectQuery or Import dataset to Direct Lake?
No matter what, there will be some refactoring needed. Dataflow Gen2 would be the easiest to refactor as you can use the M scripts, functions and parameters. The challenge will be calculated columns and tables. You will either need to build that logic using M (hard if DAX is complex) or use pyspark which will offer Python's flexibility with recursion and scalability.
If I create a table in Datawarehouse, can it be used for Direct Lake?
Yes, tables created in the Fabric DWH can be used and in fact V-Order optimized as well.
Delta tables have time travel, can I use it in Direct Lake dataset?
Not at the moment.
What happens if a column data type changes, columns are dropped or added?
For schema evolution, first you need to make sure in spark, schema evolution is allowed. Default dataset automatically takes care of changes with schema. Custom dataset will need to be manually updated.
Can I download Direct Lake dataset from service?
No
Can I serialize the Direct Lake dataset as .bim or .tmdl ?
Right now the XMLA is read-only so no. But considering Microsoft is actively supporting TMDL development, safe to assume it will be supported in the future.
There is also support for Power BI projects, a new format. How it works with Direct Lake I am not sure.
Can I use Analyze in Excel with Direct Lake dataset?
Yes
Can I use Direct Lake dataset with Paginated Reports?
I am not 100% sure. I need to check.
High cardinality columns are what kill import dataset compression and performance. Is that the case with Direct Lake dataset as well?
Yes. High cardinality columns are harder to compress. If the dataset is large, it will fail to load the column because of the column size. Also note that import dataset cannot have more than 2B unique values in a column. I am not 100% sure if that applies to Direct Lake but that's a large column and likely won't be supported. In that case, it will fall to DQ.
Is there a limit on number parquet files in a Delta tables used for Direct Lake dataset?
Yes, I think it's based on the SKU. Refer to the documentation for correct numbers. If number of parquet files is exceeded, run OPTIMIZE command on the Delta table to reduce number of files. Notebooks will auto-optimize the Delta tables on creation. Also VACUUM the Delta tables periodically to only kpee the necessary files/data.
Can I use 'Publish to Web' report built on top of Direct Lake dataset?
No, since Direct Lake dataset requires SSO, Publish to web is not supported currently.
Is there any limitation on table names or column names?
In general, hive metastore doesn't support columns and table names with spaces in them. Plus, generally, column and table names are lowercase. After creating the Direct Lake dataset, you can change names in web modeling. Read my blog about this here.
Can I run DMVs on Direct Lake dataset?
Yes. Only certain DMVs can be used with XMLA read.
Can I create Direct Lake dataset using KQL database?
No
Can I use it in QSO enabled workspaces?
Yes, currently you have to sync the dataset manually using below PowerShell
Login-PowerBI Invoke-PowerBIRestMethod -Url 'groups/WorkspaceId/datasets/DatasetId/sync' -Method Post | ConvertFrom-Json | Format-List
How does Query Caching under dataset settings affect caching? Is that same as warm cache?
That's report level caching and may not be supported in the future for Direct Lake dataset because it won't be applicable in DQ mode.
I see Large dataset storage format setting in dataset settings, what does that do?
Not applicable to Direct Lake dataset.
Do I need to set up a gateway for Direct Lake dataset?
No gateways are needed since it's a cloud data source. May be needed for VNET situations but I am not sure.
Can I connect to a Direct Lake dataset which is in Workspace A and publish a report to Workspace B (Premium)?
Yes.
Can I connect to a Direct Lake dataset which is in Workspace A and publish a report to Workspace C (Pro)?
Yes, I think so. The only requirement is that Direct Lake must be in Fabric capacity. I will have to double-check this.
Do Direct Lake dataset haveon-demand loadlike other LDF datasets?
Yes, I talked about on-demand loading above.
Can I use streaming dataset with Direct Lake?
Not the Power BI streaming/push dataset. But you can certainly create structured streaming or use EventHub to create Delta tables which can be used for Direct Lake dataset.
Can I create perspectives?
I am not sure, but I doubt it.
Which Delta version is supported for Direct Lake?
I think only Delta < v2.2 are supported for now, but please check the official documentation.
If my data has a filter clause, let's say sum(sales) where year=2013, will it load data from the entire sales column from the Delta table or just for year 2013?
Since it is column based, the entire sales column will be loaded in memory and then FE will filter the data for year 2013.
Are hybrid tables supported?
No
Aggregation tables?
No
Auto date/time?
Direct Lake dataset does not create auto date/time tables
Dynamic M parameters?
No.
Does automatic page refresh (APR) work with Direct Lake?
Yes. To autorefresh the report in service, set the automatic page refresh so it shows the latest data, assuming the auto-sync in dataset has been tunred on.
Can I apply further transformations in the Direct Lake dataset?
Not after the Delta table has been created.
Can I create a view and use it in Direct Lake dataset?
No. Spark views are available yet. Views created in DWH cannot be used for Direct Lake either.
Can I query the Direct Lake dataset (or any other dataset) in the Notebook?
Using the REST API you can access the measures, semantic model. You can also query the Delta tables used in the dataset. Exciting things happening in this space, stay tuned.
Can I create an external dataset , i.e share it with external users ?
I am not sure, I will have to check.
Where does the data in Direct Lake reside, i.e any data residency implications?
It will be the same as the Fabric tenant geo-location. If you create a Delta table using shortcuts, it may leave the geography. Double-check that.
Does export to Excel work, similar to import?
Yes
Can I use Premium dataflow?
No. It has to be Dataflow Gen2.
If have an existing datamart can I create a Direct Lake dataset out of it ?
Yes. You will need to use the SQL endpoint of the datamart, copy the data using Dataflow Gen 2 to pipelines to create Delta table. So basically duplicate the data. Or use the same M code from the Datamart in Dataflow Gen2.
Does Log Analytics captures Direct Lake queries and activities?
I think so but I will check.
In PQ, I can group queries by creating folders to manage the queries. Can I create folders/sub-folders for the Delta tables?
No, folders are not supported.
If one of the tables has an error, will the framing happen for the rest of the tables or is it all or nothing?
I think it refreshes all or nothing, similar to import. I will test and update.
Why is the dataset size so small in the worskpace settings?
Because it shows the size of the metadata file, not the size in memory. To check the dataset/column size, use Vertipaq Analyzer.
Can I create Delta table using on prem data sources?
Yes, you can use Dataflow Gen 2 or pipelines.
Can Direct Lake dataset make use of parquet table/column statistics etc to speed up queries?
No
This all sounds good in theory. Does it really work in practice?

💡

I have been using, creating and testing the Direct Lake datasets since private preview. It is fast, especially with the warm cache for massive datasets. Cold cache performance still has a lot of room for improvement. As we get closer to GA, I am 100% sure it will get far better. For most use case scenarios with ~25M rows and typical cardinality, best practices etc., users won't even notice any change in performance. Even for very large datasets, if you pre-warm the cache, the performance is very good. The biggest challenge is the RLS. For very large datasets like TPCH-SF100 (600M rows fact, multiple 20M rows Dims), it's impossible to import it into Desktop. Even if you do import it, creating relationships, measures, updates etc. slows down development in the Desktop. Direct Lake is essentially just metadata (like TE) which means there is no downtime and waiting. As soon as the Delta tables are created, you can create the model immediately and start building the report. It almost feels unreal for those massive datasets. The development workflow is another challenge. Since there are no calculated columns, calculated tables, parameters, incremental refresh etc., you have to re-think how to do that in the new world using pyspark and pipelines. Pay close attention to the Delta tables (#files, rowgroups, size etc.). Other than Direct Lake, I am most excited about being able to use Notebooks with Power BI datasets :) If you know me and have followed me, you may know that I use Python and notebooks in Power BI development so this is the best-integrated experience I can ask for. It opens so many options and avenues to develop new solutions using ML, advanced analytics, DataOps etc. Direct Lake may not be ideal in every single scenario, especially for citizen developers as there is some learning curve involved. We still don't know or understand a whole lot about this new mode, especially the Delta table maintenance, optimization and tuning best practices. It is still in Preview and you will run into issues/bugs so I don't recommend using it in production until GA. But truly an innovative feature. Kudos to the PG !

Thank you to Holly Kelly, Kay Unkroth, Tamas Polner, Akshai Mirchandani, Daniel Coelho, Eun Hee Kim, Justyna Lucznik, Bogdan Crivat, Jacob Knightley, Christian Wade from Microsoft for for all the information and my colleagues Simon Nuss and Bryan Campbell for support/testing.

Please refer to official Microsoft documentation for features, guidance and limitations. This product is in Preview and the features may change at any time.