<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Microsoft Fabric | Power BI | Data Analytics & AI]]></title><description><![CDATA[Insights on enterprise-scale data analytics and AI using Microsoft Fabric and Power BI. Curated by Sandeep Pawar, Principal PM at Microsoft]]></description><link>https://fabric.guru</link><generator>RSS for Node</generator><lastBuildDate>Fri, 17 Apr 2026 11:56:30 GMT</lastBuildDate><atom:link href="https://fabric.guru/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[RAG in Fabric Notebook Using Microsoft Harrier Multilingual Text Embedding Model]]></title><description><![CDATA[Last week Microsoft released an open-source text embedding model called Harrier in three sizes- 270M, 0.6B and 27B. I have been testing it in my RAG pipeline and so far it has crushed all my metrics. ]]></description><link>https://fabric.guru/rag-in-fabric-notebook-using-microsoft-harrier-multilingual-text-embedding-model</link><guid isPermaLink="true">https://fabric.guru/rag-in-fabric-notebook-using-microsoft-harrier-multilingual-text-embedding-model</guid><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Tue, 14 Apr 2026 21:49:57 GMT</pubDate><content:encoded><![CDATA[<p>Last week Microsoft released an open-source text embedding model called <code>Harrier</code> in three sizes- 270M, 0.6B and 27B. I have been testing it in my RAG pipeline and so far it has crushed all my metrics. It's currently the number 1 model on <a href="https://huggingface.co/spaces/mteb/leaderboard">MTEBv2 leaderboard</a>.</p>
<img src="https://cdn.hashnode.com/uploads/covers/619d4cccfa52cd31fe52d25d/3d644ef0-02f9-4725-a492-0d486b68e653.png" alt="" style="display:block;margin:0 auto" />

<p>The 27B model is obviously too big for Fabric notebook but the 270M and 0.6B variant despite being small and extremely capable and fast even with CPU inferencing in Fabric Python notebook. For comparison, I have been using the text-embedding-ada-002 model in my pipeline with R1 score = 0.68, the 270M model topped that easily with R1 0.76 with lower latency. Bonus : no Fabric CU consumption (other than Python execution time). In fabric notebook, you can save the model in a lakehouse and load it for inferencing. The initial load, especially 0.6B model may take a while but the retrieval is fast.</p>
<p>Take a look at this sample notebook on how to operationalize it in Fabric.</p>
<p><a href="https://github.com/pawarbi/snippets/blob/main/harrier-ragbench-techqa-fabricguru.ipynb">snippets/harrier-ragbench-techqa-fabricguru.ipynb at main · pawarbi/snippets</a></p>
<ul>
<li><p>Use <code>%%configure</code> to upgrade the compute and attach a lakehouse</p>
</li>
<li><p>Download the model</p>
</li>
<li><p>Instantiate it</p>
</li>
<li><p>Load the data</p>
</li>
<li><p>Embed and retrive</p>
</li>
<li><p>Just for teh sake of testing, I also added a harrier + BM25 pipeline to improve retrieval, the R3 score improved to 99%</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Programmatically Retrieve Prep Data For AI Configuration of Semantic Models]]></title><description><![CDATA[For Power BI Copilot and Data agents with semantic models, you must use Prep Data for AI configuration to ground the responses in the context added in Prep for AI. In this blog, I will show you how yo]]></description><link>https://fabric.guru/programmatically-retrieve-prep-data-for-ai-configuration-of-semantic-models</link><guid isPermaLink="true">https://fabric.guru/programmatically-retrieve-prep-data-for-ai-configuration-of-semantic-models</guid><category><![CDATA[AI]]></category><category><![CDATA[PowerBI]]></category><category><![CDATA[semantic-model]]></category><category><![CDATA[prep for ai]]></category><category><![CDATA[mcp]]></category><category><![CDATA[remote-mcp]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Thu, 02 Apr 2026 23:36:59 GMT</pubDate><content:encoded><![CDATA[<p>For Power BI Copilot and <a href="https://learn.microsoft.com/en-us/fabric/data-science/semantic-model-best-practices#prep-for-ai-make-semantic-model-ai-ready">Data agents with semantic models</a>, you must use <a href="https://learn.microsoft.com/en-us/power-bi/create-reports/copilot-prepare-data-ai">Prep Data for AI configuration</a> to ground the responses in the context added in Prep for AI. In this blog, I will show you how you can use the <a href="https://learn.microsoft.com/en-us/power-bi/developer/mcp/remote-mcp-server-get-started">Power BI remote MCP server</a> to get the configuration.</p>
<img src="https://cdn.hashnode.com/uploads/covers/619d4cccfa52cd31fe52d25d/9ac424c6-3a4b-4340-9058-2e1a3a66cad8.png" alt="" style="display:block;margin:0 auto" />

<h2>Code</h2>
<p>Below I call the <code>GetSemanticModelSchema</code> tool from the hosted MCP to get the Prep for AI configurations. Run below in a Fabric notebook.</p>
<pre><code class="language-python">import httpx
import json

async def get_semantic_model_schema(model_id):
    """Fetch and parse the semantic model prep for ai config from Power BI MCP server.
        Author : Sandeep Pawar | Fabric.guru
    """
    MCP_SERVER_URL = "https://api.fabric.microsoft.com/v1/mcp/powerbi"
    token = notebookutils.credentials.getToken("pbi")
    headers = {
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/json"
    }

    payload = {
        "jsonrpc": "2.0",
        "id": 1,
        "method": "tools/call",
        "params": {
            "name": "GetSemanticModelSchema",
            "arguments": {"artifactId": model_id}
        }
    }

    def parse_sse_response(text):
        for line in text.split('\n'):
            if line.startswith('data: '):
                return json.loads(line[6:])
        return {}


    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(MCP_SERVER_URL, headers=headers, json=payload)
        data = parse_sse_response(response.text)

    parsed = json.loads(data['result']['content'][0]['text'])

    return {
        "name": parsed['semanticModel']['Name'],
        "tables": parsed['schema']['Tables'],
        "relationships": parsed['schema']['ActiveRelationships'],
        "custom_instructions": parsed['schema']['CustomInstructions'],
        "verified_answers": parsed['schema']['VerifiedAnswers']
    }

# retrieve
prep4ai = await get_semantic_model_schema("&lt;semantic_model_guid&gt;")
</code></pre>
<img src="https://cdn.hashnode.com/uploads/covers/619d4cccfa52cd31fe52d25d/b6ff9a43-2a98-4c22-af07-b163d67bca92.png" alt="" style="display:block;margin:0 auto" />

<p>Here you have it. Note the <a href="https://learn.microsoft.com/en-us/power-bi/create-reports/copilot-prepare-data-ai-verified-answers">verified answer</a>. This is often a confusion, what does the Verified Answer actually store ? As you can see it includes the projections - the tables, columns, measures, filters used in the visual and not the DAX query of the visual !</p>
<div>
<div>💡</div>
<div>Note that you cannot make changes to Prep for AI programmatically, this is just retrieving the configs.</div>
</div>

<p>You can use this to programmatically check, verify, monitor the configurations or make it part of your best practices analyzer for AI.</p>
]]></content:encoded></item><item><title><![CDATA[Cross-referencing Notebooks In The Updated Fabric Notebook Copilot]]></title><description><![CDATA[At FabCon Atlanta last week, the updated notebook Copilot for data engineering and data science was announced. It brings agentic capabilities to the Copilot and is much more intelligent and Fabric-awa]]></description><link>https://fabric.guru/cross-referencing-notebooks-in-the-updated-fabric-notebook-copilot</link><guid isPermaLink="true">https://fabric.guru/cross-referencing-notebooks-in-the-updated-fabric-notebook-copilot</guid><category><![CDATA[microsoftfabric]]></category><category><![CDATA[copilot]]></category><category><![CDATA[AI]]></category><category><![CDATA[agentic AI]]></category><category><![CDATA[agents]]></category><category><![CDATA[PySpark]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Tue, 24 Mar 2026 02:11:45 GMT</pubDate><content:encoded><![CDATA[<p>At FabCon Atlanta last week, the updated notebook Copilot for data engineering and data science <a href="https://blog.fabric.microsoft.com/en-us/blog/introducing-the-updated-copilot-for-data-engineering-and-data-science-preview/">was announced</a>. It brings agentic capabilities to the Copilot and is much more intelligent and <em>Fabric-aware</em> than the previous version. You can read the documentation <a href="https://learn.microsoft.com/en-us/fabric/data-engineering/copilot-notebooks-overview">here</a>. For example, you can now do and ask the Copilot following things which you couldn't previously:</p>
<ul>
<li><p>Use the Copilot without starting a session. Just open the notebook, Copilot and start asking questions and making changes. Saves you CUs.</p>
</li>
<li><p>list items in the workspace : <em>list all the lakehouses in this workspace</em></p>
</li>
<li><p>take actions : <em>mount the and  lakehouses and make  the default lakehouse</em></p>
</li>
<li><p>get ABFSS paths : <em>give me abfs path of lakehouse &lt;lakehouse_1&gt;.</em></p>
</li>
<li><p>create spark pool configurations using %%configure : <em>add configuration to use 4 cores</em></p>
</li>
<li><p>refer to content in a cell by cell # : <em>explain cell 11,</em></p>
</li>
</ul>
<p>Give it a try and you will be surprised how well it works.</p>
<p>But one of my most favorite features is being able to read and refer to other notebooks. For example, I can ask the Copilot to read <em>notebook_1</em> from the same workspace. Think of the implications for a second. Below is one example, how this can be helpful.</p>
<h2>Cross-referencing notebooks</h2>
<ol>
<li>In a Fabric workspace I created a notebook with a markdown that includes rules from <a href="https://www.palantir.com/docs/foundry/transforms-python-spark/pyspark-style-guide">Palantir PySpark style guide</a>. This style guide is an opinionated guide to PySpark code style for common situations and the associated best practices based on the most frequent recurring topics across the PySpark. Below is a summarized version in a <a href="https://raw.githubusercontent.com/pawarbi/snippets/refs/heads/main/PALANTIR_STYLE_GUIDE.md">markdown</a>:</li>
</ol>
<blockquote>
<h1>PySpark Style Guide</h1>
<blockquote>
<p><strong>Purpose</strong>: This notebook is a style contract for AI-assisted code generation and review. When referenced from another notebook (e.g., <code>refer to @pyspark_style_guide</code>), you MUST apply every rule below to all PySpark code produced or reviewed in that session.</p>
<p>Adapted from <a href="https://github.com/palantir/pyspark-style-guide">Palantir PySpark Style Guide</a> (MIT License).</p>
</blockquote>
<hr />
<h2>VERSIONS</h2>
<p>Use features and API supported by following versions:</p>
<ul>
<li><p>Spark 3.5</p>
</li>
<li><p>Delta 3.2</p>
</li>
<li><p>Python 3.11</p>
</li>
</ul>
<h2>Enforcement Checklist</h2>
<p>When reviewing or generating PySpark code, walk through each check below <strong>in order</strong>. Flag every violation found. Do not skip checks.</p>
<table>
<thead>
<tr>
<th>#</th>
<th>Check</th>
<th>What to look for</th>
</tr>
</thead>
<tbody><tr>
<td>C1</td>
<td><strong>Imports</strong></td>
<td>Any bare <code>from pyspark.sql.functions import ...</code> or alias other than <code>F</code>, <code>T</code>, <code>W</code></td>
</tr>
<tr>
<td>C2</td>
<td><strong>Column access</strong></td>
<td>Any <code>df.colName</code> dot-access outside of a join <code>on=</code> clause</td>
</tr>
<tr>
<td>C3</td>
<td><strong>String column refs</strong></td>
<td>Any <code>F.col('x')</code> that could just be <code>'x'</code> (Spark 3.0+)</td>
</tr>
<tr>
<td>C4</td>
<td><strong>Variable names</strong></td>
<td>Any single-letter dataframe names (<code>df</code>, <code>o</code>, <code>d</code>, <code>t</code>)</td>
</tr>
<tr>
<td>C5</td>
<td><strong>Magic values</strong></td>
<td>Any literal string, number, or threshold inline in <code>filter</code>, <code>when</code>, <code>withColumn</code>, <code>select</code> that is not a named constant</td>
</tr>
<tr>
<td>C6</td>
<td><strong>Select contract</strong></td>
<td>More than one function per column in a <code>select</code>, or a <code>.when()</code> expression inside a <code>select</code></td>
</tr>
<tr>
<td>C7</td>
<td><strong>withColumnRenamed</strong></td>
<td>Any use. Replace with <code>select</code> + <code>.alias()</code></td>
</tr>
<tr>
<td>C8</td>
<td><strong>Empty columns</strong></td>
<td>Any <code>lit('')</code>, <code>lit('NA')</code>, <code>lit('N/A')</code>. Must be <code>lit(None)</code></td>
</tr>
<tr>
<td>C9</td>
<td><strong>Logical density</strong></td>
<td>More than 3 boolean expressions in a single <code>.filter()</code> or <code>F.when()</code> without named variables</td>
</tr>
<tr>
<td>C10</td>
<td><strong>Chain length</strong></td>
<td>More than 5 chained statements in one block</td>
</tr>
<tr>
<td>C11</td>
<td><strong>Chain mixing</strong></td>
<td>Joins, filters, withColumn, and selects mixed in the same chain</td>
</tr>
<tr>
<td>C12</td>
<td><strong>Join hygiene</strong></td>
<td>Any <code>.join()</code> missing explicit <code>how=</code></td>
</tr>
<tr>
<td>C13</td>
<td><strong>Right joins</strong></td>
<td>Any <code>how='right'</code>. Swap df order, use <code>left</code></td>
</tr>
<tr>
<td>C14</td>
<td><strong>Window frames</strong></td>
<td>Any <code>Window.partitionBy(...).orderBy(...)</code> without explicit <code>.rowsBetween()</code> or <code>.rangeBetween()</code></td>
</tr>
<tr>
<td>C15</td>
<td><strong>Window nulls</strong></td>
<td><code>F.first()</code> or <code>F.last()</code> without <code>ignorenulls=True</code></td>
</tr>
<tr>
<td>C16</td>
<td><strong>Global windows</strong></td>
<td>Empty <code>W.partitionBy()</code> or window without <code>orderBy</code> used for aggregation. Use <code>.agg()</code> instead</td>
</tr>
<tr>
<td>C17</td>
<td><strong>Otherwise fallback</strong></td>
<td><code>.otherwise(&lt;catch-all value&gt;)</code> masking unexpected data. Use <code>None</code> or omit</td>
</tr>
<tr>
<td>C18</td>
<td><strong>Line continuation</strong></td>
<td>Any <code>\</code> for multiline. Wrap in parentheses instead</td>
</tr>
<tr>
<td>C19</td>
<td><strong>UDFs</strong></td>
<td>Any <code>@udf</code> or <code>F.udf()</code>. Rewrite with native functions</td>
</tr>
<tr>
<td>C20</td>
<td><strong>Comments</strong></td>
<td>Comments that describe <em>what</em> code does instead of <em>why</em> a decision was made</td>
</tr>
<tr>
<td>C21</td>
<td><strong>Dead code</strong></td>
<td>Commented-out code blocks. Remove them</td>
</tr>
<tr>
<td>C22</td>
<td><strong>Function size</strong></td>
<td>Functions over ~70 lines or files over ~250 lines</td>
</tr>
</tbody></table>
<hr />
<h2>Anti-Patterns (find and fix these)</h2>
<p>Each pattern below is a regex-like signature. If you see it, it is a violation.</p>
<h3>AP1: Bare function imports</h3>
<pre><code class="language-python"># VIOLATION: any of these
from pyspark.sql.functions import col, when, sum, lit
import pyspark.sql.functions as func

# FIX: always
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window as W
</code></pre>
<h3>AP2: Dot-access column references</h3>
<pre><code class="language-python"># VIOLATION: df.column_name anywhere except join on=
df.select(df.order_id, df.amount)
df.withColumn('x', df.price * df.qty)

# FIX: use string refs
df.select('order_id', 'amount')
df.withColumn('x', F.col('price') * F.col('qty'))
</code></pre>
<h3>AP3: Inline magic values</h3>
<pre><code class="language-python"># VIOLATION: bare literals in logic
df.filter(F.col('amount') &gt; 500)
F.when(F.col('status') == 'shipped', 'In Transit')
df.filter(F.col('days') &lt; 365)

# FIX: named constants at top of cell/function
HIGH_VALUE_THRESHOLD = 500
STATUS_SHIPPED = 'shipped'
LABEL_IN_TRANSIT = 'In Transit'
ONE_YEAR_DAYS = 365

df.filter(F.col('amount') &gt; HIGH_VALUE_THRESHOLD)
F.when(F.col('status') == STATUS_SHIPPED, LABEL_IN_TRANSIT)
df.filter(F.col('days') &lt; ONE_YEAR_DAYS)
</code></pre>
<h3>AP4: Complex logic inside .when() or .filter()</h3>
<pre><code class="language-python"># VIOLATION: more than 3 conditions inline
df.filter(
    (F.col('a') == 'x') &amp; (F.col('b') &gt; 10) &amp; (F.col('c') != 'y')
    &amp; ((F.col('d') == 'online') | (F.col('d') == 'partner'))
)

# FIX: named boolean expressions, max 3 in the final filter
is_valid_type = (F.col('a') == TYPE_X)
above_threshold = (F.col('b') &gt; MIN_THRESHOLD)
not_excluded = (F.col('c') != EXCLUDED_STATUS)
is_target_channel = (F.col('d') == CHANNEL_ONLINE) | (F.col('d') == CHANNEL_PARTNER)

flagged = is_valid_type &amp; above_threshold &amp; not_excluded &amp; is_target_channel
df.filter(flagged)
</code></pre>
<h3>AP5: .when() inside select</h3>
<pre><code class="language-python"># VIOLATION: conditional logic embedded in select
df.select(
    'order_id',
    F.when(F.col('status') == 'shipped', 'In Transit')
     .when(F.col('status') == 'delivered', 'Complete')
     .alias('status_label'),
)

# FIX: select plain columns, then withColumn for derived logic
df = df.select('order_id', 'status')
df = df.withColumn(
    'status_label',
    F.when(F.col('status') == STATUS_SHIPPED, LABEL_IN_TRANSIT)
     .when(F.col('status') == STATUS_DELIVERED, LABEL_COMPLETE)
)
</code></pre>
<h3>AP6: Empty column sentinels</h3>
<pre><code class="language-python"># VIOLATION
df.withColumn('notes', F.lit(''))
df.withColumn('review_date', F.lit('N/A'))

# FIX
df.withColumn('notes', F.lit(None))
df.withColumn('review_date', F.lit(None))
</code></pre>
<h3>AP7: Missing window frame</h3>
<pre><code class="language-python"># VIOLATION: implicit frame
w = W.partitionBy('customer_id').orderBy('order_date')

# FIX: always explicit
w = (W.partitionBy('customer_id')
      .orderBy('order_date')
      .rowsBetween(W.unboundedPreceding, 0))
</code></pre>
<h3>AP8: Blanket .otherwise()</h3>
<pre><code class="language-python"># VIOLATION: masks unexpected values
F.when(..., 'A').when(..., 'B').otherwise('Unknown')

# FIX: omit otherwise (returns null) or use lit(None) explicitly
F.when(..., 'A').when(..., 'B')
</code></pre>
<h3>AP9: Monster chains</h3>
<pre><code class="language-python"># VIOLATION: mixed concerns, too long
df = (df.select(...).filter(...).withColumn(...).join(...).drop(...).withColumn(...))

# FIX: separate by concern, max 5 per block
df = (
    df
    .select(...)
    .filter(...)
)
df = df.withColumn(...)
df = df.join(..., how='inner')
</code></pre>
<h3>AP10: Backslash continuation</h3>
<pre><code class="language-python"># VIOLATION
df = df.filter(F.col('a') == 'x') \
       .filter(F.col('b') &gt; 10)

# FIX: parentheses
df = (
    df
    .filter(F.col('a') == 'x')
    .filter(F.col('b') &gt; 10)
)
</code></pre>
<hr />
<h2>Quick Reference (for code generation)</h2>
<p>When <strong>writing new code</strong>, apply these defaults:</p>
<ul>
<li><p>Imports: <code>F</code>, <code>T</code>, <code>W</code> only</p>
</li>
<li><p>Columns: string refs where possible, <code>F.col()</code> when needed</p>
</li>
<li><p>Descriptive df names: <code>orders_df</code>, <code>active_orders</code>, not <code>df</code>, <code>o</code></p>
</li>
<li><p>Constants: every literal in logic gets a <code>SCREAMING_SNAKE</code> name</p>
</li>
<li><p>Selects: plain columns + one transform each, no <code>.when()</code> inside</p>
</li>
<li><p>Chains: max 5 lines, group by concern (filter/select, then enrich, then join)</p>
</li>
<li><p>Joins: always <code>how=</code>, always <code>left</code> not <code>right</code>, alias for disambiguation</p>
</li>
<li><p>Windows: always explicit frame, always <code>ignorenulls=True</code> on <code>first</code>/<code>last</code></p>
</li>
<li><p>Empty cols: <code>F.lit(None)</code>, never <code>lit('')</code> or <code>lit('NA')</code></p>
</li>
<li><p>No UDFs, no <code>.otherwise()</code> fallbacks, no <code>\</code> continuations</p>
</li>
<li><p>Comments explain <em>why</em>, not <em>what</em>. No commented-out code.</p>
</li>
</ul>
</blockquote>
<ol>
<li><p>I named the notebook PYSPARK_STYLE_GUIDE. It's all caps intentionally (more on this later).</p>
</li>
<li><p>In another notebook, which already has some PySpark code , I opened Copilot.</p>
</li>
<li><p>Asked : <em>List all notebooks in this workspace</em>. I can see the PYSPARK_STYLE_GUIDE notebook:</p>
</li>
</ol>
<img src="https://cdn.hashnode.com/uploads/covers/619d4cccfa52cd31fe52d25d/f6101e6c-24d3-481b-a148-2bb48763bfcf.png" alt="" style="display:block;margin:0 auto" />

<ol>
<li>My notebook has one cell with large code block (intentional). I prompted Copilot :</li>
</ol>
<blockquote>
<p><strong>refer to @PYSPARK_STYLE_GUIDE and fix the code without losing the function and purpose</strong></p>
</blockquote>
<p><a class="embed-card" href="https://youtu.be/gtj_f7oBeuk">https://youtu.be/gtj_f7oBeuk</a></p>

<div>
<div>💡</div>
<div>As with anything AI, be sure to always back-up, test and verify.</div>
</div>

<p>Copilot read the style notebook and applied the rules to the cells in this notebook. You could also use this to extract code patterns from other notebooks, e.g. <em>how did &lt;notebook_name&gt; ingested the data, use the same library as &lt;notebook_name&gt; to create ML features etc.</em> Super handy.</p>
<p>Your BI/DE/DS team could also create reference pattern notebooks, and refer them for driving consistency and quality. Note that you can list items in another workspace but can't refer cross-workspace.</p>
<p>This was for Copilot in Fabric notebook. In an upcoming blog, I will share how I use <strong>Skills for Fabric</strong> for development.</p>
<h2>Reference:</h2>
<ul>
<li><p><a href="https://gist.github.com/pawarbi/2298a263206a3374ed423d3624bc4907">PALANTIR_STYLE_</a><a href="http://GUIDE.md">GUIDE.md</a></p>
</li>
<li><p><a href="https://blog.fabric.microsoft.com/en-us/blog/introducing-the-updated-copilot-for-data-engineering-and-data-science-preview/">Introducing the updated Copilot for data engineering and data science (Preview) | Microsoft Fabric Blog | Microsoft Fabric</a></p>
</li>
<li><p><a href="https://www.palantir.com/docs/foundry/transforms-python-spark/pyspark-style-guide">Python (Spark) • PySpark reference • Style guide • Palantir</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Programmatically Comparing Draft vs Production Fabric Data Agent Responses]]></title><description><![CDATA[Fabric data agent has a draft and a published mode. This helps the developer test the configurations before publishing it.

You can also use the data agent SDK to test the agent programmatically. You can learn more about it here and notebook samples ...]]></description><link>https://fabric.guru/programmatically-comparing-draft-vs-production-fabric-data-agent-responses</link><guid isPermaLink="true">https://fabric.guru/programmatically-comparing-draft-vs-production-fabric-data-agent-responses</guid><category><![CDATA[Fabric data agent]]></category><category><![CDATA[microsoft fabric]]></category><category><![CDATA[data agent]]></category><category><![CDATA[sdk]]></category><category><![CDATA[AI Data Agents in Microsoft Fabric]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Fri, 16 Jan 2026 23:35:03 GMT</pubDate><content:encoded><![CDATA[<p>Fabric data agent has a draft and a published mode. This helps the developer test the configurations before publishing it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768605526850/f8079401-e64c-478a-bad7-4990088da3d7.png" alt class="image--center mx-auto" /></p>
<p>You can also use the data agent SDK to test the agent programmatically. You can learn more about it <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-science/fabric-data-agent-sdk">here</a> and notebook samples <a target="_blank" href="https://github.com/microsoft/fabric-samples/tree/main/docs-samples/data-science/data-agent-sdk">from this repo</a>. Let me show you how you can compare the data agent response from the two stages.</p>
<p>Imagine I am testing new instructions:</p>
<ul>
<li><p>In Draft stage, I used agent instruction: <code>Always return amounts rounded to nearest hundred, e.g. 1451 should be 1500, and 45,179 should be 45100</code></p>
</li>
<li><p>For published stage, the instructions are : <code>Always return amounts with $xyz, e.g. $123.4</code></p>
</li>
</ul>
<p>I should get same answer but formatted differently based on the instructions. Rounded number for draft and precise answer with a $ for production version.</p>
<h4 id="heading-code">Code</h4>
<p>The trick is to set the stage <code>ai_skill_stage=</code> as <code>“sandbox”</code> vs <code>“production”</code></p>
<pre><code class="lang-python">%pip install fabric-data-agent-sdk --q

<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> fabric.dataagent.client <span class="hljs-keyword">import</span> FabricOpenAI

DATA_AGENT_NAME = <span class="hljs-string">"&lt;DataAgentName&gt;"</span>
MODEL = <span class="hljs-string">"gpt-4o"</span>

sbx  = FabricOpenAI(artifact_name=DATA_AGENT_NAME, ai_skill_stage=<span class="hljs-string">"sandbox"</span>)
prod = FabricOpenAI(artifact_name=DATA_AGENT_NAME, ai_skill_stage=<span class="hljs-string">"production"</span>)

asst_sbx  = sbx.beta.assistants.create(model=MODEL, instructions=<span class="hljs-string">"You are the DRAFT (sandbox) data agent."</span>).id
asst_prod = prod.beta.assistants.create(model=MODEL, instructions=<span class="hljs-string">"You are the PUBLISHED (production) data agent."</span>).id


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ask</span>(<span class="hljs-params">client, assistant_id, q, *, timeout_s=<span class="hljs-number">300</span></span>):</span>
    tid = client.beta.threads.create().id
    client.beta.threads.messages.create(thread_id=tid, role=<span class="hljs-string">"user"</span>, content=q)
    run = client.beta.threads.runs.create(thread_id=tid, assistant_id=assistant_id)

    end = time.time() + timeout_s
    <span class="hljs-keyword">while</span> run.status <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> {<span class="hljs-string">"completed"</span>, <span class="hljs-string">"failed"</span>, <span class="hljs-string">"cancelled"</span>, <span class="hljs-string">"expired"</span>, <span class="hljs-string">"incomplete"</span>}:
        <span class="hljs-keyword">if</span> time.time() &gt; end:
            <span class="hljs-keyword">raise</span> TimeoutError(<span class="hljs-string">f"timeout (status=<span class="hljs-subst">{run.status}</span>)"</span>)
        time.sleep(<span class="hljs-number">2</span>)
        run = client.beta.threads.runs.retrieve(thread_id=tid, run_id=run.id)

    <span class="hljs-keyword">if</span> run.status != <span class="hljs-string">"completed"</span>:
        <span class="hljs-keyword">raise</span> RuntimeError(<span class="hljs-string">f"run status=<span class="hljs-subst">{run.status}</span>"</span>)

    <span class="hljs-keyword">for</span> m <span class="hljs-keyword">in</span> client.beta.threads.messages.list(thread_id=tid, order=<span class="hljs-string">"desc"</span>).data:
        <span class="hljs-keyword">if</span> m.role == <span class="hljs-string">"assistant"</span>:
            <span class="hljs-keyword">return</span> m.content[<span class="hljs-number">0</span>].text.value
    <span class="hljs-keyword">return</span> <span class="hljs-string">""</span>


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">compare</span>(<span class="hljs-params">q</span>):</span>
    <span class="hljs-keyword">return</span> ask(sbx, asst_sbx, q), ask(prod, asst_prod, q)


q = <span class="hljs-string">"what's the total transaction amount"</span>
draft, production = compare(q)

print(<span class="hljs-string">"DRAFT:"</span>, draft)
print(<span class="hljs-string">"\nPRODUCTION:"</span>, production)
</code></pre>
<h4 id="heading-result">Result</h4>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1768606068689/8eb020af-5711-4320-a5d2-27444ed0ffd1.png" alt class="image--center mx-auto" /></p>
<p>This is handy if you want to tune the data agent performance and compare it vs production before publishing.</p>
]]></content:encoded></item><item><title><![CDATA[Monitoring Power BI Modeling MCP Server Usage and Adoption]]></title><description><![CDATA[Power BI Modeling MCP server was launched at Ignite 2025 last month. It has quickly become an indispensable tool for working with the semantic models in Fabric workspaces. You can learn more about it below:

Announcement by Rui Romano : https://power...]]></description><link>https://fabric.guru/monitoring-power-bi-modeling-mcp-server-usage-and-adoption</link><guid isPermaLink="true">https://fabric.guru/monitoring-power-bi-modeling-mcp-server-usage-and-adoption</guid><category><![CDATA[microsoftfabric]]></category><category><![CDATA[PowerBI]]></category><category><![CDATA[mcp]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Sun, 14 Dec 2025 22:31:37 GMT</pubDate><content:encoded><![CDATA[<p><img src="https://powerbiblogsfd-ep-aveghkfaexa3e4bx.b02.azurefd.net//wp-content/uploads/2025/11/word-image-31766-29.png" alt /></p>
<p>Power BI Modeling MCP server was launched at Ignite 2025 last month. It has quickly become an indispensable tool for working with the semantic models in Fabric workspaces. You can learn more about it below:</p>
<ul>
<li><p>Announcement by Rui Romano : <a target="_blank" href="https://powerbi.microsoft.com/en-us/blog/power-bi-november-2025-feature-summary/#post-31766-_Toc214035083">https://powerbi.microsoft.com/en-us/blog/power-bi-november-2025-feature-summary/#post-31766-_Toc214035083</a></p>
</li>
<li><p>Documentation : <a target="_blank" href="https://learn.microsoft.com/en-us/power-bi/developer/mcp/">Power BI MCP server documentation - Power BI | Microsoft Learn</a></p>
</li>
<li><p>Blog by Jeffrey Wang : <a target="_blank" href="https://pbidax.wordpress.com/2025/11/25/talk-to-your-data-model-introducing-the-power-bi-modeling-mcp/">Talk to Your Data Model: Introducing the Power BI Modeling MCP – pbidax</a></p>
</li>
</ul>
<p>I was recently asked how admins can monitor usage of the Modeling MCP Server. Similar to any other external tool, if a user has permission to query a semantic model’s XMLA endpoint, you cannot restrict which client or tool they use. However, admins <em>can</em> monitor its usage.</p>
<ol>
<li><p><strong>Enable Workspace Monitoring</strong> : Turn on <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/fundamentals/workspace-monitoring-overview">Workspace Monitoring</a>, you will need to be a Workspace Admin to enable this. I have a <a target="_blank" href="https://fabric.guru/analyzing-semantic-model-logs-using-fabric-workspace-monitoring">blog</a> on how this can be helpful.</p>
</li>
<li><p><strong>Query the logs :</strong> Query the <code>SemanticModelLogs</code> table and identify the semantic models that are being queried and which users. Filter the <code>ApplicationName</code> column to find the application used for querying the model.</p>
</li>
</ol>
<pre><code class="lang-python">// Monitor Power BI Modeling Server Usage
SemanticModelLogs
| where ItemKind == <span class="hljs-string">'Dataset'</span> <span class="hljs-keyword">and</span> ApplicationName == <span class="hljs-string">'MCP-PBIModeling'</span>
| summarize sessions = dcount(OperationId) by ItemName, ExecutingUser, ApplicationName, bin(Timestamp, <span class="hljs-number">5</span>m)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1765751024596/59184754-63da-4426-8546-b748fe67910a.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[Unstructured Data Extraction Using AI Functions in Dataflow Gen2]]></title><description><![CDATA[Three and a half years ago, in pre-AI age, I wrote a blog post titled “Extracting Matching Words Using A Pre-Defined List in Power Query/M”. I helped a colleague extract certain error codes/phrases from sentences using M. It felt great to wield the M...]]></description><link>https://fabric.guru/unstructured-data-extraction-using-ai-functions-in-dataflow-gen2</link><guid isPermaLink="true">https://fabric.guru/unstructured-data-extraction-using-ai-functions-in-dataflow-gen2</guid><category><![CDATA[microsoftfabric]]></category><category><![CDATA[ai functions]]></category><category><![CDATA[llm]]></category><category><![CDATA[genai]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Fri, 05 Dec 2025 21:55:56 GMT</pubDate><content:encoded><![CDATA[<p>Three and a half years ago, in pre-AI age, I wrote a blog post titled “<strong>Extracting Matching Words Using A Pre-Defined List in Power Query/M”.</strong> I helped a colleague extract certain error codes/phrases from sentences using M. It felt great to wield the M sword. Fast forward to today, you can use the newly announced <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-factory/dataflow-gen2-ai-functions">AI Functions feature in Dataflow Gen2</a> to achieve the same result using natural language.</p>
<p>Blog: <a target="_blank" href="https://pawarbi.github.io/blog/powerquery/m/list/2022/03/11/powerquery-M-extracting-words.html">Extracting Matching Words Using A Pre-Defined List in Power Query/M | Sandeep Pawar</a></p>
<p>Following the same example I shared in the blog above, I wanted to extract certain keywords (programming languages) from text. In M, I split the text and did list lookup to find the matching words.</p>
<p><img src="https://raw.githubusercontent.com/pawarbi/blog/master/images/word13.png" alt /></p>
<h2 id="heading-fabric-ai-prompt">Fabric AI Prompt:</h2>
<p>This shouldn’t need much explanation. Similar to <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas">AI Functions in Fabric</a>, you specify the column, enter the prompt and let the LLM do the work.</p>
<p>Here is the prompt I used : <code>Extract any programming, query, or scripting languages mentioned (including those used in BI tools such as M, LOD ). Comma-separate if multiple. NA if none or unsure</code></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764971226777/b5d14a3e-2db9-459a-8672-31eaa36f0da9.png" alt class="image--center mx-auto" /></p>
<p><strong>Result:</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1764971675031/1ea4d254-5703-4bfa-a177-f46360b27c6b.png" alt class="image--center mx-auto" /></p>
<p>Give it a try:</p>
<pre><code class="lang-python">let
    Source = <span class="hljs-comment">#table(</span>
        type table [Respondent = text, Text = text],
        {
            {<span class="hljs-string">"Person1"</span>, <span class="hljs-string">"I like Python"</span>},
            {<span class="hljs-string">"Person2"</span>, <span class="hljs-string">"We use Python and R"</span>},
            {<span class="hljs-string">"Person3"</span>, <span class="hljs-string">"SQL"</span>},
            {<span class="hljs-string">"Person4"</span>, <span class="hljs-string">"My team uses Spark , SQL"</span>},
            {<span class="hljs-string">"Person5"</span>, <span class="hljs-string">"Excel all the way"</span>},
            {<span class="hljs-string">"Person6"</span>, <span class="hljs-string">"I hate DAX"</span>},
            {<span class="hljs-string">"Person7"</span>, <span class="hljs-string">"M is magic"</span>},
            {<span class="hljs-string">"Person8"</span>, <span class="hljs-string">"I don't use anything"</span>},
            {<span class="hljs-string">"Person8"</span>, <span class="hljs-string">"I learned MDX before DAX and still use MDX for our SSAS cubes "</span>}
        }
    ),
  <span class="hljs-comment">#"Renamed columns" = Table.RenameColumns(Source, {{"Text", "quote"}}),</span>
  <span class="hljs-comment">#"Added AI prompt column" = Table.AddColumn(#"Renamed columns", "language", each FabricAI.Prompt("Extract any programming, query, or scripting languages mentioned (including those used in BI tools such as M, LOD ). Comma-separate if multiple. NA if none or unsure", Record.SelectFields(_, {"quote"})), type text)</span>
<span class="hljs-keyword">in</span>
    <span class="hljs-comment">#"Added AI prompt column"</span>
</code></pre>
<p>Instead of the UI, you can also use <a target="_blank" href="https://learn.microsoft.com/en-us/powerquery-m/fabricai-prompt">FabricAI.Prompt</a> M function. Note that this function is not available in Power BI Desktop.</p>
<p>Read the documentation for more details: <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-factory/dataflow-gen2-ai-functions">Fabric AI Prompt in Dataflow Gen2 (Preview) - Microsoft Fabric | Microsoft Learn</a></p>
]]></content:encoded></item><item><title><![CDATA[Quick Test : Using Fabric AI Services To Detect Power BI Reports With Errors]]></title><description><![CDATA[If you follow me on LinkedIn or Twitter, you may have seen my posts about using AI to detect Power BI reports with errors, visualizations that lack accessibility etc. All those posts used Gemini models because of their strong multi-modal performance....]]></description><link>https://fabric.guru/quick-test-using-fabric-ai-services-to-detect-power-bi-reports-with-errors</link><guid isPermaLink="true">https://fabric.guru/quick-test-using-fabric-ai-services-to-detect-power-bi-reports-with-errors</guid><category><![CDATA[microsoft fabric]]></category><category><![CDATA[AI]]></category><category><![CDATA[semantic link labs]]></category><category><![CDATA[Power BI]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Wed, 05 Nov 2025 23:14:18 GMT</pubDate><content:encoded><![CDATA[<p>If you follow me on LinkedIn or Twitter, you may have seen my posts about using AI to detect Power BI reports with errors, visualizations that lack accessibility etc. All those posts used Gemini models because of their strong multi-modal performance. Well, Fabric now has strong multi-modal LLMs from OpenAI (GPT4.1, GPT5). In this blog, I experimented to see if I can use those models to detect if a report page has any errors. This is not a comprehensive test by any means but shows the capability and the use cases.</p>
<p>You could use this to scan your reports and detect errors before your users find them.</p>
<h3 id="heading-steps">Steps:</h3>
<ul>
<li><p>Use semantic link labs to generate pdf of the report</p>
</li>
<li><p>Convert pdf to image and encode it as base64</p>
</li>
<li><p>Send the base64 to LLM and ask it to detect any errors</p>
</li>
<li><p>Optional : if you have sensitive data on the report, you could optionally add noise to the image or apply filter (e.g. gaussian blur etc.) and then encode it. I will leave that to you but can cover if you drop it in comments below.</p>
</li>
</ul>
<h2 id="heading-example-report">Example Report :</h2>
<p>In my test, I had a report with four pages. 3 pages had error like below and the last page had no errors.</p>
<p><strong>Page1 : Has error</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762383991722/636b2f79-3315-401d-b4f9-3a4cee76ea89.png" alt class="image--center mx-auto" /></p>
<p><strong>Page 4 : No error</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762384039661/53b69ccf-aa1a-432e-bb5f-db463559eeda.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-generate-pdf">Generate PDF:</h2>
<p>Attach a lakehouse to the notebook and install <code>semantic-link-labs</code> and <code>pymupdf4llm</code></p>
<p>Below code will save the specified report as a PDF to the Files section in a lakehouse</p>
<pre><code class="lang-python">%pip install semantic-link-labs pymupdf4llm

<span class="hljs-keyword">from</span> sempy_labs.report <span class="hljs-keyword">import</span> export_report

export_report(
    report=<span class="hljs-string">"Sales (1)"</span>, <span class="hljs-comment">#name of your report</span>
    export_format=<span class="hljs-string">"PDF"</span>,
    file_name=<span class="hljs-string">"mySalesreport"</span>,   <span class="hljs-comment"># name of the pdf to save</span>
    workspace=<span class="hljs-string">"d70c1ed7-b6e4-40f1-911f-3300a13145ff"</span>,       <span class="hljs-comment"># workspace name or id where report lives</span>
    lakehouse=<span class="hljs-string">"mylakehouse"</span>,       <span class="hljs-comment"># lakehouse to save file into</span>
    lakehouse_workspace=<span class="hljs-string">"d70c1ed7-b6e4-40f1-911f-3300a13145ff"</span>  <span class="hljs-comment"># workspace for the lakehouse</span>
)
</code></pre>
<h2 id="heading-detect-error">Detect Error:</h2>
<p>Using GPT 4.1 to detect errors on each report page</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> synapse.ml.fabric.service_discovery <span class="hljs-keyword">import</span> get_fabric_env_config
<span class="hljs-keyword">from</span> synapse.ml.fabric.token_utils <span class="hljs-keyword">import</span> TokenUtils
<span class="hljs-keyword">import</span> fitz 
<span class="hljs-keyword">import</span> base64
<span class="hljs-keyword">from</span> io <span class="hljs-keyword">import</span> BytesIO
<span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image
<span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">import</span> json

<span class="hljs-comment"># Setup Fabric API</span>
fabric_env_config = get_fabric_env_config().fabric_env_config
auth_header = TokenUtils().get_openai_auth_header()
openai_base_host = fabric_env_config.ml_workload_endpoint + <span class="hljs-string">"cognitive/openai/openai/"</span>
deployment_name = <span class="hljs-string">"gpt-4.1"</span>
service_url = openai_base_host + <span class="hljs-string">f"deployments/<span class="hljs-subst">{deployment_name}</span>/chat/completions?api-version=2025-04-01-preview"</span>

auth_headers = {
    <span class="hljs-string">"Authorization"</span>: auth_header,
    <span class="hljs-string">"Content-Type"</span>: <span class="hljs-string">"application/json"</span>
}
prompt= <span class="hljs-string">"""


Analyze the image of a dashboard and determine if it contains any errors, error messages, or failed data loading. 
Look for phrases like 'Error fetching data', 'See details', error icons (X in circles), or visual indicators of failure. 
Respond with 'ERROR DETECTED' if errors are present, or 'NO ERROR' if the page appears normal. Then provide a brief explanation.

"""</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">pdf_page_to_base64</span>(<span class="hljs-params">pdf_path, page_num, dpi=<span class="hljs-number">150</span></span>):</span>
    doc = fitz.open(pdf_path)
    page = doc[page_num]
    pix = page.get_pixmap(dpi=dpi)
    img = Image.frombytes(<span class="hljs-string">"RGB"</span>, [pix.width, pix.height], pix.samples)

    buffer = BytesIO()
    img.save(buffer, format=<span class="hljs-string">"JPEG"</span>, quality=<span class="hljs-number">85</span>)
    doc.close()

    <span class="hljs-keyword">return</span> base64.b64encode(buffer.getvalue()).decode(<span class="hljs-string">'utf-8'</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">check_page_for_errors</span>(<span class="hljs-params">base64_image</span>):</span>
    payload = {
        <span class="hljs-string">"messages"</span>: [
            {
                <span class="hljs-string">"role"</span>: <span class="hljs-string">"system"</span>,
                <span class="hljs-string">"content"</span>: <span class="hljs-string">"You are an error detection assistant analyzing Power BI dashboard and reports."</span>
            },
            {
                <span class="hljs-string">"role"</span>: <span class="hljs-string">"user"</span>,
                <span class="hljs-string">"content"</span>: [
                    {
                        <span class="hljs-string">"type"</span>: <span class="hljs-string">"text"</span>,
                        <span class="hljs-string">"text"</span>: prompt
                    },
                    {
                        <span class="hljs-string">"type"</span>: <span class="hljs-string">"image_url"</span>,
                        <span class="hljs-string">"image_url"</span>: {
                            <span class="hljs-string">"url"</span>: <span class="hljs-string">f"data:image/jpeg;base64,<span class="hljs-subst">{base64_image}</span>"</span>,
                            <span class="hljs-string">"detail"</span>: <span class="hljs-string">"high"</span>
                        }
                    }
                ]
            }
        ],
        <span class="hljs-string">"max_tokens"</span>: <span class="hljs-number">500</span>,
        <span class="hljs-string">"temperature"</span>: <span class="hljs-number">0.1</span>
    }

    response = requests.post(service_url, headers=auth_headers, json=payload)

    <span class="hljs-keyword">if</span> response.status_code == <span class="hljs-number">200</span>:
        <span class="hljs-keyword">return</span> response.json()[<span class="hljs-string">"choices"</span>][<span class="hljs-number">0</span>][<span class="hljs-string">"message"</span>][<span class="hljs-string">"content"</span>]
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"API Error: <span class="hljs-subst">{response.status_code}</span>"</span>

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">analyze_pdf_for_errors</span>(<span class="hljs-params">pdf_path</span>):</span>
    doc = fitz.open(pdf_path)
    total_pages = len(doc)
    doc.close()

    results = {}

    <span class="hljs-keyword">for</span> page_num <span class="hljs-keyword">in</span> range(total_pages):
        print(<span class="hljs-string">f"Processing page <span class="hljs-subst">{page_num + <span class="hljs-number">1</span>}</span>/<span class="hljs-subst">{total_pages}</span>..."</span>)

        base64_image = pdf_page_to_base64(pdf_path, page_num)
        result = check_page_for_errors(base64_image)

        results[page_num + <span class="hljs-number">1</span>] = result

        <span class="hljs-keyword">if</span> <span class="hljs-string">"ERROR DETECTED"</span> <span class="hljs-keyword">in</span> result:
            print(<span class="hljs-string">f"  -&gt; ERROR FOUND on page <span class="hljs-subst">{page_num + <span class="hljs-number">1</span>}</span>"</span>)
        <span class="hljs-keyword">else</span>:
            print(<span class="hljs-string">f"  -&gt; Page <span class="hljs-subst">{page_num + <span class="hljs-number">1</span>}</span> OK"</span>)

    <span class="hljs-keyword">return</span> results


pdf_path = <span class="hljs-string">"/lakehouse/default/Files/mySalesreport.pdf"</span>
error_results = analyze_pdf_for_errors(pdf_path)

error_results
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1762384159587/9f5b2150-ce4a-4a98-aa6f-0ca1be8a5965.png" alt class="image--center mx-auto" /></p>
<p>Below is the JSON returned by <code>error_results</code></p>
<pre><code class="lang-python">{<span class="hljs-number">1</span>: <span class="hljs-string">'ERROR DETECTED\n\nExplanation: The dashboard contains an error message at the top center stating "Error fetching data for this visual See details" along with an error icon (X in a circle). This indicates that at least one visual has failed to load data. The rest of the visuals appear to be functioning normally.'</span>,
 <span class="hljs-number">2</span>: <span class="hljs-string">'ERROR DETECTED\n\nExplanation: The dashboard displays the message "Error fetching data for this visual" along with an error icon and a "See details" link, indicating that data failed to load for the visual.'</span>,
 <span class="hljs-number">3</span>: <span class="hljs-string">'ERROR DETECTED\n\nExplanation: The dashboard contains a visual with an error message stating "Error fetching data for this visual See details" along with an error icon (X in a circle), indicating that one of the visuals failed to load data.'</span>,
 <span class="hljs-number">4</span>: <span class="hljs-string">'NO ERROR\n\nExplanation: The dashboard displays all visuals and data without any error messages, icons, or indicators of failed data loading. All charts and figures are rendered correctly.'</span>}
</code></pre>
<p>For my report, LLM correctly identified the pages with errors and described the errors as well.</p>
<p>I did not test all types of errors so if you wanted to make it more comprehensive, you could give few-shot example of each error and what to look for.</p>
]]></content:encoded></item><item><title><![CDATA[Automate DAX UDF With Semantic Link Labs]]></title><description><![CDATA[DAX User Defined Function announced at FabCon was one of the biggest updates to the DAX language in recent years. You can read more about it on official Microsoft docs and SQLBI. In this blog, I share how you can use Semantic Link Labs (SLL) to autom...]]></description><link>https://fabric.guru/automate-dax-udf-with-semantic-link-labs</link><guid isPermaLink="true">https://fabric.guru/automate-dax-udf-with-semantic-link-labs</guid><category><![CDATA[Power BI]]></category><category><![CDATA[microsoft fabric]]></category><category><![CDATA[semantic link labs]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Thu, 16 Oct 2025 21:44:38 GMT</pubDate><content:encoded><![CDATA[<p>DAX <a target="_blank" href="https://learn.microsoft.com/en-us/dax/best-practices/dax-user-defined-functions">User Defined Function</a> announced at FabCon was one of the biggest updates to the DAX language in recent years. You can read more about it on official <a target="_blank" href="https://learn.microsoft.com/en-us/dax/best-practices/dax-user-defined-functions">Microsoft docs</a> and <a target="_blank" href="https://www.sqlbi.com/articles/introducing-user-defined-functions-in-dax/">SQLBI</a>. In this blog, I share how you can use Semantic Link Labs (SLL) to automate the process of defining and centralizing the UDFs. Note DAX UDF is still in preview so read official documentation for all the details and <a target="_blank" href="https://learn.microsoft.com/en-us/dax/best-practices/dax-user-defined-functions#considerations-and-limitations">limitations</a>.</p>
<h2 id="heading-defining-udf">Defining UDF</h2>
<p>The latest SLL <a target="_blank" href="https://github.com/microsoft/semantic-link-labs/releases/tag/0.12.4">version 0.12.4</a>, thanks to my colleague <a target="_blank" href="https://www.elegantbi.com/about">Michael Kovalsky</a>, has <code>set_user_defined_function</code> using TOM. Here is how you use it:</p>
<pre><code class="lang-python"><span class="hljs-comment">#%pip install semantic-link-labs --q #upgrade to 0.12.4</span>
<span class="hljs-keyword">import</span> sempy_labs <span class="hljs-keyword">as</span> labs
<span class="hljs-keyword">from</span> sempy_labs.tom <span class="hljs-keyword">import</span> connect_semantic_model

dataset = <span class="hljs-string">''</span> <span class="hljs-comment"># Enter the name or ID of your semantic model</span>
workspace = <span class="hljs-literal">None</span> <span class="hljs-comment"># Enter the name or ID of the workspace in which the semantic model resides</span>
name = <span class="hljs-string">'AddTax'</span> <span class="hljs-comment"># Name of the user-defined function</span>
expression = <span class="hljs-string">"(amount : NUMERIC) =&gt; amount * 1.1"</span> <span class="hljs-comment"># Expression logic of the user-defined function</span>

<span class="hljs-keyword">with</span> connect_semantic_model(dataset=dataset, readonly=<span class="hljs-literal">False</span>, workspace=workspace) <span class="hljs-keyword">as</span> tom:
    tom.set_user_defined_function(name=name, expression=expression)

<span class="hljs-comment"># List user-defined functions</span>
df = labs.list_user_defined_functions(dataset=dataset, workspace=workspace)
display(df)
</code></pre>
<h2 id="heading-using-dax-lib">Using DAX Lib</h2>
<p>SQLBI has an open-source collection of UDFs at <a target="_blank" href="https://docs.daxlib.org/what-is">DAX Lib</a>, submitted by users and community members. You can use TMDL to apply any of these UDFs in Desktop or using <a target="_blank" href="https://docs.tabulareditor.com/te3/tutorials/udfs.html">TE3</a>. We can use SLL to use any of those functions from DAX Lib’s <a target="_blank" href="https://github.com/daxlib/daxlib">repo</a>. Below I will use the <a target="_blank" href="https://daxlib.org/package/Kolosko.SummaryStats/">Mode function</a> my friend <a target="_blank" href="https://kerrykolosko.com/about/">Kerry</a> submitted.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">DAX Lib has community submitted functions so please acknowledge and attribute to the authors if you use the functions</div>
</div>

<p>In the repo, the functions are defined in TMDL so thanks to LLM below function extracts the name and the function using regex.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> re
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Tuple, Optional

<span class="hljs-comment">## thanks to LLM for below</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">extract_dax_function</span>(<span class="hljs-params">text: str</span>) -&gt; Tuple[Optional[str], Optional[str]]:</span>
    <span class="hljs-string">"""
    Returns (function_name, expression_block).
    - function_name: between 'function' and '='
    - expression_block: after '=' up to (but not including) the first 'annotation'
    """</span>
    udf_name = re.search(<span class="hljs-string">r"function\s+['\"]?\s*([^'\"=\s]+)\s*['\"]?\s*="</span>, text, re.IGNORECASE)
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> udf_name:
        <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>, <span class="hljs-literal">None</span>
    func_name = udf_name.group(<span class="hljs-number">1</span>).strip()
    udf_expr = re.search(<span class="hljs-string">r"=\s*(.*?)\bannotation\b"</span>, text, re.IGNORECASE | re.DOTALL)
    <span class="hljs-keyword">if</span> udf_expr:
        expr = udf_expr.group(<span class="hljs-number">1</span>).rstrip()
    <span class="hljs-keyword">else</span>:
        m_after_eq = re.search(<span class="hljs-string">r"=\s*(.*)$"</span>, text, re.DOTALL)
        expr = m_after_eq.group(<span class="hljs-number">1</span>).rstrip() <span class="hljs-keyword">if</span> m_after_eq <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>

    <span class="hljs-keyword">return</span> func_name, expr


<span class="hljs-keyword">import</span> requests
mode_udf = (requests.get(<span class="hljs-string">"https://raw.githubusercontent.com/daxlib/daxlib/refs/heads/main/packages/k/kolosko.summarystats/0.1.0/lib/functions.tmdl"</span>).text)
print(extract_dax_function(mode_udf)[<span class="hljs-number">1</span>])
</code></pre>
<p>Next we take that and apply it to a published semantic model.</p>
<pre><code class="lang-python">
name = extract_dax_function(mode_udf)[<span class="hljs-number">0</span>]
expression = extract_dax_function(mode_udf)[<span class="hljs-number">1</span>]
dataset = <span class="hljs-string">"Sales_udf_sandeep"</span> <span class="hljs-comment">#dataset name/id</span>
workspace=<span class="hljs-literal">None</span> <span class="hljs-comment">#if notebook is in the same workspace as the dataset, else workspace name/id</span>

<span class="hljs-keyword">with</span> connect_semantic_model(dataset=dataset, readonly=<span class="hljs-literal">False</span>, workspace=workspace) <span class="hljs-keyword">as</span> tom:
    tom.set_user_defined_function(name=name, expression=expression)

udf_df = labs.list_user_defined_functions(dataset=dataset)
display(udf_df)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760638991067/922f4f21-ebd5-48f5-8b86-9f946e04c88d.png" alt class="image--center mx-auto" /></p>
<p>Let’s check if it actually works:</p>
<pre><code class="lang-python">dax = <span class="hljs-string">"""

EVALUATE
{ Kolosko.SummaryStats.MODE(Sales, Sales[Net Price]) }

"""</span>
fabric.evaluate_dax(dax_string=dax, dataset=dataset)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760639063467/b41a35f5-499f-4c96-b287-9db0aa2044ef.png" alt class="image--center mx-auto" /></p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">It should be noted that, DAX UDFs can only be applied to semantic models with 1702 compatibility level. If it’s less than &lt;1702 you will get an error.</div>
</div>

<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760639336138/032abe6b-34b0-433a-837c-e4602f99647b.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-centralizing-udfs">Centralizing UDFs</h2>
<p>We can extend this further by automating the process. If you define one (or more) semantic model(s) with all the UDFs defined, SLL can read and apply to a list of models.</p>
<p>In the function below:</p>
<ul>
<li><p>It checks for compatibility level first</p>
</li>
<li><p>If it’s &lt;1702, it’s upgraded to 1702</p>
</li>
<li><p>Waits for 5 seconds (for the update to go through)</p>
</li>
<li><p>Applies the UDFs from the source semantic model to the target semantic model(s)</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> time

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">set_udf_with_compatibility_check</span>(<span class="hljs-params">tom, name, expression, max_retries=<span class="hljs-number">1</span></span>):</span>
    <span class="hljs-string">"""
    Set UDF with automatic compatibility level upgrade if needed
    """</span>
    <span class="hljs-keyword">for</span> attempt <span class="hljs-keyword">in</span> range(max_retries + <span class="hljs-number">1</span>):
        <span class="hljs-keyword">try</span>:
            tom.set_user_defined_function(name=name, expression=expression)
            print(<span class="hljs-string">f"Successfully set UDF: <span class="hljs-subst">{name}</span>"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>

        <span class="hljs-keyword">except</span> ValueError <span class="hljs-keyword">as</span> e:
            <span class="hljs-keyword">if</span> <span class="hljs-string">"compatibility level of at least 1702"</span> <span class="hljs-keyword">in</span> str(e) <span class="hljs-keyword">and</span> attempt == <span class="hljs-number">0</span>:
                print(<span class="hljs-string">f"Compatibility level error for <span class="hljs-subst">{name}</span>. Upgrading to 1702..."</span>)
                tom.set_compatibility_level(<span class="hljs-number">1702</span>)
                time.sleep(<span class="hljs-number">5</span>)  <span class="hljs-comment"># Wait 5 seconds as requested</span>
                <span class="hljs-keyword">continue</span>  <span class="hljs-comment"># Retry</span>
            <span class="hljs-keyword">else</span>:
                print(<span class="hljs-string">f"Failed to set UDF <span class="hljs-subst">{name}</span>: <span class="hljs-subst">{e}</span>"</span>)
                <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            print(<span class="hljs-string">f"Unexpected error setting UDF <span class="hljs-subst">{name}</span>: <span class="hljs-subst">{e}</span>"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>

    <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>


<span class="hljs-keyword">for</span> i, row <span class="hljs-keyword">in</span> udf_df.iterrows():
    name = row[<span class="hljs-string">'Function Name'</span>]
    expression = row[<span class="hljs-string">'Expression'</span>]
    dataset = <span class="hljs-string">'order_analysis'</span>
    workspace = <span class="hljs-literal">None</span>

    <span class="hljs-keyword">with</span> connect_semantic_model(dataset=dataset, readonly=<span class="hljs-literal">False</span>, workspace=workspace) <span class="hljs-keyword">as</span> tom:
        set_udf_with_compatibility_check(tom, name, expression)
</code></pre>
<p>This could be a powerful pattern. You could use several approaches to centralize the UDFs:</p>
<ul>
<li><p>Define a set of golden/centralized semantic models in a workspace with UDFs defined.</p>
</li>
<li><p>Bring these model(s) into other model as composite models</p>
</li>
<li><p>Or use notebooks to continuously update and apply the UDFs to downstream models</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760650827086/c1d258ff-7c37-4d15-9984-d8d95df9da74.png" alt class="image--center mx-auto" /></p>
<p>Below is another approach:</p>
<h2 id="heading-using-notebooks-for-cataloging">Using notebooks for cataloging</h2>
<p>I <a target="_blank" href="https://pawarbi.github.io/blog/powerbi/dax/daylightsavings/2023/03/29/powerbi-dax-adjust-daylight-savings.html">wrote a blog</a> a while ago on getting the daylight savings adjusted US time. I converted that to a UDF. The use case here is to make this available to all the users so based on their defined timezone, they can get the daylight adjusted US time.</p>
<pre><code class="lang-python">    ( tz: STRING ) =&gt;
    VAR _utcnow   = UTCNOW()
    VAR _todayUTC = TRUNC(_utcnow)
    VAR _y        = YEAR(_todayUTC)

    /* US DST window: <span class="hljs-number">2</span>nd Sun Mar  <span class="hljs-number">1</span>st Sun Nov (UTC-date granularity) */
    VAR _mar1     = DATE(_y, <span class="hljs-number">3</span>, <span class="hljs-number">1</span>)
    VAR _nov1     = DATE(_y,<span class="hljs-number">11</span>, <span class="hljs-number">1</span>)
    VAR _secondSunMar = (_mar1 + MOD(<span class="hljs-number">8</span> - WEEKDAY(_mar1), <span class="hljs-number">7</span>)) + <span class="hljs-number">7</span>
    VAR _firstSunNov  =  _nov1 + MOD(<span class="hljs-number">8</span> - WEEKDAY(_nov1), <span class="hljs-number">7</span>)
    VAR _isDST = _todayUTC &gt;= _secondSunMar &amp;&amp; _todayUTC &lt; _firstSunNov

    /* offsets (hours to ADD to UTC) */
    VAR _std =
        SWITCH(TRUE(),
            tz=<span class="hljs-string">"useastern"</span>,  <span class="hljs-number">-5</span>,
            tz=<span class="hljs-string">"uscentral"</span>,  <span class="hljs-number">-6</span>,
            tz=<span class="hljs-string">"usmountain"</span>, <span class="hljs-number">-7</span>,
            tz=<span class="hljs-string">"usarizona"</span>,  <span class="hljs-number">-7</span>,
            tz=<span class="hljs-string">"uspacific"</span>,  <span class="hljs-number">-8</span>,
            tz=<span class="hljs-string">"usalaska"</span>,   <span class="hljs-number">-9</span>,
            tz=<span class="hljs-string">"ushawaii"</span>,  <span class="hljs-number">-10</span>,
            <span class="hljs-number">-8</span>  // default
        )
    VAR _dst =
        SWITCH(TRUE(),
            tz=<span class="hljs-string">"useastern"</span>,  <span class="hljs-number">-4</span>,
            tz=<span class="hljs-string">"uscentral"</span>,  <span class="hljs-number">-5</span>,
            tz=<span class="hljs-string">"usmountain"</span>, <span class="hljs-number">-6</span>,
            tz=<span class="hljs-string">"usarizona"</span>,  <span class="hljs-number">-7</span>,   /* no DST */
            tz=<span class="hljs-string">"uspacific"</span>,  <span class="hljs-number">-7</span>,
            tz=<span class="hljs-string">"usalaska"</span>,   <span class="hljs-number">-8</span>,
            tz=<span class="hljs-string">"ushawaii"</span>,  <span class="hljs-number">-10</span>,   /* no DST */
            <span class="hljs-number">-7</span>  // default
        )
    VAR _usesDST = NOT (tz IN {<span class="hljs-string">"usarizona"</span>,<span class="hljs-string">"ushawaii"</span>})
    VAR _offsetHours = IF(_usesDST &amp;&amp; _isDST, _dst, _std)

    RETURN
        _utcnow + (_offsetHours / <span class="hljs-number">24.0</span>)

RETURN
_utcnow + (_offsetHours / <span class="hljs-number">24.0</span>)
</code></pre>
<p>You can use Fabric notebooks for documentation, embed that into an OrgApp for self-service users to see the list of various UDFs available. Users can copy and paste the UDFs they need. In the catalog you can add all the relevant details for the users to get details and use the functions. If you want to make it very fancy, you can also include working examples, thanks to Semantic Link Labs !</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760649228487/52c2476e-7230-41ee-a10b-18882e3194db.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1760650936296/be57b4ee-632b-4530-9c54-847e985d4cf6.png" alt class="image--center mx-auto" /></p>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://youtu.be/yll88IYtS2E">https://youtu.be/yll88IYtS2E</a></div>
<p> </p>
<p>These are just some ideas. DAX UDFs are super powerful and automating and centralizing them will make them even more useful &amp; powerful.</p>
]]></content:encoded></item><item><title><![CDATA[Access Fabric Lakehouse With Onelake SAS]]></title><description><![CDATA[Delegated access to Fabric Onelake using short-lived shared=access signature (SAS) was announced last year. It allows secure, short-term, delegated access to files and folders in Onelake. A OneLake SAS can provide temporary access to applications tha...]]></description><link>https://fabric.guru/access-fabric-lakehouse-with-onelake-sas</link><guid isPermaLink="true">https://fabric.guru/access-fabric-lakehouse-with-onelake-sas</guid><category><![CDATA[onelake]]></category><category><![CDATA[microsoftfabric]]></category><category><![CDATA[SAS]]></category><category><![CDATA[lakehouse]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Thu, 18 Sep 2025 22:01:46 GMT</pubDate><content:encoded><![CDATA[<p>Delegated access to Fabric Onelake using short-lived shared=access signature (SAS) was <a target="_blank" href="https://blog.fabric.microsoft.com/en-us/blog/onelake-shared-access-signatures-sas-now-available-in-public-preview/">announced last year</a>. It allows secure, short-term, delegated access to files and folders in Onelake. A OneLake SAS can provide temporary access to applications that don't support Microsoft Entra. These applications can then load data or serve as proxies between other customer applications or software development companies.[<a target="_blank" href="https://learn.microsoft.com/en-us/fabric/onelake/how-to-create-a-onelake-shared-access-signature">*</a>]. If you are building client-side applications, this is a great way to securely give read/write access to the data. You can learn more about it from here : <a target="_blank" href="https://blog.fabric.microsoft.com/en-us/blog/onelake-shared-access-signatures-sas-now-available-in-public-preview/">OneLake shared access signatures (SAS) now available in public preview | Microsoft Fabric Blog | Microsoft Fabric</a></p>
<p>There three steps for you to generate the SAS :</p>
<ul>
<li><p>Enable it in the workspace settings:</p>
<p>  <img src="https://dataplatformblogwebfd-d3h9cbawf0h8ecgf.b01.azurefd.net/wp-content/uploads/2024/09/image-51.png" alt /></p>
</li>
<li><p>Generate the User Delegation Key (<a target="_blank" href="https://learn.microsoft.com/en-us/rest/api/storageservices/get-user-delegation-key">UDK</a>) : Read more about it <a target="_blank" href="https://learn.microsoft.com/en-us/rest/api/storageservices/get-user-delegation-key">here</a>.</p>
</li>
<li><p>Build the SAS URL : This can be the tricky part because you have to parse and create XML. <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/onelake/how-to-create-a-onelake-shared-access-signature#construct-a-user-delegation-sas">This documentation</a> provides all the details.</p>
</li>
</ul>
<p>Once you have the SAS URL, you just make a get request like any other API to access the data.</p>
<p>Below I share how to do this using service principal for demonstration purposes. For user facing client application, use the appropriate authentication method to generate the token. Also note that you should use Azure Key Vault for security, for my demo I am hard coding them to keep it simple. <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/onelake/onelake-shared-access-signature-overview">Read</a> the best practices before implementing.</p>
<h2 id="heading-data">Data</h2>
<p>I have a <code>finance.csv</code> file in a Fabric Lakehouse that I want read using Onelake SAS.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758231764007/df9d11c1-f1d6-4b3b-a7ed-049a7320a3d2.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-code">Code</h2>
<p>I created a service principal and gave it a granular access to the folder.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">For generating the UDK, you have to use the regional Onelake endpoint instead of the global endpoint, i.e. https://{REGION}-onelake.blob.fabric.microsoft.com instead of https://onelake.blob.fabric.microsoft.com</div>
</div>

<pre><code class="lang-python">%pip install azure-identity --q

<span class="hljs-keyword">import</span> base64
<span class="hljs-keyword">import</span> hmac
<span class="hljs-keyword">import</span> hashlib
<span class="hljs-keyword">import</span> datetime <span class="hljs-keyword">as</span> dt
<span class="hljs-keyword">import</span> requests
<span class="hljs-keyword">from</span> urllib.parse <span class="hljs-keyword">import</span> quote
<span class="hljs-keyword">from</span> xml.etree <span class="hljs-keyword">import</span> ElementTree <span class="hljs-keyword">as</span> ET
<span class="hljs-keyword">from</span> azure.identity <span class="hljs-keyword">import</span> ClientSecretCredential
<span class="hljs-keyword">from</span> email.utils <span class="hljs-keyword">import</span> formatdate
<span class="hljs-keyword">import</span> io
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> logging


TENANT_ID     = <span class="hljs-string">""</span>
CLIENT_ID     = <span class="hljs-string">""</span>
CLIENT_SECRET = <span class="hljs-string">""</span>
REGION        = <span class="hljs-string">"centralus"</span> <span class="hljs-comment">#IMPORTANT : use the regional endpoint, i.e. capacity region</span>
WORKSPACE_ID  = <span class="hljs-string">""</span> 
ITEM_ID       = <span class="hljs-string">""</span> <span class="hljs-comment">#lakehouse id</span>
PATH          = <span class="hljs-string">"Files/raw_data/finance.csv"</span> <span class="hljs-comment"># relative path</span>

cred = ClientSecretCredential(TENANT_ID, CLIENT_ID, CLIENT_SECRET)
STORAGE_BEARER = cred.get_token(<span class="hljs-string">"https://storage.azure.com/.default"</span>).token
FABRIC_BEARER  = cred.get_token(<span class="hljs-string">"https://api.fabric.microsoft.com/.default"</span>).token

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_udk</span>():</span>

    token_obj = cred.get_token(<span class="hljs-string">"https://storage.azure.com/.default"</span>)
    token_expiry = dt.datetime.fromtimestamp(token_obj.expires_on, dt.timezone.utc)

    now = dt.datetime.now(dt.timezone.utc)
    udk_expiry = now + dt.timedelta(minutes=<span class="hljs-number">55</span>)

    <span class="hljs-keyword">if</span> token_expiry &lt;= udk_expiry:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"OAuth token expires before UDK. Token: <span class="hljs-subst">{token_expiry}</span>, UDK: <span class="hljs-subst">{udk_expiry}</span>"</span>)


    url = <span class="hljs-string">f"https://<span class="hljs-subst">{REGION}</span>-onelake.blob.fabric.microsoft.com/?restype=service&amp;comp=userdelegationkey"</span>
    st  = (now - dt.timedelta(minutes=<span class="hljs-number">2</span>)).strftime(<span class="hljs-string">"%Y-%m-%dT%H:%M:%SZ"</span>)
    se  = (now + dt.timedelta(minutes=<span class="hljs-number">55</span>)).strftime(<span class="hljs-string">"%Y-%m-%dT%H:%M:%SZ"</span>)
    xml = <span class="hljs-string">f'&lt;?xml version="1.0"?&gt;&lt;KeyInfo&gt;&lt;Start&gt;<span class="hljs-subst">{st}</span>&lt;/Start&gt;&lt;Expiry&gt;<span class="hljs-subst">{se}</span>&lt;/Expiry&gt;&lt;/KeyInfo&gt;'</span>
    hdr = {<span class="hljs-string">"Authorization"</span>: <span class="hljs-string">f"Bearer <span class="hljs-subst">{STORAGE_BEARER}</span>"</span>, <span class="hljs-string">"x-ms-version"</span>:<span class="hljs-string">"2022-11-02"</span>,
           <span class="hljs-string">"x-ms-date"</span>: formatdate(usegmt=<span class="hljs-literal">True</span>), <span class="hljs-string">"Content-Type"</span>:<span class="hljs-string">"application/xml"</span>}
    r = requests.post(url, data=xml, headers=hdr, timeout=<span class="hljs-number">20</span>); r.raise_for_status()
    x = ET.fromstring(r.text)

    logging.info(<span class="hljs-string">f"UDK generated successfully, expires at <span class="hljs-subst">{se}</span>"</span>)
    <span class="hljs-keyword">return</span> {t: x.findtext(t) <span class="hljs-keyword">for</span> t <span class="hljs-keyword">in</span> [<span class="hljs-string">"SignedOid"</span>,<span class="hljs-string">"SignedTid"</span>,<span class="hljs-string">"SignedStart"</span>,<span class="hljs-string">"SignedExpiry"</span>,<span class="hljs-string">"SignedService"</span>,<span class="hljs-string">"SignedVersion"</span>,<span class="hljs-string">"Value"</span>]}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">build_file_sas_guid</span>(<span class="hljs-params">udk: dict, workspace_id: str, item_id: str, path: str, perms=<span class="hljs-string">"r"</span></span>):</span>

    sv  = udk[<span class="hljs-string">"SignedVersion"</span>]
    sr  = <span class="hljs-string">"b"</span>
    spr = <span class="hljs-string">"https"</span>
    key = base64.b64decode(udk[<span class="hljs-string">"Value"</span>])

    canonical = <span class="hljs-string">f"/blob/onelake/<span class="hljs-subst">{workspace_id}</span>/<span class="hljs-subst">{item_id}</span>/<span class="hljs-subst">{path}</span>"</span>

    now = dt.datetime.now(dt.timezone.utc)

    signed_start = dt.datetime.strptime(udk[<span class="hljs-string">"SignedStart"</span>], <span class="hljs-string">"%Y-%m-%dT%H:%M:%SZ"</span>).replace(tzinfo=dt.timezone.utc)
    signed_expiry = dt.datetime.strptime(udk[<span class="hljs-string">"SignedExpiry"</span>], <span class="hljs-string">"%Y-%m-%dT%H:%M:%SZ"</span>).replace(tzinfo=dt.timezone.utc)

    st = max(now - dt.timedelta(minutes=<span class="hljs-number">2</span>), signed_start)
    se = min(now + dt.timedelta(minutes=<span class="hljs-number">50</span>), signed_expiry)

    st_s = st.replace(tzinfo=<span class="hljs-literal">None</span>).strftime(<span class="hljs-string">"%Y-%m-%dT%H:%M:%SZ"</span>)
    se_s = se.replace(tzinfo=<span class="hljs-literal">None</span>).strftime(<span class="hljs-string">"%Y-%m-%dT%H:%M:%SZ"</span>)

    parts = [
        perms, st_s, se_s, canonical,
        udk[<span class="hljs-string">"SignedOid"</span>], udk[<span class="hljs-string">"SignedTid"</span>], udk[<span class="hljs-string">"SignedStart"</span>], udk[<span class="hljs-string">"SignedExpiry"</span>],
        udk[<span class="hljs-string">"SignedService"</span>], udk[<span class="hljs-string">"SignedVersion"</span>],
        <span class="hljs-string">""</span>, <span class="hljs-string">""</span>, <span class="hljs-string">""</span>,
        <span class="hljs-string">""</span>,
        spr,
        sv,
        sr,
        <span class="hljs-string">""</span>,
        <span class="hljs-string">""</span>,
        <span class="hljs-string">""</span>, <span class="hljs-string">""</span>, <span class="hljs-string">""</span>, <span class="hljs-string">""</span>, <span class="hljs-string">""</span>
    ]
    sig = base64.b64encode(hmac.new(key, <span class="hljs-string">"\n"</span>.join(parts).encode(), hashlib.sha256).digest()).decode()

    enc = <span class="hljs-keyword">lambda</span> v: quote(v, safe=<span class="hljs-string">""</span>)
    qs = (
        <span class="hljs-string">f"sp=<span class="hljs-subst">{perms}</span>&amp;st=<span class="hljs-subst">{enc(st_s)}</span>&amp;se=<span class="hljs-subst">{enc(se_s)}</span>"</span>
        <span class="hljs-string">f"&amp;skoid=<span class="hljs-subst">{enc(udk[<span class="hljs-string">'SignedOid'</span>])}</span>&amp;sktid=<span class="hljs-subst">{enc(udk[<span class="hljs-string">'SignedTid'</span>])}</span>"</span>
        <span class="hljs-string">f"&amp;skt=<span class="hljs-subst">{enc(udk[<span class="hljs-string">'SignedStart'</span>])}</span>&amp;ske=<span class="hljs-subst">{enc(udk[<span class="hljs-string">'SignedExpiry'</span>])}</span>"</span>
        <span class="hljs-string">f"&amp;sks=<span class="hljs-subst">{udk[<span class="hljs-string">'SignedService'</span>]}</span>&amp;skv=<span class="hljs-subst">{udk[<span class="hljs-string">'SignedVersion'</span>]}</span>"</span>
        <span class="hljs-string">f"&amp;sv=<span class="hljs-subst">{sv}</span>&amp;sr=<span class="hljs-subst">{sr}</span>&amp;spr=<span class="hljs-subst">{spr}</span>&amp;sig=<span class="hljs-subst">{enc(sig)}</span>"</span>
    )

    path_url = quote(path, safe=<span class="hljs-string">"/"</span>)
    sas_url = <span class="hljs-string">f"https://onelake.blob.fabric.microsoft.com/<span class="hljs-subst">{workspace_id}</span>/<span class="hljs-subst">{item_id}</span>/<span class="hljs-subst">{path_url}</span>?<span class="hljs-subst">{qs}</span>"</span>

    logging.info(<span class="hljs-string">f"SAS URL generated successfully, expires at <span class="hljs-subst">{se_s}</span>"</span>)
    <span class="hljs-keyword">return</span> sas_url


<span class="hljs-keyword">try</span>:
    udk = get_udk()
    sas_url = build_file_sas_guid(udk, WORKSPACE_ID, ITEM_ID, PATH, perms=<span class="hljs-string">"r"</span>)
    print(<span class="hljs-string">"SAS URL:"</span>, sas_url)

    r = requests.get(sas_url, timeout=<span class="hljs-number">30</span>)
    print(<span class="hljs-string">"GET:"</span>, r.status_code)

    <span class="hljs-keyword">if</span> r.status_code == <span class="hljs-number">200</span>:
        r.raise_for_status()
        df = pd.read_csv(io.BytesIO(r.content))
        df.head(<span class="hljs-number">2</span>)
    <span class="hljs-keyword">else</span>:
        print(<span class="hljs-string">f"Error Status code: <span class="hljs-subst">{r.status_code}</span>"</span>)
        print(<span class="hljs-string">f"Response: <span class="hljs-subst">{r.text}</span>"</span>)

<span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
    print(<span class="hljs-string">f"Error : <span class="hljs-subst">{e}</span>"</span>)
    <span class="hljs-keyword">import</span> traceback
    traceback.print_exc()
</code></pre>
<h2 id="heading-example">Example:</h2>
<p>I was able to generate the SAS and read the CSV from Google Colab notebook :</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758232416915/ac35e48c-9bd2-4901-81c6-25263da0eb2e.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-writing-files">Writing Files</h2>
<p>For writing back to the Lakehouse, follow the same steps to generate the SAS and use the put method:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> requests
sas_url = build_file_sas_guid(udk, WORKSPACE_ID, ITEM_ID, <span class="hljs-string">"Files/raw_data/my_new_file.csv"</span>, perms=<span class="hljs-string">"cw"</span>)
data_to_write = <span class="hljs-string">"column1,column2,column3\nvalue1,value2,value3"</span>

headers = {
    <span class="hljs-string">'x-ms-blob-type'</span>: <span class="hljs-string">'BlockBlob'</span>,
    <span class="hljs-string">'Content-Type'</span>: <span class="hljs-string">'text/csv'</span>
}

response = requests.put(sas_url, data=data_to_write, headers=headers, timeout=<span class="hljs-number">30</span>)
response.raise_for_status()
print(<span class="hljs-string">"Upload status:"</span>, response.status_code)
<span class="hljs-comment">## 201 if successful</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1758232605348/6dc983d6-42ec-4e5b-b14e-66b38bc62efb.png" alt class="image--center mx-auto" /></p>
<p>Onelake SAS has a maximum life of one hour.</p>
<h2 id="heading-references">References:</h2>
<ul>
<li><p><a target="_blank" href="https://blog.fabric.microsoft.com/en-us/blog/onelake-shared-access-signatures-sas-now-available-in-public-preview/">OneLake shared access signatures (SAS) now available in public preview | Microsoft Fabric Blog | Microsoft Fabric</a></p>
</li>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/onelake/onelake-shared-access-signature-overview">What is a OneLake shared access signature (SAS) - Microsoft Fabric | Microsoft Learn</a></p>
</li>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/rest/api/storageservices/get-user-delegation-key">Get User Delegation Key (REST API) - Azure Storage | Microsoft Learn</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Direct Lake Incremental Framing Effect]]></title><description><![CDATA[I was writing a blog on Direct Lake incremental framing but my colleague Chris Webb beat me to it and just published an excellent blog. To summarize, with incremental framing, when a Direct Lake semantic model refreshes, it analyzes the Delta log to ...]]></description><link>https://fabric.guru/direct-lake-incremental-framing-effect</link><guid isPermaLink="true">https://fabric.guru/direct-lake-incremental-framing-effect</guid><category><![CDATA[incremental refresh]]></category><category><![CDATA[microsoftfabric]]></category><category><![CDATA[DirectLake]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Mon, 01 Sep 2025 17:00:33 GMT</pubDate><content:encoded><![CDATA[<p>I was writing a blog on Direct Lake incremental framing but my colleague Chris Webb beat me to it and just <a target="_blank" href="https://blog.crossjoin.co.uk/2025/08/31/performance-testing-power-bi-direct-lake-models-revisited-ensuring-worst-case-performance/">published an excellent blog</a>. To summarize, with incremental framing, when a Direct Lake semantic model refreshes, it analyzes the Delta log to see what's changed since the last refresh:</p>
<ul>
<li><p>It identifies which parquet files are new, which have been modified, and which have been removed</p>
</li>
<li><p>For unchanged data, it maintains the existing data in memory (preserving dictionaries and other optimizations)</p>
</li>
<li><p>It only reloads the data from new or modified parquet files</p>
</li>
<li><p>It removes from memory any data from deleted parquet files</p>
</li>
</ul>
<p>You can read more about it in the <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/fundamentals/direct-lake-overview#framing">documentation</a>.</p>
<p>I will instead highlight another update based on the work by two of my other colleagues, <a target="_blank" href="https://dax.tips/">Phil Seamark</a> and <a target="_blank" href="https://www.elegantbi.com/about">Michael Kovalsky</a>. Semantic Link Lab’s <code>.delta_analyzer_history()</code> function estimates the incremental framing effect based on the updates to the delta table. 0 means no benefit at all and 100% means highly effective. Note that this is based on the changes to the delta table and does not account for any refresh/updates made to the semantic model.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> sempy_labs <span class="hljs-keyword">as</span> labs
labs.delta_analyzer_history(<span class="hljs-string">"sales"</span>).tail(<span class="hljs-number">2</span>) <span class="hljs-comment">#sales is the name of the table in the lakehouse attached to the notebook</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756745664490/7c5c1625-8e39-4bc6-8297-963dae02f3f1.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1756745772272/13d0c3e4-3948-4ae1-b53e-2dc72ef871ae.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-references">References:</h2>
<ul>
<li><p><a target="_blank" href="https://github.com/microsoft/semantic-link-labs">microsoft/semantic-link-labs: Early access to new features for Microsoft Fabric's Semantic Link.</a></p>
</li>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/fundamentals/direct-lake-overview">Direct Lake overview - Microsoft Fabric | Microsoft Learn</a></p>
</li>
<li><p><a target="_blank" href="https://blog.crossjoin.co.uk/2025/08/31/performance-testing-power-bi-direct-lake-models-revisited-ensuring-worst-case-performance/">Performance testing Power BI Direct Lake models revisited: ensuring worst-case performance</a></p>
</li>
<li><p><a target="_blank" href="https://blog.crossjoin.co.uk/2023/07/09/performance-testing-power-bi-direct-lake-mode-datasets-in-fabric/">Chris Webb's BI Blog: Performance Testing Power BI Direct Lake Mode Datasets In Fabric</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Unstructured To Structured : Using Fabric AI Functions For Contextual Data Quality Check]]></title><description><![CDATA[Earlier this year, I read a fantastic blog/newsletter by Jack Vanlightly about contextual data quality. He shared his thoughts on a research paper “Big data quality framework”. The original authors argue that we often focus on data quality metrics li...]]></description><link>https://fabric.guru/unstructured-to-structured-using-fabric-ai-functions-for-contextual-data-quality-check</link><guid isPermaLink="true">https://fabric.guru/unstructured-to-structured-using-fabric-ai-functions-for-contextual-data-quality-check</guid><category><![CDATA[microsoft fabric]]></category><category><![CDATA[ai functions]]></category><category><![CDATA[llm]]></category><category><![CDATA[data-quality]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Fri, 01 Aug 2025 21:55:12 GMT</pubDate><content:encoded><![CDATA[<p>Earlier this year, I read a fantastic blog/newsletter by <a target="_blank" href="https://jack-vanlightly.com/home/">Jack Vanlightly</a> about contextual data quality. He shared his thoughts on a research paper <a target="_blank" href="https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00468-0">“Big data quality framework”</a>. The original authors argue that we often focus on data quality metrics like completeness and consistency but overlook contextual data quality issues. For example, in healthcare, a patient visit record might have perfect data quality, valid patient ID, correct <a target="_blank" href="https://www.cdc.gov/nchs/icd/icd-10-cm/index.html">ICD-10 code</a> format (Z34.90 for pregnancy care), proper visit date, and complete provider information. However, if that pregnancy care code is assigned to a 67-year-old male patient, there's a contextual data quality issue that traditional validation completely misses. Your EMR system might show green across all data quality dashboards while simultaneously creating impossible medical scenarios that could lead to incorrect treatments, insurance claim rejections, and regulatory reporting errors. This is where LLMs are great, they understand that pregnancy codes and elderly male patients don't make clinical sense together, catching the semantic inconsistencies that rigid rule based validation never could.</p>
<p>I highly encourage you to read his blog and subscribe to his newsletter "<a target="_blank" href="https://www.hotds.dev/p/humans-of-the-data-sphere-issue-7?utm_source=publication-search">Humans of the Data Sphere</a>.”</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754074479167/4bcdfd91-102a-4f6f-9fef-7cf59fa795ed.png" alt class="image--center mx-auto" /></p>
<p><em>Ref:</em> <a target="_blank" href="https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00468-0"><em>Big data quality framework: a holistic approach to continuous quality management | Journal of Big Data | Full Text</em></a></p>
<p><a target="_blank" href="https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00468-0">In this multi-part series, I will try to explore how AI/LLMs can possibly be used along with the rule-based DQ ch</a>ecks. In this first blog, I will focus on the first pass analysis, getting it set up using <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas">AI Functions</a> in Microsoft Fabric and in the following blogs, I will operationalize it to create a robust evaluation framework.</p>
<h2 id="heading-fabric-ai-functions">Fabric AI Functions</h2>
<p>I have written about AI Functions <a target="_blank" href="https://fabric.guru/unstructured-to-structured-extracting-data-from-messy-excel-sheets-using-fabric-ai-function">before</a>. It allows you to use AI seamlessly in your data engineering applications with single line of code. In this introductory blog, I will show how we can use AI Functions for checking contextual DQ issues.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754076933891/e5a29a9e-1f11-45db-ba50-e83910ad3345.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-data">Data</h2>
<p>The <a target="_blank" href="https://www.consumerfinance.gov/complaint/">Consumer Financial Protection Bureau (CFPB)</a> has comprehensive data on consumer complaints about financial products and services. It includes details like the product, sub-product, issue, complaint description, and more. When consumers log complaints, they select the product, sub-product, and issue categories. On the website, trends are shown for each product, sub-product, and issue based on the categories chosen by consumers. However, the website does not provide definitions for these categories, making it easy for consumers to assign their complaints to the wrong category. So, even though the data might be complete, the category assigned may not match the issue described which will also lead to incorrect trend analysis.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754077329657/cabb8e1d-898b-4aa3-9fca-a3566b750c3e.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754077425920/67a3c723-2bd9-4be9-83a4-36d436383f3f.png" alt class="image--center mx-auto" /></p>
<p>I downloaded the last 6 months data from the website (you can use API as well) with complaint narrative (~490K records).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754077892118/b27d64da-a6e1-48e9-9058-68d54659f612.png" alt class="image--center mx-auto" /></p>
<p>To analyze the contextual DQ issues, we will:</p>
<ul>
<li><p>Use LLM to review the Product, Sub-product, complaint and analyze if the complaint matches the product &amp; sub-product assigned by the user</p>
</li>
<li><p>Return <code>true</code> or <code>false</code> based on the analysis</p>
</li>
<li><p>Return the suggested category</p>
</li>
<li><p>Return issue type (product mismatch, sub product mismatch or both)</p>
</li>
<li><p>Brief explanation</p>
</li>
<li><p>Flag for human evaluation if the LLM is unsure</p>
</li>
</ul>
<p>As I mentioned above, this will be a first pass analysis and in the following blogs I will refine with evaluation harness to improve the process.</p>
<h2 id="heading-prompt">Prompt</h2>
<p>The prompt follows the similar template I have used in previous blogs - instructions, examples, guidelines with tags and return the JSON in defined schema.</p>
<pre><code class="lang-plaintext">&lt;INSTRUCTIONS&gt;
You are a data quality expert analyzing CFPB consumer complaints for product categorization accuracy.
Your task is to determine if the complaint narrative matches the assigned product/sub-product categories.
&lt;/INSTRUCTIONS&gt;

&lt;VALID_PRODUCTS_SUBPRODUCTS&gt;
Credit reporting or other personal consumer reports:
- Credit reporting
- Other personal consumer report

Debt collection:
- I do not know
- Credit card debt
- Other debt
- Telecommunications debt
- Rental debt
- Medical debt
- Auto debt
- Payday loan debt
- Federal student loan debt
- Private student loan debt
- Mortgage debt

Checking or savings account:
- Checking account
- Other banking product or service
- Savings account
- CD (Certificate of Deposit)

Credit card:
- General-purpose credit card or charge card
- Store credit card

Money transfer, virtual currency, or money service:
- Domestic (US) money transfer
- Mobile or digital wallet
- Virtual currency
- International money transfer
- Money order, traveler's check or cashier's check
- Check cashing service
- Foreign currency exchange

Student loan:
- Federal student loan servicing
- Private student loan

Mortgage:
- Conventional home mortgage
- FHA mortgage
- VA mortgage
- Home equity loan or line of credit (HELOC)
- Other type of mortgage
- USDA mortgage
- Manufactured home loan
- Reverse mortgage

Vehicle loan or lease:
- Loan
- Lease

Payday loan, title loan, personal loan, or advance loan:
- Installment loan
- Payday loan
- Personal line of credit
- Title loan
- Other advances of future income
- Earned wage access
- Pawn loan
- Tax refund anticipation loan or check

Prepaid card:
- General-purpose prepaid card
- Government benefit card
- Gift card
- Payroll card
- Student prepaid card

Debt or credit management:
- Debt settlement
- Credit repair services
- Mortgage modification or foreclosure avoidance
- Student loan debt relief
&lt;/VALID_PRODUCTS_SUBPRODUCTS&gt;

&lt;EXPECTED_OUTPUT&gt;
{
"is_correctly_categorized": true/false,
"issue_type": "correct|product_mismatch|subproduct_mismatch|both_incorrect",
"explanation": "brief explanation",
"suggested_product": "correct product if wrong",
"suggested_subproduct": "correct sub-product if wrong",
"should_review": true/false
}
&lt;/EXPECTED_OUTPUT&gt;

&lt;REVIEW_CRITERIA&gt;
Set should_review to true if:
- The complaint narrative is ambiguous or could fit multiple categories
- The technical/financial terminology is unclear or inconsistent
- Multiple financial products are mentioned making categorization difficult
- The complaint lacks sufficient detail to make a confident assessment
- You are uncertain about the correct categorization
&lt;/REVIEW_CRITERIA&gt;

Analyze the complaint and return ONLY valid JSON.
</code></pre>
<h2 id="heading-code">Code</h2>
<p>In my case, I used the Python notebook. However, if you use the Pyspark notebook, as announced recently, you do not need to install any libraries. AI Functions is now part of the Fabric RT1.3.</p>
<pre><code class="lang-python">
%pip install -q --force-reinstall openai==<span class="hljs-number">1.30</span> <span class="hljs-number">2</span>&gt;/dev/null

%pip install -q --force-reinstall https://mmlspark.blob.core.windows.net/pip/<span class="hljs-number">1.0</span><span class="hljs-number">.12</span>-spark3<span class="hljs-number">.5</span>/synapseml_core<span class="hljs-number">-1.0</span><span class="hljs-number">.12</span>.dev1-py2.py3-none-any.whl <span class="hljs-number">2</span>&gt;/dev/null

%pip install -q --force-reinstall https://mmlspark.blob.core.windows.net/pip/<span class="hljs-number">1.0</span><span class="hljs-number">.12</span><span class="hljs-number">.2</span>-spark3<span class="hljs-number">.5</span>/synapseml_internal<span class="hljs-number">-1.0</span><span class="hljs-number">.12</span><span class="hljs-number">.2</span>.dev1-py2.py3-none-any.whl <span class="hljs-number">2</span>&gt;/dev/null

<span class="hljs-keyword">import</span> synapse.ml.aifunc <span class="hljs-keyword">as</span> aifunc
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> openai
<span class="hljs-keyword">from</span> synapse.ml.aifunc <span class="hljs-keyword">import</span> Conf

cols = [<span class="hljs-string">'Date received'</span>,<span class="hljs-string">'Product'</span>,<span class="hljs-string">'Sub-product'</span>,<span class="hljs-string">'Issue'</span>,<span class="hljs-string">'Sub-issue'</span>,<span class="hljs-string">'Consumer complaint narrative'</span>, <span class="hljs-string">'Complaint ID'</span>]
df = pd.read_csv(<span class="hljs-string">"/lakehouse/default/Files/dq_aifunction/complaints-2025-08-01_12_34.csv"</span>, usecols =cols)
df[<span class="hljs-string">'Date received'</span>] = pd.to_datetime(df[<span class="hljs-string">'Date received'</span>],errors=<span class="hljs-string">'coerce'</span>)
df = df.sort_values(by = [<span class="hljs-string">'Date received'</span>], ascending=<span class="hljs-literal">False</span>)
</code></pre>
<p>Steps:</p>
<ul>
<li><p>Pre process the data</p>
</li>
<li><p>Analyze the categories using AI functions</p>
</li>
<li><p>Parse JSON</p>
</li>
<li><p>Prepare output df</p>
</li>
</ul>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> re

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_narrative_text</span>(<span class="hljs-params">text</span>):</span>
 <span class="hljs-keyword">if</span> pd.isna(text):
     <span class="hljs-keyword">return</span> <span class="hljs-string">""</span>
 cleaned = re.sub(<span class="hljs-string">r'XXXX+'</span>, <span class="hljs-string">'XXXX'</span>, str(text))
 cleaned = re.sub(<span class="hljs-string">r'\s+'</span>, <span class="hljs-string">' '</span>, cleaned).strip()
 <span class="hljs-keyword">return</span> cleaned

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">check_qlty_prmpt</span>():</span>
 prompt = <span class="hljs-string">"""
&lt;INSTRUCTIONS&gt;
You are a data quality expert analyzing CFPB consumer complaints for product categorization accuracy.
Your task is to determine if the complaint narrative matches the assigned product/sub-product categories.
&lt;/INSTRUCTIONS&gt;

&lt;VALID_PRODUCTS_SUBPRODUCTS&gt;
Credit reporting or other personal consumer reports:
- Credit reporting
- Other personal consumer report

Debt collection:
- I do not know
- Credit card debt
- Other debt
- Telecommunications debt
- Rental debt
- Medical debt
- Auto debt
- Payday loan debt
- Federal student loan debt
- Private student loan debt
- Mortgage debt

Checking or savings account:
- Checking account
- Other banking product or service
- Savings account
- CD (Certificate of Deposit)

Credit card:
- General-purpose credit card or charge card
- Store credit card

Money transfer, virtual currency, or money service:
- Domestic (US) money transfer
- Mobile or digital wallet
- Virtual currency
- International money transfer
- Money order, traveler's check or cashier's check
- Check cashing service
- Foreign currency exchange

Student loan:
- Federal student loan servicing
- Private student loan

Mortgage:
- Conventional home mortgage
- FHA mortgage
- VA mortgage
- Home equity loan or line of credit (HELOC)
- Other type of mortgage
- USDA mortgage
- Manufactured home loan
- Reverse mortgage

Vehicle loan or lease:
- Loan
- Lease

Payday loan, title loan, personal loan, or advance loan:
- Installment loan
- Payday loan
- Personal line of credit
- Title loan
- Other advances of future income
- Earned wage access
- Pawn loan
- Tax refund anticipation loan or check

Prepaid card:
- General-purpose prepaid card
- Government benefit card
- Gift card
- Payroll card
- Student prepaid card

Debt or credit management:
- Debt settlement
- Credit repair services
- Mortgage modification or foreclosure avoidance
- Student loan debt relief
&lt;/VALID_PRODUCTS_SUBPRODUCTS&gt;

&lt;EXPECTED_OUTPUT&gt;
{
"is_correctly_categorized": true/false,
"issue_type": "correct|product_mismatch|subproduct_mismatch|both_incorrect",
"explanation": "brief explanation",
"suggested_product": "correct product if wrong",
"suggested_subproduct": "correct sub-product if wrong",
"should_review": true/false
}
&lt;/EXPECTED_OUTPUT&gt;

&lt;REVIEW_CRITERIA&gt;
Set should_review to true if:
- The complaint narrative is ambiguous or could fit multiple categories
- The technical/financial terminology is unclear or inconsistent
- Multiple financial products are mentioned making categorization difficult
- The complaint lacks sufficient detail to make a confident assessment
- You are uncertain about the correct categorization
&lt;/REVIEW_CRITERIA&gt;

Analyze the complaint and return ONLY valid JSON.
"""</span>
 <span class="hljs-keyword">return</span> prompt

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">analyze_category</span>(<span class="hljs-params">df, sample_size=None</span>):</span>
 <span class="hljs-keyword">if</span> sample_size:
     sample_df = df.sample(min(sample_size, len(df)), random_state=<span class="hljs-number">42</span>).copy()
 <span class="hljs-keyword">else</span>:
     sample_df = df.copy()

 sample_df[<span class="hljs-string">'cleaned_narrative'</span>] = sample_df[<span class="hljs-string">'Consumer complaint narrative'</span>].apply(clean_narrative_text)
 sample_df = sample_df[sample_df[<span class="hljs-string">'cleaned_narrative'</span>].str.len() &gt; <span class="hljs-number">10</span>].copy()

 sample_df[<span class="hljs-string">'analysis_input'</span>] = sample_df.apply(<span class="hljs-keyword">lambda</span> row: 
     <span class="hljs-string">f"Current Product: <span class="hljs-subst">{row[<span class="hljs-string">'Product'</span>]}</span>\n"</span>
     <span class="hljs-string">f"Current Sub-product: <span class="hljs-subst">{row[<span class="hljs-string">'Sub-product'</span>]}</span>\n"</span> 
     <span class="hljs-string">f"Issue: <span class="hljs-subst">{row[<span class="hljs-string">'Issue'</span>]}</span>\n"</span>
     <span class="hljs-string">f"Complaint: <span class="hljs-subst">{row[<span class="hljs-string">'cleaned_narrative'</span>][:<span class="hljs-number">800</span>]}</span>..."</span>
 , axis=<span class="hljs-number">1</span>)

 prompt = check_qlty_prmpt()
 sample_df[<span class="hljs-string">'llm_analysis'</span>] = sample_df[[<span class="hljs-string">'analysis_input'</span>]].ai.generate_response(prompt, , conf=Conf(seed=<span class="hljs-number">0</span>, max_concurrency=<span class="hljs-number">25</span>))

 <span class="hljs-keyword">return</span> sample_df

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse_json</span>(<span class="hljs-params">df</span>):</span>
 <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">parse_json_response</span>(<span class="hljs-params">response_text</span>):</span>
     <span class="hljs-keyword">try</span>:
         json_match = re.search(<span class="hljs-string">r'\{.*\}'</span>, response_text, re.DOTALL)
         <span class="hljs-keyword">if</span> json_match:
             json_str = json_match.group(<span class="hljs-number">0</span>)
             <span class="hljs-keyword">return</span> json.loads(json_str)
         <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
     <span class="hljs-keyword">except</span>:
         <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

 df[<span class="hljs-string">'parsed_analysis'</span>] = df[<span class="hljs-string">'llm_analysis'</span>].apply(parse_json_response)
 df[<span class="hljs-string">'analysis_valid'</span>] = df[<span class="hljs-string">'parsed_analysis'</span>].notna()

 valid_df = df[df[<span class="hljs-string">'analysis_valid'</span>]].copy()

 <span class="hljs-keyword">if</span> len(valid_df) &gt; <span class="hljs-number">0</span>:
     valid_df[<span class="hljs-string">'is_correctly_categorized'</span>] = valid_df[<span class="hljs-string">'parsed_analysis'</span>].apply(
         <span class="hljs-keyword">lambda</span> x: x.get(<span class="hljs-string">'is_correctly_categorized'</span>, <span class="hljs-literal">None</span>) <span class="hljs-keyword">if</span> x <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>)
     valid_df[<span class="hljs-string">'issue_type'</span>] = valid_df[<span class="hljs-string">'parsed_analysis'</span>].apply(
         <span class="hljs-keyword">lambda</span> x: x.get(<span class="hljs-string">'issue_type'</span>, <span class="hljs-literal">None</span>) <span class="hljs-keyword">if</span> x <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>)
     valid_df[<span class="hljs-string">'explanation'</span>] = valid_df[<span class="hljs-string">'parsed_analysis'</span>].apply(
         <span class="hljs-keyword">lambda</span> x: x.get(<span class="hljs-string">'explanation'</span>, <span class="hljs-literal">None</span>) <span class="hljs-keyword">if</span> x <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>)
     valid_df[<span class="hljs-string">'suggested_product'</span>] = valid_df[<span class="hljs-string">'parsed_analysis'</span>].apply(
         <span class="hljs-keyword">lambda</span> x: x.get(<span class="hljs-string">'suggested_product'</span>, <span class="hljs-literal">None</span>) <span class="hljs-keyword">if</span> x <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>)
     valid_df[<span class="hljs-string">'suggested_subproduct'</span>] = valid_df[<span class="hljs-string">'parsed_analysis'</span>].apply(
         <span class="hljs-keyword">lambda</span> x: x.get(<span class="hljs-string">'suggested_subproduct'</span>, <span class="hljs-literal">None</span>) <span class="hljs-keyword">if</span> x <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>)
     valid_df[<span class="hljs-string">'should_review'</span>] = valid_df[<span class="hljs-string">'parsed_analysis'</span>].apply(
         <span class="hljs-keyword">lambda</span> x: x.get(<span class="hljs-string">'should_review'</span>, <span class="hljs-literal">None</span>) <span class="hljs-keyword">if</span> x <span class="hljs-keyword">else</span> <span class="hljs-literal">None</span>)

 <span class="hljs-keyword">return</span> valid_df

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">check_quality</span>(<span class="hljs-params">df, sample_size=None</span>):</span>
 <span class="hljs-keyword">if</span> sample_size:
     print(<span class="hljs-string">f"Analyzing <span class="hljs-subst">{sample_size}</span> complaints for categorization:"</span>)
 <span class="hljs-keyword">else</span>:
     print(<span class="hljs-string">f"Analyzing all <span class="hljs-subst">{len(df)}</span> complaints for categorization:"</span>)

 analyzed_df = analyze_category(df, sample_size)
 results_df = parse_json(analyzed_df)

 output_cols = [
     <span class="hljs-string">'Date received'</span>, <span class="hljs-string">'Product'</span>, <span class="hljs-string">'Sub-product'</span>, <span class="hljs-string">'Issue'</span>, 
     <span class="hljs-string">'Consumer complaint narrative'</span>, <span class="hljs-string">'is_correctly_categorized'</span>,
     <span class="hljs-string">'issue_type'</span>, <span class="hljs-string">'explanation'</span>, <span class="hljs-string">'suggested_product'</span>, <span class="hljs-string">'suggested_subproduct'</span>,
     <span class="hljs-string">'should_review'</span>
 ]

 available_cols = [col <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> output_cols <span class="hljs-keyword">if</span> col <span class="hljs-keyword">in</span> results_df.columns]
 final_df = results_df[available_cols].copy()

 print(<span class="hljs-string">f"<span class="hljs-subst">{len(final_df)}</span> complaints analyzed."</span>)
 print(<span class="hljs-string">f"Incorrectly categorized: <span class="hljs-subst">{(~final_df[<span class="hljs-string">'is_correctly_categorized'</span>]).sum()}</span>"</span>)
 print(<span class="hljs-string">f"Flagged for review: <span class="hljs-subst">{final_df[<span class="hljs-string">'should_review'</span>].sum()}</span>"</span>)

 <span class="hljs-keyword">return</span> final_df

<span class="hljs-comment">## first 250 complaints for demo purposes</span>
results_df = check_quality(df.head(<span class="hljs-number">250</span>)) 
display(results_df)
</code></pre>
<p><strong>Output:</strong></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754082345958/4217deec-501a-427e-811e-0ec1e0cbac21.png" alt class="image--center mx-auto" /></p>
<p>Out of the 250 sample complaints analyzed, 113 were labeled as incorrectly categorized and 64 were flagged for further human evaluation. Thus we can see that we can “potentially” use AI Functions for catching data quality issues. I said potentially because more evaluation needs to be done (future blog) to use this effectively and reduce errors.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1754084815683/aa78bfc4-1b6b-4d0f-b0e1-f8792aa4a56b.png" alt class="image--center mx-auto" /></p>
<p>Let’s look at the very first example:</p>
<blockquote>
<p><strong>Complaint:</strong></p>
<p>I am filing this new complaint because the debt collector, First Financial XXXX XXXX, and the creditor, XXXX XXXX ( XXXX XXXX, XXXX ), have provided me with debt verification documents that are factually inaccurate. This is about my previous, closed complaint ( Complaint ID # [ XXXX ] ). The 'Vehicle Valuation Report ' sent to me as verification of the debt is based on an incorrect vehicle mileage of XXXX miles. This is false. The official rental agreement ( # XXXX ) confirms that the vehicle 's actual mileage at the time of rental was XXXX miles. This discrepancy of over XXXX miles has artificially and significantly inflated the vehicle 's valuation, which is the entire basis for the debt they claim I owe. By knowingly or negligently providing a valuation based on false information and continuing to demand payment based on it, they are making false representations in an attempt to collect a debt. This is a serious issue that goes beyond a simple dispute over methods. I have sent a formal letter disputing this invalid verification and demand that this illegitimate debt be waived.</p>
</blockquote>
<p>The user submitted this issue under Debt Collection → Other Debt whereas AI Functions categorized as Debt Collection → Auto Debt. This complaint is clearly about debt related a vehicle.</p>
<p>Some complaints were flagged for further review, e.g.</p>
<blockquote>
<p>I canceled a Google Play subscription on XX/XX/XXXX through my XXXX device, but I was still charged on XX/XX/XXXX. I did not receive the service and have not received a refund or the service restored. I attempted to resolve the issue through Google Play support, but my request was denied or unresolved. They sent back several emails saying my card info had to be updated but it has been the same info since I uploaded the card which covers XX/XX/XXXX. Further, they said that restoring the subscription would have to come from the app developer, XXXX XXXX. XXXX XXXX rebutted this and said XXXX handles subscriptions. What I want : I want a full refund for the charge and for my cancellation to be properly, and promptly, honored.</p>
</blockquote>
<p>This was categorized by the user as Money transfer, virtual currency, money service → digital wallet. However, there complaints is about unauthorized subscription, credit card billing rather than money transfer or bitcoin etc. One could argue that Google play uses Google wallet and hence digital wallet is the right option. So not all categorizations are straightforward and LLM can identify complaints that should be reviewed further and thus improving the DQ.</p>
<p>This was one example, but this approach can also be used for tabular data to catch inconsistencies and errors. One of the very first Power BI report I ever built 10 years ago was actually for catching errors Oracle ERP system. We saw more than 50% of the codes assigned were incorrect leading to delayed product shipments, incorrect processing of orders, warranty claims etc. I think using AI Functions for such tasks is a very viable solution.</p>
<h2 id="heading-challenges">Challenges</h2>
<p>As always, we need to remember that, like rule-based checks, LLMs cannot be completely predictable. We can manage their behavior to some degree and measure their accuracy, but it's important to always verify and validate by creating baselines and evaluation tools that you can improve over time. I will write about in future blogs.</p>
<h2 id="heading-references">References</h2>
<p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview?tabs=pandas-pyspark%2Cpandas">Transform and enrich data seamlessly with AI functions - Microsoft Fabric | Microsoft Learn</a></p>
<p><a target="_blank" href="https://www.consumerfinance.gov/complaint/">Submit a complaint | Consumer Financial Protection Bureau</a></p>
<p><a target="_blank" href="https://jack-vanlightly.com/home/">About Me — Jack Vanlightly</a></p>
<p><a target="_blank" href="https://fabric.guru/unstructured-to-structured-extracting-data-from-messy-excel-sheets-using-fabric-ai-function">Unstructured to Structured : Extracting Data From Messy Excel Sheets Using Fabric AI Function</a></p>
<p><a target="_blank" href="https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00468-0">Big data quality framework: a holistic approach to continuous quality management | Journal of Big Data | Full Text</a></p>
]]></content:encoded></item><item><title><![CDATA[How to Check If Your Power BI Report Uses the Default Semantic Model]]></title><description><![CDATA[Default semantic models will be going away : read the announcement here and the details. Starting August 8, 2025, Power BI default semantic models are no longer created automatically when a warehouse, lakehouse, or mirrored item is created. Note that...]]></description><link>https://fabric.guru/how-to-check-if-your-power-bi-report-uses-the-default-semantic-model</link><guid isPermaLink="true">https://fabric.guru/how-to-check-if-your-power-bi-report-uses-the-default-semantic-model</guid><category><![CDATA[default semantic model]]></category><category><![CDATA[microsoftfabric]]></category><category><![CDATA[DirectLake]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Wed, 23 Jul 2025 17:37:19 GMT</pubDate><content:encoded><![CDATA[<p><strong>Default semantic models will be going away</strong> : read the announcement <a target="_blank" href="https://blog.fabric.microsoft.com/en-us/blog/sunsetting-default-semantic-models-microsoft-fabric?ft=All">here</a> and the details. Starting August 8, 2025, Power BI <em>default</em> semantic models are no longer created automatically when a warehouse, lakehouse, or mirrored item is created. Note that the change will be implemented in two phases and blogs/docs will be coming in the next few weeks with more details.</p>
<p>But until then, if you want to check whether any reports use the default semantic models, here are two approaches using <a target="_blank" href="https://github.com/microsoft/semantic-link-labs">Semantic Link Labs</a>.</p>
<h2 id="heading-if-any-reports-use-default-semantic-model">If any reports use default semantic model</h2>
<p>Below we get a list of all the reports in a workspace and <code>is_default</code> column shows if the report uses a default semantic model. The key function here is <code>labs.is_default_semantic_model</code></p>
<pre><code class="lang-python"><span class="hljs-comment">#%pip install semantic-link-labs -q</span>

<span class="hljs-keyword">import</span> sempy_labs <span class="hljs-keyword">as</span> labs
<span class="hljs-keyword">import</span> sempy.fabric <span class="hljs-keyword">as</span> fabric

<span class="hljs-comment">#only the reports in the workspace where the notebook is hosted will be returned</span>
<span class="hljs-comment">#specify the workspace parameter if you want to search other workspaces &amp; modify accordingly</span>
df = fabric.list_reports().assign(
    is_default=<span class="hljs-keyword">lambda</span> df: df.apply(
        <span class="hljs-keyword">lambda</span> row: labs.is_default_semantic_model(
            fabric.resolve_dataset_name(row[<span class="hljs-string">'Dataset Id'</span>])
        ),
        axis=<span class="hljs-number">1</span>
    )
)

df
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753291251532/c98190e3-575d-48d4-a103-81fe392e6dae.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-if-any-default-semantic-models-are-used-in-reports">If any default semantic models are used in reports</h2>
<p>The other way, you may want to identify if there are any default semantic models that are used by a report.</p>
<p><code>has_reports</code> column shows if the default model has any reports attached to it. To get the list of reports, use <code>labs.list_reports_using_semantic_model</code> function.</p>
<pre><code class="lang-python"><span class="hljs-comment">#%pip install semantic-link-labs -q</span>

<span class="hljs-keyword">import</span> sempy_labs <span class="hljs-keyword">as</span> labs
<span class="hljs-keyword">import</span> sempy.fabric <span class="hljs-keyword">as</span> fabric
<span class="hljs-comment">## only the models in teh workspace where the notebook is hosted will be scanned. Otherwise, specify workspace param  </span>
df = fabric.list_datasets()
df = df[df[<span class="hljs-string">'Dataset Name'</span>].apply(labs.is_default_semantic_model)]
df[<span class="hljs-string">'has_reports'</span>] = df[<span class="hljs-string">'Dataset Name'</span>].apply(
    <span class="hljs-keyword">lambda</span> name: len(labs.list_reports_using_semantic_model(dataset=name))&gt;<span class="hljs-number">0</span>
)
df = df[df[<span class="hljs-string">'has_reports'</span>]]

df
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1753291669171/ccdb2ade-ca11-490e-b84a-dbad805cfd4f.png" alt class="image--center mx-auto" /></p>
<p>To scan all the workspaces, either use <code>list_items</code> function or loop over the workspaces you are interested in.</p>
<p>More on this later !</p>
<h2 id="heading-references">References</h2>
<p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-warehouse/semantic-models">Power BI Semantic Models - Microsoft Fabric | Microsoft Learn</a></p>
<p><a target="_blank" href="https://blog.fabric.microsoft.com/en-US/blog/sunsetting-default-semantic-models-microsoft-fabric/">Sunsetting Default Semantic Models – Microsoft Fabric | Microsoft Fabric Blog | Microsoft Fabric</a></p>
]]></content:encoded></item><item><title><![CDATA[Reading Delta Tables With ColumnMapping Using Polars]]></title><description><![CDATA[There was question a couple of days ago on r/MicrosoftFabric subreddit on reading Data Warehouse tables shortcutted into Lakehouse. You can easily query this using Spark or T-SQL in the notebook, the question was how to do this using Polars since del...]]></description><link>https://fabric.guru/reading-delta-tables-with-columnmapping-using-polars</link><guid isPermaLink="true">https://fabric.guru/reading-delta-tables-with-columnmapping-using-polars</guid><category><![CDATA[microsoftfabric]]></category><category><![CDATA[Polars]]></category><category><![CDATA[duckDB]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Fri, 18 Jul 2025 22:28:03 GMT</pubDate><content:encoded><![CDATA[<p>There was <a target="_blank" href="https://www.reddit.com/r/MicrosoftFabric/comments/1m1fuf2/shortcut_tables_are_useless_in_python_notebooks/">question</a> a couple of days ago on <a target="_blank" href="https://www.reddit.com/r/MicrosoftFabric/">r/MicrosoftFabric</a> subreddit on reading Data Warehouse tables shortcutted into Lakehouse. You can easily query this using Spark or T-SQL in the notebook, the question was how to do this using Polars since delta tables created by Datawarehouse have <a target="_blank" href="https://docs.delta.io/latest/delta-column-mapping.html">Column Mapping enabled</a>. Polars is built on Delta-rs which <a target="_blank" href="https://github.com/delta-io/delta-rs/issues/930">does not support</a> reading tables with Column Mapping yet.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752876651357/bc08d452-e362-4aff-bc99-387583051777.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752876617837/16057557-85c0-404f-acdc-9c677d5dff97.png" alt class="image--center mx-auto" /></p>
<p>Below is a crude approach I came up with to map the logical column names to physical column names.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Before you proceed, please note that this is a very inefficient solution and comes with many performance limitations. So, unless you have very small data and you can verify the data, I would advise using Spark or T-SQL. Verify and validate.</div>
</div>

<h3 id="heading-the-logic">The logic:</h3>
<ul>
<li><p>Get the logical column names and physical column names to make a dictionary</p>
</li>
<li><p>Get the parquet files from the delta transaction log</p>
</li>
<li><p>Apply column mapping</p>
</li>
<li><p>Read and union</p>
</li>
</ul>
<p>As you can see from above, you lose the parallelization and efficiency in the process.</p>
<h3 id="heading-code">Code</h3>
<pre><code class="lang-python"><span class="hljs-comment">#Python notebook</span>
<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl
<span class="hljs-keyword">from</span> deltalake <span class="hljs-keyword">import</span> DeltaTable
<span class="hljs-keyword">import</span> os

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">scan_delta_cm</span>(<span class="hljs-params">path: str</span>) -&gt; pl.LazyFrame:</span>
   delta_table = DeltaTable(path)

   colmaps: dict[str, str] = dict()
   <span class="hljs-keyword">for</span> field <span class="hljs-keyword">in</span> delta_table.schema().fields:
       logical_name = field.name
       physical_name = field.metadata.get(<span class="hljs-string">"delta.columnMapping.physicalName"</span>, field.name)
       colmaps[logical_name] = physical_name

   all_lazy_frames = []
   <span class="hljs-keyword">for</span> add_action <span class="hljs-keyword">in</span> delta_table.get_add_actions(flatten=<span class="hljs-literal">True</span>).to_pylist():
       file_path = os.path.join(delta_table.table_uri, add_action[<span class="hljs-string">"path"</span>])
       lazy_df = pl.scan_parquet(file_path)

       file_schema = lazy_df.collect_schema()
       available_columns = file_schema.names()

       select_exprs = []
       <span class="hljs-keyword">for</span> logical_name, physical_name <span class="hljs-keyword">in</span> colmaps.items():
           <span class="hljs-keyword">if</span> physical_name <span class="hljs-keyword">in</span> available_columns:
               select_exprs.append(pl.col(physical_name).alias(logical_name))

       <span class="hljs-keyword">if</span> select_exprs:
           lazy_df = lazy_df.select(select_exprs)
           all_lazy_frames.append(lazy_df)

   <span class="hljs-keyword">if</span> all_lazy_frames:
       <span class="hljs-keyword">return</span> pl.concat(all_lazy_frames)
   <span class="hljs-keyword">else</span>:
       <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"No data files found"</span>)

<span class="hljs-comment">#path is abfs path of the table</span>
df = scan_delta_cm(path).collect()
<span class="hljs-comment"># df.head()</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752877001406/7a6849c6-4aba-4f3f-b7b5-19639ad94529.png" alt class="image--center mx-auto" /></p>
<p>Think of this more of an experiment than a solution which will work for limited cases (tables with deletion vectors wont work either as expected). For any business critical job, I would advise using Spark in such scenarios.</p>
<p>The other easier alternative is to use Duckdb which supports tables with columnMapping.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1752877424357/deef7454-5699-43fe-bbbb-1a2769d2f9b2.png" alt class="image--center mx-auto" /></p>
<p>If you must use Polars, you can <a target="_blank" href="https://duckdb.org/docs/stable/guides/python/polars.html">zero-copy</a> this duckdb to polars df.</p>
]]></content:encoded></item><item><title><![CDATA[Programmatically Apply Organizational Theme To Multiple Power BI Reports Using Semantic Link Labs]]></title><description><![CDATA[Power BI Core Visuals team led by Miguel Myers published a huge update last week : Organizational Themes. It’s been a long standing ask by the users. It allows Power BI admins to manage and distribute centrally stored themes to all developers in the ...]]></description><link>https://fabric.guru/programmatically-apply-organizational-theme-to-multiple-power-bi-reports-using-semantic-link-labs</link><guid isPermaLink="true">https://fabric.guru/programmatically-apply-organizational-theme-to-multiple-power-bi-reports-using-semantic-link-labs</guid><category><![CDATA[organizational theme]]></category><category><![CDATA[org theme]]></category><category><![CDATA[microsoft fabric]]></category><category><![CDATA[semantic link labs]]></category><category><![CDATA[PowerBI]]></category><category><![CDATA[api]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Tue, 24 Jun 2025 15:11:24 GMT</pubDate><content:encoded><![CDATA[<p><a target="_blank" href="https://www.linkedin.com/company/pbicorevisuals/">Power BI Core Visuals</a> team led by <a target="_blank" href="https://www.linkedin.com/in/miguelmyers/">Miguel Myers</a> published a huge update last week : <a target="_blank" href="https://www.linkedin.com/pulse/organizational-themes-preview-pbicorevisuals-j7jxe/?trackingId=LWbxnrre1r5HbEBlbXr%2B%2Fg%3D%3D">Organizational Themes</a>. It’s been a long standing ask by the users. It allows Power BI admins to manage and distribute centrally stored themes to all developers in the organization. Read the blog post for details. Currently, as the above blog explains, org themes don’t update existing reports automatically, which makes sense. But what if you want to bulk update many published reports with the org themes? The latest version of <a target="_blank" href="https://github.com/microsoft/semantic-link-labs">Semantic Link Labs</a> to the rescue (<a target="_blank" href="https://github.com/microsoft/semantic-link-labs/releases/tag/0.11.0">v 0.11.0</a>) !!!</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">Before you proceed, please note that Organizational Theme is a <em>Preview</em> feature and is available only to admins. Semantic Link Labs needs the report to be in <em>PBIR</em> format for below function to work. You can read about PBIR, its use cases &amp; limitations <a target="_self" href="https://powerbi.microsoft.com/en-us/blog/power-bi-enhanced-report-format-pbir-in-power-bi-desktop-developer-mode-preview/?cdn=disable">here</a>. Also, Semantic Link Labs currently uses an internal API which will be updated when the public API for org themes is available.</div>
</div>

<h2 id="heading-organizational-theme">Organizational Theme</h2>
<p>I have three report themes published in the tenant.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750711772838/87c045c3-7d3d-4c03-a610-c0539d360a72.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-semantic-link-labs">Semantic Link Labs</h2>
<p>Using Semantic Link Labs, I want to apply one of the themes (the <code>Sales Dark Theme</code> for illustration purposes) to be applied to all reports in a workspace. To show that this works for reports in Pro workspaces as well, I published three reports to a Pro workspace. Using Semantic Link Labs, I will:</p>
<ul>
<li><p>Install SLL version &gt;= 0.11.0</p>
</li>
<li><p>Get the JSON of the <code>Sales Dark Theme</code> from the org themes</p>
</li>
<li><p>Get a list of all the reports in the workspace</p>
</li>
<li><p>If the report theme is different from the org theme, <a target="_blank" href="https://github.com/microsoft/semantic-link-labs/wiki/Code-Examples#set-the-theme-of-a-report">apply the theme</a> to each report</p>
<ul>
<li>else, skip</li>
</ul>
</li>
</ul>
<p>    <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750712165035/0333eb1d-856a-4fa8-a2e7-9eb0f566388a.gif" alt class="image--center mx-auto" /></p>
<p>    The base code is very simple - get org theme, apply the theme:</p>
<ul>
<li><pre><code class="lang-python">          theme_json = theme.get_org_theme_json(theme=<span class="hljs-string">'MyTheme'</span>)
          <span class="hljs-keyword">with</span> connect_report(report=report, workspace=workspace,readonly=<span class="hljs-literal">False</span>) <span class="hljs-keyword">as</span> rpt:
              rpt.set_theme(theme_json=theme_json)
</code></pre>
<p>  Below, I make it a bit more robust with error handling, catch a few edge cases and do some comparison etc:</p>
</li>
</ul>
<pre><code class="lang-python">%pip install semantic-link-labs --q

<span class="hljs-keyword">from</span> sempy_labs.report <span class="hljs-keyword">import</span> connect_report
<span class="hljs-keyword">from</span> sempy_labs.theme <span class="hljs-keyword">import</span> list_org_themes, get_org_theme_json
<span class="hljs-keyword">import</span> sempy.fabric <span class="hljs-keyword">as</span> fabric
<span class="hljs-keyword">import</span> json

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">theme_content</span>(<span class="hljs-params">theme_dict</span>):</span>
    <span class="hljs-comment"># remove the 'name' field for content comparison</span>
    <span class="hljs-keyword">return</span> {k: v <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> theme_dict.items() <span class="hljs-keyword">if</span> k != <span class="hljs-string">"name"</span>}

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">apply_org_theme</span>(<span class="hljs-params">workspace, org_theme_name=None</span>):</span>
    <span class="hljs-string">"""
    For each report in the workspace, applies the org theme if not already set or if only the name is different.

    """</span>
    <span class="hljs-comment"># get org themes and select the desired one</span>
    org_themes = list_org_themes()
    <span class="hljs-keyword">if</span> org_themes.empty:
        <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">"org themes nt found."</span>)
    <span class="hljs-keyword">if</span> org_theme_name:
        row = org_themes[org_themes[<span class="hljs-string">"Theme Name"</span>] == org_theme_name]
        <span class="hljs-keyword">if</span> row.empty:
            <span class="hljs-keyword">raise</span> ValueError(<span class="hljs-string">f"Org theme '<span class="hljs-subst">{org_theme_name}</span>' not found."</span>)
        theme_id = row[<span class="hljs-string">"Theme Id"</span>].iloc[<span class="hljs-number">0</span>]
    <span class="hljs-keyword">else</span>:
        theme_id = org_themes[<span class="hljs-string">"Theme Id"</span>].iloc[<span class="hljs-number">0</span>]
        org_theme_name = org_themes[<span class="hljs-string">"Theme Name"</span>].iloc[<span class="hljs-number">0</span>]
    org_theme_json = get_org_theme_json(theme_id)
    <span class="hljs-keyword">if</span> isinstance(org_theme_json, str):
        org_theme_json = json.loads(org_theme_json) <span class="hljs-comment">#if theme returned is not json, just in case</span>

    org_theme_content = theme_content(org_theme_json)
    org_theme_name_val = org_theme_json.get(<span class="hljs-string">"name"</span>, org_theme_name)

    <span class="hljs-comment"># all reports in the workspace; can be Fabric or Pro workspace</span>
    reports = fabric.list_reports(workspace=workspace)
    updated = []

    <span class="hljs-keyword">for</span> ix, row <span class="hljs-keyword">in</span> reports.iterrows():
        report_id = row[<span class="hljs-string">"Id"</span>]
        report_name = row[<span class="hljs-string">"Name"</span>]
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">with</span> connect_report(report=report_id, workspace=workspace, readonly=<span class="hljs-literal">False</span>, show_diffs=<span class="hljs-literal">False</span>) <span class="hljs-keyword">as</span> rpt:
                <span class="hljs-comment"># get custom theme fallback to base theme.</span>
                <span class="hljs-keyword">try</span>:
                    <span class="hljs-comment"># custom theme</span>
                    report_theme = rpt.get_theme(<span class="hljs-string">"customTheme"</span>)
                <span class="hljs-keyword">except</span> Exception:
                    <span class="hljs-keyword">try</span>:
                        report_theme = rpt.get_theme(<span class="hljs-string">"baseTheme"</span>)
                    <span class="hljs-keyword">except</span> Exception:
                        report_theme = {}

                <span class="hljs-keyword">if</span> isinstance(report_theme, str):
                    report_theme = json.loads(report_theme)

                report_theme_content = theme_content(report_theme)
                report_theme_name_val = report_theme.get(<span class="hljs-string">"name"</span>, <span class="hljs-string">""</span>)

                <span class="hljs-comment"># compare different scenarios</span>
                <span class="hljs-comment"># sometime the theme details could be same but name could be different</span>
                content_match = json.dumps(report_theme_content, sort_keys=<span class="hljs-literal">True</span>) == json.dumps(org_theme_content, sort_keys=<span class="hljs-literal">True</span>)
                name_match = report_theme_name_val == org_theme_name_val

                <span class="hljs-keyword">if</span> content_match <span class="hljs-keyword">and</span> name_match:
                    print(<span class="hljs-string">f"Report '<span class="hljs-subst">{report_name}</span>' already has the '<span class="hljs-subst">{org_theme_name}</span>' org theme."</span>)
                    <span class="hljs-keyword">continue</span>
                <span class="hljs-keyword">elif</span> content_match <span class="hljs-keyword">and</span> <span class="hljs-keyword">not</span> name_match:
                    print(<span class="hljs-string">f"Report theme details are the same but the theme name is different. Updated the report with the new theme '<span class="hljs-subst">{org_theme_name}</span>'."</span>)
                    rpt.set_theme(theme_json=org_theme_json)
                    updated.append(report_name)
                <span class="hljs-keyword">else</span>:
                    print(<span class="hljs-string">f"Applying org theme '<span class="hljs-subst">{org_theme_name}</span>' to '<span class="hljs-subst">{report_name}</span>'..."</span>)
                    rpt.set_theme(theme_json=org_theme_json)
                    updated.append(report_name)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> err:
            print(<span class="hljs-string">f"failed to update '<span class="hljs-subst">{report_name}</span>': <span class="hljs-subst">{err}</span>"</span>)
    <span class="hljs-keyword">return</span> updated
</code></pre>
<p>Example:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1750712601252/08048aed-e3ff-4559-84a6-8c407efa904f.png" alt class="image--center mx-auto" /></p>
<p>There is no API, yet, to update the Org theme but I am sure it will be made available in the future.</p>
<p>Note again that I executed the notebook in a Fabric workspace but the reports are in a Pro workspace ! <a target="_blank" href="https://www.linkedin.com/in/michaelkovalsky/recent-activity/all/">Michael Kovalsky</a> and <a target="_blank" href="https://github.com/microsoft/semantic-link-labs/graphs/contributors">other contributors</a> have been adding many new functions to the <code>ReportWrapper</code> class, you should <a target="_blank" href="https://github.com/microsoft/semantic-link-labs/blob/main/notebooks/Report%20Analysis.ipynb">take a look at it for all the possibilities</a>.</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">If you are a Power BI developer with no prior experience in notebooks and Python, I recommend checking out <a target="_self" href="https://data-goblins.com/">Kurt Buhler</a>’s <a target="_self" href="https://tabulareditor.com/blog/introducing-notebooks-for-power-bi-people">new notebook tutorial</a> to get started. Also bookmark <a target="_self" href="https://github.com/microsoft/semantic-link-labs/tree/main/notebooks">these example notebooks</a>.</div>
</div>]]></content:encoded></item><item><title><![CDATA[Quick Test: Finding Power BI Report Pages With Errors Using Semantic Link Labs and LLM]]></title><description><![CDATA[As the title says, it’s a test. I wanted to experiment with something based on a discussion I had. The user was using Semantic Link Labs’s awesome ReportWrapper to find Power BI report pages with broken visuals. (You can use this notebook to learn mo...]]></description><link>https://fabric.guru/quick-test-finding-power-bi-report-pages-with-errors-using-semantic-link-labs-and-llm</link><guid isPermaLink="true">https://fabric.guru/quick-test-finding-power-bi-report-pages-with-errors-using-semantic-link-labs-and-llm</guid><category><![CDATA[microsoft fabric]]></category><category><![CDATA[semantic link labs]]></category><category><![CDATA[gemini]]></category><category><![CDATA[llm]]></category><category><![CDATA[genai]]></category><category><![CDATA[object detection ]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Fri, 13 Jun 2025 22:14:59 GMT</pubDate><content:encoded><![CDATA[<p>As the title says, it’s a test. I wanted to experiment with something based on a discussion I had. The user was using Semantic Link Labs’s awesome <code>ReportWrapper</code> to find Power BI report pages with broken visuals. (You can use <a target="_blank" href="https://github.com/microsoft/semantic-link-labs/blob/main/notebooks/Report%20Analysis.ipynb">this notebook</a> to learn more). However, for ReportWrapper you need the report to be in <code>.pbir</code> format. In this case the reports were in pbix. So, I thought of an alternative approach- using LLM 😁</p>
<h2 id="heading-recipe">Recipe</h2>
<ul>
<li><p>Export the Power BI report as an image to a lakehouse. <a target="_blank" href="https://fabric.guru/exporting-power-bi-reports-and-sharing-with-granular-access-control-in-fabric">I have written</a> about this before.</p>
</li>
<li><p>Extract all the pages of the report as png</p>
</li>
<li><p>Use an LLM to detect broken visuals</p>
</li>
<li><p>Save the pages with errors</p>
</li>
</ul>
<h2 id="heading-using-llm">Using LLM</h2>
<p>There are several ways to do this:</p>
<ul>
<li><p>Use multimodal embedding to convert the report image to an embedding vector and do similarity search (e.g. I used Jina 2 Clip embedding and did a similarity search with <code>this report has an error message with gray background</code> . It worked decently. It wasn’t very robust. Plus, can’t identify the visual with the error and was prone to errors.)</p>
</li>
<li><p>Multimodal AI Search : Similar to above but instead of text, perform similarity search using embedding of error messages. This also wasn’t too accurate.</p>
</li>
<li><p>Use multimodal model: Use a multimodal LLM to query the image. This worked well using several different LLMs however gemini-2.5 worked the best without any additional complexities (and it’s free for testing :D). Below I show how to do that.</p>
</li>
</ul>
<h2 id="heading-using-google-gemini">Using Google Gemini</h2>
<p>Gemini &gt;1.5 models are multimodal, i.e. they work with txt, audio, images etc. Unlike many similar models, it’s also a very capable object detection model. You can ask it to return bounding boxes for the objects you are interested in. Below, I use that to identify the error messages. I tested it on several reports, and it worked 100% of times (on the reports I tested). You will need to generate the Gemini API key (free).</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">As always, do your due diligence before sending any sensitive data to AI services providers. In my case, since I am using the free services, I am aware that Google will use it for their training. It’s written in their TOC. Use caution and be aware of the risks.</div>
</div>

<h2 id="heading-extract-report-pages">Extract Report Pages</h2>
<p>You will need to enable the tenant setting to allow exporting reports as images. Attach a lakehouse to Fabric notebook and specify the report, lakehouse details.</p>
<pre><code class="lang-python">%pip install semantic-link-labs google-genai --q
<span class="hljs-keyword">from</span> sempy_labs.report <span class="hljs-keyword">import</span> export_report

(export_report(
    report = <span class="hljs-string">"Sales and Returns-error"</span>,
    workspace=<span class="hljs-string">"Sales"</span>,
    export_format = <span class="hljs-string">"PNG"</span>,
    lakehouse=<span class="hljs-string">"MyLakehouse"</span>,
    lakehouse_workspace=<span class="hljs-string">"Sales"</span>)
)
</code></pre>
<h2 id="heading-detect-visuals-with-errors">Detect Visuals with Errors</h2>
<p>The prompt is simple:</p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text"><em>'Analyze this image and return a JSON response with: 1) "has_error_messages": true/false if image contains error messages, 2) "coordinates": array of bounding boxes [ymin, xmin, ymax, xmax] for each error message found. Error messages typically have a black x in a circle with a gray box.'</em></div>
</div>

<pre><code class="lang-python"><span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> re
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">from</span> google <span class="hljs-keyword">import</span> genai
<span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image, ImageDraw

<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> re
<span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> zipfile
<span class="hljs-keyword">import</span> glob
<span class="hljs-keyword">from</span> google <span class="hljs-keyword">import</span> genai
<span class="hljs-keyword">from</span> PIL <span class="hljs-keyword">import</span> Image, ImageDraw

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">detect_error_messages</span>(<span class="hljs-params">zip_path, api_key, output_dir=None</span>):</span>
    <span class="hljs-keyword">if</span> output_dir <span class="hljs-keyword">is</span> <span class="hljs-literal">None</span>:
        output_dir = os.path.join(os.path.dirname(zip_path), <span class="hljs-string">"extracted"</span>)

    os.makedirs(output_dir, exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-keyword">with</span> zipfile.ZipFile(zip_path, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> zip_ref:
        zip_ref.extractall(output_dir)


    png_files = glob.glob(os.path.join(output_dir, <span class="hljs-string">"*.png"</span>))

    client = genai.Client(api_key=api_key)
    results = []

    <span class="hljs-keyword">for</span> image_path <span class="hljs-keyword">in</span> png_files:
        <span class="hljs-keyword">try</span>:
            image = Image.open(image_path)

            response = client.models.generate_content(
                model=<span class="hljs-string">'gemini-2.5-pro-preview-06-05'</span>,
                contents=[
                    <span class="hljs-string">'Analyze this image and return a JSON response with: 1) "has_error_messages": true/false if image contains error messages, 2) "coordinates": array of bounding boxes [ymin, xmin, ymax, xmax] for each error message found. Error messages typically have a black x in a circle with a gray box.'</span>,
                    image
                ]
            )

            result = _extract_json_result(response.text)
            result[<span class="hljs-string">"image_path"</span>] = image_path
            result[<span class="hljs-string">"filename"</span>] = os.path.basename(image_path)

            <span class="hljs-keyword">if</span> result[<span class="hljs-string">"has_error_messages"</span>] <span class="hljs-keyword">and</span> result[<span class="hljs-string">"coordinates"</span>]:
                base_name = os.path.splitext(image_path)[<span class="hljs-number">0</span>]
                ext = os.path.splitext(image_path)[<span class="hljs-number">1</span>]
                output_path = <span class="hljs-string">f"<span class="hljs-subst">{base_name}</span>-error<span class="hljs-subst">{ext}</span>"</span>

                _draw_bounding_boxes(image_path, result[<span class="hljs-string">"coordinates"</span>], output_path)
                result[<span class="hljs-string">"output_path"</span>] = output_path

            results.append(result)

        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            results.append({
                <span class="hljs-string">"image_path"</span>: image_path,
                <span class="hljs-string">"filename"</span>: os.path.basename(image_path),
                <span class="hljs-string">"error"</span>: str(e),
                <span class="hljs-string">"has_error_messages"</span>: <span class="hljs-literal">False</span>,
                <span class="hljs-string">"coordinates"</span>: []
            })

    <span class="hljs-keyword">return</span> results

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_extract_json_result</span>(<span class="hljs-params">text</span>):</span>
    json_pattern = <span class="hljs-string">r'\{[^{}]*"has_error_messages"[^{}]*\}'</span>
    match = re.search(json_pattern, text, re.DOTALL)

    <span class="hljs-keyword">if</span> match:
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">return</span> json.loads(match.group())
        <span class="hljs-keyword">except</span> json.JSONDecodeError:
            <span class="hljs-keyword">pass</span>

    coord_pattern = <span class="hljs-string">r'\[\s*\d+\s*,\s*\d+\s*,\s*\d+\s*,\s*\d+\s*\]'</span>
    coordinates = [json.loads(match) <span class="hljs-keyword">for</span> match <span class="hljs-keyword">in</span> re.findall(coord_pattern, text)]

    <span class="hljs-keyword">return</span> {
        <span class="hljs-string">"has_error_messages"</span>: len(coordinates) &gt; <span class="hljs-number">0</span>,
        <span class="hljs-string">"coordinates"</span>: coordinates
    }

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">_draw_bounding_boxes</span>(<span class="hljs-params">image_path, coordinates, output_path</span>):</span>
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)
    width, height = image.size

    <span class="hljs-keyword">for</span> i, (ymin, xmin, ymax, xmax) <span class="hljs-keyword">in</span> enumerate(coordinates):
        x1 = int(xmin * width / <span class="hljs-number">1000</span>)
        y1 = int(ymin * height / <span class="hljs-number">1000</span>)
        x2 = int(xmax * width / <span class="hljs-number">1000</span>)
        y2 = int(ymax * height / <span class="hljs-number">1000</span>)

        dash_length = <span class="hljs-number">10</span>

        <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(x1, x2, dash_length * <span class="hljs-number">2</span>):
            draw.line([(x, y1), (min(x + dash_length, x2), y1)], fill=<span class="hljs-string">'red'</span>, width=<span class="hljs-number">5</span>)
            draw.line([(x, y2), (min(x + dash_length, x2), y2)], fill=<span class="hljs-string">'red'</span>, width=<span class="hljs-number">5</span>)

        <span class="hljs-keyword">for</span> y <span class="hljs-keyword">in</span> range(y1, y2, dash_length * <span class="hljs-number">2</span>):
            draw.line([(x1, y), (x1, min(y + dash_length, y2))], fill=<span class="hljs-string">'red'</span>, width=<span class="hljs-number">5</span>)
            draw.line([(x2, y), (x2, min(y + dash_length, y2))], fill=<span class="hljs-string">'red'</span>, width=<span class="hljs-number">5</span>)

    image.save(output_path)



api_key = <span class="hljs-string">"AIzaSyAxxxxxxxxxxxxxxxxxxxxxxxx"</span>
image_path = <span class="hljs-string">"/lakehouse/default/Files/Sales and Returns-error.png"</span>

result = detect_error_messages(image_path, api_key)
</code></pre>
<p>Above function returns a json with keys <code>"has_error_messages"</code> and <code>”coordinates”</code> . If a visual with error is detected: <code>has_error_messages</code> is <code>true</code> and the coordinates of the visual with error are returned. The function draws a red dotted rectangle around the visuals with errors. I used Claude to refine the function.</p>
<h2 id="heading-test">Test</h2>
<ol>
<li>You can see that this report has many pages. Some pages have error like below, some don’t.</li>
</ol>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749851745097/feac43b5-7be6-4a70-83ca-10468679a7ce.png" alt class="image--center mx-auto" /></p>
<ol start="2">
<li><p>The function saved the report pages as a PNG and extracted the report pages:</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749851866573/de1a182f-dc0a-4af3-b5b8-5a3177640041.png" alt class="image--center mx-auto" /></p>
<ol start="3">
<li><p>In the extracted folder, any pages with errors were saved with the bounding boxes</p>
</li>
<li><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749851919830/71c6b14d-9ab6-4c7b-99d6-4ecdce6797f3.png" alt class="image--center mx-auto" /></p>
<p> Here is what the output looks like:</p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749851964933/1db770d2-7689-4da4-bb58-456a4298b919.png" alt class="image--center mx-auto" /></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749852072876/522e4cb7-6dd3-4bfd-a627-d12158bd02fe.png" alt class="image--center mx-auto" /></p>
<p> <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749852107994/48646f10-42eb-423e-8625-b37bda366f22.png" alt class="image--center mx-auto" /></p>
<p> You can automate this process and send email alerts if errors are detected - all <a target="_blank" href="https://fabric.guru/email-semantic-model-bpa-report-using-semantic-link-labs">using semantic link labs</a>.</p>
<p> I do not expect anyone to use the given the pre-requisites and risks - but I just wanted to test it for my own sake. It works !</p>
</li>
</ol>
</li>
</ol>
<p>    Using Semantic Link Labs if you have a pbir report, <a target="_blank" href="https://www.kerski.tech/bringing-dataops-to-power-bi-part43/">John Kerski’s solution</a> and <a target="_blank" href="https://data-goblins.com/power-bi/something-is-wrong-with-one-or-more-fields">Kurt’s solution</a> are more practical.</p>
]]></content:encoded></item><item><title><![CDATA[Fabric Workspace Activity Location Based On IP Address Using KQL]]></title><description><![CDATA[This is a second blog in a row inspired by Edgar Cotte (Sr PM, Fabric CAT). At a recent RTI workshop, he shared a handy KQL function geo_info_from_ip_address which retrieves geolocation based on IP address. You can read more about it here.
A few mont...]]></description><link>https://fabric.guru/geolocation-based-on-ip-address-using-kql</link><guid isPermaLink="true">https://fabric.guru/geolocation-based-on-ip-address-using-kql</guid><category><![CDATA[realtimedashboard]]></category><category><![CDATA[realtimeintelligence]]></category><category><![CDATA[KQL]]></category><category><![CDATA[eventhouse]]></category><category><![CDATA[geolocation]]></category><category><![CDATA[#GeospatialAnalysis]]></category><category><![CDATA[microsoft fabric]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Mon, 09 Jun 2025 22:38:41 GMT</pubDate><content:encoded><![CDATA[<p>This is a second blog in a row inspired by <a target="_blank" href="https://www.linkedin.com/in/edgarcotte/">Edgar Cotte</a> (Sr PM, Fabric CAT). At a recent RTI workshop, he shared a handy KQL function <code>geo_info_from_ip_address</code> which retrieves geolocation based on IP address. You can read more about it <a target="_blank" href="https://learn.microsoft.com/en-us/kusto/query/geo-info-from-ip-address-function?view=microsoft-fabric">here</a>.</p>
<p>A few months ago I wrote a blog on <a target="_blank" href="https://fabric.guru/whats-your-most-active-fabric-workspace">getting workspace activities</a> using Semantic Link Labs. I have been using it on one of my workspaces in personal tenant to generate activity data. I retrieve the logs for each day and save it to a lakehouse. The activity event logs have <code>Client IP</code> address and I wanted to try above function on this field. So I shortcutted the delta table to a KQL table in an Eventhouse and it worked like a charm. Super helpful for auditing &amp; monitoring the workspace activities.</p>
<pre><code class="lang-python"><span class="hljs-comment">## GET ACTIVITIES FOR THE LAST 7 DAYS AND SAVE IT TO A LAKEHOUSE USING POLARS</span>
<span class="hljs-comment">## USING PYTHON NOTEBOOK BELOW</span>

<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime, timedelta
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> sempy_labs <span class="hljs-keyword">as</span> labs
<span class="hljs-keyword">import</span> polars <span class="hljs-keyword">as</span> pl

<span class="hljs-comment">#last 7 days</span>
N=<span class="hljs-number">7</span>

activities = []
<span class="hljs-keyword">for</span> n <span class="hljs-keyword">in</span> range(N):
    day = datetime.now() - timedelta(days=n)
    start_of_day = day.replace(hour=<span class="hljs-number">0</span>, minute=<span class="hljs-number">0</span>, second=<span class="hljs-number">0</span>, microsecond=<span class="hljs-number">0</span>).strftime(<span class="hljs-string">'%Y-%m-%dT%H:%M:%S'</span>)
    end_of_day = day.replace(hour=<span class="hljs-number">23</span>, minute=<span class="hljs-number">59</span>, second=<span class="hljs-number">59</span>, microsecond=<span class="hljs-number">999999</span>).strftime(<span class="hljs-string">'%Y-%m-%dT%H:%M:%S'</span>)

    df = labs.admin.list_activity_events(
        start_time=start_of_day, 
        end_time=end_of_day
    ) 
    activities.append(df)

final_df = pd.concat(activities)
pl_df = pl.from_pandas(final_df)
<span class="hljs-comment">## change to your abfss</span>
pl_df.write_delta(<span class="hljs-string">"abfss://Sales@onelake.dfs.fabric.microsoft.com/MyLakehouse.Lakehouse/Tables/dbo/ws_activities"</span>, mode=<span class="hljs-string">"overwrite"</span>)
</code></pre>
<p>In the Eventhouse, I added the above table as a shortcut and queried to aggregate by location using <code>geo_info_from_ip_address</code>:</p>
<pre><code class="lang-python">//table name <span class="hljs-string">"ws_activities"</span>

external_table(<span class="hljs-string">"ws_activities"</span>)
| extend Location = geo_info_from_ip_address([<span class="hljs-string">'Client IP'</span>])
| extend LocationString = strcat_delim(<span class="hljs-string">", "</span>, 
    tostring(Location.city), 
    tostring(Location.state), 
    tostring(Location.country))
| extend Latitude = todouble(Location.latitude)
| extend Longitude = todouble(Location.longitude)
| where isnotempty(LocationString) <span class="hljs-keyword">and</span> LocationString != <span class="hljs-string">", , "</span>
| where isnotempty(Latitude) <span class="hljs-keyword">and</span> isnotempty(Longitude)
| summarize EventCount = count() by LocationString, Latitude, Longitude
</code></pre>
<p>I live in Portland, OR and recently travelled to Seattle, WA so this checks out:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749508228502/596fb9bc-3f6b-4454-b42f-5f4e1a966931.png" alt class="image--center mx-auto" /></p>
<p>Rendered the result as a map which shows Portland and Seattle.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749508612727/9b97867c-6a56-4c47-a8c0-aaca06121fdb.png" alt class="image--center mx-auto" /></p>
<p>The same map is available in <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/real-time-intelligence/dashboard-real-time-create">Real Time Dashboard</a> so you can track the location of activities in real time and generate alerts if needed!</p>
]]></content:encoded></item><item><title><![CDATA[Using Fabric CLI in Fabric Notebook]]></title><description><![CDATA[First I would like to thank Edgar Cotte (Sr PM Fabric CAT) for the inspiration for this blog. Edgar shared this recently in a workshop so I got curious and explored more. Fabric CLI, as the name suggests, allows you to interact with the Fabric enviro...]]></description><link>https://fabric.guru/using-fabric-cli-in-fabric-notebook</link><guid isPermaLink="true">https://fabric.guru/using-fabric-cli-in-fabric-notebook</guid><category><![CDATA[fabric-cli]]></category><category><![CDATA[fabriccli]]></category><category><![CDATA[microsoftfabric]]></category><category><![CDATA[automation]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Fri, 06 Jun 2025 22:14:18 GMT</pubDate><content:encoded><![CDATA[<p>First I would like to thank <a target="_blank" href="https://www.linkedin.com/in/edgarcotte/">Edgar Cotte</a> (Sr PM Fabric CAT) for the inspiration for this blog. Edgar shared this recently in a workshop so I got curious and explored more. Fabric CLI, as the name suggests, allows you to interact with the Fabric environment using command line interface. Whether you are a Fabric admin or a developer, it’s essential for exploration and automation from a convenient interface. But if you work with external clients like I do where there may be restrictions on installing libraries, using Fabric CLI in Fabric notebook may be an easier option especially if you are already familiar with all the commands.</p>
<p>To learn more about Fabric CLI, refer to:</p>
<ul>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/rest/api/fabric/articles/fabric-command-line-interface">Fabric command line interface - Microsoft Fabric REST APIs | Microsoft Learn</a></p>
</li>
<li><p><a target="_blank" href="https://microsoft.github.io/fabric-cli/">Microsoft Fabric CLI | fabric-cli</a></p>
</li>
<li><p><a target="_blank" href="https://www.youtube.com/watch?v=y2rWzSFStZ8">Microsoft Fabric CLI: Turbo-charge Ops with Retro Command-Line Power</a></p>
</li>
<li><p><a target="_blank" href="https://www.youtube.com/watch?v=-3OjBT6f_Yw">What's Coming in Fabric Automation and CI/CD | BRK205</a></p>
</li>
</ul>
<div class="embed-wrapper"><div class="embed-loading"><div class="loadingRow"></div><div class="loadingRow"></div></div><a class="embed-card" href="https://www.youtube.com/watch?v=y2rWzSFStZ8">https://www.youtube.com/watch?v=y2rWzSFStZ8</a></div>
<p> </p>
<h2 id="heading-installation">Installation:</h2>
<p>In Fabric python notebook, first install fabric cli :</p>
<pre><code class="lang-python">%pip install ms-fabric-cli --q
</code></pre>
<h2 id="heading-set-up-auth-token">Set up Auth Token:</h2>
<p>This will use the identity of the user executing the notebook. You can also use service principal authentication as <a target="_blank" href="https://github.com/pawarbi/snippets/blob/main/fabcli-sp_token.py">shown here</a>.</p>
<pre><code class="lang-python">token = notebookutils.credentials.getToken(<span class="hljs-string">'pbi'</span>)
os.environ[<span class="hljs-string">'FAB_TOKEN'</span>] = token
os.environ[<span class="hljs-string">'FAB_TOKEN_ONELAKE'</span>] = token
</code></pre>
<h2 id="heading-commands">Commands:</h2>
<p>There are two types of commands - command line and interactive. In the notebook, you should be able to run all commands supported in command line mode. The documentation does a great job of describing if a command can be used in command line, interactive or both.</p>
<p>There are two ways you can execute the supported commands - using magic or using a Python wrapper for the commands.</p>
<h3 id="heading-using-cell-magic">Using cell magic</h3>
<p>Once installed and auth setup, use <code>!fab</code> magic to run the shell command. Below I get a list of all the workspaces along with the capacity attached and capacity id.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749152405199/3c57aa85-835e-4857-b580-5cb81ffc7faf.png" alt class="image--center mx-auto" /></p>
<p>To get a list of all items in a workspace:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749152587886/5a8029bf-e566-4e8c-9430-80a45d9dd3cd.png" alt class="image--center mx-auto" /></p>
<p>This works even if you have spaces in item/workspace names:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749152761203/d713a13f-3ce2-4f0e-8f3e-e06812da53e3.png" alt class="image--center mx-auto" /></p>
<p>You can use <a target="_blank" href="https://jmespath.org/">JMESPath</a> in the query or you can also use shell commands like below to filter the lines that contain word “Trial” and print the first column (i.e. workspace names)</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749153667441/70802d78-4270-4b45-b6c9-7c724b2ae0dc.png" alt class="image--center mx-auto" /></p>
<p>If a command is not available, you can call an API inline as well:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749153926222/e997c6f1-2a02-4086-8603-c6ab87dcc6b8.png" alt class="image--center mx-auto" /></p>
<p>To download items to an attached default lakehouse:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749154451855/134d485b-4803-48d9-ac00-53c5171974c0.png" alt class="image--center mx-auto" /></p>
<h3 id="heading-multi-line">Multi-line</h3>
<p><code>!fab</code> allows only one line command. To create more complex multi line commands, you can use <code>%%sh</code> cell magic like below:</p>
<pre><code class="lang-powershell">%%sh
<span class="hljs-built_in">echo</span> <span class="hljs-string">"=== Exploring Reid Workspace ==="</span>

<span class="hljs-comment"># jump to workspace</span>
fab <span class="hljs-literal">-c</span> <span class="hljs-string">"cd Reid.Workspace"</span> || <span class="hljs-built_in">echo</span> <span class="hljs-string">"Failed to navigate"</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"Listing all items:"</span>
fab <span class="hljs-literal">-c</span> <span class="hljs-string">"ls Reid.Workspace -l"</span> || <span class="hljs-built_in">echo</span> <span class="hljs-string">"No items or access denied"</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">""</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Notebooks only:"</span>
fab <span class="hljs-literal">-c</span> <span class="hljs-string">"ls Reid.Workspace"</span> | grep <span class="hljs-string">"\.Notebook"</span> || <span class="hljs-built_in">echo</span> <span class="hljs-string">"No notebooks found"</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">""</span>
<span class="hljs-built_in">echo</span> <span class="hljs-string">"Reports only:"</span>
fab <span class="hljs-literal">-c</span> <span class="hljs-string">"ls Reid.Workspace"</span> | grep <span class="hljs-string">"\.Report"</span> || <span class="hljs-built_in">echo</span> <span class="hljs-string">"No reports found"</span>

<span class="hljs-built_in">echo</span> <span class="hljs-string">"Exploration completed"</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749171066823/c990a193-3ad1-44e9-9aa7-d9e055a951e2.png" alt class="image--center mx-auto" /></p>
<p>You can also pass Python variables to commands, making it very dynamic:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749171699172/4642f5e8-5d38-4d5d-bc59-4026d6110231.png" alt class="image--center mx-auto" /></p>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">It would be great if <code>fabric-cli</code> library is installed in the default Python runtime with the token set up so users can use <code>!fab</code> easily. That would be super handy.</div>
</div>

<h2 id="heading-using-python">Using Python</h2>
<p>Above method is great for interactive exploration. For automation, you can create Python functions and use it like any other Python function. For example, I downloaded one notebook but for more complex patterns/automation, I can use <code>subprocess.run()</code> to execute the commands.</p>
<p>To download all notebooks from a workspace to a lakehouse location:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">download_item</span>(<span class="hljs-params">workspace, item_name, local_path=<span class="hljs-string">"/lakehouse/default/Files/tmp"</span>, force=True</span>):</span>
    <span class="hljs-string">"""
    Download a Fabric item to local dir
    """</span>
    <span class="hljs-keyword">try</span>:

        os.makedirs(local_path, exist_ok=<span class="hljs-literal">True</span>)
        cmd = <span class="hljs-string">f"export <span class="hljs-subst">{workspace}</span>.Workspace/<span class="hljs-subst">{item_name}</span> -o <span class="hljs-subst">{local_path}</span>"</span>
        <span class="hljs-keyword">if</span> force:
            cmd += <span class="hljs-string">" -f"</span>
        result = subprocess.run([<span class="hljs-string">"fab"</span>, <span class="hljs-string">"-c"</span>, cmd], capture_output=<span class="hljs-literal">True</span>, text=<span class="hljs-literal">True</span>)

        <span class="hljs-keyword">if</span> result.returncode == <span class="hljs-number">0</span>:
            print(<span class="hljs-string">f"Downloaded <span class="hljs-subst">{item_name}</span> to <span class="hljs-subst">{local_path}</span>"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">True</span>
        <span class="hljs-keyword">else</span>:
            print(<span class="hljs-string">f"Failed: <span class="hljs-subst">{result.stderr}</span>"</span>)
            <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Error: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">return</span> <span class="hljs-literal">False</span>
</code></pre>
<p>For loop:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">download_all_notebooks</span>(<span class="hljs-params">workspace, local_path=<span class="hljs-string">"/lakehouse/default/Files/tmp"</span></span>):</span>
    <span class="hljs-string">"""Download all notebooks from workspace"""</span>
    <span class="hljs-keyword">try</span>:

        result = subprocess.run([<span class="hljs-string">"fab"</span>, <span class="hljs-string">"-c"</span>, <span class="hljs-string">f"ls <span class="hljs-subst">{workspace}</span>.Workspace"</span>], capture_output=<span class="hljs-literal">True</span>, text=<span class="hljs-literal">True</span>)
        notebooks = [line.strip() <span class="hljs-keyword">for</span> line <span class="hljs-keyword">in</span> result.stdout.split(<span class="hljs-string">'\n'</span>) <span class="hljs-keyword">if</span> <span class="hljs-string">'.Notebook'</span> <span class="hljs-keyword">in</span> line]

        success_count = <span class="hljs-number">0</span>
        <span class="hljs-keyword">for</span> notebook <span class="hljs-keyword">in</span> notebooks:
            <span class="hljs-keyword">if</span> download_item(workspace, notebook, local_path):
                success_count += <span class="hljs-number">1</span>

        print(<span class="hljs-string">f"Downloaded <span class="hljs-subst">{success_count}</span>/<span class="hljs-subst">{len(notebooks)}</span> notebooks"</span>)
        <span class="hljs-keyword">return</span> success_count

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Error downloading: <span class="hljs-subst">{e}</span>"</span>)
        <span class="hljs-keyword">return</span> <span class="hljs-number">0</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749155367727/adf4e4dd-6666-46de-98f7-7373ce4d12e1.png" alt class="image--center mx-auto" /></p>
<p>You can generalize this to create re-usable functions, e.g. below lists all workspaces:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> subprocess
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">from</span> typing <span class="hljs-keyword">import</span> Optional, Dict, Any

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">exec_fabcli</span>(<span class="hljs-params">command: str, capture_output: bool = False, silently_continue: bool = False</span>) -&gt; Optional[str]:</span>
    <span class="hljs-string">"""
    Run a Fabric CLI command from within a notebook.

    Args:
        command: The fab command to run (without 'fab' prefix)
        capture_output: Whether to capture and return the output
        silently_continue: Whether to suppress exceptions on errors

    Returns:
        Command output if capture_output=True, otherwise None
    """</span>
    <span class="hljs-keyword">try</span>:
        result = subprocess.run([<span class="hljs-string">"fab"</span>, <span class="hljs-string">"-c"</span>, command], capture_output=<span class="hljs-literal">True</span>, text=<span class="hljs-literal">True</span>, timeout=<span class="hljs-number">60</span>)

        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> silently_continue <span class="hljs-keyword">and</span> result.returncode != <span class="hljs-number">0</span>:
            error_msg = <span class="hljs-string">f"Command failed with exit code <span class="hljs-subst">{result.returncode}</span>\n"</span>
            error_msg += <span class="hljs-string">f"STDOUT: <span class="hljs-subst">{result.stdout}</span>\n"</span>
            error_msg += <span class="hljs-string">f"STDERR: <span class="hljs-subst">{result.stderr}</span>"</span>
            <span class="hljs-keyword">raise</span> Exception(error_msg)

        <span class="hljs-keyword">if</span> capture_output:
            <span class="hljs-keyword">return</span> result.stdout.strip()
        <span class="hljs-keyword">else</span>:
            <span class="hljs-comment"># output</span>
            <span class="hljs-keyword">if</span> result.stdout:
                print(result.stdout)
            <span class="hljs-keyword">if</span> result.stderr:
                print(<span class="hljs-string">f"Warning: <span class="hljs-subst">{result.stderr}</span>"</span>)

    <span class="hljs-keyword">except</span> subprocess.TimeoutExpired:
        <span class="hljs-keyword">raise</span> Exception(<span class="hljs-string">"Command timed out after 60 seconds"</span>)
    <span class="hljs-keyword">except</span> FileNotFoundError:
        <span class="hljs-keyword">raise</span> Exception(<span class="hljs-string">"Fabric CLI not found. Make sure 'fab-cli' is installed"</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">list_workspaces</span>(<span class="hljs-params">detailed: bool = False, show_hidden: bool = False</span>) -&gt; str:</span>
    <span class="hljs-string">"""List all workspaces"""</span>
    flags = <span class="hljs-string">""</span>
    <span class="hljs-keyword">if</span> detailed:
        flags += <span class="hljs-string">" -l"</span>
    <span class="hljs-keyword">if</span> show_hidden:
        flags += <span class="hljs-string">" -a"</span>
    <span class="hljs-keyword">return</span> exec_fabcli(<span class="hljs-string">f"ls<span class="hljs-subst">{flags}</span>"</span>, capture_output=<span class="hljs-literal">True</span>)

print(list_workspaces(detailed=<span class="hljs-literal">True</span>))
</code></pre>
<h3 id="heading-semantic-linklabs-vs-fabric-cli">Semantic Link/Labs vs Fabric CLI</h3>
<p>Semantic Link and Semantic Link Labs also provide similar capabilities but if you are already using CLI, this is a convenient way to reuse what you already know. This will especially be helpful for automating workflows. Different tools, different use cases. Fabric CLI can be usedf or CI/CD with Github and ADO automation. <a target="_blank" href="https://www.linkedin.com/in/jacobknightley">Jacob Knightley</a> shared this excellent comparison at FabCon :</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749247421596/fad7661e-dc9f-4e28-bbd0-8161373b77cc.png" alt class="image--center mx-auto" /></p>
<p>Here are additional resource to learn more:</p>
<ul>
<li><p><a target="_blank" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRuiRomano%2Ffabric-cli-powerbi-cicd-sample&amp;data=05%7C02%7Csandeeppawar%40hitachisolutions.com%7C121210cd6da247bd9cca08dda4d25902%7Ce85feadf11e747bba16043b98dcc96f1%7C0%7C0%7C638847945593115950%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&amp;sdata=GcXj%2BJPRH6NcUdw4FYIC3Y7o8EFWtMkgzqogULaHq7s%3D&amp;reserved=0">RuiRomano/fabric-cli-powerbi-cicd-sample</a></p>
</li>
<li><p>Fabric CLI to deploy FUAM : <a target="_blank" href="https://github.com/microsoft/fabric-toolbox/blob/main/monitoring/fabric-unified-admin-monitoring/scripts/Deploy_FUAM.ipynb">fabric-toolbox/monitoring/fabric-unified-admin-monitoring/scripts/Deploy_FUAM.ipynb at main · microsoft/fabric-toolbox</a></p>
</li>
<li><p>Demos by <a target="_blank" href="https://murggu.com/">Aitor Murguzur</a>: <a target="_blank" href="https://github.com/murggu/fab-demos">murggu/fab-demos</a></p>
</li>
<li><p>Multi-tenant scenario: <a target="_blank" href="https://github.com/alisonpezzott/pbi-ci-cd-isv-multi-tenant">alisonpezzott/pbi-ci-cd-isv-multi-tenant: CI/CD scenario Multi Tenant for Microsoft Power BI PRO projects by utilizing fabric-cli and GitHub Actions</a></p>
</li>
<li><p><a target="_blank" href="https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fecotte%2FFabric-Monitoring-RTI&amp;data=05%7C02%7Csandeeppawar%40hitachisolutions.com%7C121210cd6da247bd9cca08dda4d25902%7Ce85feadf11e747bba16043b98dcc96f1%7C0%7C0%7C638847945593164781%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&amp;sdata=9fHIIP9qnTvuOnky5IVZlWj0NX5D5jz8PUwbMpQDp1Q%3D&amp;reserved=0">ecotte/Fabric-Monitoring-RTI</a> by Edgar Cotte</p>
</li>
</ul>
<h2 id="heading-experiment-fabgpt">Experiment : <code>fabgpt</code></h2>
<p>I can’t complete the blog without mentioning AI, can I ? :D I love how Fabric CLI allows you to query, automate so easily. I was wondering what if all of that could be achieved with natural language? In the below example, I used Open AI + notebook cell magic to wrap above Python functions in a cell magic which I am calling <code>fabgpt</code> :D It generates above Python code and executes it.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1749247691995/3e26df1c-ffcc-4b63-a311-ad4743e56368.png" alt class="image--center mx-auto" /></p>
]]></content:encoded></item><item><title><![CDATA[Bulk Copy Semantic Model Objects and Properties Between Models Using Semantic Link Labs]]></title><description><![CDATA[The new version of Semantic Link Labs is out (v 0.9.11), and as always, Michael Kovalsky has added many new features that make working with Semantic Models in Fabric items much easier. One new method introduced is copy_object(), which, as the name su...]]></description><link>https://fabric.guru/bulk-copy-semantic-model-objects-and-properties-between-models-using-semantic-link-labs</link><guid isPermaLink="true">https://fabric.guru/bulk-copy-semantic-model-objects-and-properties-between-models-using-semantic-link-labs</guid><category><![CDATA[microsoftfabric]]></category><category><![CDATA[semantic link labs]]></category><category><![CDATA[sempy]]></category><category><![CDATA[migration]]></category><category><![CDATA[semantic-model]]></category><category><![CDATA[Power BI]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Tue, 27 May 2025 04:21:17 GMT</pubDate><content:encoded><![CDATA[<p>The new version of Semantic Link Labs is out (<a target="_blank" href="https://github.com/microsoft/semantic-link-labs/releases/tag/0.9.11">v 0.9.11</a>), and as always, <a target="_blank" href="https://www.linkedin.com/in/michaelkovalsky/">Michael Kovalsky</a> has added many new features that make working with Semantic Models in Fabric items much easier. One new method introduced is <code>copy_object()</code>, which, as the name suggests, makes copying semantic model objects from one semantic model to another a breeze. Previously, you could do this using TOM, but now it has its own function, so you can use the boilerplate function. You can also use <a target="_blank" href="https://github.com/TabularEditor/TabularEditor">Tabular Editor</a> (either manually copy/paste or using C#). I have done many migrations and have always used a C# script I developed. In this case, we use Python. You can copy paste between Import &lt; - &gt; Direct Lake models as well (as long as the objects/properties are supported).</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748317769333/036ef0cc-7d24-4e88-81b0-321250a39e7d.gif" alt class="image--center mx-auto" /></p>
<h1 id="heading-copyobejct">copy_obejct():</h1>
<p>The method is easy to use:</p>
<pre><code class="lang-python"><span class="hljs-comment">#%pip install semantic_link_labs --q</span>
<span class="hljs-keyword">import</span> sempy.labs <span class="hljs-keyword">as</span> labs
<span class="hljs-keyword">with</span> labs.tom.connect_semantic_model(dataset=<span class="hljs-string">"SourceDataset"</span>, workspace=<span class="hljs-string">"SourceWorkspace"</span>) <span class="hljs-keyword">as</span> tom:
    <span class="hljs-comment"># to copy Sales table from source dataset to target dataset</span>
    table = tom.model.Tables[<span class="hljs-string">"Sales"</span>]
    tom.copy_object(
        object=table,
        target_dataset=<span class="hljs-string">"TargetDataset"</span>,
        target_workspace=<span class="hljs-string">"TargetWorkspace"</span> 
    )
</code></pre>
<h1 id="heading-bulk-copy-objects">Bulk Copy Objects:</h1>
<p>For demo purposes, I created two models <code>model_1</code> and <code>model_2</code> . <code>model_1</code> has relationships, calculated columns, calculated tables and measures. <code>model_2</code> has the two import tables without any relationships and calc tables/calumns and measures. I want to copy from <code>model_1</code> to <code>model_2</code>. I will define the functions:</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> sempy_labs.tom <span class="hljs-keyword">import</span> connect_semantic_model
<span class="hljs-keyword">import</span> sempy.fabric <span class="hljs-keyword">as</span> fabric

<span class="hljs-comment">#if workspaces are None, current workspace is used</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">copy_relationships</span>(<span class="hljs-params">
    source_dataset: str,
    target_dataset: str,
    source_workspace: str = None,
    target_workspace: str = None
</span>):</span>
    <span class="hljs-string">"""
    Copy all relationships from the source semantic model to the target semantic model.
    """</span>
    <span class="hljs-keyword">with</span> connect_semantic_model(dataset=source_dataset, workspace=source_workspace) <span class="hljs-keyword">as</span> src_tom:
        <span class="hljs-keyword">for</span> rel <span class="hljs-keyword">in</span> src_tom.model.Relationships:
            print(<span class="hljs-string">f"Copying relationship: <span class="hljs-subst">{rel.Name}</span>"</span>)
            src_tom.copy_object(
                object=rel,
                target_dataset=target_dataset,
                target_workspace=target_workspace
            )
    print(<span class="hljs-string">"All relationships have been copied."</span>)


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">copy_calculated_columns</span>(<span class="hljs-params">
    source_dataset: str,
    target_dataset: str,
    source_workspace: str = None,
    target_workspace: str = None
</span>):</span>
    <span class="hljs-string">"""
    Copy all calculated columns from the source semantic model to the target semantic model.
    Uses sempy.fabric.list_columns to reliably identify calculated columns.
    """</span>
    calc_cols_df = fabric.list_columns(source_dataset, workspace=source_workspace)[[<span class="hljs-string">'Table Name'</span>, <span class="hljs-string">'Column Name'</span>, <span class="hljs-string">'Type'</span>]]
    calculated_columns = set(
        tuple(x) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> calc_cols_df.query(<span class="hljs-string">'Type == "Calculated"'</span>)[[<span class="hljs-string">'Table Name'</span>, <span class="hljs-string">'Column Name'</span>]].values
    )

    <span class="hljs-keyword">with</span> connect_semantic_model(dataset=source_dataset, workspace=source_workspace) <span class="hljs-keyword">as</span> src_tom:
        <span class="hljs-keyword">for</span> table <span class="hljs-keyword">in</span> src_tom.model.Tables:
            <span class="hljs-keyword">for</span> column <span class="hljs-keyword">in</span> table.Columns:
                <span class="hljs-keyword">if</span> (table.Name, column.Name) <span class="hljs-keyword">in</span> calculated_columns:
                    print(<span class="hljs-string">f"Copying calculated column: <span class="hljs-subst">{column.Name}</span> from table: <span class="hljs-subst">{table.Name}</span>"</span>)
                    src_tom.copy_object(
                        object=column,
                        target_dataset=target_dataset,
                        target_workspace=target_workspace
                    )
    print(<span class="hljs-string">"All calculated columns have been copied."</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">copy_calculated_tables</span>(<span class="hljs-params">
    source_dataset: str,
    target_dataset: str,
    source_workspace: str = None,
    target_workspace: str = None
</span>):</span>
    <span class="hljs-string">"""
    Copy all calculated tables from the source semantic model to the target semantic model.

    """</span>

    calculated_tables = set(
        fabric.list_tables(source_dataset, workspace=source_workspace)
              .query(<span class="hljs-string">'Type == "Calculated Table"'</span>)[<span class="hljs-string">'Name'</span>]
    )

    <span class="hljs-keyword">with</span> connect_semantic_model(dataset=source_dataset, workspace=source_workspace) <span class="hljs-keyword">as</span> src_tom:
        <span class="hljs-keyword">for</span> table <span class="hljs-keyword">in</span> src_tom.model.Tables:
            <span class="hljs-keyword">if</span> table.Name <span class="hljs-keyword">in</span> calculated_tables:
                print(<span class="hljs-string">f"Copying calculated table: <span class="hljs-subst">{table.Name}</span>"</span>)
                src_tom.copy_object(
                    object=table,
                    target_dataset=target_dataset,
                    target_workspace=target_workspace
                )
    print(<span class="hljs-string">"All calculated tables have been copied."</span>)

<span class="hljs-keyword">from</span> sempy_labs.tom <span class="hljs-keyword">import</span> connect_semantic_model

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">copy_all_measures</span>(<span class="hljs-params">
    source_dataset: str,
    target_dataset: str,
    source_workspace: str = None,
    target_workspace: str = None
</span>):</span>
    <span class="hljs-string">"""
    Copy all measures from every table in the source semantic model to the target semantic model.

    """</span>
    <span class="hljs-keyword">with</span> connect_semantic_model(dataset=source_dataset, workspace=source_workspace) <span class="hljs-keyword">as</span> src_tom:
        <span class="hljs-keyword">for</span> table <span class="hljs-keyword">in</span> src_tom.model.Tables:
            <span class="hljs-keyword">for</span> measure <span class="hljs-keyword">in</span> table.Measures:
                print(<span class="hljs-string">f"Copying measure '<span class="hljs-subst">{measure.Name}</span>' from table '<span class="hljs-subst">{table.Name}</span>'..."</span>)
                <span class="hljs-keyword">try</span>:
                    src_tom.copy_object(
                        object=measure,
                        target_dataset=target_dataset,
                        target_workspace=target_workspace)
                <span class="hljs-keyword">except</span>:
                    print(<span class="hljs-string">f"Error with <span class="hljs-subst">{measure.Name}</span>, check again"</span>)
                    <span class="hljs-keyword">continue</span>

    print(<span class="hljs-string">"All measures have been copied."</span>)

copy_relationships(
source_dataset=<span class="hljs-string">"model_1"</span>,
target_dataset=<span class="hljs-string">"model_2"</span>,
source_workspace=<span class="hljs-string">"a79cbb27-3cc-d64bf25ca405"</span>,
target_workspace=<span class="hljs-string">"a79cbb27-3bf64bf25ca405"</span>
)

copy_calculated_tables(
source_dataset=<span class="hljs-string">"model_1"</span>,
target_dataset=<span class="hljs-string">"model_2"</span>,
source_workspace=<span class="hljs-string">"a79cbb27-3cc-d64bf25ca405"</span>,
target_workspace=<span class="hljs-string">"a79cbb27-3bf64bf25ca405"</span>
)

copy_calculated_columns(
source_dataset=<span class="hljs-string">"model_1"</span>,
target_dataset=<span class="hljs-string">"model_2"</span>,
source_workspace=<span class="hljs-string">"a79cbb27-3cc-d64bf25ca405"</span>,
target_workspace=<span class="hljs-string">"a79cbb27-3bf64bf25ca405"</span>
)

copy_all_measures(
source_dataset=<span class="hljs-string">"model_1"</span>,
target_dataset=<span class="hljs-string">"model_2"</span>,
source_workspace=<span class="hljs-string">"a79cbb27-3cc-d64bf25ca405"</span>,
target_workspace=<span class="hljs-string">"a79cbb27-3bf64bf25ca405"</span>
)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748318366157/e6591cc4-8819-4022-b5bf-6db62036cc77.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-before">Before:</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748318552220/ce82fd65-cbb7-4754-abf1-87079bdff985.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-after">After</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1748318590557/cda64ae4-a302-45ff-a8a1-1d919d2670a3.png" alt class="image--center mx-auto" /></p>
<p>When you copy tables, make sure to update the connection and refresh the target semantic model to reflect the changes and ensure all objects are copied correctly. When you copy objects, all its properties are also copied (e.g. formatting, annotations etc.) You can also copy any TOM object, such as partitions, hierarchies, perspectives, and more.</p>
<p>You need XMLA Write enabled in the tenant and capacity settings. You also need Contributor+ access (or build permissions) to both the semantic models. This works in any Premium/Fabric workspace.</p>
<p>Download notebook from <a target="_blank" href="https://github.com/pawarbi/snippets/blob/main/Copy%20Model%20Objects.ipynb">here.</a></p>
<h1 id="heading-references">References:</h1>
<ul>
<li><p><a target="_blank" href="https://semantic-link-labs.readthedocs.io/en/stable/sempy_labs.tom.html#sempy_labs.tom.TOMWrapper.copy_object">sempy_labs.tom package — semantic-link-labs 0.9.11 documentation</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/TabularEditor/TabularEditor">TabularEditor/TabularEditor: This is the code repository and issue tracker for Tabular Editor 2.X (free, open-source version). This repository is being maintained by Daniel Otykier.</a></p>
</li>
<li><p><a target="_blank" href="https://docs.tabulareditor.com/common/CSharpScripts/csharp-script-library-advanced.html">Advanced C# Scripts | Tabular Editor Documentation</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Unstructured to Structured : Extracting Data From Messy Excel Sheets Using Fabric AI Function]]></title><description><![CDATA[I have written several blogs about using LLMs, especially in Fabric, to extract structured data from unstructured data. In my last blog on this topic, I explained how to extract structured data from PDF invoices. PDFs are a type of unstructured data ...]]></description><link>https://fabric.guru/unstructured-to-structured-extracting-data-from-messy-excel-sheets-using-fabric-ai-function</link><guid isPermaLink="true">https://fabric.guru/unstructured-to-structured-extracting-data-from-messy-excel-sheets-using-fabric-ai-function</guid><category><![CDATA[microsoftfabric]]></category><category><![CDATA[ai function]]></category><category><![CDATA[ai functions]]></category><category><![CDATA[genai]]></category><category><![CDATA[unstructured data]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Fri, 23 May 2025 04:46:39 GMT</pubDate><content:encoded><![CDATA[<p>I have written several blogs about using LLMs, especially in Fabric, to extract structured data from unstructured data. In my last blog on this topic, I explained how to <a target="_blank" href="https://fabric.guru/unstructured-to-structured-using-fabric-ai-functions-to-extract-invoice-data-from-pdfs">extract structured data from PDF invoices</a>. PDFs are a type of unstructured data source, but the data doesn't always have to be in text or PDF format. In this blog, I will demonstrate how we can use the same techniques to extract tabular data from an Excel file where the tables lack a common structure and are poorly formatted. Who hasn't dealt with such Excel data?</p>
<p>I presented this use case at FabCon Vegas.</p>
<h1 id="heading-the-badly-formatted-excel">The Badly Formatted Excel</h1>
<p>I came across <a target="_blank" href="https://community.fabric.microsoft.com/t5/Power-Query/Cleaning-up-unstructured-excel-purchase-orders/td-p/1405785">this post</a> on the Power BI forum. The user wanted to import the Excel sheet below into Power BI and create a model for reporting. Sounds simple, right? If you know Power Query and some M, you can easily turn this into a table for all Excel files on a schedule—<strong><em>as long as all the files have the same structure, the same column names, and the same columns.</em></strong> If there is any difference, you'll either get an error or incorrect results. Plus, imagine if below purchase orders were in different languages from different suppliers!</p>
<p>This is exactly what we will try to fix using Fabric AI Functions.</p>
<p><img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/367393iF0139B11B04BAE17/image-size/large?v=v2&amp;px=999" alt="POexample.JPG" /></p>
<p>Another user shared the M code to clean the data but also noted that this will work if the form doesn’t change.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747885590819/37945831-fc87-4e5a-a95c-6f0d1d496c74.png" alt class="image--center mx-auto" /></p>
<h1 id="heading-synthetic-data">Synthetic Data</h1>
<p>I don't have the data from this user, so I created an Excel file with four variations of the purchase order data mentioned above. I randomly placed data in different cells, added extra columns and rows, changed the layout, and used different column names to create messy layout. Although the data is in a table format, the formatting is unstructured and almost impossible to process using Power Query or Python.</p>
<h2 id="heading-sheet-1">Sheet 1:</h2>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747886349690/d4ce6486-5ebe-41bb-bb2d-8bd9c6519a56.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-sheet-2">Sheet 2:</h2>
<p>Compared to sheet 1, in Sheet 2, following changes were made:</p>
<ul>
<li><p>Split the PO header details in two columns instead of in one</p>
</li>
<li><p>Excluded JOB ID from PO header and instead added it as a table header in a merged cell (typically merged cells are hard to deal with)</p>
</li>
<li><p>PO date is in a slightly different format (mm-dd-yyyy instead of mm dd yyyy)</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747886737203/f8768247-433d-49a0-8767-e9d3d054eb3c.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-sheet-3">Sheet 3</h2>
<ul>
<li><p>Randomly changed the header layout. Notice that the positions are different, and instead of having the column:value pairs in two columns next to each other, they are now one below the other.</p>
</li>
<li><p>Added an extract <code>NUMBER</code> column</p>
</li>
<li><p>Added a merged cell table header for PO details</p>
</li>
<li><p>Changed column names</p>
</li>
<li><p>PO date is in different format</p>
</li>
<li><p>Missing phone number</p>
<p>  <img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747893897845/8029e2c9-8fc7-4c1f-b2a1-4ee291f184d4.png" alt class="image--center mx-auto" /></p>
</li>
</ul>
<h2 id="heading-sheet-4">Sheet 4</h2>
<ul>
<li><p>Added some text at the top of the table irrelevant to the data</p>
</li>
<li><p>Changed the column order and column header</p>
</li>
<li><p>Changed date format</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747887709730/f5444fc7-503f-4ab7-898d-e397587fefd2.png" alt class="image--center mx-auto" /></p>
<p>As you can see - all the PO forms are different and although they have similar data, the layouts are totally different. There is no way Power Query or any Python code with regex pattern matching can extract data in a structured form from this Excel.</p>
<p>You can download the Excel from here : <a target="_blank" href="https://github.com/pawarbi/snippets/blob/main/complete_purchase_order.xlsx">snippets/complete_purchase_order.xlsx at main · pawarbi/snippets</a></p>
<h1 id="heading-goals">Goals</h1>
<p>As mentioned above, the main goal is to extract the data, but we also we want to :</p>
<ul>
<li><p>evaluate the accuracy of extraction to ensure AI Function can be used reliably.</p>
</li>
<li><p>perform the extraction efficiently with minimal CU consumption</p>
</li>
<li><p>make the process reproducible to ensure we can trace the results and performance in the future</p>
</li>
<li><p>identify risks, errors and solution space</p>
</li>
</ul>
<h1 id="heading-fabric-ai-function">Fabric AI Function</h1>
<p><a target="_blank" href="https://blog.fabric.microsoft.com/en-us/blog/announcing-ai-functions-for-easy-llm-powered-data-enrichment?ft=All">Fabric AI functions</a> use LLM (Azure Open AI deployment) for text summarization, extraction, classification etc with convenient Python functions. It can be used with any F SKU in a Fabric PySpark notebook. We will start with an easy/baseline solution first and progressively work towards the goals mentioned above to achieve the final production-ready solution.</p>
<p>To get started, install AI functions in a Fabric PySpark notebook with runtime 1.3</p>
<pre><code class="lang-python"><span class="hljs-comment">## Fabric PySpark notebook with runtimr 1.3</span>
%pip install -q tiktoken deepdiff tabulate openai==<span class="hljs-number">1.30</span> &gt; /dev/null <span class="hljs-number">2</span>&gt;&amp;<span class="hljs-number">1</span>
%pip install -q --force-reinstall httpx==<span class="hljs-number">0.27</span><span class="hljs-number">.0</span> &gt; /dev/null <span class="hljs-number">2</span>&gt;&amp;<span class="hljs-number">1</span>
%pip install -q --force-reinstall https://mmlspark.blob.core.windows.net/pip/<span class="hljs-number">1.0</span><span class="hljs-number">.9</span>/synapseml_core<span class="hljs-number">-1.0</span><span class="hljs-number">.9</span>-py2.py3-none-any.whl &gt; /dev/null <span class="hljs-number">2</span>&gt;&amp;<span class="hljs-number">1</span>
%pip install -q --force-reinstall https://mmlspark.blob.core.windows.net/pip/<span class="hljs-number">1.0</span><span class="hljs-number">.10</span><span class="hljs-number">.0</span>-spark3<span class="hljs-number">.4</span><span class="hljs-number">-5</span>-a5d50c90-SNAPSHOT/synapseml_internal<span class="hljs-number">-1.0</span><span class="hljs-number">.10</span><span class="hljs-number">.0</span>.dev1-py2.py3-none-any.whl &gt; /dev/null <span class="hljs-number">2</span>&gt;&amp;<span class="hljs-number">1</span>

<span class="hljs-comment">#optional - I will explain later</span>
!wget -O /synfs/nb_resource/builtin/json_diff_extraction_accuracy.py https://raw.githubusercontent.com/pawarbi/snippets/refs/heads/main/json_diff_extraction_accuracy.py
</code></pre>
<p>Since we want to extract the data, we can use the <code>.extract</code> method in AI functions to specify the data we want to extract. Note that AI functions work on text data so we need to convert the excel to some text format. In my previous blogs, I have shared how LLM love markdown format. So we will convert the dataframe to markdown before passing it to AI functions.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> re
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">import</span> tiktoken 
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm
<span class="hljs-keyword">import</span> openai
<span class="hljs-keyword">from</span> notebookutils <span class="hljs-keyword">import</span> fs
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">from</span> synapse.ml.aifunc <span class="hljs-keyword">import</span> Conf
<span class="hljs-keyword">from</span> deepdiff <span class="hljs-keyword">import</span> DeepDiff
<span class="hljs-keyword">from</span> pydantic <span class="hljs-keyword">import</span> BaseModel, ValidationError, field_validator
<span class="hljs-keyword">import</span> builtin.json_diff_extraction_accuracy <span class="hljs-keyword">as</span> diff

path = <span class="hljs-string">"/lakehouse/default/Files/complete_purchase_order.xlsx"</span>
df = pd.read_excel(path)

<span class="hljs-comment"># convert data to markdown</span>
md = pd.DataFrame({<span class="hljs-string">"data"</span>:[df.to_markdown()]})

<span class="hljs-comment"># define fields to extract</span>
df = md[<span class="hljs-string">"data"</span>].ai.extract(
    <span class="hljs-string">'job_id'</span>,
    <span class="hljs-string">'customer_id'</span>,
    <span class="hljs-string">'po_date'</span>,
    <span class="hljs-string">'name'</span>,
    <span class="hljs-string">'phone_number'</span>,
    <span class="hljs-string">'delivery_type'</span>,
    <span class="hljs-string">'delivery_date'</span>,
    <span class="hljs-string">'delivery_time'</span>,
    <span class="hljs-string">'delivery_address'</span>,
    <span class="hljs-string">'eir_code'</span>,
    <span class="hljs-string">'type'</span>,
    <span class="hljs-string">'material_code'</span>,
    <span class="hljs-string">'material_description'</span>,
    <span class="hljs-string">'quantity'</span>,
    <span class="hljs-string">'uom'</span>
)
display(df)
</code></pre>
<p>And just like that without any prompts or efforts, AI function extracted the fields correctly from the first sheet.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747889649608/592d7731-2940-467a-8763-b52615e0feaa.png" alt class="image--center mx-auto" /></p>
<p>We can loop over the other sheets and be done with it. But not so fast. Note that we only got one row back corresponding to the PO header and didn’t get all the rows in the details table. That’s because in the <code>.extract()</code> we can only extract text and not complex data types like dictionaries and lists. To achieve that we will need to write a prompt and use <code>.generate_response()</code> function. To do so, we will:</p>
<ul>
<li><p>write a prompt which includes the schema of the expected response we want</p>
</li>
<li><p>one of our goals is to achieve the solution by minimizing the CU consumption. AI functions are billed by the spark meter but measured in input and output tokens. We will count the tokens to track this.</p>
</li>
<li><p>process the response to create two dataframes - one for order detail and another for materials table.</p>
</li>
</ul>
<h3 id="heading-prompt">Prompt</h3>
<p>In the below prompt, I give instructions and define the JSON schema the LLM should use to provide the response. Notice the data types defined as well as <code>enum</code> (options to choose from, any other value would be invalid). This is a fairly straightforward prompt.</p>
<pre><code class="lang-python">prompt = <span class="hljs-string">"""
  Extract purchase order details as a valid JSON and material list from the originally from an Excel. Return ONLY the JSON without any explanation or details based on below schema.
  Ignore extraneous details.
  "response_schema": {
    "purchase_order": {
      "job_id": "string",
      "customer_id": "string",
      "po_date": "string (date format: DD/MM/YYYY)",
      "name": "string",
      "phone_number": "string",
      "delivery_type": "string" (enum: Delivery, Pickup, Shipping)
      "delivery_date": "string (date format: DD/MM/YYYY)",
      "delivery_time": "string",
      "delivery_address": "string",
      "eir_code": "string"
    },
    "materials": [
      {
        "type": "string",
        "material_code": "string",
        "material_description": "string",
        "quantity": "integer",
        "uom": "string" (unit of measure, null if not found)
      }
    ]
  }
"""</span>
</code></pre>
<p>To calculate the number of tokens, we can use the <code>tiktoken</code> library. Currently, AI function uses <code>gpt-3.5</code> model. If it changes to any other model, be sure update the model accordingly below (each model uses a different tokenizer).</p>
<pre><code class="lang-python">model = <span class="hljs-string">"gpt-3.5-turbo"</span>
tokenizer= tiktoken.encoding_for_model(model)
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_token_count</span>(<span class="hljs-params">obj</span>):</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> isinstance(obj, str):
        obj = str(obj)
    <span class="hljs-keyword">return</span> len(tokenizer.encode(obj))
</code></pre>
<p>Extract the data</p>
<pre><code class="lang-python">path = <span class="hljs-string">"/lakehouse/default/Files/complete_purchase_order.xlsx"</span>


sheet_names = [<span class="hljs-string">'PO1'</span>, <span class="hljs-string">'PO2'</span>, <span class="hljs-string">'PO3'</span>, <span class="hljs-string">'PO4'</span>, <span class="hljs-string">'PO5'</span>]
df_dict = {
    sheet: pd.read_excel(path, sheet_name=sheet).dropna(how=<span class="hljs-string">'all'</span>).dropna(how=<span class="hljs-string">'all'</span>, axis=<span class="hljs-number">1</span>)
    <span class="hljs-keyword">for</span> sheet <span class="hljs-keyword">in</span> sheet_names
}

df_extract = pd.DataFrame({
    <span class="hljs-string">"json_data"</span>: [json.loads(df_dict[<span class="hljs-string">f'PO<span class="hljs-subst">{i}</span>'</span>].to_json()) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">6</span>)],
    <span class="hljs-string">"string_data"</span>: [df_dict[<span class="hljs-string">f'PO<span class="hljs-subst">{i}</span>'</span>].to_string() <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">6</span>)],
    <span class="hljs-string">"markdwn_data"</span>: [df_dict[<span class="hljs-string">f'PO<span class="hljs-subst">{i}</span>'</span>].to_markdown() <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">6</span>)]
})
col = <span class="hljs-string">"string_data"</span>
df_extract[<span class="hljs-string">'data'</span>] = df_extract[[col]].ai.generate_response(prompt, conf=Conf(seed=<span class="hljs-number">0</span>, max_concurrency=<span class="hljs-number">50</span>))
df_extract[<span class="hljs-string">'input_tokens'</span>] = df_extract[col].apply(get_token_count)
df_extract[<span class="hljs-string">'output_tokens'</span>] = df_extract[<span class="hljs-string">"data"</span>].apply(get_token_count)

<span class="hljs-comment"># df_extract</span>
<span class="hljs-comment">#### Extract orders and materials</span>

order_dfs = []
material_dfs = []

<span class="hljs-keyword">for</span> num <span class="hljs-keyword">in</span> range(<span class="hljs-number">4</span>):
    po = <span class="hljs-string">f"PO<span class="hljs-subst">{num+<span class="hljs-number">1</span>}</span>"</span>
    current_data = json.loads(df_extract[<span class="hljs-string">'data'</span>][num])
    order_df = pd.json_normalize(current_data)
    order_df[<span class="hljs-string">'source_po'</span>] = po
    order_dfs.append(order_df.drop(columns=[<span class="hljs-string">'materials'</span>], errors=<span class="hljs-string">'ignore'</span>))

    materials_df = pd.json_normalize(current_data.get(<span class="hljs-string">'materials'</span>, []))
    materials_df[<span class="hljs-string">'source_po'</span>] = po
    material_dfs.append(materials_df)

order_df = pd.concat(order_dfs, ignore_index=<span class="hljs-literal">True</span>).sort_values(<span class="hljs-string">'purchase_order.customer_id'</span>).reset_index(drop=<span class="hljs-literal">True</span>)
materials_df = pd.concat(material_dfs, ignore_index=<span class="hljs-literal">True</span>).sort_values([<span class="hljs-string">'source_po'</span>,<span class="hljs-string">'type'</span>,<span class="hljs-string">'material_code'</span>]).reset_index(drop=<span class="hljs-literal">True</span>)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747890675319/18608613-c796-44a6-a5d0-1702a878d637.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747890695629/c87e00eb-8d4f-49b1-9e98-0e749d6954e8.png" alt class="image--center mx-auto" /></p>
<p>To get the number of tokens, use the <code>.ai.stats</code> method:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747890811639/ed3aef3e-0f03-49be-88c1-8106a995cd72.png" alt class="image--center mx-auto" /></p>
<p>In this case, we used 3761 input tokens (prompt + input data) and 3773 output tokens (the JSON returned).</p>
<p>Next we still need to clean up the above dataframes a bit more to make it consumption ready by cleaning the column names, fixing the date format and data types etc.</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_order_df</span>(<span class="hljs-params">df</span>):</span>
    <span class="hljs-string">"""
    Clean the order dataframe with robust error handling
    """</span>
    cleaned_df = df.copy()

    <span class="hljs-comment"># Remove prefix from column names</span>
    new_columns = {col: col.replace(<span class="hljs-string">'purchase_order.'</span>, <span class="hljs-string">''</span>) <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cleaned_df.columns <span class="hljs-keyword">if</span> col.startswith(<span class="hljs-string">'purchase_order.'</span>)}
    cleaned_df = cleaned_df.rename(columns=new_columns)

    <span class="hljs-comment"># Convert customer_id to integer with error handling</span>
    <span class="hljs-keyword">if</span> <span class="hljs-string">'customer_id'</span> <span class="hljs-keyword">in</span> cleaned_df.columns:
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">safe_int_convert</span>(<span class="hljs-params">x</span>):</span>
            <span class="hljs-keyword">if</span> pd.isna(x) <span class="hljs-keyword">or</span> x == <span class="hljs-string">''</span>:
                <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
            <span class="hljs-keyword">try</span>:
                <span class="hljs-keyword">return</span> int(x)
            <span class="hljs-keyword">except</span> (ValueError, TypeError):
                <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

        cleaned_df[<span class="hljs-string">'customer_id'</span>] = cleaned_df[<span class="hljs-string">'customer_id'</span>].apply(safe_int_convert)

    <span class="hljs-comment"># Convert date columns with error handling</span>
    date_columns = [<span class="hljs-string">'po_date'</span>, <span class="hljs-string">'delivery_date'</span>]
    <span class="hljs-keyword">for</span> date_col <span class="hljs-keyword">in</span> date_columns:
        <span class="hljs-keyword">if</span> date_col <span class="hljs-keyword">in</span> cleaned_df.columns:
            <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">safe_date_convert</span>(<span class="hljs-params">x</span>):</span>
                <span class="hljs-keyword">if</span> pd.isna(x) <span class="hljs-keyword">or</span> x == <span class="hljs-string">''</span>:
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
                <span class="hljs-keyword">try</span>:
                    <span class="hljs-keyword">return</span> pd.to_datetime(x, dayfirst=<span class="hljs-literal">True</span>).strftime(<span class="hljs-string">'%d-%m-%Y'</span>)
                <span class="hljs-keyword">except</span> (ValueError, TypeError):
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

            cleaned_df[date_col] = cleaned_df[date_col].apply(safe_date_convert)

    <span class="hljs-comment"># Convert text columns with error handling</span>
    text_columns = [<span class="hljs-string">'name'</span>, <span class="hljs-string">'delivery_type'</span>, <span class="hljs-string">'delivery_time'</span>, <span class="hljs-string">'delivery_address'</span>]
    <span class="hljs-keyword">for</span> text_col <span class="hljs-keyword">in</span> text_columns:
        <span class="hljs-keyword">if</span> text_col <span class="hljs-keyword">in</span> cleaned_df.columns:
            <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">safe_text_convert</span>(<span class="hljs-params">x</span>):</span>
                <span class="hljs-keyword">if</span> pd.isna(x) <span class="hljs-keyword">or</span> x == <span class="hljs-string">''</span>:
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
                <span class="hljs-keyword">try</span>:
                    <span class="hljs-keyword">return</span> str(x).title()
                <span class="hljs-keyword">except</span> (ValueError, TypeError):
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

            cleaned_df[text_col] = cleaned_df[text_col].apply(safe_text_convert)

    <span class="hljs-comment"># Sort and reset index</span>
    <span class="hljs-keyword">try</span>:
        cleaned_df = cleaned_df.sort_values(<span class="hljs-string">'customer_id'</span>, na_position=<span class="hljs-string">'last'</span>).reset_index(drop=<span class="hljs-literal">True</span>)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Warning: Sorting failed with error: <span class="hljs-subst">{e}</span>"</span>)
        cleaned_df = cleaned_df.reset_index(drop=<span class="hljs-literal">True</span>)

    <span class="hljs-keyword">return</span> cleaned_df

final_order_df = clean_order_df(order_df)
display(final_order_df)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747891085184/b034541c-4e06-43b1-aea9-9b2e4c509e3e.png" alt class="image--center mx-auto" /></p>
<p>For materials data:</p>
<pre><code class="lang-python"><span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_materials_df</span>(<span class="hljs-params">df</span>):</span>
    df[<span class="hljs-string">'quantity'</span>] = df[<span class="hljs-string">'quantity'</span>].astype(<span class="hljs-string">'float'</span>)
    df[<span class="hljs-string">'uom'</span>] = df[<span class="hljs-string">'uom'</span>].str.lower()
    <span class="hljs-keyword">return</span> df.sort_values([<span class="hljs-string">'source_po'</span>,<span class="hljs-string">'type'</span>,<span class="hljs-string">'material_code'</span>]).reset_index(drop=<span class="hljs-literal">True</span>)

final_materials_df = clean_materials_df(materials_df)
display(final_materials_df)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747891135018/483089b5-7116-418d-92dd-124108c8675f.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-validation">Validation</h2>
<p>We got the result we wanted but is it accurate? Let’s find out. To find the accuracy, we need to compare the final results with the actual data. There are many accuracy metrics but we will keep it simple. I will convert the dataframe to a JSON and then compare each field with the JSON of the original excel sheet I manually created for validation purposes.</p>
<p>Create a JSON of the above dfs:</p>
<pre><code class="lang-python">purchase_order = {
    <span class="hljs-string">"job_id"</span>: final_order_df[<span class="hljs-string">"job_id"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"job_id"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
    <span class="hljs-string">"customer_id"</span>: final_order_df[<span class="hljs-string">"customer_id"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"customer_id"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
    <span class="hljs-string">"po_date"</span>: final_order_df[<span class="hljs-string">"po_date"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"po_date"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
    <span class="hljs-string">"name"</span>: final_order_df[<span class="hljs-string">"name"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"name"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
    <span class="hljs-string">"phone_number"</span>: final_order_df[<span class="hljs-string">"phone_number"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"phone_number"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
    <span class="hljs-string">"delivery_type"</span>: final_order_df[<span class="hljs-string">"delivery_type"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"delivery_type"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
    <span class="hljs-string">"delivery_date"</span>: final_order_df[<span class="hljs-string">"delivery_date"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"delivery_date"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
    <span class="hljs-string">"delivery_time"</span>: final_order_df[<span class="hljs-string">"delivery_time"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"delivery_time"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
    <span class="hljs-string">"delivery_address"</span>: final_order_df[<span class="hljs-string">"delivery_address"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"delivery_address"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
    <span class="hljs-string">"eir_code"</span>: final_order_df[<span class="hljs-string">"eir_code"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"eir_code"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>
}


materials = final_materials_df.to_dict(orient=<span class="hljs-string">'records'</span>)

order_json = {<span class="hljs-string">"purchase_order"</span>: purchase_order}
materials_json = {<span class="hljs-string">"materials"</span>: materials}
</code></pre>
<p>Fix the data types so they conform to JSON and load the ground truth as JSON.</p>
<pre><code class="lang-python">
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">convert_numpy_types</span>(<span class="hljs-params">obj</span>):</span>
    <span class="hljs-keyword">if</span> isinstance(obj, dict):
        <span class="hljs-keyword">return</span> {k: convert_numpy_types(v) <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> obj.items()}
    <span class="hljs-keyword">elif</span> isinstance(obj, list):
        <span class="hljs-keyword">return</span> [convert_numpy_types(i) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> obj]
    <span class="hljs-keyword">elif</span> isinstance(obj, np.integer):
        <span class="hljs-keyword">return</span> int(obj)
    <span class="hljs-keyword">elif</span> isinstance(obj, np.floating):
        <span class="hljs-keyword">return</span> float(obj)
    <span class="hljs-keyword">elif</span> isinstance(obj, np.ndarray):
        <span class="hljs-keyword">return</span> obj.tolist()
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">return</span> obj


order_json = convert_numpy_types(order_json)
materials_json = convert_numpy_types(materials_json)

<span class="hljs-keyword">with</span> open(<span class="hljs-string">"/lakehouse/default/Files/ground_truth/order_ground_truth.json"</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
    order_groundtruth = json.load(f)

<span class="hljs-comment">#load materials</span>
<span class="hljs-keyword">with</span> open(<span class="hljs-string">"/lakehouse/default/Files/ground_truth/materials_ground_truth.json"</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
    material_groundtruth  = json.load(f)
</code></pre>
<p>To calculate accuracy by comparing each field I wrote Python function using <code>deepdiff</code> which you can get it from <a target="_blank" href="https://raw.githubusercontent.com/pawarbi/snippets/refs/heads/main/json_diff_extraction_accuracy.py">my repo</a>.</p>
<pre><code class="lang-python">result = diff.calculate_json_accuracy( material_groundtruth, materials_json)
accuracy_score = result[<span class="hljs-string">'score'</span>]
total_fields = result[<span class="hljs-string">'total_fields'</span>]
diff_stats = result[<span class="hljs-string">'json_diff_stats'</span>]

print(<span class="hljs-string">f"Accuracy score: <span class="hljs-subst">{result[<span class="hljs-string">'score'</span>]}</span>"</span>)
print(<span class="hljs-string">f"Total fields: <span class="hljs-subst">{result[<span class="hljs-string">'total_fields'</span>]}</span>"</span>)
print(<span class="hljs-string">f"Diff stats: <span class="hljs-subst">{result[<span class="hljs-string">'json_diff_stats'</span>]}</span>"</span>)
</code></pre>
<div data-node-type="callout">
<div data-node-type="callout-emoji">💡</div>
<div data-node-type="callout-text">You can save this diff function as a Fabric UDF to use it in various projects</div>
</div>

<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747891653718/5e5c786c-d81d-4948-a8c9-827ccbd1f068.png" alt class="image--center mx-auto" /></p>
<p>In the materials dataframe, we compared total 324 fields and all of them are valid and match the actual data. Keep in mind that this is despite missing values, different column names, column orders etc. LLM was able to understand the context and overall structure of the text to extract the fields correctly.</p>
<h2 id="heading-other-goals">Other Goals</h2>
<p>We achieved one of the goals - to extract the data accurately. But what about minimizing the cost, i.e. CU consumption?</p>
<p>I omitted one step above intentionally to keep things simple. In the above code, I first converted the data as string before sending it to LLM via AI function. I could have converted the data to other formats as well like markdown I did in the first baseline or as JSON as well. Let’s see what happens if I do that.</p>
<pre><code class="lang-python"><span class="hljs-comment">#using string</span>
display(df_extract)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747892264794/ac511e69-5fb5-4014-9947-b6e4ed88daff.png" alt class="image--center mx-auto" /></p>
<p>Use markdown instead:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747892341288/973e41c4-4718-42da-a667-3895045b7ada.png" alt class="image--center mx-auto" /></p>
<p>As you can see above, the number of input tokens are about ~10-20% more ! So using string instead of markdown is better in this case example without sacrificing accuracy.</p>
<h2 id="heading-aiops">AIOps</h2>
<p>In data science, you usually run many experiments to find the right hyperparameters, configurations, and features, including feature engineering, before deciding on the best setup to meet your goals. LLM responses are stochastic, so they should be handled like any other data science project, applying the same principles for reproducibility and accuracy. Fabric has built-in MLflow integration, allowing us to run all the experiments and finalize our solution for production. This blog is already lengthy, so I won't cover all the details here, but at a high level : I set up MLflow experiment, instrument it with metrics I want to capture (inputs, tokens, reponses, accuracy etc.), vary the formatting used, log the runs and compare the results. In a real world project, you will run many experiments with different prompts, temperatures, seeds, model configurations etc but I am skipping it here.</p>
<p>Putting it all together:</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> re
<span class="hljs-keyword">import</span> json
<span class="hljs-keyword">import</span> time
<span class="hljs-keyword">import</span> mlflow
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np
<span class="hljs-keyword">import</span> tiktoken
<span class="hljs-keyword">import</span> openai
<span class="hljs-keyword">from</span> tqdm <span class="hljs-keyword">import</span> tqdm
<span class="hljs-keyword">from</span> datetime <span class="hljs-keyword">import</span> datetime
<span class="hljs-keyword">from</span> synapse.ml.aifunc <span class="hljs-keyword">import</span> Conf
<span class="hljs-keyword">from</span> deepdiff <span class="hljs-keyword">import</span> DeepDiff
<span class="hljs-keyword">import</span> builtin.json_diff_extraction_accuracy <span class="hljs-keyword">as</span> diff
<span class="hljs-keyword">import</span> warnings
warnings.simplefilter(action=<span class="hljs-string">'ignore'</span>, category=FutureWarning)

experiment_name = <span class="hljs-string">"purchase_order_extraction_format_test"</span>
mlflow.set_experiment(experiment_name)

prompt = <span class="hljs-string">"""
  Extract purchase order details as a valid JSON and material list from the originally from an Excel. Return ONLY the JSON without any explanation or details based on below schema.
  Ignore extraneous details.
  "response_schema": {
    "purchase_order": {
      "job_id": "string",
      "customer_id": "string",
      "po_date": "string (date format: DD/MM/YYYY)",
      "name": "string",
      "phone_number": "string",
      "delivery_type": "string" (enum: Delivery, Pickup, Shipping)
      "delivery_date": "string (date format: DD/MM/YYYY)",
      "delivery_time": "string",
      "delivery_address": "string",
      "eir_code": "string"
    },
    "materials": [
      {
        "type": "string",
        "material_code": "string",
        "material_description": "string",
        "quantity": "integer",
        "uom": "string" (unit of measure, null if not found)
      }
    ]
  }
"""</span>


model = <span class="hljs-string">"gpt-3.5-turbo"</span>
tokenizer = tiktoken.encoding_for_model(model)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_token_count</span>(<span class="hljs-params">obj</span>):</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> isinstance(obj, str):
        obj = str(obj)
    <span class="hljs-keyword">return</span> len(tokenizer.encode(obj))

<span class="hljs-comment">## attach a lakehouse and upload the ground truth to teh below folder</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_groundtruth</span>():</span>
    <span class="hljs-keyword">with</span> open(<span class="hljs-string">"/lakehouse/default/Files/ground_truth/order_ground_truth.json"</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
        order_groundtruth = json.load(f)

    <span class="hljs-keyword">with</span> open(<span class="hljs-string">"/lakehouse/default/Files/ground_truth/materials_ground_truth.json"</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f:
        material_groundtruth = json.load(f)

    <span class="hljs-keyword">return</span> order_groundtruth, material_groundtruth

<span class="hljs-comment">## this is for converting to int for deepdiff, otherwise not needed</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">convert_numpy_types</span>(<span class="hljs-params">obj</span>):</span>
    <span class="hljs-keyword">if</span> isinstance(obj, dict):
        <span class="hljs-keyword">return</span> {k: convert_numpy_types(v) <span class="hljs-keyword">for</span> k, v <span class="hljs-keyword">in</span> obj.items()}
    <span class="hljs-keyword">elif</span> isinstance(obj, list):
        <span class="hljs-keyword">return</span> [convert_numpy_types(i) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> obj]
    <span class="hljs-keyword">elif</span> isinstance(obj, np.integer):
        <span class="hljs-keyword">return</span> int(obj)
    <span class="hljs-keyword">elif</span> isinstance(obj, np.floating):
        <span class="hljs-keyword">return</span> float(obj)
    <span class="hljs-keyword">elif</span> isinstance(obj, np.ndarray):
        <span class="hljs-keyword">return</span> obj.tolist()
    <span class="hljs-keyword">else</span>:
        <span class="hljs-keyword">return</span> obj


<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_order_df</span>(<span class="hljs-params">df</span>):</span>
    cleaned_df = df.copy()
    new_columns = {col: col.replace(<span class="hljs-string">'purchase_order.'</span>, <span class="hljs-string">''</span>) <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> cleaned_df.columns <span class="hljs-keyword">if</span> col.startswith(<span class="hljs-string">'purchase_order.'</span>)}
    cleaned_df = cleaned_df.rename(columns=new_columns)

    <span class="hljs-keyword">if</span> <span class="hljs-string">'customer_id'</span> <span class="hljs-keyword">in</span> cleaned_df.columns:
        <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">safe_int_convert</span>(<span class="hljs-params">x</span>):</span>
            <span class="hljs-keyword">if</span> pd.isna(x) <span class="hljs-keyword">or</span> x == <span class="hljs-string">''</span>:
                <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
            <span class="hljs-keyword">try</span>:
                <span class="hljs-keyword">return</span> int(x)
            <span class="hljs-keyword">except</span> (ValueError, TypeError):
                <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

        cleaned_df[<span class="hljs-string">'customer_id'</span>] = cleaned_df[<span class="hljs-string">'customer_id'</span>].apply(safe_int_convert)


    date_columns = [<span class="hljs-string">'po_date'</span>, <span class="hljs-string">'delivery_date'</span>]
    <span class="hljs-keyword">for</span> date_col <span class="hljs-keyword">in</span> date_columns:
        <span class="hljs-keyword">if</span> date_col <span class="hljs-keyword">in</span> cleaned_df.columns:
            <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">safe_date_convert</span>(<span class="hljs-params">x</span>):</span>
                <span class="hljs-keyword">if</span> pd.isna(x) <span class="hljs-keyword">or</span> x == <span class="hljs-string">''</span>:
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
                <span class="hljs-keyword">try</span>:
                    <span class="hljs-keyword">return</span> pd.to_datetime(x, dayfirst=<span class="hljs-literal">True</span>).strftime(<span class="hljs-string">'%d-%m-%Y'</span>)
                <span class="hljs-keyword">except</span> (ValueError, TypeError):
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

            cleaned_df[date_col] = cleaned_df[date_col].apply(safe_date_convert)


    text_columns = [<span class="hljs-string">'name'</span>, <span class="hljs-string">'delivery_type'</span>, <span class="hljs-string">'delivery_time'</span>, <span class="hljs-string">'delivery_address'</span>]
    <span class="hljs-keyword">for</span> text_col <span class="hljs-keyword">in</span> text_columns:
        <span class="hljs-keyword">if</span> text_col <span class="hljs-keyword">in</span> cleaned_df.columns:
            <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">safe_text_convert</span>(<span class="hljs-params">x</span>):</span>
                <span class="hljs-keyword">if</span> pd.isna(x) <span class="hljs-keyword">or</span> x == <span class="hljs-string">''</span>:
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>
                <span class="hljs-keyword">try</span>:
                    <span class="hljs-keyword">return</span> str(x).title()
                <span class="hljs-keyword">except</span> (ValueError, TypeError):
                    <span class="hljs-keyword">return</span> <span class="hljs-literal">None</span>

            cleaned_df[text_col] = cleaned_df[text_col].apply(safe_text_convert)

    <span class="hljs-keyword">try</span>:
        cleaned_df = cleaned_df.sort_values(<span class="hljs-string">'customer_id'</span>, na_position=<span class="hljs-string">'last'</span>).reset_index(drop=<span class="hljs-literal">True</span>)
    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        print(<span class="hljs-string">f"Warning: Sorting failed with error: <span class="hljs-subst">{e}</span>"</span>)
        cleaned_df = cleaned_df.reset_index(drop=<span class="hljs-literal">True</span>)

    <span class="hljs-keyword">return</span> cleaned_df

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">clean_materials_df</span>(<span class="hljs-params">df</span>):</span>
    df[<span class="hljs-string">'quantity'</span>] = df[<span class="hljs-string">'quantity'</span>].astype(<span class="hljs-string">'float'</span>)
    df[<span class="hljs-string">'uom'</span>] = df[<span class="hljs-string">'uom'</span>].str.lower()
    <span class="hljs-keyword">return</span> df.sort_values([<span class="hljs-string">'source_po'</span>,<span class="hljs-string">'type'</span>,<span class="hljs-string">'material_code'</span>]).reset_index(drop=<span class="hljs-literal">True</span>)

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_format_test</span>(<span class="hljs-params">format_col, df_extract, mlflow_run_name, temp=<span class="hljs-number">0.0</span></span>):</span>
    order_groundtruth, material_groundtruth = load_groundtruth()

    <span class="hljs-keyword">with</span> mlflow.start_run(run_name=mlflow_run_name):
        mlflow.log_param(<span class="hljs-string">"model"</span>, model)
        mlflow.log_param(<span class="hljs-string">"format"</span>, format_col)
        mlflow.log_param(<span class="hljs-string">"prompt"</span>, prompt)
        mlflow.log_param(<span class="hljs-string">"timestamp"</span>, datetime.now().strftime(<span class="hljs-string">"%Y-%m-%d %H:%M:%S"</span>))

        start_time = time.time()
        df_extract[<span class="hljs-string">'data'</span>] = df_extract[[format_col]].ai.generate_response(prompt, conf=Conf(seed=<span class="hljs-number">0</span>, max_concurrency=<span class="hljs-number">50</span>, temperature=temp))
        execution_time = time.time() - start_time

        stats = df_extract.ai.stats
        mlflow.log_metric(<span class="hljs-string">"num_successful"</span>, stats[<span class="hljs-string">'num_successful'</span>])
        mlflow.log_metric(<span class="hljs-string">"num_exceptions"</span>, stats[<span class="hljs-string">'num_exceptions'</span>])
        mlflow.log_metric(<span class="hljs-string">"num_unevaluated"</span>, stats[<span class="hljs-string">'num_unevaluated'</span>])
        mlflow.log_metric(<span class="hljs-string">"prompt_tokens"</span>, stats[<span class="hljs-string">'prompt_tokens'</span>])
        mlflow.log_metric(<span class="hljs-string">"completion_tokens"</span>, stats[<span class="hljs-string">'completion_tokens'</span>])
        mlflow.log_metric(<span class="hljs-string">"execution_time"</span>, execution_time)

        df_extract[<span class="hljs-string">'input_tokens'</span>] = df_extract[format_col].apply(get_token_count)
        df_extract[<span class="hljs-string">'output_tokens'</span>] = df_extract[<span class="hljs-string">"data"</span>].apply(get_token_count)

        mlflow.log_metric(<span class="hljs-string">"avg_input_tokens"</span>, df_extract[<span class="hljs-string">'input_tokens'</span>].mean())
        mlflow.log_metric(<span class="hljs-string">"avg_output_tokens"</span>, df_extract[<span class="hljs-string">'output_tokens'</span>].mean())

        order_dfs = []
        material_dfs = []

        <span class="hljs-keyword">for</span> num <span class="hljs-keyword">in</span> range(min(<span class="hljs-number">4</span>, len(df_extract))):
            po = <span class="hljs-string">f"PO<span class="hljs-subst">{num+<span class="hljs-number">1</span>}</span>"</span>
            <span class="hljs-keyword">try</span>:
                current_data = json.loads(df_extract[<span class="hljs-string">'data'</span>][num])

                <span class="hljs-comment"># order data</span>
                order_df = pd.json_normalize(current_data)
                order_df[<span class="hljs-string">'source_po'</span>] = po
                order_dfs.append(order_df.drop(columns=[<span class="hljs-string">'materials'</span>], errors=<span class="hljs-string">'ignore'</span>))

                <span class="hljs-comment"># materials data</span>
                materials_df = pd.json_normalize(current_data.get(<span class="hljs-string">'materials'</span>, []))
                materials_df[<span class="hljs-string">'source_po'</span>] = po
                material_dfs.append(materials_df)
            <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
                mlflow.log_param(<span class="hljs-string">f"error_po<span class="hljs-subst">{num+<span class="hljs-number">1</span>}</span>"</span>, str(e))
                <span class="hljs-keyword">continue</span>

        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> order_dfs <span class="hljs-keyword">or</span> <span class="hljs-keyword">not</span> material_dfs:
            mlflow.log_metric(<span class="hljs-string">"accuracy_score"</span>, <span class="hljs-number">0.0</span>)
            <span class="hljs-keyword">return</span>

        order_df = pd.concat(order_dfs, ignore_index=<span class="hljs-literal">True</span>)
        materials_df = pd.concat(material_dfs, ignore_index=<span class="hljs-literal">True</span>)

        final_order_df = clean_order_df(order_df)
        final_materials_df = clean_materials_df(materials_df)

        <span class="hljs-comment"># this is shaping the data for deepdiff, to convert into a record format</span>
        purchase_order = {
            <span class="hljs-string">"job_id"</span>: final_order_df[<span class="hljs-string">"job_id"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"job_id"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
            <span class="hljs-string">"customer_id"</span>: final_order_df[<span class="hljs-string">"customer_id"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"customer_id"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
            <span class="hljs-string">"po_date"</span>: final_order_df[<span class="hljs-string">"po_date"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"po_date"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
            <span class="hljs-string">"name"</span>: final_order_df[<span class="hljs-string">"name"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"name"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
            <span class="hljs-string">"phone_number"</span>: final_order_df[<span class="hljs-string">"phone_number"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"phone_number"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
            <span class="hljs-string">"delivery_type"</span>: final_order_df[<span class="hljs-string">"delivery_type"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"delivery_type"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
            <span class="hljs-string">"delivery_date"</span>: final_order_df[<span class="hljs-string">"delivery_date"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"delivery_date"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
            <span class="hljs-string">"delivery_time"</span>: final_order_df[<span class="hljs-string">"delivery_time"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"delivery_time"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
            <span class="hljs-string">"delivery_address"</span>: final_order_df[<span class="hljs-string">"delivery_address"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"delivery_address"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>,
            <span class="hljs-string">"eir_code"</span>: final_order_df[<span class="hljs-string">"eir_code"</span>].iloc[<span class="hljs-number">0</span>] <span class="hljs-keyword">if</span> <span class="hljs-string">"eir_code"</span> <span class="hljs-keyword">in</span> final_order_df.columns <span class="hljs-keyword">and</span> len(final_order_df) &gt; <span class="hljs-number">0</span> <span class="hljs-keyword">else</span> <span class="hljs-string">""</span>
        }

        materials = final_materials_df.to_dict(orient=<span class="hljs-string">'records'</span>)

        order_json = {<span class="hljs-string">"purchase_order"</span>: purchase_order}
        materials_json = {<span class="hljs-string">"materials"</span>: materials}

        order_json = convert_numpy_types(order_json)
        materials_json = convert_numpy_types(materials_json)

        <span class="hljs-comment"># claculate accuracy</span>
        result = diff.calculate_json_accuracy(material_groundtruth, materials_json)
        accuracy_score = result[<span class="hljs-string">'score'</span>]
        total_fields = result[<span class="hljs-string">'total_fields'</span>]
        diff_stats = result[<span class="hljs-string">'json_diff_stats'</span>]

        <span class="hljs-comment"># Log metrics</span>
        mlflow.log_metric(<span class="hljs-string">"accuracy_score"</span>, accuracy_score)
        mlflow.log_metric(<span class="hljs-string">"total_fields"</span>, total_fields)
        mlflow.log_metric(<span class="hljs-string">"diff_additions"</span>, diff_stats[<span class="hljs-string">'additions'</span>])
        mlflow.log_metric(<span class="hljs-string">"diff_deletions"</span>, diff_stats[<span class="hljs-string">'deletions'</span>])
        mlflow.log_metric(<span class="hljs-string">"diff_modifications"</span>, diff_stats[<span class="hljs-string">'modifications'</span>])
        mlflow.log_metric(<span class="hljs-string">"diff_total"</span>, diff_stats[<span class="hljs-string">'total'</span>])

        <span class="hljs-keyword">with</span> open(<span class="hljs-string">"order_extracted.json"</span>, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
            json.dump(order_json, f, indent=<span class="hljs-number">2</span>)

        <span class="hljs-keyword">with</span> open(<span class="hljs-string">"materials_extracted.json"</span>, <span class="hljs-string">"w"</span>) <span class="hljs-keyword">as</span> f:
            json.dump(materials_json, f, indent=<span class="hljs-number">2</span>)

        mlflow.log_artifact(<span class="hljs-string">"order_extracted.json"</span>)
        mlflow.log_artifact(<span class="hljs-string">"materials_extracted.json"</span>)

        <span class="hljs-comment"># optional i you want to log the data for tracing       </span>
        <span class="hljs-comment"># mlflow.log_artifact("order_dataframe.csv")</span>
        <span class="hljs-comment"># mlflow.log_artifact("materials_dataframe.csv")</span>

        <span class="hljs-keyword">return</span> {
            <span class="hljs-string">"format"</span>: format_col,
            <span class="hljs-string">"accuracy"</span>: accuracy_score,
            <span class="hljs-string">"execution_time"</span>: execution_time,
            <span class="hljs-string">"diff_total"</span>: diff_stats[<span class="hljs-string">'total'</span>]
        }
</code></pre>
<p>Run the MLFlow experiments:</p>
<pre><code class="lang-python">
path = <span class="hljs-string">"/lakehouse/default/Files/complete_purchase_order.xlsx"</span>
sheet_names = [<span class="hljs-string">'PO1'</span>, <span class="hljs-string">'PO2'</span>, <span class="hljs-string">'PO3'</span>, <span class="hljs-string">'PO4'</span>, <span class="hljs-string">'PO5'</span>]

<span class="hljs-comment"># Load Excel data</span>
df_dict = {
    sheet: pd.read_excel(path, sheet_name=sheet).dropna(how=<span class="hljs-string">'all'</span>).dropna(how=<span class="hljs-string">'all'</span>, axis=<span class="hljs-number">1</span>)
    <span class="hljs-keyword">for</span> sheet <span class="hljs-keyword">in</span> sheet_names
}


df_extract = pd.DataFrame({
    <span class="hljs-string">"json_data"</span>: [json.loads(df_dict[<span class="hljs-string">f'PO<span class="hljs-subst">{i}</span>'</span>].to_json()) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">6</span>)],
    <span class="hljs-string">"string_data"</span>: [df_dict[<span class="hljs-string">f'PO<span class="hljs-subst">{i}</span>'</span>].to_string() <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">6</span>)],
    <span class="hljs-string">"markdwn_data"</span>: [df_dict[<span class="hljs-string">f'PO<span class="hljs-subst">{i}</span>'</span>].to_markdown() <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">1</span>, <span class="hljs-number">6</span>)]
})


formats = [<span class="hljs-string">"string_data"</span>, <span class="hljs-string">"json_data"</span>, <span class="hljs-string">"markdwn_data"</span>]
results = []

<span class="hljs-keyword">for</span> fmt <span class="hljs-keyword">in</span> formats:
    result = process_format_test(fmt, df_extract.copy(), <span class="hljs-string">f"format_test_<span class="hljs-subst">{fmt}</span>"</span>)
    <span class="hljs-keyword">if</span> result:
        results.append(result)

<span class="hljs-keyword">if</span> results:
    results_df = pd.DataFrame(results)
    results_df.to_csv(<span class="hljs-string">"format_comparison.csv"</span>, index=<span class="hljs-literal">False</span>)

    <span class="hljs-keyword">with</span> mlflow.start_run(run_name=<span class="hljs-string">"summary"</span>):
        mlflow.log_artifact(<span class="hljs-string">"format_comparison.csv"</span>)
        best_format = results_df.loc[results_df[<span class="hljs-string">'accuracy'</span>].idxmax(), <span class="hljs-string">'format'</span>]
        mlflow.log_param(<span class="hljs-string">"best_format"</span>, best_format)

        <span class="hljs-keyword">for</span> col <span class="hljs-keyword">in</span> results_df.columns:
            <span class="hljs-keyword">if</span> col != <span class="hljs-string">'format'</span>:
                mlflow.log_metric(<span class="hljs-string">f"avg_<span class="hljs-subst">{col}</span>"</span>, results_df[col].mean())
</code></pre>
<p>MLFLow inline run comparison:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747893473159/dc1de40e-f326-422e-ac9f-a84e6e57ed2d.gif" alt class="image--center mx-auto" /></p>
<p>In the run comparison we can see that both text and markdown resulted in 100% accuracy however markdown generated more tokens. JSON on the other hand is &lt;100% accurate.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747893678744/c5be31ac-77da-4af1-afd7-b0b944419729.png" alt class="image--center mx-auto" /></p>
<p>Why does this matter? Because, as you receive more data, varied data, as LLMs used in AI function changes, you need to be able to trace this back and reproduce/benchmark the results (as much as possible) so you can productionize the solution confidently. Using AI is easy, building evals is hard but necessary.</p>
<h1 id="heading-risks-limitations">Risks, Limitations</h1>
<ul>
<li><p>I created four layout examples. In the real world, you might encounter a layout different from the four I used. We can address this in a few ways. We can extract features from the current forms (Azure Doc Intelligence, for example, also provides the layout). We can train a machine learning model on many examples to create a classifier or use a large language model as a classifier for incoming forms. (e.g. a layout drift model)</p>
</li>
<li><p>Rate limits and context size: GPT-3.5 has a context window of only 16K tokens (including both input and output) and a maximum output of less than 4K tokens. This means it can't handle large Excel tables. In the example above, the average output was 1000 tokens, so for GPT-3.5, the data can't be more than about four times what I used. However, this is temporary. It was announced at FabCon that the model will be updated to newer GPT models with a larger context window (128K instead of 16K). AI functions also have a rate limit of 1000 TPM. For long documents, we can split the data into chunks to work around these limits, but that will require extra effort.</p>
</li>
<li><p>LLMs are bound to hallucinate and are subject to prompt injection, e.g. what if in the excel sheet there is text that intentionally or unintentionally instructs the models to behave differently and fudge the numbers?</p>
</li>
<li><p>The eval harness I created checked for two things - schema and values. But we could build more evaluation metrics for specific cases to ensure the accuracy risks are mitigated.</p>
</li>
<li><p>If Microsoft changes the underlying model, it may affect your results. You can always use a Azure Open AI custom deployment to control for this.</p>
</li>
<li><p>Always be aware of risks associated with sending sensitive data to LLMs and AI services.</p>
</li>
</ul>
<p>To estimate cost, you can use my <a target="_blank" href="https://fabric.guru/microsoft-fabric-copilot-and-ai-workload-cu-calculator">Fabric AI &amp; Copilot cost calculator.</a></p>
<h1 id="heading-references">References</h1>
<ul>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/overview?tabs=pandas">Transform and enrich data seamlessly with AI functions - Microsoft Fabric | Microsoft Learn</a></p>
</li>
<li><p><a target="_blank" href="https://docs.pydantic.dev/latest/">Welcome to Pydantic - Pydantic</a></p>
</li>
<li><p><a target="_blank" href="https://fabric.guru/unstructured-to-structured-using-fabric-ai-functions-to-extract-invoice-data-from-pdfs">Unstructured To Structured : Using Fabric AI Functions To Extract Invoice Data From PDFs</a></p>
</li>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-science/ai-functions/configuration">Customize the configuration of AI functions - Microsoft Fabric | Microsoft Learn</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Extracting Spark Event Logs in Fabric for Monitoring & Optimization]]></title><description><![CDATA[I wrote a blog a couple of weeks ago about extracting Spark driver logs using REST API. In this blog, I will share how to call APIs to get the spark event logs, parse the logs for spark performance metrics. This can be used for debugging, optimizing ...]]></description><link>https://fabric.guru/extracting-spark-event-logs-in-fabric-for-monitoring-and-optimization</link><guid isPermaLink="true">https://fabric.guru/extracting-spark-event-logs-in-fabric-for-monitoring-and-optimization</guid><category><![CDATA[spark metrics]]></category><category><![CDATA[microsoftfabric]]></category><category><![CDATA[spark]]></category><category><![CDATA[instrumentation]]></category><dc:creator><![CDATA[Sandeep Pawar]]></dc:creator><pubDate>Mon, 19 May 2025 07:00:00 GMT</pubDate><content:encoded><![CDATA[<p>I wrote <a target="_blank" href="https://fabric.guru/extracting-fabric-spark-driver-logs-using-api">a blog a couple of weeks ago</a> about extracting Spark driver logs using REST API. In this blog, I will share how to call APIs to get the spark event logs, parse the logs for spark performance metrics. This can be used for debugging, optimizing &amp; monitoring spark applications in Fabric. Note that in this case I am retrieving the logs for an executed application. If you want to do real time monitoring, use the <a target="_blank" href="https://blog.fabric.microsoft.com/en-us/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/">spark emitter</a> + Eventstream instead. <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/fundamentals/workspace-monitoring-overview">Workspace Monitoring</a> in Fabric does not <em>yet</em> include spark logs.</p>
<h3 id="heading-steps">Steps:</h3>
<ul>
<li><p>Get application id and livy id of the application (notebook or SJD)</p>
</li>
<li><p>Retrieve event log zip and save it to a lakehouse</p>
</li>
<li><p>Extract the event log JSON and save it to a lakehouse</p>
</li>
<li><p>Parse the event log for spark metrics</p>
</li>
</ul>
<h2 id="heading-get-session-info">Get Session Info</h2>
<p>Use the <code>get_latest_session_info()</code> function from my <a target="_blank" href="https://fabric.guru/extracting-fabric-spark-driver-logs-using-api">previous blog</a>. This will give you the <strong>latest</strong> application id and the livy id which will be required for further API calls. Note that if you want to get ids for all previous sessions modify the function accordingly.</p>
<pre><code class="lang-python"><span class="hljs-comment">## Fabric Python or Pyspark notebook</span>
<span class="hljs-comment">## Function from https://fabric.guru/extracting-fabric-spark-driver-logs-using-api</span>

notebook_id = <span class="hljs-string">"996b5d64-xxxxxxxx-ff2ba76d0fc8"</span>
workspace_id = fabric.get_notebook_workspace_id() <span class="hljs-comment">#or replace with your workspace id</span>
session = get_latest_session_info(notebook_id, workspace_id)
livy_id = session[<span class="hljs-string">'livyId'</span>]
app_id = session[<span class="hljs-string">'applicationId'</span>]
</code></pre>
<h2 id="heading-retrieve-spark-event-log">Retrieve Spark Event Log</h2>
<p>Spark event log can be several hundred MBs or even GBs so attach a default lakehouse to save the log zip file to a lakehouse.</p>
<pre><code class="lang-python"><span class="hljs-keyword">import</span> os
<span class="hljs-keyword">import</span> zipfile
<span class="hljs-keyword">import</span> glob
<span class="hljs-keyword">import</span> sempy.fabric <span class="hljs-keyword">as</span> fabric

<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_spark_event_log</span>(<span class="hljs-params">notebook_id, workspace_id, output_path, livy_id=None, app_id=None</span>):</span>
    <span class="hljs-string">"""
    Sandeep Pawar | fabric.guru | May 19,2025
    Gets Spark event logs and saves to specified path in an attached lakehouse.
    """</span>
    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> livy_id <span class="hljs-keyword">or</span> <span class="hljs-keyword">not</span> app_id:
        session_info = get_latest_session_info(notebook_id, workspace_id)
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> session_info:
            <span class="hljs-keyword">return</span> <span class="hljs-string">"Error: Could not retrieve session info"</span>
        livy_id = session_info.get(<span class="hljs-string">'livyId'</span>)
        app_id = session_info.get(<span class="hljs-string">'applicationId'</span>)

    <span class="hljs-keyword">if</span> os.path.isdir(output_path) <span class="hljs-keyword">or</span> output_path.endswith(<span class="hljs-string">'/'</span>):
        os.makedirs(output_path, exist_ok=<span class="hljs-literal">True</span>)
        output_path = os.path.join(output_path, <span class="hljs-string">f"spark_log_<span class="hljs-subst">{app_id}</span>.zip"</span>)
    <span class="hljs-keyword">else</span>:
        os.makedirs(os.path.dirname(output_path), exist_ok=<span class="hljs-literal">True</span>)

    client = fabric.FabricRestClient()
    <span class="hljs-comment">#refer to https://learn.microsoft.com/en-us/rest/api/fabric/spark/livy-sessions for API details</span>
    <span class="hljs-comment">#/1/ below is for the first try</span>
    endpoint = <span class="hljs-string">f"https://api.fabric.microsoft.com/v1/workspaces/<span class="hljs-subst">{workspace_id}</span>/notebooks/<span class="hljs-subst">{notebook_id}</span>/livySessions/<span class="hljs-subst">{livy_id}</span>/applications/<span class="hljs-subst">{app_id}</span>/1/logs"</span>

    <span class="hljs-keyword">try</span>:
        print(<span class="hljs-string">f"Retrieving logs from endpoint: <span class="hljs-subst">{endpoint}</span>"</span>)
        response = client.get(endpoint)

        <span class="hljs-keyword">if</span> response.status_code != <span class="hljs-number">200</span>:
            <span class="hljs-keyword">return</span> <span class="hljs-string">f"Error: Received status code <span class="hljs-subst">{response.status_code}</span>. Response: <span class="hljs-subst">{response.text}</span>"</span>

        <span class="hljs-keyword">with</span> open(output_path, <span class="hljs-string">"wb"</span>) <span class="hljs-keyword">as</span> f:
            f.write(response.content)

        file_size = len(response.content)
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"Successfully saved event logs (<span class="hljs-subst">{file_size/<span class="hljs-number">1024</span>/<span class="hljs-number">1024</span>:<span class="hljs-number">.2</span>f}</span> MB) to <span class="hljs-subst">{output_path}</span>"</span>

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"Error retrieving logs: <span class="hljs-subst">{str(e)}</span>"</span>
</code></pre>
<h2 id="heading-extract-event-log">Extract Event Log</h2>
<p>The zip file contains the event log which needs to be extracted. Below functions unzips the file and save it to the specified path in the lakehouse.</p>
<pre><code class="lang-python"><span class="hljs-comment">## Be sure to attach a default lakehouse</span>
<span class="hljs-comment"># I assumed it's .zip, adjust if other zip compression in the future</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">unzip_spark_log</span>(<span class="hljs-params">zip_path, extract_path</span>):</span>
    <span class="hljs-string">"""
    Sandeep Pawar | fabric.guru | May 19, 2025
    Extracts Spark event log zip to specified directory. 

    """</span>
    <span class="hljs-keyword">if</span> os.path.isdir(zip_path):
        zip_files = glob.glob(os.path.join(zip_path, <span class="hljs-string">"*.zip"</span>))
        <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> zip_files:
            <span class="hljs-keyword">return</span> <span class="hljs-string">f"Error: <span class="hljs-subst">{zip_path}</span> is a directory and no zip files were found within it"</span>
        zip_path = zip_files[<span class="hljs-number">0</span>]
        print(<span class="hljs-string">f"Using zip file: <span class="hljs-subst">{zip_path}</span>"</span>)

    <span class="hljs-keyword">if</span> <span class="hljs-keyword">not</span> os.path.exists(zip_path):
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"Error: Zip file <span class="hljs-subst">{zip_path}</span> doesn't exist"</span>

    os.makedirs(extract_path, exist_ok=<span class="hljs-literal">True</span>)

    <span class="hljs-keyword">try</span>:
        <span class="hljs-keyword">with</span> zipfile.ZipFile(zip_path, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> zip_ref:
            file_list = zip_ref.namelist()
            zip_ref.extractall(extract_path)

        <span class="hljs-keyword">return</span> <span class="hljs-string">f"Extracted <span class="hljs-subst">{len(file_list)}</span> files to <span class="hljs-subst">{extract_path}</span>: <span class="hljs-subst">{<span class="hljs-string">', '</span>.join(file_list)}</span>"</span>

    <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
        <span class="hljs-keyword">return</span> <span class="hljs-string">f"Error extracting logs: <span class="hljs-subst">{str(e)}</span>"</span>
</code></pre>
<p>Example:</p>
<pre><code class="lang-python"><span class="hljs-comment">##python or pyspark notebook</span>
output_path = <span class="hljs-string">"/lakehouse/default/Files/rawlogs"</span>
get_spark_event_log(notebook_id, workspace_id,output_path , livy_id=livy_id, app_id=app_id) <span class="hljs-comment">#saves the zip file</span>
unzip_spark_log(output_path, output_path) <span class="hljs-comment">#unzips the zip file</span>
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747715798186/c4ff546e-6e8c-4a11-af7c-cbfe6cb8b2c1.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-parse-log-for-spark-metrics">Parse log for spark metrics:</h2>
<p>Parse the event log JSON to get the performance metrics. Below I am extracting the stage metrics across all the stages. From this you can get metrics like data spill, GC time, shuffle read/write, CPU time, idle time, memory used etc which can help you optimize the spark application (more on this later).</p>
<pre><code class="lang-python"><span class="hljs-comment">##pyspark notebook</span>
%%pyspark
df = spark.read.json(<span class="hljs-string">f"Files/rawlogs/application_xxxxxxx"</span>)
df1 = df.filter(<span class="hljs-string">"Event='SparkListenerStageCompleted'"</span>).select(<span class="hljs-string">"`Stage Info`.*"</span>)
df1.createOrReplaceTempView(<span class="hljs-string">"t2"</span>)
df2 = spark.sql(<span class="hljs-string">"select 'Submission Time','Completion Time', 'Number of Tasks', 'Stage ID', t3.col.* from t2 lateral view explode(Accumulables) t3"</span>)
df2.createOrReplaceTempView(<span class="hljs-string">"t4"</span>)
result = spark.sql(<span class="hljs-string">"select Name, sum(Value) as value from t4 group by Name order by Name"</span>)
display(result)
</code></pre>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747716080783/3257cd75-a60e-49b7-9451-9c462c8ce7ac.png" alt class="image--center mx-auto" /></p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747716130137/2bcb4dbb-c5ca-477c-8b51-ad6cfc2d0220.png" alt class="image--center mx-auto" /></p>
<p>Similar to stage-level, you can also get task-level metrics, or aggregate metrics for all jobs in a workspace etc. There are other couple of interesting APIs which I will cover in future blogs.</p>
<p>I will share more on how to interpret these metrics and how they can provide insights into the application. Note that Fabric offers many built-in <a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-engineering/spark-monitoring-overview">monitoring capabilities</a>. APIs give you the ability to access detailed metrics and create additional custom metrics.</p>
<p>You can download the event log manually as well from the spark history server:</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1747755139148/8793f06c-74ae-4dfc-a56a-c4ffb912e6a5.png" alt class="image--center mx-auto" /></p>
<p>I would like to thank <a target="_blank" href="https://www.linkedin.com/in/jenny-jiang-8b57036/">Jenny Jiang from Microsoft</a> for answering my questions.</p>
<h2 id="heading-references">References:</h2>
<ul>
<li><p><a target="_blank" href="https://fabric.guru/extracting-fabric-spark-driver-logs-using-api">Extracting Fabric Spark Driver Logs Using API</a></p>
</li>
<li><p><a target="_blank" href="https://blog.fabric.microsoft.com/en-us/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/">Announcing the Fabric Apache Spark Diagnostic Emitter: Collect Logs and Metrics | Microsoft Fabric Blog | Microsoft Fabric</a></p>
</li>
<li><p><a target="_blank" href="https://blog.fabric.microsoft.com/en-us/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/">Livy Sessions - REST API (Spark) | Microsoft Learn</a></p>
</li>
<li><p><a target="_blank" href="https://blog.fabric.microsoft.com/en-us/blog/announcing-the-fabric-apache-spark-diagnostic-emitter-collect-logs-and-metrics/">Apache Spark monitoring overview - Microsoft Fabric | Microsoft L</a><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/data-engineering/spark-monitoring-overview">earn</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/LucaCanali/sparkMeasure">https://github.com/LucaCanali/sparkMeasure</a></p>
</li>
<li><p><a target="_blank" href="https://spark.apache.org/docs/latest/monitoring.html">Monitoring and Instrumentation - Spark 3.5.5 Documentation</a></p>
</li>
<li><p><a target="_blank" href="https://kb.databricks.com/metrics/explore-spark-metrics">How to explore Apache Spark metrics with Spark listeners - Databricks</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/groupon/spark-metrics">groupon/spark-metrics: A library to expose more of Apache Spark's metrics system</a></p>
</li>
<li><p><a target="_blank" href="https://learn.microsoft.com/en-us/fabric/fundamentals/workspace-monitoring-overview">Workspace monitoring overview - Microsoft Fabric | Microsoft Learn</a></p>
</li>
</ul>
]]></content:encoded></item></channel></rss>