Skip to main content

Command Palette

Search for a command to run...

Cross-referencing Notebooks In The Updated Fabric Notebook Copilot

Published

At FabCon Atlanta last week, the updated notebook Copilot for data engineering and data science was announced. It brings agentic capabilities to the Copilot and is much more intelligent and Fabric-aware than the previous version. You can read the documentation here. For example, you can now do and ask the Copilot following things which you couldn't previously:

  • Use the Copilot without starting a session. Just open the notebook, Copilot and start asking questions and making changes. Saves you CUs.

  • list items in the workspace : list all the lakehouses in this workspace

  • take actions : mount the and lakehouses and make the default lakehouse

  • get ABFSS paths : give me abfs path of lakehouse <lakehouse_1>.

  • create spark pool configurations using %%configure : add configuration to use 4 cores

  • refer to content in a cell by cell # : explain cell 11,

Give it a try and you will be surprised how well it works.

But one of my most favorite features is being able to read and refer to other notebooks. For example, I can ask the Copilot to read notebook_1 from the same workspace. Think of the implications for a second. Below is one example, how this can be helpful.

Cross-referencing notebooks

  1. In a Fabric workspace I created a notebook with a markdown that includes rules from Palantir PySpark style guide. This style guide is an opinionated guide to PySpark code style for common situations and the associated best practices based on the most frequent recurring topics across the PySpark. Below is a summarized version in a markdown:

PySpark Style Guide

Purpose: This notebook is a style contract for AI-assisted code generation and review. When referenced from another notebook (e.g., refer to @pyspark_style_guide), you MUST apply every rule below to all PySpark code produced or reviewed in that session.

Adapted from Palantir PySpark Style Guide (MIT License).


VERSIONS

Use features and API supported by following versions:

  • Spark 3.5

  • Delta 3.2

  • Python 3.11

Enforcement Checklist

When reviewing or generating PySpark code, walk through each check below in order. Flag every violation found. Do not skip checks.

# Check What to look for
C1 Imports Any bare from pyspark.sql.functions import ... or alias other than F, T, W
C2 Column access Any df.colName dot-access outside of a join on= clause
C3 String column refs Any F.col('x') that could just be 'x' (Spark 3.0+)
C4 Variable names Any single-letter dataframe names (df, o, d, t)
C5 Magic values Any literal string, number, or threshold inline in filter, when, withColumn, select that is not a named constant
C6 Select contract More than one function per column in a select, or a .when() expression inside a select
C7 withColumnRenamed Any use. Replace with select + .alias()
C8 Empty columns Any lit(''), lit('NA'), lit('N/A'). Must be lit(None)
C9 Logical density More than 3 boolean expressions in a single .filter() or F.when() without named variables
C10 Chain length More than 5 chained statements in one block
C11 Chain mixing Joins, filters, withColumn, and selects mixed in the same chain
C12 Join hygiene Any .join() missing explicit how=
C13 Right joins Any how='right'. Swap df order, use left
C14 Window frames Any Window.partitionBy(...).orderBy(...) without explicit .rowsBetween() or .rangeBetween()
C15 Window nulls F.first() or F.last() without ignorenulls=True
C16 Global windows Empty W.partitionBy() or window without orderBy used for aggregation. Use .agg() instead
C17 Otherwise fallback .otherwise(<catch-all value>) masking unexpected data. Use None or omit
C18 Line continuation Any \ for multiline. Wrap in parentheses instead
C19 UDFs Any @udf or F.udf(). Rewrite with native functions
C20 Comments Comments that describe what code does instead of why a decision was made
C21 Dead code Commented-out code blocks. Remove them
C22 Function size Functions over ~70 lines or files over ~250 lines

Anti-Patterns (find and fix these)

Each pattern below is a regex-like signature. If you see it, it is a violation.

AP1: Bare function imports

# VIOLATION: any of these
from pyspark.sql.functions import col, when, sum, lit
import pyspark.sql.functions as func

# FIX: always
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window as W

AP2: Dot-access column references

# VIOLATION: df.column_name anywhere except join on=
df.select(df.order_id, df.amount)
df.withColumn('x', df.price * df.qty)

# FIX: use string refs
df.select('order_id', 'amount')
df.withColumn('x', F.col('price') * F.col('qty'))

AP3: Inline magic values

# VIOLATION: bare literals in logic
df.filter(F.col('amount') > 500)
F.when(F.col('status') == 'shipped', 'In Transit')
df.filter(F.col('days') < 365)

# FIX: named constants at top of cell/function
HIGH_VALUE_THRESHOLD = 500
STATUS_SHIPPED = 'shipped'
LABEL_IN_TRANSIT = 'In Transit'
ONE_YEAR_DAYS = 365

df.filter(F.col('amount') > HIGH_VALUE_THRESHOLD)
F.when(F.col('status') == STATUS_SHIPPED, LABEL_IN_TRANSIT)
df.filter(F.col('days') < ONE_YEAR_DAYS)

AP4: Complex logic inside .when() or .filter()

# VIOLATION: more than 3 conditions inline
df.filter(
    (F.col('a') == 'x') & (F.col('b') > 10) & (F.col('c') != 'y')
    & ((F.col('d') == 'online') | (F.col('d') == 'partner'))
)

# FIX: named boolean expressions, max 3 in the final filter
is_valid_type = (F.col('a') == TYPE_X)
above_threshold = (F.col('b') > MIN_THRESHOLD)
not_excluded = (F.col('c') != EXCLUDED_STATUS)
is_target_channel = (F.col('d') == CHANNEL_ONLINE) | (F.col('d') == CHANNEL_PARTNER)

flagged = is_valid_type & above_threshold & not_excluded & is_target_channel
df.filter(flagged)

AP5: .when() inside select

# VIOLATION: conditional logic embedded in select
df.select(
    'order_id',
    F.when(F.col('status') == 'shipped', 'In Transit')
     .when(F.col('status') == 'delivered', 'Complete')
     .alias('status_label'),
)

# FIX: select plain columns, then withColumn for derived logic
df = df.select('order_id', 'status')
df = df.withColumn(
    'status_label',
    F.when(F.col('status') == STATUS_SHIPPED, LABEL_IN_TRANSIT)
     .when(F.col('status') == STATUS_DELIVERED, LABEL_COMPLETE)
)

AP6: Empty column sentinels

# VIOLATION
df.withColumn('notes', F.lit(''))
df.withColumn('review_date', F.lit('N/A'))

# FIX
df.withColumn('notes', F.lit(None))
df.withColumn('review_date', F.lit(None))

AP7: Missing window frame

# VIOLATION: implicit frame
w = W.partitionBy('customer_id').orderBy('order_date')

# FIX: always explicit
w = (W.partitionBy('customer_id')
      .orderBy('order_date')
      .rowsBetween(W.unboundedPreceding, 0))

AP8: Blanket .otherwise()

# VIOLATION: masks unexpected values
F.when(..., 'A').when(..., 'B').otherwise('Unknown')

# FIX: omit otherwise (returns null) or use lit(None) explicitly
F.when(..., 'A').when(..., 'B')

AP9: Monster chains

# VIOLATION: mixed concerns, too long
df = (df.select(...).filter(...).withColumn(...).join(...).drop(...).withColumn(...))

# FIX: separate by concern, max 5 per block
df = (
    df
    .select(...)
    .filter(...)
)
df = df.withColumn(...)
df = df.join(..., how='inner')

AP10: Backslash continuation

# VIOLATION
df = df.filter(F.col('a') == 'x') \
       .filter(F.col('b') > 10)

# FIX: parentheses
df = (
    df
    .filter(F.col('a') == 'x')
    .filter(F.col('b') > 10)
)

Quick Reference (for code generation)

When writing new code, apply these defaults:

  • Imports: F, T, W only

  • Columns: string refs where possible, F.col() when needed

  • Descriptive df names: orders_df, active_orders, not df, o

  • Constants: every literal in logic gets a SCREAMING_SNAKE name

  • Selects: plain columns + one transform each, no .when() inside

  • Chains: max 5 lines, group by concern (filter/select, then enrich, then join)

  • Joins: always how=, always left not right, alias for disambiguation

  • Windows: always explicit frame, always ignorenulls=True on first/last

  • Empty cols: F.lit(None), never lit('') or lit('NA')

  • No UDFs, no .otherwise() fallbacks, no \ continuations

  • Comments explain why, not what. No commented-out code.

  1. I named the notebook PYSPARK_STYLE_GUIDE. It's all caps intentionally (more on this later).

  2. In another notebook, which already has some PySpark code , I opened Copilot.

  3. Asked : List all notebooks in this workspace. I can see the PYSPARK_STYLE_GUIDE notebook:

  1. My notebook has one cell with large code block (intentional). I prompted Copilot :

refer to @PYSPARK_STYLE_GUIDE and fix the code without losing the function and purpose

https://youtu.be/gtj_f7oBeuk

💡
As with anything AI, be sure to always back-up, test and verify.

Copilot read the style notebook and applied the rules to the cells in this notebook. You could also use this to extract code patterns from other notebooks, e.g. how did <notebook_name> ingested the data, use the same library as <notebook_name> to create ML features etc. Super handy.

Your BI/DE/DS team could also create reference pattern notebooks, and refer them for driving consistency and quality. Note that you can list items in another workspace but can't refer cross-workspace.

This was for Copilot in Fabric notebook. In an upcoming blog, I will share how I use Skills for Fabric for development.

Reference: