Cross-referencing Notebooks In The Updated Fabric Notebook Copilot
At FabCon Atlanta last week, the updated notebook Copilot for data engineering and data science was announced. It brings agentic capabilities to the Copilot and is much more intelligent and Fabric-aware than the previous version. You can read the documentation here. For example, you can now do and ask the Copilot following things which you couldn't previously:
Use the Copilot without starting a session. Just open the notebook, Copilot and start asking questions and making changes. Saves you CUs.
list items in the workspace : list all the lakehouses in this workspace
take actions : mount the and lakehouses and make the default lakehouse
get ABFSS paths : give me abfs path of lakehouse <lakehouse_1>.
create spark pool configurations using %%configure : add configuration to use 4 cores
refer to content in a cell by cell # : explain cell 11,
Give it a try and you will be surprised how well it works.
But one of my most favorite features is being able to read and refer to other notebooks. For example, I can ask the Copilot to read notebook_1 from the same workspace. Think of the implications for a second. Below is one example, how this can be helpful.
Cross-referencing notebooks
- In a Fabric workspace I created a notebook with a markdown that includes rules from Palantir PySpark style guide. This style guide is an opinionated guide to PySpark code style for common situations and the associated best practices based on the most frequent recurring topics across the PySpark. Below is a summarized version in a markdown:
PySpark Style Guide
Purpose: This notebook is a style contract for AI-assisted code generation and review. When referenced from another notebook (e.g.,
refer to @pyspark_style_guide), you MUST apply every rule below to all PySpark code produced or reviewed in that session.Adapted from Palantir PySpark Style Guide (MIT License).
VERSIONS
Use features and API supported by following versions:
Spark 3.5
Delta 3.2
Python 3.11
Enforcement Checklist
When reviewing or generating PySpark code, walk through each check below in order. Flag every violation found. Do not skip checks.
# Check What to look for C1 Imports Any bare from pyspark.sql.functions import ...or alias other thanF,T,WC2 Column access Any df.colNamedot-access outside of a joinon=clauseC3 String column refs Any F.col('x')that could just be'x'(Spark 3.0+)C4 Variable names Any single-letter dataframe names ( df,o,d,t)C5 Magic values Any literal string, number, or threshold inline in filter,when,withColumn,selectthat is not a named constantC6 Select contract More than one function per column in a select, or a.when()expression inside aselectC7 withColumnRenamed Any use. Replace with select+.alias()C8 Empty columns Any lit(''),lit('NA'),lit('N/A'). Must belit(None)C9 Logical density More than 3 boolean expressions in a single .filter()orF.when()without named variablesC10 Chain length More than 5 chained statements in one block C11 Chain mixing Joins, filters, withColumn, and selects mixed in the same chain C12 Join hygiene Any .join()missing explicithow=C13 Right joins Any how='right'. Swap df order, useleftC14 Window frames Any Window.partitionBy(...).orderBy(...)without explicit.rowsBetween()or.rangeBetween()C15 Window nulls F.first()orF.last()withoutignorenulls=TrueC16 Global windows Empty W.partitionBy()or window withoutorderByused for aggregation. Use.agg()insteadC17 Otherwise fallback .otherwise(<catch-all value>)masking unexpected data. UseNoneor omitC18 Line continuation Any \for multiline. Wrap in parentheses insteadC19 UDFs Any @udforF.udf(). Rewrite with native functionsC20 Comments Comments that describe what code does instead of why a decision was made C21 Dead code Commented-out code blocks. Remove them C22 Function size Functions over ~70 lines or files over ~250 lines
Anti-Patterns (find and fix these)
Each pattern below is a regex-like signature. If you see it, it is a violation.
AP1: Bare function imports
# VIOLATION: any of these from pyspark.sql.functions import col, when, sum, lit import pyspark.sql.functions as func # FIX: always from pyspark.sql import functions as F from pyspark.sql import types as T from pyspark.sql import Window as WAP2: Dot-access column references
# VIOLATION: df.column_name anywhere except join on= df.select(df.order_id, df.amount) df.withColumn('x', df.price * df.qty) # FIX: use string refs df.select('order_id', 'amount') df.withColumn('x', F.col('price') * F.col('qty'))AP3: Inline magic values
# VIOLATION: bare literals in logic df.filter(F.col('amount') > 500) F.when(F.col('status') == 'shipped', 'In Transit') df.filter(F.col('days') < 365) # FIX: named constants at top of cell/function HIGH_VALUE_THRESHOLD = 500 STATUS_SHIPPED = 'shipped' LABEL_IN_TRANSIT = 'In Transit' ONE_YEAR_DAYS = 365 df.filter(F.col('amount') > HIGH_VALUE_THRESHOLD) F.when(F.col('status') == STATUS_SHIPPED, LABEL_IN_TRANSIT) df.filter(F.col('days') < ONE_YEAR_DAYS)AP4: Complex logic inside .when() or .filter()
# VIOLATION: more than 3 conditions inline df.filter( (F.col('a') == 'x') & (F.col('b') > 10) & (F.col('c') != 'y') & ((F.col('d') == 'online') | (F.col('d') == 'partner')) ) # FIX: named boolean expressions, max 3 in the final filter is_valid_type = (F.col('a') == TYPE_X) above_threshold = (F.col('b') > MIN_THRESHOLD) not_excluded = (F.col('c') != EXCLUDED_STATUS) is_target_channel = (F.col('d') == CHANNEL_ONLINE) | (F.col('d') == CHANNEL_PARTNER) flagged = is_valid_type & above_threshold & not_excluded & is_target_channel df.filter(flagged)AP5: .when() inside select
# VIOLATION: conditional logic embedded in select df.select( 'order_id', F.when(F.col('status') == 'shipped', 'In Transit') .when(F.col('status') == 'delivered', 'Complete') .alias('status_label'), ) # FIX: select plain columns, then withColumn for derived logic df = df.select('order_id', 'status') df = df.withColumn( 'status_label', F.when(F.col('status') == STATUS_SHIPPED, LABEL_IN_TRANSIT) .when(F.col('status') == STATUS_DELIVERED, LABEL_COMPLETE) )AP6: Empty column sentinels
# VIOLATION df.withColumn('notes', F.lit('')) df.withColumn('review_date', F.lit('N/A')) # FIX df.withColumn('notes', F.lit(None)) df.withColumn('review_date', F.lit(None))AP7: Missing window frame
# VIOLATION: implicit frame w = W.partitionBy('customer_id').orderBy('order_date') # FIX: always explicit w = (W.partitionBy('customer_id') .orderBy('order_date') .rowsBetween(W.unboundedPreceding, 0))AP8: Blanket .otherwise()
# VIOLATION: masks unexpected values F.when(..., 'A').when(..., 'B').otherwise('Unknown') # FIX: omit otherwise (returns null) or use lit(None) explicitly F.when(..., 'A').when(..., 'B')AP9: Monster chains
# VIOLATION: mixed concerns, too long df = (df.select(...).filter(...).withColumn(...).join(...).drop(...).withColumn(...)) # FIX: separate by concern, max 5 per block df = ( df .select(...) .filter(...) ) df = df.withColumn(...) df = df.join(..., how='inner')AP10: Backslash continuation
# VIOLATION df = df.filter(F.col('a') == 'x') \ .filter(F.col('b') > 10) # FIX: parentheses df = ( df .filter(F.col('a') == 'x') .filter(F.col('b') > 10) )
Quick Reference (for code generation)
When writing new code, apply these defaults:
Imports:
F,T,WonlyColumns: string refs where possible,
F.col()when neededDescriptive df names:
orders_df,active_orders, notdf,oConstants: every literal in logic gets a
SCREAMING_SNAKEnameSelects: plain columns + one transform each, no
.when()insideChains: max 5 lines, group by concern (filter/select, then enrich, then join)
Joins: always
how=, alwaysleftnotright, alias for disambiguationWindows: always explicit frame, always
ignorenulls=Trueonfirst/lastEmpty cols:
F.lit(None), neverlit('')orlit('NA')No UDFs, no
.otherwise()fallbacks, no\continuationsComments explain why, not what. No commented-out code.
I named the notebook PYSPARK_STYLE_GUIDE. It's all caps intentionally (more on this later).
In another notebook, which already has some PySpark code , I opened Copilot.
Asked : List all notebooks in this workspace. I can see the PYSPARK_STYLE_GUIDE notebook:
- My notebook has one cell with large code block (intentional). I prompted Copilot :
refer to @PYSPARK_STYLE_GUIDE and fix the code without losing the function and purpose
Copilot read the style notebook and applied the rules to the cells in this notebook. You could also use this to extract code patterns from other notebooks, e.g. how did <notebook_name> ingested the data, use the same library as <notebook_name> to create ML features etc. Super handy.
Your BI/DE/DS team could also create reference pattern notebooks, and refer them for driving consistency and quality. Note that you can list items in another workspace but can't refer cross-workspace.
This was for Copilot in Fabric notebook. In an upcoming blog, I will share how I use Skills for Fabric for development.