💡

📢I am planning on writing a series of blogs on extracting structured output from unstructured data, especially in Fabric. There is a lot to test, learn, write and I am short on time. So, to keep things moving, I will share the code first and write the explanation & details, as I find time. This blog will be a work in progress so feel free to visit back in a week or two.

There is hardly a day when I am not using Fabric APIs, using Semantic Link or Semantic Link Labs or using requests. The number of APIs available keeps growing every single week and it’s hard to keep track of the APIs that are available, their limitations, if they support SPN etc. All APIs are published to the MS Learn and it’s not tabulated anywhere that I am aware of. In this blog, I will use LLM to extract the APIs, their description, limitations, example request and which identities they support. In this blog, I will show it using gemini 1.5 flash model but in the next blog, I will use the AI Services available in Fabric to do the same thing.

Recipe

Any MS Learn documentation can be downloaded as a PDF. I will be scraping this PDF to extract the text. Here is the link : rest api fabric | Microsoft Learn
This is PDF is 1300+ pages if you download it. If I pass it to the LLMs as is, it’s ~450,000 tokens. While gemini flash can handle 1M tokens, the output tokens are limited to ~8K. To overcome that, I parse the PDF to extract each API service and pass the chunked text to LLM.
The instructions to the model are to extract the data as a json with specific constraints and examples provided (more on this later)
Loop over each API service, combine all to create a dataframe

Get the data

Each Fabric API has its own documentation page and it would be impractical, if not impossible, to dynamically scrape each page. Instead, I will download the entire documentation as a PDF and convert that to text. Roman Klimenko on LinkedIn rightly pointed out that I could also scrape it from the Github but the challenge is almost all the API documentation pages are in a private repo and cannot be accessed. In the below code, I am using PyPDF2 library to extract the text from the pdf. The output here will be totally unstructured text from all pages and our goal is to extract structured data from it.

%pip install PyPDF2 --q
%pip install google-generativeai --q

import requests
import PyPDF2
from io import BytesIO
url = "https://learn.microsoft.com/pdf?url=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Frest%2Fapi%2Ffabric%2Ftoc.json"
response = requests.get(url)
pdf_file = BytesIO(response.content)


pdf_reader = PyPDF2.PdfReader(pdf_file)
all_apis = ""
for page in pdf_reader.pages:
    all_apis += page.extract_text()

Parse Text

As mentioned above, the text is very long (450,000 tokens). While some LLMs, like Google Gemini, can handle this, we are still limited by the output tokens. Additionally, as the number of tokens increases, the output quality generally decreases. To manage the large text, we need to break it into chunks. The most logical way to do this is by extracting each API so we don't lose any context. I noticed in the PDF that each API has a "Service" category, so in the code below, I split the text by "Service" and save each API by "Service." This will keep the text chunks semantically grouped, which will also help the LLM.

def parse_api_by_service(text):
   services = {}
   current_api = ""
   current_service = None

   lines = text.split('\n')

   for line in lines:
       if 'Service:' in line:
           service_name = line.split('Service:')[1].strip()

           # previous API if exists
           if current_api and current_service:
               if current_service not in services:
                   services[current_service] = []
               services[current_service].append(current_api)

           # API from previous line
           api_name = lines[lines.index(line)-2].strip()
           current_api = api_name + '\n'
           current_service = service_name

       elif current_service:
           current_api += line + '\n'


   if current_api and current_service:
       if current_service not in services:
           services[current_service] = []
       services[current_service].append(current_api)

   return services

services = parse_api_by_service(all_apis)

print(services.keys())

Results is :

Prompt

To do : ⚒Add explanation (I have a lot to say about this)

instructions= """Extract API details in the following structured format:
{
 apiName: "name of the API endpoint (leave blank if not found)",
 description: "concise description (leave blank if not found)", 
 limitations: "any limits/constraints (leave blank if not found)",
 sampleRequest: "API request format (leave blank if not found)",
 supportedIdentities: {
   user: "Yes/No/blank if not found",
   servicePrincipal: "Yes/No/blank if not found", 
   managedIdentities: "Yes/No/blank if not found"
 }
}
from the following text:
{{text}}

Constraints:
- Extract only API documentation content
- Include all endpoint variations
- Use {curlyBraces} for variables 
- Leave fields blank if info not found


Example Output:
{
 "apiName": "Items - Get Item",
 "description": "Returns properties of the specified item",
 "limitations": "To create a non-PowerBI Fabric item the workspace must be on a supported Fabric capacity, "200 requests per hour",
 "sampleRequest": "GET https://api.fabric.microsoft.com/v1/workspaces/{workspaceId}/items/{itemId}",
 "supportedIdentities": {
   "user": "Yes",
   "servicePrincipal": "Yes",
   "managedIdentities": ""
 }
}
"""

Large Language Model

To do : ⚒ Add explanation


import os
import google.generativeai as genai
import json

## Get API key from Google AI Studio
genai.configure(api_key="<key>")

# Create the model
generation_config = {
  "temperature": 0.3,
  "max_output_tokens": 8192,
  "response_mime_type": "application/json",
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-flash-002",
  generation_config=generation_config,
  system_instruction=str(instructions),
)

chat_session = model.start_chat( history=[ ] )

Get LLM Response

To do : ⚒ Add explanation

import time
def process_api_sections(sections, model, generation_config):
    results = {}
    for section in sections.keys():
        try:
            print(f"Extracting: {section}, text_size: {len(sections[section])}")
            chat_session = model.start_chat(history=[])
            response = chat_session.send_message(sections[section])
            results[section] = json.loads(response.candidates[0].content.parts[0].text)
            print(f"Extracted APIs: {len(results[section])}")
        except Exception as e:
            print(f"Error processing {section}: {e}")
            results[section] = []

        time.sleep(5)   

    return results

result = process_api_sections(services, model, generation_config)

Create a dataframe

To do : ⚒ Add explanation

import pandas as pd 
df = pd.DataFrame()
for key, content in result.items():
   _df = pd.json_normalize(content)
   _df['service'] = key
   df = pd.concat([df, _df], ignore_index=True)

Result:

There you have it, we just extracted 200 APIs.

💡

As mentioned above, I will add more details as time permits but until then feel free to test this and let me know your thoughts.

Power BI Report

To do : I will save this to a lakehouse and build a Power BI report with publish to web for everyone to use.

Unstructured To Structured 1 : Extracting List Of Fabric APIs Using LLM

Recipe

Get the data

Parse Text

Prompt

Large Language Model

Get LLM Response

Create a dataframe

Result:

Power BI Report

Comments (3)

More from this blog

RAG in Fabric Notebook Using Microsoft Harrier Multilingual Text Embedding Model

Programmatically Retrieve Prep Data For AI Configuration of Semantic Models

Cross-referencing Notebooks In The Updated Fabric Notebook Copilot

Programmatically Comparing Draft vs Production Fabric Data Agent Responses

Monitoring Power BI Modeling MCP Server Usage and Adoption

Command Palette

Recipe

Get the data

Parse Text

Prompt

Large Language Model

Get LLM Response

Create a dataframe

Result:

Power BI Report

Comments (3)

More from this blog