Unstructured To Structured 1 : Extracting List Of Fabric APIs Using LLM
Using LLM to make a list of Fabric APIs and their details
Principal Program Manager, Microsoft Fabric CAT helping users and organizations build scalable, insightful, secure solutions. Blogs, opinions are my own and do not represent my employer.
There is hardly a day when I am not using Fabric APIs, using Semantic Link or Semantic Link Labs or using requests. The number of APIs available keeps growing every single week and itβs hard to keep track of the APIs that are available, their limitations, if they support SPN etc. All APIs are published to the MS Learn and itβs not tabulated anywhere that I am aware of. In this blog, I will use LLM to extract the APIs, their description, limitations, example request and which identities they support. In this blog, I will show it using gemini 1.5 flash model but in the next blog, I will use the AI Services available in Fabric to do the same thing.

Recipe

Any MS Learn documentation can be downloaded as a PDF. I will be scraping this PDF to extract the text. Here is the link : rest api fabric | Microsoft Learn
This is PDF is 1300+ pages if you download it. If I pass it to the LLMs as is, itβs ~450,000 tokens. While
gemini flashcan handle 1M tokens, the output tokens are limited to ~8K. To overcome that, I parse the PDF to extract each API service and pass the chunked text to LLM.The instructions to the model are to extract the data as a json with specific constraints and examples provided (more on this later)
Loop over each API service, combine all to create a dataframe
Get the data
Each Fabric API has its own documentation page and it would be impractical, if not impossible, to dynamically scrape each page. Instead, I will download the entire documentation as a PDF and convert that to text. Roman Klimenko on LinkedIn rightly pointed out that I could also scrape it from the Github but the challenge is almost all the API documentation pages are in a private repo and cannot be accessed. In the below code, I am using PyPDF2 library to extract the text from the pdf. The output here will be totally unstructured text from all pages and our goal is to extract structured data from it.
%pip install PyPDF2 --q
%pip install google-generativeai --q
import requests
import PyPDF2
from io import BytesIO
url = "https://learn.microsoft.com/pdf?url=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Frest%2Fapi%2Ffabric%2Ftoc.json"
response = requests.get(url)
pdf_file = BytesIO(response.content)
pdf_reader = PyPDF2.PdfReader(pdf_file)
all_apis = ""
for page in pdf_reader.pages:
all_apis += page.extract_text()
Parse Text
As mentioned above, the text is very long (450,000 tokens). While some LLMs, like Google Gemini, can handle this, we are still limited by the output tokens. Additionally, as the number of tokens increases, the output quality generally decreases. To manage the large text, we need to break it into chunks. The most logical way to do this is by extracting each API so we don't lose any context. I noticed in the PDF that each API has a "Service" category, so in the code below, I split the text by "Service" and save each API by "Service." This will keep the text chunks semantically grouped, which will also help the LLM.

def parse_api_by_service(text):
services = {}
current_api = ""
current_service = None
lines = text.split('\n')
for line in lines:
if 'Service:' in line:
service_name = line.split('Service:')[1].strip()
# previous API if exists
if current_api and current_service:
if current_service not in services:
services[current_service] = []
services[current_service].append(current_api)
# API from previous line
api_name = lines[lines.index(line)-2].strip()
current_api = api_name + '\n'
current_service = service_name
elif current_service:
current_api += line + '\n'
if current_api and current_service:
if current_service not in services:
services[current_service] = []
services[current_service].append(current_api)
return services
services = parse_api_by_service(all_apis)
print(services.keys())
Results is :

Prompt
To do : βAdd explanation (I have a lot to say about this)
instructions= """Extract API details in the following structured format:
{
apiName: "name of the API endpoint (leave blank if not found)",
description: "concise description (leave blank if not found)",
limitations: "any limits/constraints (leave blank if not found)",
sampleRequest: "API request format (leave blank if not found)",
supportedIdentities: {
user: "Yes/No/blank if not found",
servicePrincipal: "Yes/No/blank if not found",
managedIdentities: "Yes/No/blank if not found"
}
}
from the following text:
{{text}}
Constraints:
- Extract only API documentation content
- Include all endpoint variations
- Use {curlyBraces} for variables
- Leave fields blank if info not found
Example Output:
{
"apiName": "Items - Get Item",
"description": "Returns properties of the specified item",
"limitations": "To create a non-PowerBI Fabric item the workspace must be on a supported Fabric capacity, "200 requests per hour",
"sampleRequest": "GET https://api.fabric.microsoft.com/v1/workspaces/{workspaceId}/items/{itemId}",
"supportedIdentities": {
"user": "Yes",
"servicePrincipal": "Yes",
"managedIdentities": ""
}
}
"""
Large Language Model
To do : β Add explanation
import os
import google.generativeai as genai
import json
## Get API key from Google AI Studio
genai.configure(api_key="<key>")
# Create the model
generation_config = {
"temperature": 0.3,
"max_output_tokens": 8192,
"response_mime_type": "application/json",
}
model = genai.GenerativeModel(
model_name="gemini-1.5-flash-002",
generation_config=generation_config,
system_instruction=str(instructions),
)
chat_session = model.start_chat( history=[ ] )
Get LLM Response
To do : β Add explanation
import time
def process_api_sections(sections, model, generation_config):
results = {}
for section in sections.keys():
try:
print(f"Extracting: {section}, text_size: {len(sections[section])}")
chat_session = model.start_chat(history=[])
response = chat_session.send_message(sections[section])
results[section] = json.loads(response.candidates[0].content.parts[0].text)
print(f"Extracted APIs: {len(results[section])}")
except Exception as e:
print(f"Error processing {section}: {e}")
results[section] = []
time.sleep(5)
return results
result = process_api_sections(services, model, generation_config)
Create a dataframe
To do : β Add explanation
import pandas as pd
df = pd.DataFrame()
for key, content in result.items():
_df = pd.json_normalize(content)
_df['service'] = key
df = pd.concat([df, _df], ignore_index=True)
Result:
There you have it, we just extracted 200 APIs.

Power BI Report
To do : I will save this to a lakehouse and build a Power BI report with publish to web for everyone to use.