Skip to main content

Command Palette

Search for a command to run...

Unstructured To Structured 1 : Extracting List Of Fabric APIs Using LLM

Using LLM to make a list of Fabric APIs and their details

Updated
S

Principal Program Manager, Microsoft Fabric CAT helping users and organizations build scalable, insightful, secure solutions. Blogs, opinions are my own and do not represent my employer.

πŸ’‘
πŸ“’I am planning on writing a series of blogs on extracting structured output from unstructured data, especially in Fabric. There is a lot to test, learn, write and I am short on time. So, to keep things moving, I will share the code first and write the explanation & details, as I find time. This blog will be a work in progress so feel free to visit back in a week or two.

There is hardly a day when I am not using Fabric APIs, using Semantic Link or Semantic Link Labs or using requests. The number of APIs available keeps growing every single week and it’s hard to keep track of the APIs that are available, their limitations, if they support SPN etc. All APIs are published to the MS Learn and it’s not tabulated anywhere that I am aware of. In this blog, I will use LLM to extract the APIs, their description, limitations, example request and which identities they support. In this blog, I will show it using gemini 1.5 flash model but in the next blog, I will use the AI Services available in Fabric to do the same thing.

Recipe

  1. Any MS Learn documentation can be downloaded as a PDF. I will be scraping this PDF to extract the text. Here is the link : rest api fabric | Microsoft Learn

  2. This is PDF is 1300+ pages if you download it. If I pass it to the LLMs as is, it’s ~450,000 tokens. While gemini flash can handle 1M tokens, the output tokens are limited to ~8K. To overcome that, I parse the PDF to extract each API service and pass the chunked text to LLM.

  3. The instructions to the model are to extract the data as a json with specific constraints and examples provided (more on this later)

  4. Loop over each API service, combine all to create a dataframe

Get the data

Each Fabric API has its own documentation page and it would be impractical, if not impossible, to dynamically scrape each page. Instead, I will download the entire documentation as a PDF and convert that to text. Roman Klimenko on LinkedIn rightly pointed out that I could also scrape it from the Github but the challenge is almost all the API documentation pages are in a private repo and cannot be accessed. In the below code, I am using PyPDF2 library to extract the text from the pdf. The output here will be totally unstructured text from all pages and our goal is to extract structured data from it.

%pip install PyPDF2 --q
%pip install google-generativeai --q

import requests
import PyPDF2
from io import BytesIO
url = "https://learn.microsoft.com/pdf?url=https%3A%2F%2Flearn.microsoft.com%2Fen-us%2Frest%2Fapi%2Ffabric%2Ftoc.json"
response = requests.get(url)
pdf_file = BytesIO(response.content)


pdf_reader = PyPDF2.PdfReader(pdf_file)
all_apis = ""
for page in pdf_reader.pages:
    all_apis += page.extract_text()

Parse Text

As mentioned above, the text is very long (450,000 tokens). While some LLMs, like Google Gemini, can handle this, we are still limited by the output tokens. Additionally, as the number of tokens increases, the output quality generally decreases. To manage the large text, we need to break it into chunks. The most logical way to do this is by extracting each API so we don't lose any context. I noticed in the PDF that each API has a "Service" category, so in the code below, I split the text by "Service" and save each API by "Service." This will keep the text chunks semantically grouped, which will also help the LLM.

def parse_api_by_service(text):
   services = {}
   current_api = ""
   current_service = None

   lines = text.split('\n')

   for line in lines:
       if 'Service:' in line:
           service_name = line.split('Service:')[1].strip()

           # previous API if exists
           if current_api and current_service:
               if current_service not in services:
                   services[current_service] = []
               services[current_service].append(current_api)

           # API from previous line
           api_name = lines[lines.index(line)-2].strip()
           current_api = api_name + '\n'
           current_service = service_name

       elif current_service:
           current_api += line + '\n'


   if current_api and current_service:
       if current_service not in services:
           services[current_service] = []
       services[current_service].append(current_api)

   return services

services = parse_api_by_service(all_apis)

print(services.keys())

Results is :

Prompt

To do : βš’Add explanation (I have a lot to say about this)

instructions= """Extract API details in the following structured format:
{
 apiName: "name of the API endpoint (leave blank if not found)",
 description: "concise description (leave blank if not found)", 
 limitations: "any limits/constraints (leave blank if not found)",
 sampleRequest: "API request format (leave blank if not found)",
 supportedIdentities: {
   user: "Yes/No/blank if not found",
   servicePrincipal: "Yes/No/blank if not found", 
   managedIdentities: "Yes/No/blank if not found"
 }
}
from the following text:
{{text}}

Constraints:
- Extract only API documentation content
- Include all endpoint variations
- Use {curlyBraces} for variables 
- Leave fields blank if info not found


Example Output:
{
 "apiName": "Items - Get Item",
 "description": "Returns properties of the specified item",
 "limitations": "To create a non-PowerBI Fabric item the workspace must be on a supported Fabric capacity, "200 requests per hour",
 "sampleRequest": "GET https://api.fabric.microsoft.com/v1/workspaces/{workspaceId}/items/{itemId}",
 "supportedIdentities": {
   "user": "Yes",
   "servicePrincipal": "Yes",
   "managedIdentities": ""
 }
}
"""

Large Language Model

To do : βš’ Add explanation


import os
import google.generativeai as genai
import json

## Get API key from Google AI Studio
genai.configure(api_key="<key>")

# Create the model
generation_config = {
  "temperature": 0.3,
  "max_output_tokens": 8192,
  "response_mime_type": "application/json",
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-flash-002",
  generation_config=generation_config,
  system_instruction=str(instructions),
)

chat_session = model.start_chat( history=[ ] )

Get LLM Response

To do : βš’ Add explanation

import time
def process_api_sections(sections, model, generation_config):
    results = {}
    for section in sections.keys():
        try:
            print(f"Extracting: {section}, text_size: {len(sections[section])}")
            chat_session = model.start_chat(history=[])
            response = chat_session.send_message(sections[section])
            results[section] = json.loads(response.candidates[0].content.parts[0].text)
            print(f"Extracted APIs: {len(results[section])}")
        except Exception as e:
            print(f"Error processing {section}: {e}")
            results[section] = []

        time.sleep(5)   

    return results

result = process_api_sections(services, model, generation_config)

Create a dataframe

To do : βš’ Add explanation

import pandas as pd 
df = pd.DataFrame()
for key, content in result.items():
   _df = pd.json_normalize(content)
   _df['service'] = key
   df = pd.concat([df, _df], ignore_index=True)

Result:

There you have it, we just extracted 200 APIs.

πŸ’‘
As mentioned above, I will add more details as time permits but until then feel free to test this and let me know your thoughts.

Power BI Report

To do : I will save this to a lakehouse and build a Power BI report with publish to web for everyone to use.

J

ALPHA KEY, A LICENSED CRYPTO RECOVERY HACKER, IS A GREAT REFERENCE

Please contact the qualified crypto recovery hacker using their Alpha key since they are the only one I have found who has truly done appropriately and to whom I can offer my sincere regards. Contact info : Mail:Alphakey@consultant.com WhatsApp :+15714122170 Signal:+18622823879 Telegram: ALPHAKEYHACKER

I

Buy Verified PayPal Accounts Secure Your Online Transactions: Buy Verified PayPal Accounts at Super Smm Shop Skype: supersmmshop Telegram: @supersmmshop Whatsapp: +1 (503) 489 7815

S

This blog post presents an innovative approach to extracting and organizing Fabric APIs using LLMs, specifically with the Gemini 1.5 Flash model. The detailed steps for scraping the Microsoft Learn PDF and parsing the API data are clear and practical, making it easier for developers to keep track of the vast number of APIs available. The structured output format for API details ensures consistency and usability, which is a significant plus. I'm particularly excited about the potential insights from the Power BI report that will follow! Great work on leveraging AI to simplify the API discovery process! πŸŒŸπŸ“Š