Azure Data
3 TopicsBringing Generative AI to Your Data: Semantic Operators in Azure Database for PostgreSQL
by: Arda Aytekin, PhD, Principal Software Engineer; Binnur Gorer, PhD, Senior Software Engineer; Serdar Mumcu, Principal Software Engineer While vector search has been a crucial component in Generative AI (GenAI) applications and agents, it only scratches the surface of what is possible. Vector search often fails to capture the intricate semantic relationships within enterprise data, leading to a significant loss of valuable knowledge at the foundational level. This limitation restricts the agent’s ability to fully understand and reason about the data, as the semantic connections remain hidden and this loss propagates upward. However, with the increasing power, affordability, and accessibility of Large Language Models (LLMs), we now have the tools to uncover these semantic relationships and deliver more accurate results. Although Azure AI Foundry provides powerful APIs for working with these advanced AI models, integrating them into data-centric applications often requires orchestration layers and external services, which can add operational overhead. To eliminate this friction, we are introducing Semantic Operators—is_true, extract, generate, and rank—within the azure_ai extension for Azure Database for PostgreSQL. These operators leverage the capabilities of LLMs directly from SQL workflows, enabling truth evaluation, high-level data extraction, intelligent text generation, and document reranking, respectively. This innovation empowers you to unlock deeper insights and elevate the performance of your AI applications. Meet the Operators We introduce four core semantic operators, each designed to address distinct categories of semantic analysis challenges. Their real power comes from the flexible input options and the variety of supported LLMs, enabling our users to explore the potential of LLMs for their specific use cases and challenges. Truth Evaluation azure_ai.is_true evaluates whether a natural language statement is likely true. It is ideal for filtering noisy or off-topic content. Below is a simple example of how to use the operator: WITH statements(stmt) AS ( VALUES ('Surface Pro is made with recycled content, including 100% recycled aluminum alloy in the enclosure of 12-inch and 13-inch devices.'), ('Surface Pro is a product of an EU-based startup.'), ('My laptop has a great battery life.') ) SELECT stmt, azure_ai.is_true(stmt) AS truthy FROM statements; which gives the following result set (subject to variability of LLMs’ responses): stmt truthy Surface Pro is made with recycled content, including 100% recycled aluminum alloy in the enclosure of 12-inch and 13-inch devices. True Surface Pro is a product of an EU-based startup. False My laptop has a great battery life. NULL As we can observe from the result set, azure_ai.is_true outputs NULL when the truthiness of the value cannot be determined and issues a warning with a detailed message (intentionally left out for brevity). High-Level Data Extraction azure_ai.extract extracts structured high-level data (e.g., features, entities) from unstructured text. It is particularly useful for identifying named entities, actions, or other key elements within a text, making it easier to organize and analyze information. Below is a simple example from a review processing task: WITH reviews(review) AS ( VALUES ('The product has a great battery life.'), ('Noise cancellation does not work as advertised. Avoid this product.'), ('The product has a good design, but it is a bit heavy. Not recommended for travel.'), ('Music quality is good but call quality could have been better.') ) SELECT review, azure_ai.extract( 'Concisely extract features from the review, leave n/a or 0 when unsure: ' || review, ARRAY[ 'battery: string - battery life of the product', 'design: string - design features of the product', 'sound: string - sound quality (e.g., music, call, noise cancellation) of the product', 'sentiment: number - sentiment score of the review; 1 (lowest) to 5 (highest)' ] ) AS data FROM reviews; This query yields the following result set (again, subject to variability of LLMs’ responses): review data The product has a great battery life. {"sound": "n/a", "design": "n/a", "battery": "great battery life", "sentiment": 5} Noise cancellation does not work as advertised. Avoid this product. {"sound": "Noise cancellation does not work as advertised.", "design": "n/a", "battery": "n/a", "sentiment": 1} The product has a good design, but it is a bit heavy. Not recommended for travel. {"sound": "n/a", "design": "good design, but a bit heavy", "battery": "n/a", "sentiment": 3} Music quality is good but call quality could have been better. {"sound": "Music quality is good, call quality could have been better.", "design": "n/a", "battery": "n/a", "sentiment": 3} Intelligent Text Generation azure_ai.generate generates freeform text using an LLM based on the input and system prompts. It can be used to summarize, translate, rewrite, or respond to content, providing users with the ability to create coherent and contextually appropriate text effortlessly. Additionally, its capabilities extend beyond these functions, as users can craft different prompts and provide a JSON schema to instruct the LLM to generate structured information for their specific use cases. Revisiting the previous review processing task, we can get a one-sentence summary of the reviews by employing the freeform version of this operator: WITH reviews(review) AS ( VALUES ('The product has a great battery life.'), ('Noise cancellation does not work as advertised. Avoid this product.'), ('The product has a good design, but it is a bit heavy. Not recommended for travel.'), ('Music quality is good but call quality could have been better.') ) SELECT azure_ai.generate( string_agg('Review: ' || review, '\n'), system_prompt => 'Summarize the reviews in a single sentence concisely.' ) AS summary FROM reviews; We get the following response from the LLM: summary The product has good battery life, design, and music quality, but falls short on noise cancellation, call quality, and portability. Document Reranking azure_ai.rank re-ranks a list of documents based on their relevance to a query. It empowers businesses to substantially improve the accuracy and relevance of information retrieved in search and retrieval-augmented generation (RAG) systems. In addition to supporting LLMs for reranking, the operator also accommodates cross-encoder models specifically designed for this function. To understand this operator’s power, let us revisit the previous review processing task and ask ourselves the question: “Which review best answers the customer query: ‘clear calling capability that blocks out background noise’?” Here is an example SQL query with azure_ai.rank: WITH reviews(id, review) AS ( VALUES (1, 'The product has a great battery life.'), (2, 'Noise cancellation does not work as advertised. Avoid this product.'), (3, 'The product has a good design, but it is a bit heavy. Not recommended for travel.'), (4, 'Music quality is good but call quality could have been better.') ) SELECT rank, id, review FROM azure_ai.rank( 'clear calling capability that blocks out background noise', ARRAY(SELECT review FROM reviews ORDER BY id ASC), ARRAY(SELECT id FROM reviews ORDER BY id ASC) ) rr LEFT JOIN reviews r USING (id) ORDER BY rank ASC; The above query yields the following reranked reviews (lower rank means higher relevance): rank id review 1 2 Noise cancellation does not work as advertised. Avoid this product. 2 4 Music quality is good but call quality could have been better. 3 3 The product has a good design, but it is a bit heavy. Not recommended for travel. 4 1 The product has a great battery life. Let us explore how these operators work in a real-world scenario. Use Case: Processing Product Descriptions and Reviews Imagine you're managing a platform that collects thousands of user reviews for various products daily. With such a vast amount of freeform, user-generated content, efficiently analyzing and surfacing the most meaningful insights become crucial. By leveraging automation and AI-driven techniques, you can streamline the review processing workflow to improve searchability, relevance, and moderation efficiency. For this use case, we will work on the Agentic Shop dataset, which contains product descriptions and the corresponding user reviews for three categories of products: headphones, smartwatches and tablets. The product table has IDs, names, categories and freeform text descriptions of the products, as well as the descriptions’ embeddings. Similarly, the review table has review IDs, product IDs and freeform review text, as well as the reviews’ embeddings. Both tables have indexes on the embedding columns to support vector (similarity) search. Step by step, we will perform the following operations: Do semantic similarity search on product descriptions and reviews, Improve the semantic search using reranking, Filter out the false positives using azure_ai.is_true, Extract high-level data to build a graph on top of products and reviews, and, Answer complex questions using the graph and semantic operators. With the help of semantic operators, you can do all of this in SQL! Semantic Similarity Search With the tables, embeddings and indexes at hand, let us try to list the Top 10 headphones with “good clear calling capability and good reviews about battery life.” By using semantic similarity search alone, you can try listing the Top 10 related products first, and then get the Top 10 semantically most relevant reviews for them: WITH potential_headphones AS ( SELECT id AS product_id, name, description FROM products WHERE category = 'headphones' ORDER BY description_emb <=> azure_openai.create_embeddings( 'text-embedding-3-small', 'good clear calling' )::vector ASC LIMIT 10 ) SELECT p.product_id, r.id AS review_id, p.name AS product_name, p.description, r.review_text FROM potential_headphones p LEFT JOIN reviews r USING (product_id) ORDER BY review_text_emb <=> azure_openai.create_embeddings( 'text-embedding-3-small', 'good battery life' )::vector ASC LIMIT 10; The query yields the following result set (kept to the Top 5 results and redacted for brevity): product_id review_id product_name description review_text 11 247 [redacted] [redacted] [redacted] 11 244 [redacted] [redacted] [redacted] 15 339 [redacted] [redacted] [redacted] 10 220 [redacted] [redacted] [redacted] 5 114 [redacted] [redacted] [redacted] When we inspect the result set carefully, we notice, for instance, that even though the product description for product_id=15 contains product features around “superior noise cancellation” and “high-quality sound performance,” it does not specifically mention “good calling capability.” This is of course expected due to the nature of semantic similarity search and our choice of first retrieving the top products and then fetching the topmost relevant reviews. Reranking the Results To alleviate the previous issue and improve the semantic similarity search results, we can benefit from azure_ai.rank. Let us try to improve the previous query by fetching the Top 50 relevant reviews first, and then reranking the results by using a cross-encoder: WITH potential_headphones AS ( SELECT id AS product_id, name, description FROM product WHERE category = 'headphones' ORDER BY description_emb <=> azure_openai.create_embeddings( 'text-embedding-3-small', 'good clear calling' )::vector ASC LIMIT 10 ), potential_headphones_reviews AS ( SELECT p.product_id, r.id AS review_id, p.name AS product_name, p.description, r.review_text, ROW_NUMBER() OVER () AS row_id FROM potential_headphones p LEFT JOIN review r USING (product_id) ORDER BY review_text_emb <=> azure_openai.create_embeddings( 'text-embedding-3-small', 'good battery life' )::vector ASC LIMIT 50 ), reranked_results AS ( SELECT id AS row_id, rank FROM azure_ai.rank( 'good clear calling capability and good battery life', ARRAY( SELECT 'Product Description: ' || description || '. Review: ' || review_text FROM potential_headphones_reviews ORDER BY row_id ASC ), ARRAY( SELECT row_id FROM potential_headphones_reviews ORDER BY row_id ASC ) ) ) SELECT phr.product_id, phr.review_id, phr.product_name, phr.description, phr.review_text FROM potential_headphones_reviews phr LEFT JOIN reranked_results rr USING (row_id) ORDER BY rr.rank ASC LIMIT 10; Now, this improved query yields the following result set: product_id review_id product_name description review_text 14 316 [redacted] [redacted] [redacted] 14 319 [redacted] [redacted] [redacted] 8 184 [redacted] [redacted] [redacted] 19 440 [redacted] [redacted] [redacted] 8 176 [redacted] [redacted] [redacted] As you can see, we have received different, and semantically more relevant, results than before. Specifically, we do not see product_id=15 in the Top 5 reranked results, anymore. This time, however, when we inspect the results carefully, we notice, for example, that the review text for review_id=319 is as follows: I enjoy these headphones, but they aren't without their flaws. The audio clarity is splendid, and I do find them quite comfortable, even after extended use. However, the battery life is not as long as advertised; I often find myself charging them more frequently than I would like. Still, they do provide solid sound for an overall great listening experience, especially for the price. The issue here is that the result set now has reviews that have better (semantic) relevance scores than before when compared to the provided natural language query, but the model is not able to differentiate negative reviews from the positive ones. Filtering out the False Positives To get rid of the false positives that appear in the reranked reviews, we leverage azure_ai.is_true as a filtering/where condition on the reranked results: WITH potential_headphones AS ( -- as before ), potential_headphones_reviews AS ( -- as before ), reranked_results AS ( -- as before ) SELECT phr.product_id, phr.review_id, phr.product_name, phr.description, phr.review_text FROM potential_headphones_reviews phr LEFT JOIN reranked_results rr USING (row_id) WHERE azure_ai.is_true( 'The following product description and review satisfies the following customer ask: ' || 'good clear calling capability and good battery life' || '. Product Description: ' || phr.description || '. Review: ' || phr.review_text ) ORDER BY rr.rank ASC LIMIT 10; Finally, we get the following result set, which excludes the false positives from the reranked results of the previous query: product_id review_id product_name description review_text 14 316 [redacted] [redacted] [redacted] 8 184 [redacted] [redacted] [redacted] 19 440 [redacted] [redacted] [redacted] 19 435 [redacted] [redacted] [redacted] 8 186 [redacted] [redacted] [redacted] Extracting High-Level Data to Build a Graph Extracting high-level data from product descriptions and reviews provides structured insights into customer experiences. By using azure_ai.extract, you can identify key aspects of the product being discussed and the overall sentiment of the review by configuring the description of the data you want to extract. After obtaining the required high-level data in a structured format, you can build even more complex relationships on your datasets by using a graph representation of data. To this end, you can leverage the Apache AGE extension in Azure Database for PostgreSQL. For our use case, let us assume that the following key aspects of the products are important for our customers: design (e.g., dimensions, shape, weight, material), sound quality (e.g., sound/music quality, bass, treble), and water resistance. As such, we need to extract these high-level data from product descriptions and store them as features of the product in, say, JSONB format. Similarly, we can extract the same set of features, if mentioned at all, from the reviews, together with the sentiments as inferred from the review text by the LLMs. By using azure_ai.extract, we can achieve this in two steps… one for the product descriptions: SELECT id, azure_ai.extract( description, ARRAY[ 'design: string - comma separated list of design features of the product (e.g., dimensions, shape, weight, material); empty if not mentioned', 'soundQuality: string - comma separated list of sound quality features of the product (e.g., sound/music quality, bass, treble); empty if not mentioned', 'waterResistance: string - comma separated list of water resistance features of the product; empty if not mentioned' ] ) AS features FROM product; and another one for the reviews: SELECT id, azure_ai.extract( review_text, ARRAY[ 'design: string - comma separated list of design features of the product (e.g., dimensions, shape, weight, material); empty if not mentioned', 'designSentiment: string - overall sentiment of the design features of the product, e.g., positive, negative, neutral, or n/a if not applicable', 'soundQuality: string - comma separated list of sound quality features of the product (e.g., sound/music quality, bass, treble); empty if not mentioned', 'soundQualitySentiment: string - overall sentiment of the sound quality features of the product, e.g., positive, negative, neutral, or n/a if not applicable', 'waterResistance: string - comma separated list of water resistance features of the product; empty if not mentioned', 'waterResistanceSentiment: string - overall sentiment of the water resistance features of the product, e.g., positive, negative, neutral, or n/a if not applicable', 'overallSentiment: string - overall sentiment of the review, e.g., positive, negative, neutral, or n/a if not applicable' ] ) AS features FROM review; The queries above help extract the high-level data we need to populate a knowledge graph with the following relationship diagram: Relationship diagram among products, their features, and the reviews that mention these features. Products have a "has-a" relationship with both features and reviews, and reviews have a "mentions-a" relationship with the features. Here, product nodes can contain some convenient data in addition to the product IDs (e.g., name, category), feature nodes contain the feature names (e.g., design), and review nodes contain the review IDs. Similarly, the directed edges from product nodes to feature nodes contain feature data (e.g., extracted data from product descriptions for the design feature), edges from product nodes to review nodes contain the overall sentiment of the review, and edges from review nodes to feature nodes contain both the feature data (e.g., extracted data from product reviews for the design feature) and the corresponding feature sentiment (e.g., extracted sentiment from product reviews specifically for the design feature, if at all). The graph representation depicted in Figure 1 can be implemented easily with the help of the Apache AGE extension in Azure Database for PostgreSQL. Answering Complex Questions Finally, after we build the graph representation of the relationship among products, reviews and the features of interest, we are ready to answer complex questions in SQL. Let us imagine that we would like to answer the following challenging question: “What are the Top 3 lightweight and waterproof headphones with many good reviews about music quality?” To answer this question, we can start with the following graph query: SELECT * FROM cypher( 'products_reviews_features_graph', $$ MATCH (p:product {category: 'headphones'})-[:HAS_FEATURE]->(:feature {name: 'design'}) WITH p MATCH (p)-[:HAS_FEATURE]->(:feature {name: 'waterResistance'}) WITH p MATCH (p)-[:HAS_REVIEW]->(r:review)-[:MENTIONS_FEATURE {sentiment: 'positive'}]->(:feature {name: 'soundQuality'}) RETURN p.id, r.id $$ ) AS result(product_id integer, review_id integer); This query lists all the ID pairs of products that are in the headphones category and have design and water resistance features specified, and those reviews that have positive sentiment about the specific product’s sound quality features. However, our featurization process in the previous section implies that the design feature of a product specification contains not only the weight of the product but also other design elements such as dimensions, shape and material. As such, we need to join the graph query with a common table expression that filters out those product descriptions that do not explicitly specify the weight of the product. Finally, let us also summarize the design and water resistance features of the product, as well as all the review elements considering the sound quality of the product. The final query looks like this: WITH graph_query AS ( SELECT * FROM cypher( 'shop_graph', $$ MATCH (p:product {category: 'headphones'})-[:HAS_FEATURE]->(:feature {name: 'design'}) WITH p MATCH (p)-[:HAS_FEATURE]->(:feature {name: 'waterResistance'}) WITH p MATCH (p)-[:HAS_REVIEW]->(r:review)-[:MENTIONS_FEATURE {sentiment: 'positive'}]->(:feature {name: 'soundQuality'}) RETURN p.id, r.id $$ ) AS result(product_id integer, review_id integer) ), detailed_data AS ( SELECT g.product_id, g.review_id, p.name, p.features->>'design' AS design, p.features->>'waterResistance' AS water_resistance, r.features->>'soundQuality' AS sound_quality FROM graph_query g JOIN product p ON g.product_id = p.id JOIN review r ON g.review_id = r.id ), filtered_products AS ( SELECT DISTINCT ON (product_id) product_id FROM detailed_data WHERE azure_ai.is_true('Product spec specifically mentions lightweight design: ' || design) ) SELECT f.product_id AS id, d.name, azure_ai.generate( 'Design features: ' || d.design || '. Water resistance features: ' || d.water_resistance, system_prompt => 'Summarize the features of the product concisely.' ) AS product_summary, azure_ai.generate( string_agg('Review: ' || d.sound_quality, ','), system_prompt => 'Summarize the reviews concisely.' ) AS review_summary, COUNT(d.review_id) AS review_count FROM filtered_products f JOIN detailed_data d ON f.product_id = d.product_id GROUP BY f.product_id, d.name, d.design, d.water_resistance ORDER BY review_count DESC LIMIT 3; We get the following result set: id name product_summary review_summary review_count 6 [redacted] [redacted] [redacted] 16 8 [redacted] [redacted] [redacted] 13 7 [redacted] [redacted] [redacted] 12 There we have it – Top 3 products with the desired features and the most positive reviews about sound quality! Let us examine the top match for example: ID. 6 Name. Nexvo Wireless Earbuds – 48H Playback, LED Display, Earhooks, Mic Product Summary. The product features a sport-focused design with three pairs of ear caps and flexible earhooks for a secure fit during outdoor activities. It includes a charging case with a dual LED display and offers water resistance suitable for use in rainy conditions. Review Summary. The reviews consistently praise the product's sound quality, highlighting it as crystal clear, rich, full, and immersive with good bass response and outstanding clarity. Review Count. 16 Final Thoughts Semantic operators in Azure PostgreSQL are more than just wrappers around LLMs—they represent a paradigm shift in how we think about AI in data systems. By embedding generative AI directly into SQL, we are not only simplifying the integration process but also empowering developers to build smarter, faster, and more maintainable applications. This innovative approach allows for seamless interaction with AI capabilities, enabling developers to leverage the full potential of LLMs to solve complex problems, enhance data analysis, enrich available data with high-level features, and create more intuitive user experiences. The flexibility and power of these operators open new possibilities for innovation, making it easier to develop cutting-edge applications that can adapt to evolving needs and challenges.Innovating with PostgreSQL @Build
At this year's Microsoft Build, we're excited to share the latest updates and innovations in Microsoft Azure Database for PostgreSQL. Whether you're building AI powered apps and agents or just looking to uplevel your PostgreSQL experience, we've got sessions packed with insights and tools tailored for developers and technical leaders. As a fully managed, AI-ready open source relational database that offers 58% cost savings over an on-premises PostgreSQL database – Azure Database for PostgreSQL enhances your security, scalability, and management of enterprise workloads. Check out what’s happening with Postgres at Build — in Seattle and online: 🔍 Breakout Sessions BRK211: Building Advanced Agentic Apps with PostgreSQL on Azure What benefits do agentic architectures bring compared to traditional RAG patterns? Find out how we answer this question by exploring advanced agentic capabilities offered by popular GenAI frameworks (LangChain, LlamaIndex, Semantic Kernel) and how they transform RAG applications built on Azure Database for PostgreSQL. Learn how to further improve agentic apps by integrating advanced RAG techniques, making vector search faster with the DiskANN vector search algorithm, and more accurate with Semantic Ranking and GraphRAG. BRK204: What’s New in Microsoft Databases: Empowering AI-Driven App Dev Explore advanced applications powered by Microsoft databases on-premises, on Azure and in Microsoft Fabric. Uncover innovative approaches to scalability and learn about intelligent data processing with AI-driven insights and agentic integrations. See new features with engaging demos across all databases including Azure Database for PostgreSQL. 💻 Demo Session DEM564: Boost Your Development Workflows with PostgreSQL Discover how to transform your development workflow on PostgreSQL. Whether you're building AI-powered apps, managing complex datasets, or just looking to streamline your PostgreSQL experience, this demo will show you how to level up your productivity with PostgreSQL on Azure. 🧪 Hands-On Labs LAB360: Build an Agentic App with PostgreSQL, GraphRAG, and Semantic Kernel Sign up to get hands-on experience building an agent-driven, RAG-based application with Azure Database for PostgreSQL and VS Code. Explore coding and architectural concepts while using DiskANN Index for Vector Search, and integrating Apache AGE for PostgreSQL to extend into a GraphRAG pattern leveraging the Semantic Kernel Agent Framework. 💬 Meet the Experts Have questions? Looking to talk open source, AI agentic apps, or migration? Visit us in the Expert Meetup Zone to connect with the Postgres product teams, engineers, and architects. 🔎 How to find it: Log into the Microsoft Build 2025 website or use the official event mobile app to view the venue map and session schedule. 📍 To find the Expert Meetup zone, check out the official MS Build Event Guide for a venue maps and other logistical information. 🐘Get Started with Azure Database for PostgreSQL Want to try it out firsthand? 🚀 Start building 📘 Explore the documentation Let's connect, code, and grow together at Build 2025!