Talk
Many projects build knowledge graphs with custom schemas but struggle to align them with standard hubs like Wikidata. Manual mapping is tedious and error-prone, while fully automated methods often lack accuracy. This talk introduces `wikidata-mapper`, a Python tool leveraging Large Language Models (LLMs via `DSPy`) to suggest semantic mappings between simple YAML ontology schemas and Wikidata identifiers (QIDs/PIDs). We demonstrate its interactive workflow, including confidence-based auto-acceptance, batch suggestion/review modes for scalability, and a novel hierarchy suggestion feature. Learn how this tool combines LLM power with human oversight to efficiently ground custom knowledge representations in Wikidata, using libraries like `inquirer`, `tenacity`, and `platformdirs`. Ideal for KG practitioners, data engineers, and anyone needing to integrate custom schemas with public knowledge bases.
* **Intermediate Python Proficiency:** Attendees should be comfortable reading and understanding Python code, including object-oriented concepts (classes, methods), functions, standard data structures (dictionaries, lists, sets), and basic exception handling. * **Conceptual Understanding of Knowledge Graphs:** A basic familiarity with the concept of knowledge graphs – representing entities (nodes) and their relationships (edges) – and their general purpose (e.g., data integration, question answering, RAG) is expected. Deep expertise in graph theory or specific graph databases is not required. * **Familiarity with Data Schemas/Ontologies:** Understanding the purpose of defining structured schemas or lightweight ontologies (e.g., defining types like 'Person', 'Organization', and relations like 'worksFor') to organize data is beneficial. No experience with formal ontology languages like OWL/RDF is necessary. * **Awareness of Large Language Models (LLMs):** A basic understanding of what LLMs are and their capability for semantic understanding and text generation based on prompts is helpful. Deep knowledge of LLM architectures or training is not needed. * **(Helpful, Not Required):** Prior awareness of what Wikidata is (a large, collaborative knowledge base) and the basic concept of interacting with web APIs will provide useful context but will be briefly touched upon in the talk.
(1) The Challenge: Isolated Knowledge & the Need for Grounding
Many data science and knowledge-based projects naturally involve creating custom schemas or lightweight ontologies – defining terms like PROJECT_LEAD
, SALES_REGION
, or COMPONENT_FAILURE_MODE
that make sense within a specific context (e.g., in simple YAML files). While valuable locally, these schemas often become isolated knowledge silos. Without alignment to a common standard, integrating data across projects becomes difficult, querying consistently is a challenge, and opportunities to leverage vast amounts of external, structured knowledge are missed. How can we connect these custom concepts to the wider world?
This is where Wikidata comes in. Why map to Wikidata specifically?
SALES_REGION
aligns with a standard Wikidata geo-identifier (QID).However, the mapping process itself is a major bottleneck. Manually finding the correct Wikidata QID/PID for potentially hundreds of custom terms is extremely tedious, requires niche expertise, and is error-prone. Fully automated alignment tools often struggle with the semantic ambiguity inherent in custom schemas (e.g., does 'position' mean job title P39 or coordinate P625?).
(2) Our Approach: LLM-Assisted Interactive Mapping
This talk introduces a practical Python tool and workflow designed to dramatically accelerate the process of mapping custom schema elements (from formats like YAML) to Wikidata identifiers, while maintaining high accuracy through efficient human oversight. We combine the semantic reasoning of Large Language Models (LLMs) with an interactive workflow.
(3) Workflow Walkthrough:
We will walk through the tool's workflow:
dspy
to prompt an LLM with source term details and candidates, generating structured JSON suggestions (LLMMappingSuggestion
Pydantic model) with confidence scores and reasoning.inquirer
-based CLI for quickly accepting/overriding/skipping LLM suggestions, aided by confidence thresholding for auto-acceptance.--batch-suggest
and --batch-review
modes for handling larger schemas efficiently.subclass_of
links to the user's own schema, helping refine its structure.(4) Technical & Engineering Highlights:
dspy
, requests
, tenacity
, inquirer
, pydantic
, platformdirs
, yaml
).(5) Demo: (Mention plan for a brief demo showcasing the interactive mapping and hierarchy suggestion.)
(6) Benefits & Conclusion:
This approach provides a pragmatic solution to a common data integration bottleneck. By intelligently combining LLM suggestions with efficient human validation, it allows teams to:
Significantly speed up mapping custom schemas to Wikidata.
Improve mapping consistency and accuracy.
Enrich internal knowledge graphs by linking them to a global standard.
Leverage Wikidata's structure to refine custom schemas.
Attendees will learn about a practical workflow applying LLMs to KG alignment and gain insights into building robust, interactive data tools with the modern Python stack.
Machine Learning Engineer
As a Staff MLE, Sankalp gets fired up by complex technical challenges, diving deep into time series, constrained optimization, and high-performance computing. He's currently exploring the practical frontier of Generative AI, applying LLMs and multimodal techniques to improve how knowledge graphs are built from diverse sources. This talk focuses on a crucial component of that work: efficiently mapping and aligning extracted concepts to standard knowledge bases like Wikidata. Off-duty, his adventures shift from algorithmic to atmospheric (skydiving) and aquatic (scuba diving), often accompanied by his adventure-loving dog.