RDF / Turtle Schema.org Embeddings on Zenodo CC-BY-NC

The first knowledge graph to combine a culturally rich domain with comprehensive coverage of text, images, video, and audio. Engineered on schema.org for semantic interoperability, with pre-computed image, video, audio, text, and KG (RotatE) embeddings released on Zenodo.

1.80M RDF Triples
376 Seed Movies
5,484 Artists
94.41% Quad-Modal
View on GitHub Explore Schema

Four Modalities, One Graph

IMDB4M overcomes the bimodal bottleneck of existing knowledge graphs by integrating text, images, video, and audio as first-class semantic objects. All numbers below are derived directly from data/kg/imdb_kg_cleaned.ttl on the 376 seed movies.

📝

Text

100%

Plots, reviews, keywords, captions, genres

18.58 triples / seed movie

🖼️

Images

100%

Stills, posters with captions & entity links

6.91 triples / seed movie

🎬

Video

99.20%

Trailers with thumbnails, duration, dates

0.99 triples / seed movie

🎵

Audio

94.95%

Soundtracks with performers, composers, lyricists

12.02 triples / seed movie

355 / 376 seed movies (94.41%) carry all four modalities simultaneously in the cleaned KG.

By the Numbers

A comprehensive resource comprising over 1.8 million RDF triples describing movies, artists, and their multimodal content.

Metric Value
RDF Triples 1,800,490
Unique RDF Nodes (URIs + literals + bnodes) 656,121
URIRef Entities (released as KG embeddings) 139,465
Distinct Predicates 58
Seed Movies (fully annotated) 376
Total Movies (schema:Movie) 50,756
Artists Analyzed (actors, directors, composers) 5,484
schema:PerformanceRole Instances 232,492
schema:ImageObject Instances 34,039
schema:VideoObject Instances 3,981
schema:Person Instances 16,994
schema:MusicRecording Instances 4,521
schema:MusicComposition Instances 3,970
schema:AggregateRating Instances 734
schema:Review Instances 563
Wikidata Alignments (owl:sameAs) 4,284 actors + 376 movies (4,660 triples)

Comparison with Related Work

Dataset Text Image Video Audio #Entity #Relation
MKG-W 14,123 14,463 15,000 169
MKG-Y 12,305 14,244 15,000 28
TIVA-KG 11,858 11,636 10,269 2,441 11,858 16
KVC16K 14,822 14,822 14,822 14,822 16,015 4
IMDB4M 390,747 34,039 3,981 4,521 656,121 58

Schema.org Foundation

Built on widely adopted vocabulary standards for semantic interoperability and Web-scale discoverability.

IMDB4M Knowledge Graph Schema

PerformanceRole Pattern

Actor participation captures actor, movie, and character name together, preserving identity separately from fictional roles.

N-ary Structures

Typed blank nodes with xsd:date, xsd:dateTime, xsd:duration, xsd:integer, and xsd:decimal for machine-interpretable values.

Two-level Audio

schema:MusicRecording for performed audio artifacts and schema:MusicComposition for the underlying musical work.

Multi-Modal & KG Embeddings

IMDB4M ships pre-computed embeddings for every released modality plus knowledge-graph embeddings trained with PyKEEN’s RotatE. All vectors are L2-normalised and aligned one-for-one to the KG via imdb4m:hasEmbedding records.

Where to get them: the embedding files are not stored in this git repository. They are released as a Zenodo deposit at 10.5281/zenodo.20057840. After cloning the repo, copy the Zenodo files into the empty embeddings_output/ directory at the project root and the rest of the project will pick them up.
Note on entity counts (rdflib vs PyKEEN). All KG-size figures elsewhere on this page (e.g. 656,121 unique RDF nodes, 263,343 literal nodes) are computed by parsing data/kg/imdb_kg_cleaned.ttl with rdflib. The PyKEEN entity table that backs the released KG embeddings has 656,003 rows instead — 118 fewer than rdflib reports — because PyKEEN dedupes literals more aggressively than rdflib (literals with identical lexical form but different datatypes or language tags are collapsed into one entity). The 118-node delta is entirely in the literal block (rdflib: 263,343 distinct literals; PyKEEN: 263,225). Concretely, the released bundle contains 139,465 rows in the URIRef KG-embedding table (the table most users consume — unaffected by the dedup difference) and ~656,003 rows in the optional /kg_pykeen_entities/ HDF5 group. Both numbers describe the same KG; they differ only in how the literal namespace is collapsed.

Released Vectors

Modality Rows Dim Model
Image 33,247 768 openai/clip-vit-large-patch14
Video 4,350 512 microsoft/xclip-base-patch32
Audio 4,034 512 laion/larger_clap_music_and_speech
Text 4,216 1,024 BAAI/bge-large-en-v1.5
KG (RotatE) 139,465 512 PyKEEN RotatE (256-d complex / 512-d real)

Held-out KG Variants

RotatE is additionally trained with specific label predicates removed before training, supporting clean classification benchmarks. The all-labels run held out 66,838 triples (1,731,988 used for training).

Variant Held-out Predicate(s) Purpose
full none KG retrieval, projections, fusion
genre schema:genre Clean genre classification
rating schema:contentRating Clean rating classification
decade schema:datePublished Clean decade classification
language schema:inLanguage Clean language classification
all-labels the four above, jointly Stricter robustness setting

Embedding Quality (Leave-one-out NCC accuracy)

How well each modality (and several fusion strategies) recovers KG-derived labels for the 352-movie multimodal intersection. Headline numbers from plots/out/tab_fusion_metrics.tex.

Modality / Fusion Decade (5) Rating (4) Genre (9)
Random baseline 20.0% 25.0% 11.1%
Poster (CLIP ViT-L/14) 77.6% 61.2% 37.5%
Trailer (X-CLIP) 44.6% 56.9% 30.1%
Soundtrack (CLAP) 34.9% 29.5% 13.8%
KG (RotatE, label-held-out) 38.4% 48.0% 25.8%
Text balanced avg. (BGE) 40.3% 64.3% 50.1%
Fused, late, poster-weighted (2:1:1:1:1) 65.6% 63.7% 42.7%
Fused, supervised CV late 80.0% ± 4.2 61.6% ± 2.2 57.4% ± 6.0
Alignment guarantee: every row in the released parquet files has a matching subject in data/kg/imdb_kg_cleaned.ttl. Each modality is delivered as a Parquet (zstd) file and an HDF5 (gzip-4) group with shared row order, so parquet_row == hdf5_index.

Validated & Verified

Systematic validation through SPARQL-based question answering, KG-wide competency-question coverage, and human-validated link verification.

98.72%
F1 Score
99.35%
Precision
98.09%
Recall
94.72%
Exact Match Rate
99.29%
Query Success (KG-wide)
87.16%
YouTube Link Accuracy

Evaluated against 18 competency questions formalised as SPARQL queries, covering directors, writers, actors, ratings, plots, trailers, soundtracks, images, and more. Query success is re-run over all 376 seed movies (6,720 / 6,768 instances return at least one answer); YouTube-link accuracy is the human-validated agreement rate (129 / 148 sampled links).

Research Applications

IMDB4M enables research across multiple domains in the Semantic Web and Multimedia communities.

🎥

Movie Recommendation

Content-based recommendation using visual style of posters and acoustic features of soundtracks.

  • Audio embeddings from linked YouTube videos
  • Visual style analysis of movie posters
  • Temporal features from trailers
🔍

Multimodal QA

Knowledge Graph Question Answering (KGQA) with perceptual grounding and RAG systems.

  • Joint reasoning over symbolic and perceptual modalities
  • Complex queries with reified relations
  • Multimodal RAG benchmarking
🧩

KG Completion

Multimodal Knowledge Graph Completion including link prediction and entity alignment.

  • Infer schema:genre from poster and plot
  • Cross-platform entity alignment via imdb4m:hasEmbedding
  • Released RotatE vectors + held-out variants for clean evaluation

Cite This Work

If you use IMDB4M in your research, please cite our paper. The embedding bundle is archived separately on Zenodo and should be cited via its dataset DOI.

@inproceedings{imdb4m2026, title = {{IMDB4M}: A Large-Scale Multi-Modal Knowledge Graph of Movies}, author = {Reklos, Ioannis and de Berardinis, Jacopo and Simperl, Elena and Mero{\~n}o-Pe{\~n}uela, Albert}, year = {2026}, note = {Under review} }
@dataset{imdb4m_embeddings_2026, title = {{IMDB4M} Multi-Modal and KG Embeddings (v1)}, author = {Reklos, Ioannis and de Berardinis, Jacopo and Simperl, Elena and Mero{\~n}o-Pe{\~n}uela, Albert}, year = {2026}, doi = {10.5281/zenodo.20057840} }
View on GitHub