IMDB4M - A Large-Scale Quad-Modal Knowledge Graph of Movies

Multimodal Coverage

Four Modalities, One Graph

IMDB4M overcomes the bimodal bottleneck of existing knowledge graphs by integrating text, images, video, and audio as first-class semantic objects. All numbers below are derived directly from data/kg/imdb_kg_cleaned.ttl on the 376 seed movies.

📝

Text

100%

Plots, reviews, keywords, captions, genres

18.58 triples / seed movie

🖼️

Images

100%

Stills, posters with captions & entity links

6.91 triples / seed movie

🎬

Video

99.20%

Trailers with thumbnails, duration, dates

0.99 triples / seed movie

🎵

Audio

94.95%

Soundtracks with performers, composers, lyricists

12.02 triples / seed movie

355 / 376 seed movies (94.41%) carry all four modalities simultaneously in the cleaned KG.

Knowledge Graph

By the Numbers

A comprehensive resource comprising over 1.8 million RDF triples describing movies, artists, and their multimodal content.

Metric	Value
RDF Triples	1,800,490
Unique RDF Nodes (URIs + literals + bnodes)	656,121
URIRef Entities (released as KG embeddings)	139,465
Distinct Predicates	58
Seed Movies (fully annotated)	376
Total Movies (`schema:Movie`)	50,756
Artists Analyzed (actors, directors, composers)	5,484
`schema:PerformanceRole` Instances	232,492
`schema:ImageObject` Instances	34,039
`schema:VideoObject` Instances	3,981
`schema:Person` Instances	16,994
`schema:MusicRecording` Instances	4,521
`schema:MusicComposition` Instances	3,970
`schema:AggregateRating` Instances	734
`schema:Review` Instances	563
Wikidata Alignments (`owl:sameAs`)	4,284 actors + 376 movies (4,660 triples)

Comparison with Related Work

Dataset	Text	Image	Video	Audio	#Entity	#Relation
MKG-W	14,123	14,463	–	–	15,000	169
MKG-Y	12,305	14,244	–	–	15,000	28
TIVA-KG	11,858	11,636	10,269	2,441	11,858	16
KVC16K	14,822	14,822	14,822	14,822	16,015	4
IMDB4M	390,747	34,039	3,981	4,521	656,121	58

Ontology

Schema.org Foundation

Built on widely adopted vocabulary standards for semantic interoperability and Web-scale discoverability.

PerformanceRole Pattern

Actor participation captures actor, movie, and character name together, preserving identity separately from fictional roles.

N-ary Structures

Typed blank nodes with xsd:date, xsd:dateTime, xsd:duration, xsd:integer, and xsd:decimal for machine-interpretable values.

Two-level Audio

schema:MusicRecording for performed audio artifacts and schema:MusicComposition for the underlying musical work.

Pre-computed Vectors

Multi-Modal & KG Embeddings

IMDB4M ships pre-computed embeddings for every released modality plus knowledge-graph embeddings trained with PyKEEN’s RotatE. All vectors are L2-normalised and aligned one-for-one to the KG via imdb4m:hasEmbedding records.

Where to get them: the embedding files are not stored in this git repository. They are released as a Zenodo deposit at 10.5281/zenodo.20057840. After cloning the repo, copy the Zenodo files into the empty embeddings_output/ directory at the project root and the rest of the project will pick them up.

Note on entity counts (rdflib vs PyKEEN). All KG-size figures elsewhere on this page (e.g. 656,121 unique RDF nodes, 263,343 literal nodes) are computed by parsing data/kg/imdb_kg_cleaned.ttl with rdflib. The PyKEEN entity table that backs the released KG embeddings has 656,003 rows instead — 118 fewer than rdflib reports — because PyKEEN dedupes literals more aggressively than rdflib (literals with identical lexical form but different datatypes or language tags are collapsed into one entity). The 118-node delta is entirely in the literal block (rdflib: 263,343 distinct literals; PyKEEN: 263,225). Concretely, the released bundle contains 139,465 rows in the URIRef KG-embedding table (the table most users consume — unaffected by the dedup difference) and ~656,003 rows in the optional /kg_pykeen_entities/ HDF5 group. Both numbers describe the same KG; they differ only in how the literal namespace is collapsed.

Released Vectors

Modality	Rows	Dim	Model
Image	33,247	768	openai/clip-vit-large-patch14
Video	4,350	512	microsoft/xclip-base-patch32
Audio	4,034	512	laion/larger_clap_music_and_speech
Text	4,216	1,024	BAAI/bge-large-en-v1.5
KG (RotatE)	139,465	512	PyKEEN RotatE (256-d complex / 512-d real)

Held-out KG Variants

RotatE is additionally trained with specific label predicates removed before training, supporting clean classification benchmarks. The all-labels run held out 66,838 triples (1,731,988 used for training).

Variant	Held-out Predicate(s)	Purpose
`full`	none	KG retrieval, projections, fusion
`genre`	schema:genre	Clean genre classification
`rating`	schema:contentRating	Clean rating classification
`decade`	schema:datePublished	Clean decade classification
`language`	schema:inLanguage	Clean language classification
`all-labels`	the four above, jointly	Stricter robustness setting

Embedding Quality (Leave-one-out NCC accuracy)

How well each modality (and several fusion strategies) recovers KG-derived labels for the 352-movie multimodal intersection. Headline numbers from plots/out/tab_fusion_metrics.tex.

Modality / Fusion	Decade (5)	Rating (4)	Genre (9)
Random baseline	20.0%	25.0%	11.1%
Poster (CLIP ViT-L/14)	77.6%	61.2%	37.5%
Trailer (X-CLIP)	44.6%	56.9%	30.1%
Soundtrack (CLAP)	34.9%	29.5%	13.8%
KG (RotatE, label-held-out)	38.4%	48.0%	25.8%
Text balanced avg. (BGE)	40.3%	64.3%	50.1%
Fused, late, poster-weighted (2:1:1:1:1)	65.6%	63.7%	42.7%
Fused, supervised CV late	80.0% ± 4.2	61.6% ± 2.2	57.4% ± 6.0

Alignment guarantee: every row in the released parquet files has a matching subject in data/kg/imdb_kg_cleaned.ttl. Each modality is delivered as a Parquet (zstd) file and an HDF5 (gzip-4) group with shared row order, so parquet_row == hdf5_index.

Quality Assurance

Validated & Verified

Systematic validation through SPARQL-based question answering, KG-wide competency-question coverage, and human-validated link verification.

98.72%

F1 Score

99.35%

Precision

98.09%

Recall

94.72%

Exact Match Rate

99.29%

Query Success (KG-wide)

87.16%

YouTube Link Accuracy

Evaluated against 18 competency questions formalised as SPARQL queries, covering directors, writers, actors, ratings, plots, trailers, soundtracks, images, and more. Query success is re-run over all 376 seed movies (6,720 / 6,768 instances return at least one answer); YouTube-link accuracy is the human-validated agreement rate (129 / 148 sampled links).

Use Cases

Research Applications

IMDB4M enables research across multiple domains in the Semantic Web and Multimedia communities.

🎥

Movie Recommendation

Content-based recommendation using visual style of posters and acoustic features of soundtracks.

Audio embeddings from linked YouTube videos
Visual style analysis of movie posters
Temporal features from trailers

🔍

Multimodal QA

Knowledge Graph Question Answering (KGQA) with perceptual grounding and RAG systems.

Joint reasoning over symbolic and perceptual modalities
Complex queries with reified relations
Multimodal RAG benchmarking

🧩

KG Completion

Multimodal Knowledge Graph Completion including link prediction and entity alignment.

Infer schema:genre from poster and plot
Cross-platform entity alignment via imdb4m:hasEmbedding
Released RotatE vectors + held-out variants for clean evaluation

Reference

Cite This Work

If you use IMDB4M in your research, please cite our paper. The embedding bundle is archived separately on Zenodo and should be cited via its dataset DOI.

@inproceedings{imdb4m2026,
  title     = {{IMDB4M}: A Large-Scale Multi-Modal Knowledge Graph of Movies},
  author    = {Reklos, Ioannis and de Berardinis, Jacopo and Simperl, Elena and Mero{\~n}o-Pe{\~n}uela, Albert},
  year      = {2026},
  note      = {Under review}
}

@dataset{imdb4m_embeddings_2026,
  title     = {{IMDB4M} Multi-Modal and KG Embeddings (v1)},
  author    = {Reklos, Ioannis and de Berardinis, Jacopo and Simperl, Elena and Mero{\~n}o-Pe{\~n}uela, Albert},
  year      = {2026},
  doi       = {10.5281/zenodo.20057840}
}

View on GitHub