Initial integration of QTI and IMSCP import by rtibbles · Pull Request #468 · learningequality/ricecooker

rtibbles · 2024-02-19T05:31:09Z

Does an initial integration of IMSCP import into ricecooker.
Adds some simple utilities for mapping IMSCP/SCORM metadata fields to our metadata
Allows proper mapping of embedded webcontent files to the entry option
Writes tests to cover this behaviour

Fixes #332
Fixes #337

rtibbles · 2024-02-19T17:14:35Z

Looks like the way I am doing the zipfile testing is not compatible with Windows, open file handles etc - will need to update for that.

jredrejo · 2024-02-20T20:14:07Z

+    # Update the node with the general metadata
+    node.description = metadata_dict.get("description") or node.description
+    if metadata_dict.get("language"):
+        node.set_language(metadata_dict.get("language"))


I had two problems ithe these few lines of code:
language was en-US
metadata.dict was this one:

(Pdb) metadata_dict {'title': 'Work with computers', 'description': 'This learning path will introduce you to the different parts and types of the computer and their functions. You will also learn the difference between operating systems and applications and their functions. Peripherals and portable storage devices will be discussed as well.', 'language': 'en-US', 'keyword': 'Default'}

It worked with these changes (I ignore if the keyword change will work with other scorm files)

if metadata_dict.get("language"): node.set_language(metadata_dict.get("language").lower().split("-")[0]) keyword = metadata_dict.get("keyword") if keyword: node.tags = node.tags + [keyword,]

Updated - hopefully has a similar effect, add some additional validation to make sure it's a language we know.

jredrejo · 2024-02-21T19:16:25Z

+    is_primary = True
+
+    def get_preset(self):
+        return self.preset or format_presets.IMSCP_ZIP


This is producing a server error in Studio https://learningequality.sentry.io/issues/4997106816/?project=1252819

jredrejo · 2024-04-11T18:36:02Z

I've been playing with the latest code from this PR:

Ricecooker works
Uploading to Studio unstable works
The channel appears correctly (preview doesn't work, but that's not a ricecooker problem) https://unstable.studio.learningequality.org/es-es/channels/6877a75b029e5259a02c93046ee6dd80 , token: kokom-fitih
Using kolibri v0.16.1, I can see the channel, the structure is a bit weird, with more levels than expected: 1 single scorm file has up to four levels to access to the resource:
After importing it, it does not work correctly, this is the error message:

jredrejo

Left a detailed comment with the two issues found

rtibblesbot

Solid initial integration of IMSCP import into ricecooker — the manifest parsing, SCORM metadata mapping, and pipeline handler structure are well-designed. Two blocking issues around license propagation and dynamic child processing in nested structures need fixing before merge.

CI passing. All tests pass across Python 3.9-3.13 on all platforms.

Blocking:

License not propagated through nested topics — leaf ContentNodes under TopicNodes get license=None (see inline)
process_files dynamic children loop doesn't recurse into TopicNode children's descendants (see inline)
assert used for runtime validation of untrusted XML input in imscp.py — disabled under python -O (see inline)

Suggestions:

QTIZipFile added to files.py but no corresponding QTI handler or tests — consider deferring to a follow-up PR or adding a comment (see inline)
_extract_lom_text can return mixed types (string, list, or None) which could surprise callers (see inline)

rtibblesbot · 2026-02-11T05:44:06Z

+                node = ContentNode(
+                    source_id,
+                    title,
+                    self.license,


blocking: self.license is None when this method is called recursively on a TopicNode (created at line 790). TopicNode inherits from TreeNode where license defaults to None. So leaf ContentNodes created under topic nodes will have license=None, which will fail ContentNode.validate() at runtime (line 924: "ContentNode must have a license").

Consider propagating the license explicitly, e.g. adding a license parameter to _build_children_from_metadata similar to how parent_files is propagated:

def _build_children_from_metadata(self, children_list, parent_files=None, license=None): license = license or self.license ... node = ContentNode(source_id, title, license, ...) ... node._build_children_from_metadata(nested_children, parent_files=parent_files, license=license)

rtibblesbot · 2026-02-11T05:44:06Z

+                for f in child.files:
+                    if f.filename:
+                        child_filenames.append(f.filename)
+                filenames.extend(child_filenames)


blocking: This loop only processes direct children (self.children). For TopicNode children that contain their own leaf ContentNode grandchildren, those grandchildren are never validated and their filenames are never collected. With a structure like ContentNode -> TopicNode -> ContentNode(leaf), the leaf's files won't appear in filenames.

This needs to recurse into TopicNode children, e.g.:

def _collect_dynamic_filenames(self, node): filenames = [] for child in node.children: if isinstance(child, ContentNode) and not child._files_processed: child._files_processed = True child.validate() for f in child.files: if f.filename: filenames.append(f.filename) if isinstance(child, TopicNode): filenames.extend(self._collect_dynamic_filenames(child)) return filenames

rtibblesbot · 2026-02-11T05:44:06Z

+                text += child.text
+            if child.tail:
+                text += child.tail
+        assert text.strip(), "Title element has no title: {}".format(


blocking: assert is disabled when Python runs with -O (optimized mode). Since this validates untrusted XML input from content packages, it should use a proper exception:

if not text.strip(): raise ManifestParseError( "Title element has no title: {}".format( etree.tostring(title_elem, pretty_print=True) ) )

rtibblesbot · 2026-02-11T05:44:06Z

+        return self.preset or format_presets.IMSCP_ZIP
+
+
+class QTIZipFile(DownloadFile):


suggestion: QTIZipFile is added here but there's no corresponding QTIConversionHandler or QTIMetadataExtractor in the pipeline, and no tests for QTI handling. The PR title mentions "QTI and IMSCP" but only IMSCP is implemented. Consider either removing QTIZipFile from this PR (deferring to a follow-up) or adding a code comment noting it's a placeholder for upcoming QTI work.

rtibblesbot · 2026-02-11T05:44:06Z

+                    return s.text
+        if len(strings) == 1:
+            return strings[0].text
+        return [s.text for s in strings]


suggestion: _extract_lom_text returns a str (single string match), list[str] (multiple strings), or None — three different types. Callers in _collect_field pass the result through without normalization, so downstream code (e.g. metadata_dict_to_content_node_fields) needs to handle all three. This works for fields like keyword where _normalize_keywords handles both, but for title or description a list would be unexpected. Consider always returning a string for single-valued fields, or documenting the contract clearly.

rtibblesbot

Solid initial integration of IMSCP and QTI import into ricecooker. The architecture is well-considered — standalone manifest parsing, metadata mapping utilities, pipeline handler integration, and nested child node building all fit cleanly into the existing codebase patterns.

CI passing (linting). Test coverage is thorough across all new modules.

suggestion (1): Consider handler ordering for edge cases (convert.py)
suggestion (2): Potential AttributeError when TreeNode._build_children_from_metadata is called on a TopicNode with no files attribute default (nodes.py)
suggestion (3): _apply_content_node_metadata unconditionally overwrites list fields from pipeline (nodes.py)
nitpick (1): Metadata field application list is duplicated between two methods (nodes.py)
nitpick (2): _recursive_update could be a dict in-place update note (context.py)

rtibblesbot · 2026-02-11T23:44:38Z

+        """
+        if parent_files is None:
+            parent_files = list(self.files)
+        license = license or getattr(self, "license", None)


suggestion: license = license or getattr(self, "license", None) — if self.license is a falsy value (e.g. an empty string was somehow set), this would fall through to None. In practice this is likely fine since ContentNode.__init__ requires a license object, but worth noting that getattr with a default of None combined with or means the fallback is only used when the attribute is missing or falsy, not just missing.

rtibblesbot · 2026-02-11T23:44:38Z

+                node._files_processed = True
+
+            # Apply additional metadata fields
+            metadata_fields = [


nitpick: This metadata_fields list overlaps significantly with _METADATA_APPLY_FIELDS (defined on ContentNode at line 998). The two lists serve the same purpose (applying metadata to a node) but differ slightly (this one includes tags but omits thumbnail, author, aggregator, provider, role, etc.). Consider extracting a shared constant, or documenting why the child-building path intentionally uses a narrower set of fields.

rtibblesbot · 2026-02-11T23:44:38Z

+
+        See _METADATA_APPLY_FIELDS for the allowlist and exclusion rationale.
+        """
+        for field_name in self._METADATA_APPLY_FIELDS:


suggestion: _apply_content_node_metadata unconditionally overwrites list fields like learning_activities, resource_types, grade_levels, etc. via setattr. If the node had pre-existing values set in the constructor (e.g. by the sushi chef), the pipeline metadata would silently replace them rather than merge. Contrast with the tags handling two lines below (line 1061) which correctly appends. This may be intentional (pipeline metadata should take precedence for these fields) but it's worth a comment documenting the design choice, or making the merge-vs-replace behavior consistent.

rtibblesbot · 2026-02-11T23:44:38Z

+        children = []
+        for item in items:
+            fields = metadata_dict_to_content_node_fields(item.get("metadata", {}))
+            fields["title"] = item.get("title", fields.get("title"))


suggestion: fields["title"] = item.get("title", fields.get("title")) — if the item has no "title" key and metadata_dict_to_content_node_fields also produced no title, fields["title"] will be set to None. Downstream in _build_children_from_metadata (nodes.py line 801), the fallback child_meta.title or "Untitled" handles this, but _content_node_metadata_from_dict will store None as the title. This is fine but slightly surprising — a missing title propagates as None through the metadata layer and only gets defaulted at the node-building layer.

rtibblesbot · 2026-02-11T23:44:38Z

+                item[key_stripped] = value
+
+        resource_type = resource_elem.get("type", "")
+        if resource_type == "webcontent" or is_qti_resource(resource_type):


suggestion: if resource_type == "webcontent" or is_qti_resource(resource_type): — any other resource types (e.g. "associatedcontent/imscc_xmlv1p1/learning-application-resource") won't get files populated. This means resources of unrecognized types are silently dropped. Consider logging a warning for unrecognized resource types that have file references, similar to how dangling identifierref is logged on line 301.

rtibblesbot · 2026-02-11T23:44:38Z

+    # Handle XML files that are marked as UTF-8 but have non-UTF-8 chars.
+    # Detect the real encoding with chardet and re-encode as UTF-8.
+    try:
+        f = zf.open("imsmanifest.xml", "r")


nitpick: The chardet fallback opens the file in a non-context-manager style (f = zf.open(...); data = f.read(); f.close()). Using a with statement would be marginally cleaner:

with zf.open("imsmanifest.xml", "r") as f: data = f.read()

Add utilities to map SCORM metadata fields (from IMS Content Packages) to Kolibri content node fields: title, description, language, tags, learning_activities, resource_types, and learner_needs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add ricecooker/utils/imscp.py with functions for parsing imsmanifest.xml from IMSCP zip files. Includes recursive organization tree building, resource linking, external metadata resolution, encoding fallback via chardet, and single-child topic chain flattening to reduce unnecessary nesting levels. Adds xmltodict dependency to setup.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tadata Extend ContentNode._process_uri() to build child TopicNode/ContentNode trees when the pipeline produces nested ContentNodeMetadata (e.g. from IMSCP manifests with organization hierarchies). Add children and tags fields to ContentNodeMetadata. Simplify IMSCPZipFile and QTIZipFile to plain DownloadFile subclasses with only a preset override, since all parsing now happens in the pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add IMSCPConversionHandler and IMSCPMetadataExtractor to the file pipeline, ordered before their HTML5 counterparts so IMSCP zips (detected by presence of imsmanifest.xml) are handled correctly. The metadata extractor uses parse_imscp_manifest() and metadata_dict_to_content_node_fields() to produce nested ContentNodeMetadata trees, which ContentNode._process_uri() then expands into TopicNode/ContentNode children. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend IMSCP manifest parsing to recognize QTI resources (imsqti_* prefix) alongside webcontent. QTI leaf items are mapped to exercise kind with QTI_ZIP preset, while webcontent items retain html5/IMSCP_ZIP. Mixed packages route to IMSCP_ZIP at the file level with per-child preset overrides via shallow copy. Also hardens XML parsing (safe parser, null guards), migrates child metadata from raw dicts to typed ContentNodeMetadata objects, and extracts shared test helpers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rtibblesbot

Prior findings

5 prior findings from round 1 all resolved (3 blocking, 2 suggestion — resolved in 1b191bc). Well done on the thorough resolution.

0/6 second-round findings resolved. All 6 re-raised below (no code changes since round 2).

CI now failing — ruff-format reformatted 5 files when run locally: ricecooker/utils/SCORM_metadata.py, ricecooker/utils/imscp.py, tests/helpers.py, tests/test_imscp.py, tests/test_imscp_pipeline.py. Round 2 noted "CI passing (linting)" — this appears to be a formatting regression, likely from a prek/ruff version bump on main since February. Fix: run uvx prek run --all-files and commit the reformatted files.

Blocking:

ruff-format failing on 5 files — run uvx prek run --all-files and commit

Suggestions and nitpicks are all re-raises from round 2 (UNADDRESSED — no replies, no code changes since then):

License fallback has a false-falsy edge case (suggestion, nodes.py)
_apply_content_node_metadata overwrites list fields rather than merging (suggestion, nodes.py)
fields["title"] can be stored as None (suggestion, extract_metadata.py)
Unrecognized resource types silently drop their files (suggestion, imscp.py)
metadata_fields duplicates a subset of _METADATA_APPLY_FIELDS (nitpick, nodes.py)
chardet fallback opens file without a with statement (nitpick, imscp.py)

@rtibblesbot's comments are generated by an LLM, and should be evaluated accordingly

How was this generated?

Compared the current PR state against findings from a prior review:

Retrieved prior bot reviews deterministically via the GitHub API
Classified each prior finding as RESOLVED, UNADDRESSED, ACKNOWLEDGED, or CONTESTED
Only raised NEW findings for newly introduced code
Reviewed the pull request diff checking for correctness, design, architecture, testing, completeness, and adherence to DRY/SRP principles
Checked CI status and linked issue acceptance criteria

rtibblesbot · 2026-06-05T05:56:36Z

+            return format_presets.QTI_ZIP
+        # IMSCP for: pure webcontent, mixed (QTI + webcontent), or
+        # unknown resource types (safe default for any IMS package)
+        return format_presets.IMSCP_ZIP


praise: Defaulting to IMSCP_ZIP for mixed packages and unknown resource types is the right call — it handles the open-ended case (e.g. proprietary LMS resource types, future IMS resource types) gracefully without silently dropping content or crashing. The comment spelling out the covered cases makes the intent clear.

rtibblesbot · 2026-06-05T05:56:36Z

+        """
+        if parent_files is None:
+            parent_files = list(self.files)
+        license = license or getattr(self, "license", None)


suggestion (re-raise): license = license or getattr(self, "license", None) — if self.license is somehow a falsy value (empty string, zero), the or would fall through to None, which would then fail validation when the leaf ContentNode is created. The docstring correctly notes this method is suited for ContentNode callers, but a guard like if license is None: license = getattr(self, "license", None) would make the None-only fallback intent explicit and immune to falsy values.

rtibblesbot · 2026-06-05T05:56:36Z

+                node._files_processed = True
+
+            # Apply additional metadata fields
+            metadata_fields = [


nitpick (re-raise): metadata_fields here overlaps significantly with _METADATA_APPLY_FIELDS (line 1015) — both serve the purpose of "apply these metadata fields to a node" but differ slightly (this one includes tags and learner_needs; _METADATA_APPLY_FIELDS adds title, thumbnail, author, etc.). Extracting a shared constant, or at least a comment explaining why these two lists differ, would reduce the maintenance surface.

rtibblesbot · 2026-06-05T05:56:36Z

+        for field_name in self._METADATA_APPLY_FIELDS:
+            value = getattr(content_node_metadata, field_name, None)
+            if value is not None:
+                setattr(self, field_name, value)


suggestion (re-raise): setattr(self, field_name, value) unconditionally replaces list fields like learning_activities, resource_types, etc. when the pipeline provides them. Contrast with _build_children_from_metadata (line 877: existing + val) which merges lists. If a sushi chef pre-populates these fields in the constructor and the pipeline also provides values, the chef's values are silently dropped here. Either document that pipeline metadata takes precedence over constructor values for these fields, or apply the same merge logic used in _build_children_from_metadata. (Note: tags is correctly merged two lines below.)

rtibblesbot · 2026-06-05T05:56:36Z

+        children = []
+        for item in items:
+            fields = metadata_dict_to_content_node_fields(item.get("metadata", {}))
+            fields["title"] = item.get("title", fields.get("title"))


suggestion (re-raise): fields["title"] = item.get("title", fields.get("title")) — if neither item nor the LOM metadata produced a title, this stores None as the title key. Downstream in _build_children_from_metadata the fallback child_meta.title or "Untitled" handles it, but it means ContentNodeMetadata.title can be explicitly None (from a key that was set to None) rather than absent. Not broken, but slightly surprising — worth a comment or a guard like or fields.pop("title", None) to keep None out of the dict.

rtibblesbot · 2026-06-05T05:56:36Z

+                item[key_stripped] = value
+
+        resource_type = resource_elem.get("type", "")
+        if resource_type == "webcontent" or is_qti_resource(resource_type):


suggestion (re-raise): Resources whose type is neither webcontent nor a imsqti_* prefix (e.g. associatedcontent/imscc_xmlv1p1/learning-application-resource) fall through with no files key set. This silently drops their file references. A logger.warning for unrecognized types that have child file elements would help chefs diagnose unexpected empty packages, similar to the dangling-identifierref warning on line 301.

rtibblesbot · 2026-06-05T05:56:36Z

+    # Handle XML files that are marked as UTF-8 but have non-UTF-8 chars.
+    # Detect the real encoding with chardet and re-encode as UTF-8.
+    try:
+        f = zf.open("imsmanifest.xml", "r")


nitpick (re-raise): The chardet fallback uses f = zf.open(...); data = f.read(); f.close() rather than a with block. If f.read() raises, f.close() is skipped. A with zf.open("imsmanifest.xml", "r") as f: is the safe pattern here.

rtibbles added this to the 0.8 milestone Feb 19, 2024

rtibbles requested a review from jredrejo February 19, 2024 05:31

jredrejo reviewed Feb 20, 2024

View reviewed changes

jredrejo reviewed Feb 21, 2024

View reviewed changes

rtibbles force-pushed the qti branch from 105d5bd to b324108 Compare February 25, 2024 01:14

rtibbles force-pushed the qti branch from b324108 to affca16 Compare March 9, 2024 23:55

rtibbles assigned jredrejo Mar 26, 2024

jredrejo suggested changes Apr 11, 2024

View reviewed changes

rtibbles mentioned this pull request Feb 20, 2025

Pipeline refactor #573

Merged

rtibbles modified the milestones: 0.8, 0.9 Folder Import Handling Jun 24, 2025

rtibbles unassigned jredrejo Sep 3, 2025

marcellamaki self-assigned this Sep 23, 2025

rtibbles force-pushed the qti branch 3 times, most recently from f4eadef to a994a18 Compare February 11, 2026 01:44

rtibbles requested a review from rtibblesbot February 11, 2026 02:30

rtibblesbot suggested changes Feb 11, 2026

View reviewed changes

rtibbles force-pushed the qti branch from a994a18 to d0e995b Compare February 11, 2026 23:40

rtibblesbot reviewed Feb 11, 2026

View reviewed changes

rtibbles and others added 5 commits June 4, 2026 22:39

rtibbles force-pushed the qti branch from d0e995b to 1b191bc Compare June 5, 2026 05:41

rtibblesbot suggested changes Jun 5, 2026

View reviewed changes

		return self.preset or format_presets.IMSCP_ZIP


		class QTIZipFile(DownloadFile):

Uh oh!

Conversation

rtibbles commented Feb 19, 2024

Uh oh!

rtibbles commented Feb 19, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jredrejo commented Apr 11, 2024

Uh oh!

jredrejo left a comment

Choose a reason for hiding this comment

Uh oh!

rtibblesbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibblesbot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rtibblesbot left a comment

Choose a reason for hiding this comment

Prior findings

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants