-
Notifications
You must be signed in to change notification settings - Fork 92
Semantic Join #1293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dimitri-yatsenko
wants to merge
15
commits into
claude/spec-issue-1243-YvqmF
Choose a base branch
from
claude/semantic-matching-joins-CgWnO
base: claude/spec-issue-1243-YvqmF
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Semantic Join #1293
dimitri-yatsenko
wants to merge
15
commits into
claude/spec-issue-1243-YvqmF
from
claude/semantic-matching-joins-CgWnO
+1,430
−56
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This spec defines the semantic matching behavior for DataJoint 2.0 joins: - Introduces lineage tracking for attributes (origin table/column) - Defines homologous vs non-homologous namesakes terminology - Specifies matching rules: only homologous namesakes used for joins - Documents ~lineage table architecture for lineage storage - Includes dependency graph fallback for non-DJ schemas - Covers all affected operations: *, &, -, aggr - Removes @ operator in favor of .join(semantic_check=False) - Updates dj.U behavior (& replaces * for table extension) - Includes migration guide and testing strategy
Lineage strings now use plain dot notation: schema.table.attribute instead of quoted format: `schema`.`table`.`attribute`
- Only store attributes WITH lineage (native secondary attrs have no entry) - Make lineage column NOT NULL - Add cleanup on table creation (remove leftover entries) - Add cleanup on table drop
- Fix dj.U syntax: dj.U() or dj.U('attr1', 'attr2', ...)
- Add dj.U() & A example for empty U
- Clarify lineage uses database table names (snake_case with tier prefixes)
not Python class names (CamelCase)
- Update all lineage examples to use proper database table names
- Remove cross-database lineage section (three-part lineage is sufficient)
- Change "ERD" to "DataJoint diagrams"
This implements semantic matching for DataJoint 2.0 joins as specified in docs/src/design/semantic-matching-spec.md. Key changes: 1. Lineage tracking: - Add `lineage` field to Attribute class (heading.py) - Create lineage.py module for ~lineage table management - Populate lineage at table declaration time - Clean up lineage entries when tables are dropped - Load lineage from database when fetching headings 2. Semantic matching in joins: - Update assert_join_compatibility() to check for non-homologous namesakes - Update join() to only match on homologous namesakes (same name AND lineage) - Lineage is preserved through projections and renames 3. API changes: - Remove @ operator (raises error directing to .join(semantic_check=False)) - dj.U * table raises deprecation error (use dj.U & table instead) - dj.U - table raises error (infinite set) - dj.U is always compatible (contains all possible lineages) 4. Tests: - Add comprehensive tests for lineage tracking - Test homologous and non-homologous namesake handling - Test deprecated operator errors - Test dj.U operations with semantic matching
- dj.U * table now raises a deprecation error instead of redirecting to dj.U & table (they are different operations) - Remove unused variables in lineage.py
- Remove lineage_table_exists function - Catch exceptions in get_lineage, get_all_lineages, delete_lineage_entries - Single round trip to database, no race conditions
Also: - Remove redundant U check for second operand in assert_join_compatibility - Suggest .join(semantic_check=False) in error message for non-homologous namesakes
- Update spec document to match actual implementation: - assert_join_compatibility only checks if expr1 is U (not expr2) - Error message includes .join(semantic_check=False) suggestion - Use create_lineage_table function name - Add complete parameter documentation to all lineage.py functions
In A.aggr(B, ...), ensures every entry in B matches exactly one entry in A: - B must have all of A's primary key attributes - Primary key attributes must be homologous (same lineage) - Clear error messages for missing attributes or non-homologous lineage Updated docstrings for: - Aggregation.create() - QueryExpression.aggr() - U.aggr() Updated spec document with: - Functional dependency requirements - Error message examples - Additional test cases Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
In A.aggr(B), B must have all of A's primary key attributes, but these can be either primary or secondary attributes in B. Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Add semantic_check parameter to restrict() method (default True) - When semantic_check=False, wraps restriction in PromiscuousOperand - Deprecate ^ operator with error directing to .restrict(semantic_check=False) - Update spec document with .restrict() method and ^ deprecation - Add migration guide entry for ^ operator replacement Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements semantic matching for joins in DataJoint 2.0, replacing name-based matching with lineage-aware matching. Attributes are only matched in joins when they share both the same name AND the same lineage (origin).
Key Changes
New Lineage Tracking System
~lineagetable per schema to store attribute lineage informationschema.table.attribute(using actual database table names)Semantic Matching in Joins
A * Bnow matches only on homologous namesakes (same name AND same lineage)semantic_checkparameter to .join() method for opt-out:A.join(B, semantic_check=False)# bypass semantic checkAggregation Functional Dependency
A.aggr(B, ...)now validates that B has all of A's primary key attributesDeprecated Operators
@operator removed - use.join(semantic_check=False)insteaddj.U * tabledeprecateddj.U - tableraises error (produces infinite set)