Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Dec 21, 2025

Implements AttributeType per Spec 2.0.

Implements #1260, #1251

Merge #1288 first.

Summary

This PR modernizes DataJoint's custom type system with a new AttributeType base class and introduces <djblob> as the explicit type for blob serialization. Plain longblob columns now store/return raw bytes without serialization.

Breaking Change

longblob columns no longer auto-serialize. Existing schemas must migrate to <djblob>:

# Before (implicit serialization - no longer works)
data : longblob

# After (explicit serialization)
data : <djblob>

Migration

Migration utility provided

dj.migrate.migrate_blob_columns(schema, dry_run=False)

Key Changes

New AttributeType Base Class

  • Introduces encode(value, key=None) -> stored and decode(stored, key=None) -> value methods
  • Supports type chaining via dtype property (can reference another custom type)
  • Discoverable via datajoint.attribute_types entry point for distributable packages

Built-in <djblob> Type

  • Explicit type for DataJoint's mYm/dj0 blob serialization
  • Use attr: <djblob> in table definitions where serialization is needed
  • Replaces implicit longblob serialization behavior

Raw Blob Behavior

  • Plain longblob columns now store and return raw bytes (no automatic serialization)
  • Only <djblob> attributes are serialized/deserialized
  • Cleaner separation between raw binary storage and serialized Python objects

Backward Compatibility

  • Legacy AttributeAdapter subclasses continue to work unchanged
  • Adapters with blob dtypes handle pack/unpack internally
  • Deprecation warnings guide migration to new API

Migration Utilities

  • dj.migrate.analyze_blob_columns() - identify columns needing migration
  • dj.migrate.generate_migration_sql() - preview migration SQL
  • dj.migrate.migrate_blob_columns() - execute migration
  • Migration is metadata-only (updates column comments, no data changes)

This commit introduces a modern, extensible custom type system for DataJoint:

**New Features:**
- AttributeType base class with encode()/decode() methods
- Global type registry with @register_type decorator
- Entry point discovery for third-party type packages (datajoint.types)
- Type chaining: dtype can reference another custom type
- Automatic validation via validate() method before encoding
- resolve_dtype() for resolving chained types

**API Changes:**
- New: dj.AttributeType, dj.register_type, dj.list_types
- AttributeAdapter is now deprecated (backward-compatible wrapper)
- Feature flag DJ_SUPPORT_ADAPTED_TYPES is no longer required

**Entry Point Specification:**
Third-party packages can declare types in pyproject.toml:
  [project.entry-points."datajoint.types"]
  zarr_array = "dj_zarr:ZarrArrayType"

**Migration Path:**
Old AttributeAdapter subclasses continue to work but emit
DeprecationWarning. Migrate to AttributeType with encode/decode.
- Rewrite customtype.md with comprehensive documentation:
  - Overview of encode/decode pattern
  - Required components (type_name, dtype, encode, decode)
  - Type registration with @dj.register_type decorator
  - Validation with validate() method
  - Storage types (dtype options)
  - Type chaining for composable types
  - Key parameter for context-aware encoding
  - Entry point packages for distribution
  - Complete neuroscience example
  - Migration guide from AttributeAdapter
  - Best practices

- Update attributes.md to reference custom types
@dimitri-yatsenko dimitri-yatsenko added this to the DataJoint 2.0 milestone Dec 21, 2025
@github-actions github-actions bot added enhancement Indicates new improvements documentation Issues related to documentation labels Dec 21, 2025
Introduces `<djblob>` as an explicit AttributeType for DataJoint's
native blob serialization, allowing users to be explicit about
serialization behavior in table definitions.

Key changes:
- Add DJBlobType class with `serializes=True` flag to indicate
  it handles its own serialization (avoiding double pack/unpack)
- Update table.py and fetch.py to respect the `serializes` flag,
  skipping blob.pack/unpack when adapter handles serialization
- Add `dj.migrate` module with utilities for migrating existing
  schemas to use explicit `<djblob>` type declarations
- Add tests for DJBlobType functionality
- Document `<djblob>` type and migration procedure

The migration is metadata-only - blob data format is unchanged.
Existing `longblob` columns continue to work with implicit
serialization for backward compatibility.
Simplified design:
- Plain longblob columns store/return raw bytes (no serialization)
- <djblob> type handles serialization via encode/decode
- Legacy AttributeAdapter handles blob pack/unpack internally
  for backward compatibility

This eliminates the need for the serializes flag by making
blob serialization the responsibility of the adapter/type,
not the framework. Migration to <djblob> is now required
for existing schemas that rely on implicit serialization.
@dimitri-yatsenko dimitri-yatsenko added breaking Not backward compatible changes feature Indicates new features and removed enhancement Indicates new improvements labels Dec 21, 2025
@github-actions github-actions bot added enhancement Indicates new improvements and removed breaking Not backward compatible changes feature Indicates new features labels Dec 21, 2025
@dimitri-yatsenko dimitri-yatsenko changed the base branch from claude/add-file-column-type-LtXQt to pre/v2.0 December 22, 2025 16:18
@dimitri-yatsenko dimitri-yatsenko changed the base branch from pre/v2.0 to claude/add-file-column-type-LtXQt December 24, 2025 19:31
Base automatically changed from claude/add-file-column-type-LtXQt to pre/v2.0 December 24, 2025 20:09
Design document for reimplementing blob, attach, filepath, and object
types as a coherent AttributeType system. Separates storage location
(@store) from encoding behavior.
claude and others added 10 commits December 25, 2025 05:35
- <djblob@store> returns Python object (fetched and deserialized)
- <attach@store> returns local file path (downloaded automatically)
- Only object@store returns ObjectRef for explicit lazy access
- External storage is transparent - @store only affects where, not how

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Three-layer architecture:
1. MySQL types: longblob, varchar, etc.
2. Core DataJoint types: object, content (and @store variants)
3. AttributeTypes: <djblob>, <xblob>, <attach>, <xattach>

New core type `content` for content-addressed storage:
- Accepts bytes, returns bytes
- Handles hashing, deduplication, and GC registration
- AttributeTypes like <xblob> build serialization on top

Naming convention:
- <djblob> = internal serialized (database)
- <xblob> = external serialized (content-addressed)
- <attach> = internal file
- <xattach> = external file

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- content type is single-blob only (no folders)
- Parameterized syntax: <type@param> passes param to dtype
- Add content vs object comparison table
- Clarify when to use each type

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Content-addressed storage is now per-project (not per-schema)
- Deduplication works across all schemas in a project
- ContentRegistry is project-level (e.g., {project}_content database)
- GC scans all schemas in project for references
- Add migration utility for legacy ~external_* per-schema stores
- Document migration from binary(16) UUID to char(64) SHA256 hash

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Three OAS storage regions:
1. object: {schema}/{table}/{pk}/ - PK-addressed, DataJoint controls
2. content: _content/{hash} - content-addressed, deduplicated
3. filepath: _files/{user-path} - user-addressed, user controls

Upgraded filepath@store:
- Returns ObjectRef (lazy) instead of copying files
- Supports streaming via ref.open()
- Supports folders (like object)
- Stores checksum in JSON column for verification
- No more automatic copy to local stage

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
filepath changes:
- No longer an OAS region - tracks external URIs anywhere
- Supports any fsspec-compatible URI (s3://, https://, gs://, etc.)
- Returns ObjectRef for lazy access via fsspec
- No integrity guarantees (external resources may change)
- Uses json core type for storage

json core type:
- Cross-database compatible (MySQL JSON, PostgreSQL JSONB)
- Used by filepath and object types

Two OAS regions remain:
- object: PK-addressed, DataJoint controlled
- content: hash-addressed, deduplicated

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Remove general URI tracker concept from filepath
- filepath@store now requires a store parameter and uses relative paths
- Key benefit: portability across environments by changing store config
- For arbitrary URLs, recommend using varchar (simpler, more transparent)
- Add comparison table for filepath@store vs varchar use cases
- Update all diagrams and tables to reflect the change

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Remove "core types" concept - all storage types are now AttributeTypes
- Built-in AttributeTypes (object, content, filepath@store) use json dtype
- JSON stores metadata: path, hash, store name, size, etc.
- User-defined AttributeTypes can compose built-in ones (e.g., <xblob> uses content)
- Clearer separation: database types (json, longblob) vs AttributeTypes (encode/decode)

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Layer 1: Native database types (FLOAT, TINYINT, etc.) - backend-specific, discouraged
Layer 2: Core DataJoint types (float32, uint8, bool, json) - standardized, scientist-friendly
Layer 3: AttributeTypes (object, content, <djblob>, etc.) - encode/decode, composable

Core types provide:
- Consistent interface across MySQL and PostgreSQL
- Scientist-friendly names (float32 vs FLOAT, uint8 vs TINYINT UNSIGNED)
- Automatic backend translation

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
All AttributeTypes (Layer 3) now use angle bracket syntax in table definitions:
- Core types (Layer 2): int32, float64, varchar(255) - no brackets
- AttributeTypes (Layer 3): <object>, <djblob>, <filepath@main> - angle brackets

This clear visual distinction helps users immediately identify:
- Core types: direct database mapping
- AttributeTypes: encode/decode transformation

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
@dimitri-yatsenko dimitri-yatsenko marked this pull request as draft December 25, 2025 06:39
claude and others added 19 commits December 25, 2025 06:46
Seven-phase implementation plan covering:
- Phase 1: Core type system foundation (type mappings, store parameters)
- Phase 2: Content-addressed storage (<content> type, ContentRegistry)
- Phase 3: User-defined AttributeTypes (<xblob>, <attach>, <xattach>, <filepath>)
- Phase 4: Insert and fetch integration (type composition)
- Phase 5: Garbage collection (project-wide GC scanner)
- Phase 6: Migration utilities (legacy external stores)
- Phase 7: Documentation and testing

Estimated effort: 24-32 days across all phases

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Phase 1.1 - Core type mappings already complete in declare.py

Phase 1.2 - Enhanced AttributeType with store parameter support:
- Added parse_type_spec() to parse "<type@store>" into (type_name, store_name)
- Updated get_type() to handle parameterized types
- Updated is_type_registered() to ignore store parameters
- Updated resolve_dtype() to propagate store through type chains
  - Returns (final_dtype, type_chain, store_name) tuple
  - Store from outer type overrides inner type's store

Phase 1.3 - Updated heading and declaration parsing:
- Updated get_adapter() to return (adapter, store_name) tuple
- Updated substitute_special_type() to capture store from ADAPTED types
- Store parameter is now properly passed through type resolution

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Remove AttributeAdapter class and context-based lookup from attribute_adapter.py
- Simplify attribute_adapter.py to compatibility shim that re-exports from attribute_type
- Remove AttributeAdapter from package exports in __init__.py
- Update tests/schema_adapted.py to use @dj.register_type decorator
- Update tests/test_adapted_attributes.py to work with globally registered types
- Remove test_attribute_adapter_deprecated test from test_attribute_type.py

Types are now registered globally via @dj.register_type decorator, eliminating
the need for context-based adapter lookup.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
…ntics

Core types (uuid, json, blob) now map directly to native database types
without any implicit serialization. Serialization is handled by AttributeTypes
like <djblob> via encode()/decode() methods.

Changes:
- Rename SERIALIZED_TYPES to BINARY_TYPES in declare.py (clearer naming)
- Update check for default values in compile_attribute()
- Clarify in spec that core blob types store raw bytes

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Major simplification of the type system to two categories:
1. Core DataJoint types (no brackets): float32, uuid, bool, json, blob, etc.
2. AttributeTypes (angle brackets): <djblob>, <object>, <attach>, etc.

Changes:
- declare.py: Remove EXTERNAL_TYPES, BINARY_TYPES; simplify to CORE_TYPE_ALIASES + ADAPTED
- heading.py: Remove is_attachment, is_filepath, is_object, is_external flags
- fetch.py: Simplify _get() to only handle uuid, json, blob, and adapters
- table.py: Simplify __make_placeholder() to only handle uuid, json, blob, numeric
- preview.py: Remove special object field handling (will be AttributeType)
- staged_insert.py: Update object type check to use adapter

All special handling (attach, filepath, object, external storage) will be
implemented as built-in AttributeTypes in subsequent phases.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Core DataJoint types (fully supported, recorded in :type: comments):
- Numeric: float32, float64, int64, uint64, int32, uint32, int16, uint16, int8, uint8
- Boolean: bool
- UUID: uuid → binary(16)
- JSON: json
- Binary: blob → longblob
- Temporal: date, datetime
- String: char(n), varchar(n)
- Enumeration: enum(...)

Changes:
- declare.py: Define CORE_TYPES with (pattern, sql_mapping) pairs
- declare.py: Add warning for non-standard native type usage
- heading.py: Update to use CORE_TYPE_NAMES
- storage-types-spec.md: Update documentation to reflect core types

Native database types (text, mediumint, etc.) pass through with a warning
about non-standard usage.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Add content-addressed storage with deduplication for the <content> and
<xblob> AttributeTypes.

New files:
- content_registry.py: Content storage utilities
  - compute_content_hash(): SHA256 hashing
  - build_content_path(): Hierarchical path generation (_content/xx/yy/hash)
  - put_content(): Store with deduplication
  - get_content(): Retrieve with hash verification
  - content_exists(), delete_content(), get_content_size()

New built-in AttributeTypes in attribute_type.py:
- ContentType (<content>): Content-addressed storage for raw bytes
  - dtype = "json" (stores metadata: hash, store, size)
  - Automatic deduplication via SHA256 hashing
- XBlobType (<xblob>): Serialized blobs with external storage
  - dtype = "<content>" (composition with ContentType)
  - Combines djblob serialization with content-addressed storage

Updated insert/fetch for type chain support:
- table.py: Apply encoder chain from outermost to innermost
- fetch.py: Apply decoder chain from innermost to outermost
- Both pass store_name through the chain for external storage

Example usage:
  data : <content@mystore>    # Raw bytes, deduplicated
  array : <xblob@mystore>     # Serialized objects, deduplicated

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
…lization

Breaking changes:
- Remove attribute_adapter.py entirely (hard deprecate)
- Remove bypass_serialization flag from blob.py - blobs always serialize now
- Remove unused 'database' field from Attribute in heading.py

Import get_adapter from attribute_type instead of attribute_adapter.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Document function-based content storage (not registry class)
- Add implementation status table
- Explain design decision: functions vs database table
- Update Phase 5 GC design for scanning approach
- Document removed/deprecated items

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Create builtin_types.py with DJBlobType, ContentType, XBlobType
- Types serve as examples for users creating custom types
- Module docstring includes example of defining a custom GraphType
- Add get_adapter() function to attribute_type.py for compatibility
- Auto-register built-in types via import at module load

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Add <object> type for files and folders (Zarr, HDF5, etc.):
- Path derived from primary key: {schema}/{table}/objects/{pk}/{field}_{token}
- Supports bytes, files, and directories
- Returns ObjectRef for lazy fsspec-based access
- No deduplication (unlike <content>)

Update implementation plan with Phase 2b documenting ObjectType.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Migration utilities are out of scope for now. This is a breaking
change version - users will need to recreate tables with new types.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
- Document staged_insert.py for direct object storage writes
- Add flow comparison: normal insert vs staged insert
- Include staged_insert.py in critical files summary

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Add remaining built-in AttributeTypes:
- <attach>: Internal file attachment stored in longblob
- <xattach>: External file attachment via <content> with deduplication
- <filepath@store>: Reference to existing file (no copy, returns ObjectRef)

Update implementation plan to mark Phase 3 complete.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Add garbage collection module (gc.py) for content-addressed storage:
- scan_references() to find content hashes in schemas
- list_stored_content() to enumerate _content/ directory
- scan() for orphan detection without deletion
- collect() for orphan removal with dry_run option
- format_stats() for human-readable output

Add test files:
- test_content_storage.py for content_registry.py functions
- test_type_composition.py for type chain encoding/decoding
- test_gc.py for garbage collection

Update implementation plan to mark all phases complete.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Extend gc.py to handle both storage patterns:
- Content-addressed storage: <content>, <xblob>, <xattach>
- Path-addressed storage: <object>

New functions added:
- _uses_object_storage() - detect object type attributes
- _extract_object_refs() - extract path refs from JSON
- scan_object_references() - scan schemas for object paths
- list_stored_objects() - list all objects in storage
- delete_object() - delete object directory tree

Updated scan() and collect() to handle both storage types,
with combined and per-type statistics in the output.

Updated tests for new statistics format.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
External tables are deprecated in favor of the new storage type system.
Move the constant to external.py where it's used, keeping declare.py clean.

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
External tables (~external_*) are deprecated in favor of the new
AttributeType-based storage system. The new types (<xblob>, <content>,
<object>) store data directly to storage via StorageBackend without
tracking tables.

- Remove src/datajoint/external.py entirely
- Remove ExternalMapping from schemas.py
- Remove external table pre-declaration from table.py

Co-authored-by: dimitri-yatsenko <dimitri@datajoint.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Issues related to documentation enhancement Indicates new improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants