Skip to content

HnswDocumentIndex treats document IDs as string, they can be str, int, ID #1850

@oytuntez

Description

@oytuntez

Initial Checks

  • I have read and followed the docs and still think this is a bug

Description

I noticed this behavior when I wanted to access multiple documents in the index:

@requests(on='/find')
    def find(self, docs: DocList[QuoteFile], **_) -> DocList[QuoteFile]:
        return self._cache_di[docs.id]

And when I issue POST /find with body {"data":[{"id":"300055"}]}, this code yields:

       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in _get_docs_sqlite_doc_id                                     
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 544, in <genexpr>                                                   
           hashed_ids = tuple(self._to_hashed_id(id_) for                       
       id_ in doc_ids)                                                          
         File                                                                   
       "/Users/oytuntez/motaword/jina-documents/venv/lib/py…                    
       line 445, in _to_hashed_id                                               
           return                                                               
       int(hashlib.sha256(doc_id.encode('utf-8')).hexdigest…                    
       16) % 10**18                                                             
       AttributeError: 'int' object has no attribute                            
       'encode'                        

Upon investigation, I saw that most of HnswDocumentIndex treats IDs as str. However, it is my understanding that IDs can be int, see this type definition:

class ID(str, AbstractType):
    """
    Represent an unique ID
    """

    @classmethod
    def _docarray_validate(
        cls: Type[T],
        value: Union[str, int, UUID],
...

I think ID values should be cast to str if necessary (it would be in _to_hashed_id case).

Example Code

No response

Python, DocArray & OS Version

Python 3.8.12
docarray==0.40.0

Affected Components

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions