index

The job vectorizes and indexes your data for efficient retrieval and analysis. It includes checks for redundancy and relevance, ensuring that the indexed content is both unique and pertinent. This job also supports custom vectorization, allowing you to use your proprietary models for indexing. During the vectorization process the uncertainty in the vector is also computed and stored. Uncertainty is only computed for archives vectorized with a custom fine tuned vectorizer.

Required Account Privileges: "read-write"

Request JSON ["inputs"]:

   "archive":
      string (3 <= len <= 30) unique in account 
      null NOT allowed
      A unique string identifier for the archive within your account.

   "check_for_redundancy_against_archived":
      bool
      null NOT allowed
      A boolean indicating whether to check for redundancy against already archived content. (Requires similarity calibration of the archive)
   
   "archive_content_ids_subset":
      list of ints
      null allowed
      Optional. A list of integers representing the IDs of the specific contents to consider for filtering. If not provided, all contents in the archive will be considered.
   
   "check_for_redundancy_within_batch":
      bool
      null NOT allowed
      A boolean indicating whether to check for redundancy within the current batch. (Requires similarity calibration of the archive)

   "check_for_relevance":
      bool
      null NOT allowed
      A boolean indicating whether to check for the relevance of the content. (Requires relevance calibration of the archive)
      
   "custom_vectorizer_name":
      string (3 <= len <= 30)
      null NOT allowed
      An optional string representing the name of the custom vectorizer to be used.
   
   "file_urls":
      list of strings
      null allowed
      An optional list of strings containing the URLs of files to be downloaded.
      
   "download_from_batch_cloud_folder":
      bool
      null NOT allowed
      A boolean indicating whether to download files from the batch cloud folder.

Response JSON ["results"]

   "name_to_indexed_content_id":
      dict
      {"name":int}

   "name_to_uncertainty":
      dict
      {"name":float}
            
   "exact_duplicate_file_names":
      list of strings
      
   "failed_vectorization_names":
      list of strings
      
   "redundant_content_names":
      list of strings
      
   "irrelevant_content_names":
      list of strings

File Requirements

Requires files to be sent via FTP to the cloud batch folder or in the file_urls
You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length)
If passing vectors you can also pass the uncertainty as a "{name}_uncertainty.json" file containing the float value