index
The job vectorizes and indexes your data for efficient retrieval and analysis. It includes checks for redundancy and relevance, ensuring that the indexed content is both unique and pertinent. This job also supports custom vectorization, allowing you to use your proprietary models for indexing. During the vectorization process the uncertainty in the vector is also computed and stored. Uncertainty is only computed for archives vectorized with a custom fine tuned vectorizer.
Required Account Privileges: "read-write"
Request JSON ["inputs"]:
"archive": string (3 <= len <= 30) unique in account null NOT allowed A unique string identifier for the archive within your account. "check_for_redundancy_against_archived": bool null NOT allowed A boolean indicating whether to check for redundancy against already archived content. (Requires similarity calibration of the archive) "archive_content_ids_subset": list of ints null allowed Optional. A list of integers representing the IDs of the specific contents to consider for filtering. If not provided, all contents in the archive will be considered. "check_for_redundancy_within_batch": bool null NOT allowed A boolean indicating whether to check for redundancy within the current batch. (Requires similarity calibration of the archive) "check_for_relevance": bool null NOT allowed A boolean indicating whether to check for the relevance of the content. (Requires relevance calibration of the archive) "custom_vectorizer_name": string (3 <= len <= 30) null NOT allowed An optional string representing the name of the custom vectorizer to be used. "file_urls": list of strings null allowed An optional list of strings containing the URLs of files to be downloaded. "download_from_batch_cloud_folder": bool null NOT allowed A boolean indicating whether to download files from the batch cloud folder.
Response JSON ["results"]
"name_to_indexed_content_id": dict {"name":int} "name_to_uncertainty": dict {"name":float} "exact_duplicate_file_names": list of strings "failed_vectorization_names": list of strings "redundant_content_names": list of strings "irrelevant_content_names": list of strings
File Requirements
Requires files to be sent via FTP to the cloud batch folder or in the file_urls You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length) If passing vectors you can also pass the uncertainty as a "{name}_uncertainty.json" file containing the float value