calibrate_similarity
The job fine-tunes the similarity metrics used in similarity based search, clustering and redundancy filtering.
You will require a small similarity calibration dataset for the fine-tuning process. The custom model will be exclusively accessible from your account and is linked to an archive.
You can use both real and synthetic ai generated data for your examples.
Required Account Privileges: "read-write"
Request JSON ["inputs"]:
"archive": string (3 <= len <= 30) unique in account null NOT allowed A unique string identifier for the archive within your account. "custom_vectorizer_name": string (3 <= len <= 30) null NOT allowed A string indicating the name of the new custom vectorizer being created. "file_urls": list of strings null allowed An optional list of strings containing the URLs of files to be downloaded. "download_from_batch_cloud_folder": bool null NOT allowed A boolean indicating whether to download files from the batch cloud folder.
Response JSON ["results"]
The job does not return results in the response JSON
File Requirements
Requires files to be sent via FTP to the cloud batch folder or in the file_urls You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length) Similarity calibration is a step that is used to train the redundancy filter and the clustering by similarity. The similarity dataset must be composed of at least 200 pairs and max 10000 pairs of examples that are similar according to the client's criteria. To assemble the similarity dataset we recommend you gather your data into clusters, one for each of the fine tuning labels. Then extract at least 2 pairs from each cluster. The file names inside the pairs must start with a prefix that is the id of the pair. Ex: 1_file_1.ext 1_file_2.ext 2_file_3.ext 2_file_4.ext 3_file_5.ext 3_file_6.ext