calibrate_similarity

The job fine-tunes the similarity metrics used in similarity based search, clustering and redundancy filtering.

You will require a small similarity calibration dataset for the fine-tuning process. The custom model will be exclusively accessible from your account and is linked to an archive.

You can use both real and synthetic ai generated data for your examples.

Required Account Privileges: "read-write"

Request JSON ["inputs"]:

   "archive":
      string (3 <= len <= 30) unique in account 
      null NOT allowed
      A unique string identifier for the archive within your account.
   
   "custom_vectorizer_name":
      string (3 <= len <= 30)
      null NOT allowed
      A string indicating the name of the new custom vectorizer being created.
   
   "file_urls":
      list of strings
      null allowed
      An optional list of strings containing the URLs of files to be downloaded.
      
   "download_from_batch_cloud_folder":
      bool
      null NOT allowed
      A boolean indicating whether to download files from the batch cloud folder.

Response JSON ["results"]

The job does not return results in the response JSON

File Requirements

Requires files to be sent via FTP to the cloud batch folder or in the file_urls
You can pass either data files or vectors (torch safetensors ".pt" one-dimensional, any length)

Similarity calibration is a step that is used to train the redundancy filter and the clustering by similarity. 
The similarity dataset must be composed of at least 200 pairs and max 10000 pairs of examples that are similar according to the client's criteria.
To assemble the similarity dataset we recommend you gather your data into clusters, one for each of the fine tuning labels.
Then extract at least 2 pairs from each cluster.
The file names inside the pairs must start with a prefix that is the id of the pair. Ex:
1_file_1.ext
1_file_2.ext
2_file_3.ext
2_file_4.ext
3_file_5.ext
3_file_6.ext