Trial our curation for free on a sample dataset, no API required

The purpose of a trial is to explore the results of vectorizing a client's data and running filters, searches, clustering, distribution analysis on it.

The process can include calibration/fine-tuning steps or simply use our pretrained models.

The calibration is recommended for accurate results.

The trial is also conducted to determine the labeled datasets' size required for good fine-tuning results.

Initial evaluation: Run a trial without fine-tuning to assess the pretrained model's performance.

Iterative improvement: If necessary, conduct multiple fine-tuning trials, gradually increasing the dataset size until satisfactory results are achieved.


The trial requires a zip folder to be sent with the appropriate sub folders depending on the steps that will be included.

All data files must belong to the same content type. The supported file extensions are:

  •     Image: "jpg", "jpeg", "png", "bmp", "gif", "tiff", "tif"

  •     Video: "mp4", "avi", "mov", "wmv"

  •     Sound: "mp3", "wav"

  •     Point_Cloud: "xyz", "xyzn", "xyzrgb", "pts", "pcd", "ply", "stl", "obj", "off", "gltf"

  •     Text: "txt"

In the indexing process, the content is automatically sub sampled if longer than the supported length:

  •     Video is sub sampled either to single frames or to 2 seconds (64 frames at 30 fps)

  •     Sound is sub sampled to 10 seconds

  •     Text is sub sampled to 1024 characters

  •     Image and Point_Cloud are not sub sampled

0 - If your Data is of Type “Video” or “Sound” and you want to test auto trimming place your files inside a “Trimming” folder in the zip.

1 - Create a folder named "Raw_Data" in the zip that contains the data that will be indexed.

We recommend at least 100 examples and no more than 10 000 examples that will be indexed. Diversity is important, and the examples should be representative.

You can include files with “redundancy_check_only” contained in the name.

They won’t be indexed but will be used for filtering for redundancy.

2-Fine-tuning is a classification based training step that forces the model to pay attention to the important features and to ignore the irrelevant ones.

If you can contribute a labeled dataset for fine-tuning then include a folder named "Fine_Tuning_Data".

Inside it add the examples that will be used for fine-tuning.

We recommend at least 100 examples and no more than 10000 examples for fine-tuning.

You can have up to 100 labels in your fine tuning dataset.

The labels can be any class that describes the content. Each file can have multiple labels.

Open a text editor and add the labels of each file following the format:

{

    "file_name_1.ext": ["label_1", "label_2", ...],

    "file_name_2.ext": ["label_2"],

    "file_name_3.ext": ["label_1", "label_3"]

}

save your text editor file as "example_to_labels.json" and place it in the "Fine_Tuning_Data" folder.

3-Similarity calibration is a step that is used to train the redundancy filter and the clustering by similarity.

If you can contribute a labeled dataset for similarity calibration then include a folder named "Similarity_Calibration_Data".

If your data is video, sound or text you can use our automated similarity dataset generation job.

The similarity dataset must be composed of at least 200 pairs and max 10000 pairs of examples that are similar according to the client's criteria.

To assemble the similarity dataset we recommend you gather your data into clusters, one for each of the fine tuning labels (with the cluster containing items with that label) and then extract at least 2 pairs from each cluster.

The file names inside the pairs must start with a prefix “{id}_cluster_” that is the id of the pair. So the files inside the folder should look like this:

1_cluster_file_1.ext

1_cluster_file_2.ext

2_cluster_file_3.ext

2_cluster_file_4.ext

3_cluster_file_5.ext

3_cluster_file_6.ext

If your data is of type Video (to index as video or image) or Sound or Text you can obtain an automatically generated similarity calibration dataset by placing your files in a folder called “Pre_Extraction_Similarity_Calibration_Data”.

For better results include an "example_to_labels.json" in this folder too and a list of relevant keywords saved as "relevant_labels.json" to obtain a more balanced automatic similarity calibration dataset.

4-Relevance calibration is a step that is used to train the relevance filter before indexing.

If you can contribute a dataset for relevance calibration then include a folder named "Relevance_Calibration_Data".

There are no special conditions just include the subset of the files that are the most representative.We recommend at least 10 examples and no more than 100 examples for relevance calibration.

5-Reverse Search is used to sort the indexed archive by similarity to example contents.

If you want to run searches then include a “Search_Data” folder. 

Inside the Search Data folder create a folder for each data type you will provide examples ex [“Image”, “Video”,”Sound”, “Text”, “Point_Cloud”]

Inside the content type folder create a sub folder for each of the themes you want to test reverse search (one or more files in each), the name of these subfolders will be used in the results.

There are no special conditions, just include the subset of the files that are the most representative.

We recommend at least 1 example and no more than 20 examples per search.

If you add search sub folders with “essential” or “forbidden” in their names, their search results will also be used to enhance data balancing recommendations.

6-Fine tune a data translator.

If you want to fine tune a data type translator to be able to search an archive of a data type with examples from another data type add a folder called “Translator_Fine_Tuning” and inside it place:

examples of files from the input data type and from the output data type.

an "input_output_content_types_description.json" file with the followinf dictionary inside:

{

"input_content_type": one of [“Image”, “Video”,”Sound”, “Text”, “Point_Cloud”]

"output_content_type": one of [“Image”, “Video”,”Sound”, “Text”, “Point_Cloud”]

}

a trainning mapping file called “training_mappings.json"

optionally a validation mappings file "validation_mappings.json" (otherwise 15% of training pairs are selected)

the mapping files have the following structure:

[

(input file name 1, output file name 1),

(input file name 2, output file name 2),

]

7-ZIP the folder and use any FTP client to send the data.

In order for the client to evaluate the process accuracy and usefulness, our trial will create an archive, calibrate models, sample, filter and index the data, then run and return the results from the following jobs

  • Data considered irrelevant (if provided)

  • Data considered redundant (if provided)

  •  Data highlights and clusters by auto discovered number of clusters

  • Data highlights and clusters by calibrated similarity (if provided)

  • Data sorted by Inliers and Outlier

  • Data sorted by similarity to essential examples (if provided)

  • Data sorted by similarity to forbidden examples (if provided)

  • Over-represented data sorted by prioritized removal for balancing

  • Under-represented data sorted by prioritized sourcing for balancing

Send us an email requesting a trial so we can set up an FTP user for you and send you the trial credentials and non-disclosure agreement.