General Guides and Common Workflows for Machine Learning and Content Archives

1. General

1.1 Account Creation

We will create an account and provide you with the authentication credentials:

- username_or_email

- password

You will also have a dedicated FTP folder for jobs that require large data transfers:

- ftp_host

- ftp_username = username_or_email

- ftp_password = password

1.2 Archive Creation

You can create one or more archives with the "create_archive" job type with the following inputs:

- name: a unique name for the archive.

- content_type: the type of content in the archive (e.g. "Image", "Video", "Sound", "Text", "Point_Cloud").

- description: a description of the archive (optional).

- use_default_similarity_calibration (bool) (optional if you don't want a custom similarity calibration).

In your database, in your "contents" table we recommend that for each archive you have that has content indexed you create a dedicated column ex: "{archive_name}_vectorized_id".

You can index a content like a video in multiple archives for example a "Video" archive, "Sound" archive, "Coverframe" archive, etc. This column will store the vectorized id returned by the "index" job type.

During operations, like search and clustering, you can use this column to reference the indexed content.

You can use the update_parameters job type to update the description and the nr_similar_allowed allowed for an archive.

When your archive is calibrated for similarity, the nr_similar_allowed is the limit of detected similar content before indexing starts rejecting content for redundancy.

1.3 Calibrations (optional)

There are 4 types of optional calibration/fine-tuning operations that can make your archives more useful:

Vectorizer Model Fine tuning

Although we provide pre-trained vectorizer models, in some cases the accuracy of the vector representation may require fine-tuning on your data.

You can fine-tune them to better fit your content with the "fine_tune_vectorizer" job type.

You can name the custom vectorizer model and it will be stored to be used in jobs with archives of the same content type.

You can start the fine-tuning from a pre-existing custom vectorizer model or from one of your custom vectorizer models.

The fine tuning requires a classification dataset of files and labels.

You must provide a dataset of files and an "example_to_labels.json" file that maps the file names to the labels. You can use both real and ai generated examples.

Similarity Calibration

You can calibrate the archive for similarity with the "calibrate_similarity" job type. Archives with calibrated similarity can:

- Detect similar content and reject redundant content during indexing.

- Cluster content based on the calibrated similarity (rather than by number of clusters alone).

You must provide a dataset of pairs of similar content files. If are working with a Video, Sound or Text archive you can use the "extract_similarity_dataset" job type to extract the pairs of similar content files. You can use both real and ai generated examples.

Relevance Calibration

You can calibrate the archive for relevance with the "calibrate_relevance" job type.

Archives with calibrated relevance can:

- Detect irrelevant content and reject it during indexing.

You must first index a subset of guaranteed relevant content and then calibrate the archive for relevance. You can use both real and ai generated examples.

After calibration, you can remove all content from the archive with remove_content job type and then index the rest of the content with the relevance filter enabled.

Translator Training

You can index content of a data type and search for it using examples of another data type.

Fine-tune a translator model to with the "fine_tune_translator" job type.

You can name the custom translator model and it will be stored to be used in search jobs.

You can use both real and ai generated examples.

1.4 Sampling and Trimming

The vectorizer models have specific size requirements for the input data.

You can use the sample_data job type to sample your data to fit the model's requirements for the various content types.

You can either select highlights to automatically trim the most important parts of the content or select time intervals to sample the content at regular intervals.

You can use the trim_by_highlights job type to extract trims of a desired lenght centered on highlights.

1.5 Indexing

You can index the content in an archive with the "index" job type.

The indexing process will:

- Use a “custom vectorizer name” if specified. You will also store the uncertainties in this case.

- Sample the content to the right size by highlights automatically.

- Vectorize the content if the inputs are not already vectorized in the client side.

- Check for redundancy against the archived content if the archive is calibrated for similarity and the check_for_redundancy_against_archived is True.

- Check for redundancy within the batch if the check_for_redundancy_within_batch is True and the archive is calibrated for similarity.

-If using a previous dataset to filter a current dataset pass the previous files with "redundancy_check_only" in their file names. They won’t be indexed but will be used for redundancy detection.

- Check for relevance if the archive is calibrated for relevance and the check_for_relevance is True.

- Reject content that is redundant or irrelevant.

- Return the indexed content ids and the rejected content names.

1.6 Reverse Search

You can reverse search by examples (without metadata) for content in an archive with the "search" job type.

You can use one or more real or ai generated examples of the data types [Image, Video, Sound, Text, Point_Cloud]

You can use your own search to narrow down the ids first and then use "archive_content_ids_subset" to search only in the narrowed down ids.

You can restrict the ids filtering on your database or with metadata retrieved with the get_metadata job.

The search process will:

- Vectorize the examples if the inputs are not already vectorized in the client side.

- Translator vectors between formats if the archive data type is different from the input data type and you passed the translator_name.

- If multiple examples are provided they will be averaged or weighted (if preferences are provided) for concept arithmetic and automatic focus on core concepts.

- Find the most similar content in the archive.

- Return the indexed content ids and the distances to the examples.

To determine the threshold where sorted results no longer align with your search criteria, you can create a loop, retrieving the content with the get_url job and having it reviewed either by a human or by making an api call ti the multimodal llm of your choice.

1.7 Threshold Reverse Search Classifier

Perform reverse searches against calibration examples and by knowing the ID of the indexed items that match your example, you can collect the similarity values and create an average similarity threshold for the task.

You can use one or more real or ai generated examples of the data types [Image, Video, Sound, Text, Point_Cloud]

You can use your own search to narrow down the ids first and then use "archive_content_ids_subset" to search only in the narrowed down ids.

The search process will:

- Vectorize the examples if the inputs are not already vectorized in the client side.

- Translator vectors between formats if the archive data type is different from the input data type and you passed the translator_name.

- If multiple examples are provided they will be averaged or weighted (if preferences are provided) for concept arithmetic and automatic focus on core concepts.

- Find the most similar content in the archive.

- Return the indexed content ids and the distances to the examples.

- Apply the calibrated similarity threshold to filter the search results. A single value above the threshold confirms that the example used in the search contains the right data and should be classified as matching.

1.8 Clustering and Highlighting

You can cluster the content in an archive with the "cluster_by_number_of_clusters" job type.

You can use your own search to narrow down the ids first and then use "archive_content_ids_subset" to search only in the narrowed down ids.

Cluster the content based on the number of clusters or based on the cluster quality metrics (discovers the optimal number of clusters).

You can also cluster the content based on the calibrated similarity with the "cluster_by_calibrated_similarity" job type if the archive is calibrated for similarity.

Returns the clustered content ids and the average distance between the cluster elements and the cluster center.

The ids inside each cluster are sorted by the distance to the cluster center. So you can use the first elements of each cluster as the most representative highlight of the cluster.

1.9 Inliers and Outliers

You can sort the content in an archive by inliers and outliers with the "inliers_outliers" job type.

You can use your own search to narrow down the ids first and then use "archive_content_ids_subset" to search only in the narrowed down ids.

The inliers are the content that is most similar to the rest of the content in the archive.

The outliers are the content that is most different from the rest of the content in the archive.

Returns the content ids sorted by inliers and outliers and the mean distance between the content and the rest of the content in the archive.

1.10 Data Balancing

You can get data balancing recommendations the content in an archive with the "data_balance" job type.

The data balance process requires that you first run a clustering job type to get the clustered content ids.

You can cluster with the "cluster_by_number_of_clusters" job type passing the nr_of_clusters as None to discover the optimal number of clusters.

If the archive is calibrated for similarity you can also cluster with the "cluster_by_calibrated_similarity" job type.

You can improve the balance of the content by also performing the following operations:

- run "inliers_outliers" to get the inliers and outliers.

- run "search" against "essential" and "forbidden" examples to get the ids sorted by essential and forbidden examples.

Returns the prioritized over-represented ids to remove and the prioritized under-represented ids to guide sourcing.

1.11 PCA visualization

You can reduce the dimensionality of the content vectors with PCA and visualize them in any dimension with the "pca_vector_dim_reduction" job type.

You can use your own search to narrow down the ids first and then use "archive_content_ids_subset" to search only in the narrowed down ids.

The PCA process will:

- Reduce the dimensionality of the content vectors.

- Return the reduced vectors with the identifier of the content so you can visualize them in any software (ex.TensorBoard Embedding Projector) that supports embedding visualization.

2. Common Machine Learning Workflows

2.1 Review a dataset efficiently

Follow Archive Creation and Calibrations for optional vectorizer fine-tuning, similarity calibration before indexing.

Follow Indexing.

Follow Clustering and Highlighting. Run the clustering job with "sort_content_ids_by_distance_to_cluster_center" enabled.

Review the dataset quickly by inspecting the first elements of each cluster as the dataset high impact highlights.

Follow Inliers and Outliers.

Follow PCA visualization.

2.2 Detect forbidden content in a dataset and confirm the presence of essential content

Follow Archive Creation and Calibrations for optional vectorizer fine-tuning and translator calibration before indexing.

Follow Indexing.

Follow Reverse Search, with the examples of the forbidden content to detect their presence in the dataset.

Follow Reverse Search, with the examples of the essential content to confirm their presence in the dataset.

2.3 Identify and remove redundant content from a dataset

Follow Archive Creation and Calibrations for optional vectorizer fine-tuning.

Follow Calibrations for similarity calibration.

Follow Indexing, enable the redundancy filtering. If using a previous dataset to filter the current dataset pass the previous files with "redundancy_check_only" in the file name.

Take the results from the indexing job and remove the redundant content from the dataset.

2.4 Identify and remove irrelevant content from a dataset

Follow Archive Creation and Calibrations for optional vectorizer fine-tuning.

Follow Calibrations for relevance calibration.

Follow Indexing, enable the relevance filtering.

Take the results from the indexing job and remove the irrelevant content from the dataset.

2.5 Balance the dataset

Follow Calibrations for optional vectorizer fine-tuning, similarity, and translator calibration before indexing.

Follow Indexing.

Follow Clustering and Highlighting.

For better results follow Reverse Search, with the examples of the forbidden content to detect their presence in the dataset.

For better results follow Reverse Search, with the examples of the essential content to confirm their presence in the dataset.

For better results follow Inliers and Outliers.

Follow Data Balance.

2.6 Prioritize high impact labels review and propagate corrections automatically

Follow Archive Creation and Calibrations for optional vectorizer fine-tuning, similarity and translator calibration before indexing.

Follow Indexing.

Follow Clustering and Highlighting. Run the clustering job with "sort_content_ids_by_distance_to_cluster_center" enabled.

Send the high impact highlights to the labeling team for review.

Follow Reverse Search, with the examples of the high impact labels to identify the data examples to auto propagate the label changes to.

2.7 Non-random, distribution based split of training, validation and testing datasets for better generalization

Follow Archive Creation and Calibrations for optional vectorizer fine-tuning, similarity calibration before indexing.

Follow Indexing.

Follow Clustering and Highlighting. Run the clustering job with "sort_content_ids_by_distance_to_cluster_center" enabled.

Create a loop that draws an example from each cluster and iteratively fills the training, validation, and testing datasets according to the ratios you want.

This ensures that the training, validation, and testing datasets are representative of the entire dataset, and not just a random split which could lead to overfitting.

2.8 Control Dataset Augmentation with Synthetic AI generated/Simulator Data

Follow Archive Creation and Calibrations for optional vectorizer fine-tuning.

Follow Calibrations for relevance calibration.

Set up a pipeline that takes each incoming synthetic data piece and attempts to index it into a relevance calibrated archive.

Follow Indexing, enable the relevance filtering.

Take the results from the indexing job and remove the irrelevant content from the synthetic dataset.

3. Common Content/Media Archive/Data Lake Workflows

3.1 Automate content pre-processing pipeline

Follow Sampling and Trimming Use the "sample_data" job type to extract the intra content highlight piece that is most representative of the content.

Automate trimming of lengthy content using the "trim_by_highlights" job type.

3.2 Automate content onboarding, filtering, indexing and clustering pipeline

Follow Archive Creation and Calibrations for optional vectorizer fine-tuning, similarity, relevance calibration before indexing.

Follow Indexing, at the end of your existing pipeline add the "index" job type call with the relevance and redundancy filters enabled.

Store the returned indexed id in a column in your database for easy access.

Follow Clustering and Highlighting, run the clustering job on the onboarding batch and create cluster_ids in your database for easy access.

Use the cluster ids to improve the efficiency of your browsing interface by showing the most representative content of each cluster.

3.3 Offer reverse search as a service in your frontend for discoverability

Implement an example ingestion feature in your front end to allow users to search for similar content in the archive.

Follow Reverse Search, display the content to the user based on the returned sorted ids and filter by your distance threshold.

3.4 Retrieval Augmented Generation (RAG)

Use a vectorized archive to retrieve the relevant content to serve as context when prompting generative AI with a limited context window.

Follow Archive Creation and Calibrations for optional vectorizer fine-tuning, similarity, relevance calibration before indexing.

Follow Indexing, at the end of your existing pipeline add the "index" job type call with the relevance and redundancy filters enabled.

Store the returned indexed id in a column in your database for easy access.

Optionally Follow Clustering and Highlighting, run the clustering job on the onboarding batch and create cluster_ids in your database for easy access.

Take the prompt you are about to feed the AI and Follow Reverse Search to retrieve sorted contents and extract the top ones that fit the AI context window.

If you have performed clustering you can instead follow Reverse Search but pass it the ids of the cluster highlights only to fit the context window with more diverse content.

3.5 Increase the productivity of your moderation process

For each onboarding batch run the following jobs:

Follow Clustering and Highlighting. Run the clustering job with "sort_content_ids_by_distance_to_cluster_center" enabled.

Send the high impact highlights to the labeling team for review.

Follow Reverse Search, with the examples of the high impact labels to identify the data examples to auto propagate the label changes to according to the learned similarity threshold.

3.6 Get archive distribution insights

Create a backoffice job run the following jobs on demand or periodically:

Follow Clustering and Highlighting. Run the clustering job with "sort_content_ids_by_distance_to_cluster_center" enabled.

Store the first elements of each cluster as the dataset high impact highlights for visual inspection.

Follow Inliers and Outliers and add the results to the report. If you collect human preference weights (-10 to 10) pass the ids and weights in a dictionary called “id_to_preference_weight” to get inliers and outliers based in your taste.

Follow PCA visualization and add the mapped data to the report.

Plot the distribution of the inliers and outliers and the cluster centers to get insights on the archive distribution.

3.7 Increase the accuracy of your recommendation engines

Gather your user interaction data and run the following job.

Follow Reverse Search, with the threshold from the similarity calibration process.

Propagate the user's preferences to the user iteration matrix by similarity.

Retrain your collaborative filter feeding it with the populated user iteration matrice.