data_balance

The job provides recommendations for adjusting your dataset to ensure balanced representation. It identifies over-represented and under-represented items based on their diversity and similarity to essential and forbidden examples, facilitating a more evenly distributed dataset for improved model training, efficient archiving and relevant discovery.

Required Account Privileges: "read"

Request JSON ["inputs"]:

   "clustered_content_ids_sorted_by_decreasing_diversity_with_contents_sorted_by_distance_to_centroid":
      list of lists of ints
      null NOT allowed
      A list of lists of integers, where each sublist represents content IDs sorted by decreasing diversity and by their distance to the centroid within clusters.

   "ids_sorted_from_inliers_to_outliers":
      list of ints
      null allowed
      An optional list of integers representing content IDs sorted from inliers to outliers.

   "ids_sorted_by_essential_examples":
      list of ints
      null allowed
      An optional list of integers representing content IDs sorted as essential examples.

   "ids_sorted_by_forbidden_examples":
      list of ints
      null allowed   
      An optional list of integers representing content IDs sorted as forbidden examples.

Response JSON ["results"]

   "prioritized_over_represented_ids_to_remove":
      list of ints

   "prioritized_under_represented_ids_to_source": 
      list of ints