Project Summary
A program to help me manage the 80,000+ images I have from my OMATA One project. It uses machine learning to analyze and categorize images based on semantics and heuristics, making it easier to find and utilize specific images within the dataset. Developed in collaboration with Github Copilot, it creates a sparse GIF animation for sequences of images (and videos), and generates a well-structured JSON data file with paths to all of the images for easy reference.
Project Semantic Tags
AIIMAGE CATALOGINGMLPROJECT MANAGEMENTPYTHONPYTORCHSOFTWARE
The Outcomes
The Cluster Sequence Cataloger has significantly improved my ability to manage and locate images within the vast dataset. By utilizing machine learning techniques, I can now quickly categorize and retrieve images semantically, saving tons of time and effort and saving important work from disappearing into the ther. The generated JSON data file allows for easy integration into various workflows, visual catalyzers, infinite canvases all while enhancing productivity and streamlining the cataloging process. Serves as a key tool for the new OMATA owners for making use of existing assets without having to reproduce them.
As you probably know, I founded a successful hardware product company back at the very end of 2014. I sold it in 2022. I had a lot of fun building and growing the brand, figuring out how to make a complicated product like the OMATA One. And I also had a lot of fun creating pretty much all the content for the company. I ended up learning 3D rendering and animation which I used to show aspects of the product that would otherwise be super difficult.
The result is I seem to have nearly 90,000 images and over 600 videos, near as I can tell from a brute-force walk through the NAS.
It had always been on my mind to create some kind of catalog for all of that that would make sense. In part that was on my mind because it had become incredibly easy to have an image in mind — but have no idea precisely where it may be, even with my somewhat reasonable file and folder hierarchy. I knew that having a visual catalog was really what I needed as I also didn’t want any of the work I had done to disappear or get inextricably lost.
This all started when I determined that I wanted to learn how to make content for the company internally — at a minimum, direct any videos and the like. But I also knew that I would need renders and animations to share on social, as well as put up on the web and so forth. Originally, my terrible co-founder — rather than helping, or learning — hired other people using money we barely had. And I’ve always been much more inclined to do it myself. So, I got a 45 day license for Cinema 4D and started learning to use it. My friend James encouraged me and very shortly after I started, I could get reasonable renders and just kept building from there.
This code was put together in a few hours. While I consider myself a competent programmer — I mean..I did write the code for the OMATA iOS App — the outlines of a project are often the easiest part for me, while the implementation takes one into a land of sordid APIs, all manner of gotchas and tools that are not always the best at debugging. Now, though, with things like Github Copilot, walking though these various minefields during impelementation is less burdensome, although it does help if one knows how to code. I don’t think you can just bang keys and hope.
Python Codeimport osimport argparseimport jsonimport datetimeimport timeimport reimport numpy as npfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.cluster import DBSCANfrom difflib import SequenceMatcherfrom collections import defaultdict, Counterfrom pathlib import Pathimport cv2import tempfile
try: # Try to import sentence-transformers for embedding-based naming from sentence_transformers import SentenceTransformer HAVE_TRANSFORMERS = Trueexcept ImportError: HAVE_TRANSFORMERS = False print("Note: sentence-transformers not found. Install with 'pip install sentence-transformers' for enhanced sequence naming.")
try: from PIL import Image HAVE_PIL = Trueexcept ImportError: HAVE_PIL = False print("Note: PIL/Pillow not found. Install with 'pip install Pillow' for sequence preview GIFs.")
# Define common image file extensionsIMAGE_EXTENSIONS = { '.jpg', '.jpeg', '.png', '.tiff', '.tif', '.gif', '.bmp', '.webp', '.heic', '.heif', '.raw', '.svg', '.psd', '.cr2', '.crw', '.nef', '.arw', '.dng', '.orf', 'c4d', '.mp4', '.mov', '.mpg', '.mpeg', '.avi' # Add video formats}
def find_image_files(directory): """ Recursively find all image files in the given directory.
Args: directory: Directory path to search
Returns: List of image file paths """ image_files = []
for root, _, files in os.walk(directory): for file in files: ext = os.path.splitext(file.lower())[1] if ext in IMAGE_EXTENSIONS: image_files.append(os.path.join(root, file))
return image_files
def get_file_metadata(filepath): """Get metadata for a file including size and timestamps.""" stats = os.stat(filepath) return { "filename": os.path.basename(filepath), "filepath": filepath, "filesize": stats.st_size, "created": datetime.datetime.fromtimestamp(stats.st_ctime).isoformat(), "modified": datetime.datetime.fromtimestamp(stats.st_mtime).isoformat() }
def group_files_by_directory(files): """Group files by their parent directory.""" directory_groups = defaultdict(list)
for file in files: parent_dir = os.path.dirname(file) directory_groups[parent_dir].append(file)
return directory_groups
def cluster_filenames_tfidf(filenames, eps=0.7, min_samples=3): """ Cluster filenames using character-level TF-IDF and DBSCAN. """ # Use only the basename of the files for clustering basenames = [os.path.basename(f) for f in filenames]
# Skip clustering if there are too few files if len(basenames) < min_samples: return {}, filenames
vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(2, 4)) vectors = vectorizer.fit_transform(basenames)
clustering = DBSCAN(eps=eps, min_samples=min_samples, metric='cosine') labels = clustering.fit_predict(vectors)
clusters = {} unclustered = []
for label, filename in zip(labels, filenames): if label == -1: unclustered.append(filename) continue clusters.setdefault(label, []).append(filename)
return clusters, unclustered
def similarity_ratio(str1, str2): """Calculate string similarity ratio using SequenceMatcher.""" return SequenceMatcher(None, str1, str2).ratio()
def identify_sequences_by_similarity(clusters): """ Identify potential sequences within clusters based on similarity patterns. """ sequences = {} regular_clusters = {}
for cluster_id, files in clusters.items(): if len(files) < 3: # Need at least 3 files to form a meaningful sequence regular_clusters[cluster_id] = files continue
# Sort files by basename - this often naturally orders sequences files_sorted = sorted(files, key=os.path.basename) basenames = [os.path.basename(f) for f in files_sorted]
# Calculate similarity between consecutive files consecutive_similarities = [] for i in range(1, len(basenames)): sim = similarity_ratio(basenames[i-1], basenames[i]) consecutive_similarities.append(sim)
# Check if similarities are consistent (a sign of sequences) if len(consecutive_similarities) >= 2: similarities_array = np.array(consecutive_similarities)
# Measure consistency: standard deviation should be low for sequences similarity_std = np.std(similarities_array) similarity_mean = np.mean(similarities_array)
# Sequences typically have high mean similarity and low standard deviation is_likely_sequence = (similarity_mean > 0.7 and similarity_std < 0.1)
if is_likely_sequence: sequences[f"seq_{cluster_id}"] = files_sorted continue
# If not identified as a sequence, keep as regular cluster regular_clusters[cluster_id] = files
return sequences, regular_clusters
def derive_sequence_name(filenames): """ Derive a meaningful semantic name from sequence filenames. Returns a descriptive name based on common patterns in the filenames. """ if not filenames: return "unnamed_sequence"
# Extract basenames without extensions and directory paths basenames = [os.path.splitext(os.path.basename(f))[0] for f in filenames]
# Remove numeric parts (typically sequence numbers) non_numeric_parts = [] for name in basenames: # Replace digits with spaces cleaned = re.sub(r'[0-9]+', ' ', name) # Remove common separators and replace with spaces cleaned = re.sub(r'[_\-\.]', ' ', cleaned) # Normalize spaces cleaned = re.sub(r'\s+', ' ', cleaned).strip()
if cleaned: non_numeric_parts.append(cleaned)
if not non_numeric_parts: return "unnamed_sequence"
# Find common words across filenames word_lists = [name.split() for name in non_numeric_parts]
# Flatten the list of words and count frequencies all_words = [word for word_list in word_lists for word in word_list if len(word) > 2] word_counts = Counter(all_words)
# Get most common words that appear in most filenames common_words = [word for word, count in word_counts.most_common(3) if count >= len(filenames) * 0.8] # Word appears in at least 80% of files
if common_words: # Join common words to form a descriptive name sequence_name = "_".join(common_words) else: # If no common words, try finding the longest common substring longest_common = non_numeric_parts[0] for name in non_numeric_parts[1:]: matcher = SequenceMatcher(None, longest_common, name) match = matcher.find_longest_match(0, len(longest_common), 0, len(name)) if match.size > 0: longest_common = longest_common[match.a:match.a + match.size].strip() else: longest_common = "" break
if longest_common and len(longest_common) > 3: sequence_name = longest_common.replace(" ", "_") else: # Fallback: use the most frequent word if available if word_counts: sequence_name = word_counts.most_common(1)[0][0] else: # Last resort: use part of the first filename first_name = os.path.splitext(os.path.basename(filenames[0]))[0] alpha_only = re.sub(r'[^a-zA-Z]', '', first_name) sequence_name = alpha_only[:15] if alpha_only else "unnamed_sequence"
# Clean up final name: remove any remaining special characters and normalize sequence_name = re.sub(r'[^a-zA-Z0-9_]', '', sequence_name) sequence_name = re.sub(r'_+', '_', sequence_name)
return sequence_name.lower() if sequence_name else "unnamed_sequence"
def derive_sequence_name_ai(filenames, directory=None): """ Generate a meaningful sequence name using AI techniques.
Args: filenames: List of files in the sequence directory: Directory containing the files (optional)
Returns: A human-readable name for the sequence """ if not filenames: return "unnamed_sequence"
# 1. Extract path components for context first_file = filenames[0] path_parts = Path(os.path.dirname(first_file)).parts significant_path_parts = [p for p in path_parts[-3:] if not p.startswith('.')]
# 2. Extract filename components (without numbers) basenames = [os.path.splitext(os.path.basename(f))[0] for f in filenames] cleaned_names = [] for name in basenames: # Replace digits with spaces cleaned = re.sub(r'[0-9]+', ' ', name) # Remove common separators and replace with spaces cleaned = re.sub(r'[_\-\.]', ' ', cleaned) # Normalize spaces cleaned = re.sub(r'\s+', ' ', cleaned).strip() if cleaned: cleaned_names.append(cleaned)
# 3. Extract common words and tokens all_tokens = []
# Add significant path tokens for part in significant_path_parts: tokens = re.split(r'[_\-\s]', part) all_tokens.extend([t for t in tokens if len(t) > 2])
# Add filename tokens for name in cleaned_names: words = name.split() all_tokens.extend([w for w in words if len(w) > 2])
# 4. Clean and normalize tokens cleaned_tokens = [] for token in all_tokens: # Keep only alphabetic characters clean_token = re.sub(r'[^a-zA-Z]', '', token).lower() if clean_token and len(clean_token) > 2: cleaned_tokens.append(clean_token)
# 5. Use AI approach if available if HAVE_TRANSFORMERS: try: return derive_name_with_embeddings(cleaned_tokens, cleaned_names) except Exception as e: print(f"Error using transformer model: {e}") # Fall back to simpler method
# 6. Fall back to frequency-based approach token_counts = Counter(cleaned_tokens)
# Extract most common meaningful tokens common_tokens = [token for token, count in token_counts.most_common(3) if count >= max(2, len(filenames) * 0.2)]
if common_tokens: return "_".join(common_tokens)
# Last resort - use existing method return derive_sequence_name(filenames)
def derive_name_with_embeddings(tokens, cleaned_names, model_name="all-MiniLM-L6-v2"): """Use embeddings to find the most representative terms.""" # Skip if no tokens or names if not tokens or not cleaned_names: return "unnamed_sequence"
# Deduplicate tokens first unique_tokens = list(set(tokens))
# Load the model (this is cached so subsequent calls are fast) model = SentenceTransformer(model_name)
# Combine tokens and full names for comparison all_terms = unique_tokens + cleaned_names
if len(all_terms) < 2: return all_terms[0] if all_terms else "unnamed_sequence"
# Generate embeddings embeddings = model.encode(all_terms)
# Find centroid (center of mass of all embeddings) centroid = np.mean(embeddings, axis=0)
# Calculate similarity to centroid similarities = np.dot(embeddings, centroid) / ( np.linalg.norm(embeddings, axis=1) * np.linalg.norm(centroid) )
# Select top 2-3 most central tokens that are unique token_indices = list(range(len(unique_tokens))) # Use unique_tokens length name_indices = list(range(len(unique_tokens), len(all_terms)))
# Sort tokens by similarity to centroid sorted_token_indices = sorted(token_indices, key=lambda i: similarities[i], reverse=True)
# Get most representative tokens (up to 3) top_tokens = [] for idx in sorted_token_indices: token = all_terms[idx] if token not in top_tokens and len(top_tokens) < 3: top_tokens.append(token)
if not top_tokens: # Try names if tokens didn't work sorted_name_indices = sorted(name_indices, key=lambda i: similarities[i], reverse=True) if sorted_name_indices: most_central_name = all_terms[sorted_name_indices[0]] words = most_central_name.split() top_tokens = [w for w in words if len(w) > 2][:3]
if top_tokens: return "_".join(top_tokens)
return "unnamed_sequence"
def extract_frames_from_video(video_path, max_frames=60): """ Extract frames from a video file for GIF creation.
Args: video_path: Path to the video file max_frames: Maximum number of frames to extract
Returns: List of PIL Image objects """ if not HAVE_PIL: print(f"Cannot extract frames from {video_path}: PIL/Pillow library not available") return []
try: # Open the video file cap = cv2.VideoCapture(video_path)
# Check if video opened successfully if not cap.isOpened(): print(f"Error: Could not open video {video_path}") return []
# Get video properties frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
# Calculate step size to evenly sample frames step = max(1, frame_count // max_frames)
# Extract frames frames = [] for i in range(0, frame_count, step): if len(frames) >= max_frames: break
# Set the frame position cap.set(cv2.CAP_PROP_POS_FRAMES, i) ret, frame = cap.read()
if ret: # Convert BGR to RGB (PIL uses RGB) frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# Convert to PIL Image pil_img = Image.fromarray(frame_rgb)
# Resize if needed pil_img.thumbnail((300, 300))
frames.append(pil_img)
# Release the video capture object cap.release()
return frames
except Exception as e: print(f"Error extracting frames from {video_path}: {e}") return []
def create_preview_gif(file_path, output_name, sequence_id): """ Create a preview GIF for a file (either a video or part of an image sequence).
Args: file_path: Path to the file (video or image) output_name: Base name for the output GIF sequence_id: Unique ID of the sequence
Returns: Path to the created GIF file or None if creation failed """ if not HAVE_PIL: print(f"Cannot create GIF: PIL/Pillow library not available") return None
# Create output directory unique_dir_name = f"{output_name}_{sequence_id}" output_dir = os.path.join("seq_gifs", unique_dir_name) os.makedirs(output_dir, exist_ok=True)
# Determine file type ext = os.path.splitext(file_path.lower())[1] is_video = ext in {'.mp4', '.mov', '.mpg', '.mpeg', '.avi'}
frames = [] try: if is_video: # For video files, extract frames frames = extract_frames_from_video(file_path) else: # For single images, just load the image img = Image.open(file_path) img.thumbnail((300, 300))
# Convert to RGB if necessary if img.mode == 'RGBA': img = img.convert('RGB')
frames = [img]
if not frames: print(f"No frames generated for {file_path}") return None
# Save as GIF output_path = os.path.join(output_dir, f"{output_name}_{sequence_id}.gif")
# Save with a reasonable frame duration (100ms) frames[0].save( output_path, format='GIF', append_images=frames[1:], save_all=True, duration=100, # milliseconds per frame loop=0 # 0 means loop forever ) print(f"Saved preview GIF to {output_path}") return output_path except Exception as e: print(f"Error creating GIF from {file_path}: {e}") return None
def create_sequence_preview_gif(sequence_files, sequence_name, sequence_id, max_frames=60, min_frames=20, max_dimension=300): """ Create a preview GIF for a sequence with a large number of files.
Args: sequence_files: List of image files in the sequence sequence_name: Semantic name of the sequence sequence_id: Unique ID of the sequence max_frames: Maximum number of frames to include in the GIF min_frames: Minimum number of frames to include in the GIF max_dimension: Maximum width or height in pixels
Returns: Path to the created GIF file or None if creation failed """ if not HAVE_PIL: print(f"Cannot create GIF for {sequence_name}: PIL/Pillow library not available") return None
import numpy as np
# Create a unique directory name that includes both semantic name and sequence ID unique_dir_name = f"{sequence_name}_{sequence_id}"
# Create output directory if it doesn't exist output_dir = os.path.join("seq_gifs", unique_dir_name) os.makedirs(output_dir, exist_ok=True)
# Determine number of frames to use num_files = len(sequence_files) num_frames = min(max_frames, max(min_frames, num_files))
# Sample files evenly across the sequence if num_frames >= num_files: selected_files = sequence_files else: indices = np.linspace(0, num_files - 1, num_frames, dtype=int) selected_files = [sequence_files[i] for i in indices]
print(f"Creating GIF for {sequence_name} using {len(selected_files)} frames")
# Process images and create GIF frames = [] for file_path in selected_files: try: ext = os.path.splitext(file_path.lower())[1] is_video = ext in {'.mp4', '.mov', '.mpg', '.mpeg', '.avi'}
if is_video: video_frames = extract_frames_from_video(file_path, max_frames=max_frames) frames.extend(video_frames) else: img = Image.open(file_path)
# Resize while maintaining aspect ratio img.thumbnail((max_dimension, max_dimension))
# Convert to RGB if necessary (for formats like PNG with alpha) if img.mode == 'RGBA': img = img.convert('RGB')
frames.append(img) except Exception as e: print(f"Error processing {file_path}: {e}")
if not frames: print(f"No valid frames found for {sequence_name}") return None
# Save as GIF with the unique name output_path = os.path.join(output_dir, f"{sequence_name}_{sequence_id}.gif")
try: # Save with a reasonable frame duration (100ms) frames[0].save( output_path, format='GIF', append_images=frames[1:], save_all=True, duration=100, # milliseconds per frame loop=0 # 0 means loop forever ) print(f"Saved preview GIF to {output_path}") return output_path except Exception as e: print(f"Error saving GIF for {sequence_name}: {e}") return None
def main(): parser = argparse.ArgumentParser(description='Cluster image files in a directory based on filename similarity') parser.add_argument('--directory', '-d', type=str, default='.', help='Directory to search for image files (default: current directory)') parser.add_argument('--output', '-o', type=str, help='Output JSON file path (default: auto-generated)') args = parser.parse_args()
# Start tracking overall execution time start_total_time = time.time()
directory = os.path.abspath(args.directory) print(f"Searching for files in: {directory}")
# Track file discovery time start_discovery_time = time.time() image_files = find_image_files(directory) discovery_time = time.time() - start_discovery_time print(f"Found {len(image_files)} files in {discovery_time:.2f} seconds")
if not image_files: print("No files found.") return
# Create GIFs for video files video_gifs = {} video_files = []
for filepath in image_files: ext = os.path.splitext(filepath.lower())[1] if ext in {'.mp4', '.mov', '.mpg', '.mpeg', '.avi'}: video_files.append(filepath)
print(f"Found {len(video_files)} video files for GIF creation")
for video_path in video_files: # Create a safe filename version of the video name basename = os.path.basename(video_path) name_part = os.path.splitext(basename)[0] safe_name = re.sub(r'[^a-zA-Z0-9_]', '_', name_part)
# Use the full path hash as the sequence ID to ensure uniqueness sequence_id = f"video_{abs(hash(video_path)) % 10000}"
# Create GIF gif_path = create_preview_gif(video_path, safe_name, sequence_id)
if gif_path: video_gifs[video_path] = gif_path
# Track directory grouping time start_grouping_time = time.time() directory_groups = group_files_by_directory(image_files) grouping_time = time.time() - start_grouping_time print(f"Found {len(directory_groups)} directories with files in {grouping_time:.2f} seconds")
# Track clustering and sequence identification time start_clustering_time = time.time() all_sequences = {} all_clusters = {} all_unclustered = []
for dir_path, files in directory_groups.items(): print(f"Processing directory: {dir_path} ({len(files)} files)")
# Cluster files in this directory dir_clusters, dir_unclustered = cluster_filenames_tfidf(files)
# Identify sequences in this directory's clusters dir_sequences, dir_regular_clusters = identify_sequences_by_similarity(dir_clusters)
# Add unique identifiers to incorporate directory information dir_id = os.path.basename(dir_path)
# Add to global collections with unique IDs for seq_id, seq_files in dir_sequences.items(): all_sequences[f"{dir_id}_{seq_id}"] = seq_files
for cluster_id, cluster_files in dir_regular_clusters.items(): all_clusters[f"{dir_id}_{cluster_id}"] = cluster_files
all_unclustered.extend(dir_unclustered)
clustering_time = time.time() - start_clustering_time print(f"Clustering and sequence identification completed in {clustering_time:.2f} seconds")
# Display summary results to console print("\n=== SUMMARY ===") print(f"Total sequences: {len(all_sequences)}") print(f"Total non-sequence clusters: {len(all_clusters)}")
total_in_sequences = sum(len(s) for s in all_sequences.values()) total_in_clusters = sum(len(c) for c in all_clusters.values()) print(f"Files in sequences: {total_in_sequences}") print(f"Files in other clusters: {total_in_clusters}") print(f"Unclustered files: {len(all_unclustered)}")
# Semantic name generation for sequences start_naming_time = time.time() sequence_names = {} for seq_id, files in all_sequences.items(): sequence_names[seq_id] = derive_sequence_name_ai(files) print(f"Derived sequence name: {sequence_names[seq_id]} for {len(files)} files") naming_time = time.time() - start_naming_time print(f"Sequence naming completed in {naming_time:.2f} seconds")
# GIF creation as a separate step start_gif_time = time.time() preview_gifs = {}
for seq_id, files in all_sequences.items(): if len(files) > 30 and HAVE_PIL: # Create a safe filename version of the sequence name semantic_name = sequence_names[seq_id] safe_name = re.sub(r'[^a-zA-Z0-9_]', '_', semantic_name)
# Pass both the semantic name and the sequence ID gif_path = create_sequence_preview_gif( files, safe_name, seq_id.replace("/", "_").replace("\\", "_") )
if gif_path: preview_gifs[seq_id] = gif_path
gif_time = time.time() - start_gif_time print(f"GIF creation completed in {gif_time:.2f} seconds")
# Now prepare JSON (without GIF creation) start_json_time = time.time()
# Create result dictionary json_data = { "metadata": { "root_path": directory, "scan_date": datetime.datetime.now().isoformat(), "total_images": len(image_files), "total_clusters": len(all_clusters), "total_sequences": len(all_sequences), "unclustered_count": len(all_unclustered), "timing": { "total_seconds": 0, # Will be updated at the end "file_discovery_seconds": discovery_time, "directory_grouping_seconds": grouping_time, "clustering_seconds": clustering_time, "sequence_naming_seconds": naming_time, "gif_creation_seconds": gif_time, "json_preparation_seconds": 0 # Will be updated after JSON preparation } }, "sequences": [], "clusters": [], "unclustered": [] }
# Add sequences with AI-generated semantic names for seq_id, files in all_sequences.items(): seq_data = { "id": seq_id, "name": sequence_names[seq_id], "directory": os.path.dirname(files[0]) if files else "", "count": len(files), "files": [get_file_metadata(f) for f in files] }
# Add preview GIF path if available if seq_id in preview_gifs: seq_data["preview_gif"] = preview_gifs[seq_id]
json_data["sequences"].append(seq_data)
# Add clusters for cluster_id, files in all_clusters.items(): cluster_data = { "id": str(cluster_id), "directory": os.path.dirname(files[0]) if files else "", "count": len(files), "files": [get_file_metadata(f) for f in files] } json_data["clusters"].append(cluster_data)
# Add unclustered files json_data["unclustered"] = [get_file_metadata(f) for f in all_unclustered]
# Add video GIFs to the appropriate sections for filepath, gif_path in video_gifs.items(): video_data = { "id": f"video_{abs(hash(filepath)) % 10000}", "name": os.path.splitext(os.path.basename(filepath))[0], "directory": os.path.dirname(filepath), "filepath": filepath, "preview_gif": gif_path, "metadata": get_file_metadata(filepath) }
if "videos" not in json_data: json_data["videos"] = [] json_data["videos"].append(video_data)
json_time = time.time() - start_json_time
# Update JSON preparation time in timing data json_data["metadata"]["timing"]["json_preparation_seconds"] = json_time
# Update total time total_time = time.time() - start_total_time json_data["metadata"]["timing"]["total_seconds"] = total_time
print(f"JSON preparation completed in {json_time:.2f} seconds")
# Generate output filename with timestamp timestamp = datetime.datetime.now().strftime("%m%d%Y_%H%M%S") output_file = args.output if args.output else f"files_clustering_{timestamp}.json"
# Save JSON output with open(output_file, 'w', encoding='utf-8') as f: json.dump(json_data, f, indent=2)
print(f"\nResults saved to: {output_file}") print(f"Total execution time: {total_time:.2f} seconds")
if __name__ == "__main__": main()
interface FilesClusteringData { metadata: { root_path: string; // Base path for the scanned files scan_date: string; // ISO timestamp of when scan was performed total_images: number; // Total number of images found total_clusters: number; // Total number of file clusters total_sequences: number; // Total number of sequences identified unclustered_count: number; // Count of files not in any sequence/cluster timing: { total_seconds: number; // Total processing time file_discovery_seconds: number; // Time spent finding files directory_grouping_seconds: number; // Time spent grouping by directory clustering_seconds: number; // Time spent clustering files sequence_naming_seconds: number; // Time spent naming sequences gif_creation_seconds: number; // Time spent creating GIFs json_preparation_seconds: number; // Time spent preparing JSON output } }; sequences: Sequence[]; // Array of file sequences clusters: Cluster[]; // Array of file clusters unclustered: File[]; // Array of files not belonging to any group videos: Video[]; // Array of video files with preview GIFs}
interface Sequence { id: string; // Unique identifier for the sequence name: string; // Human-readable name directory: string; // Common directory path count: number; // Number of files in sequence files: File[]; // Array of files in sequence}
interface Cluster { id: string; // Unique identifier for the cluster directory: string; // Common directory path count: number; // Number of files in cluster files: File[]; // Array of files in cluster}
interface File { filename: string; // Filename without path filepath: string; // Full path to file filesize: number; // File size in bytes created: string; // ISO 8601 timestamp for creation date modified: string; // ISO 8601 timestamp for last modification}
interface Video { id: string; // Unique video identifier name: string; // Human-readable name directory: string; // Directory containing the video filepath: string; // Full path to video file preview_gif: string; // Path to preview GIF metadata: File; // File metadata for the video}