Automating Optimizely Content Metadata with AI

In large-scale CMS projects like CmsIv, content auditing is a continuous challenge. As editorial teams evolve, older blog posts often lack standardized SEO metadata, such as categorized topics and hash tags. Manually updating hundreds or thousands of legacy pages is impractical. This post details how to automate this process in Optimizely CMS 12 using a scheduled job that scans content and integrates a powerful AI service for metadata generation.

Prerequisites: Content Model Preparation

Before implementing the job, ensure your target content type—in our case, the BlogItemPage—is configured to store the necessary AI-generated metadata. We assume a simple structure defined in CmsIv.Model/Pages/BlogItemPage.cs.

using EPiServer.DataAnnotations;
using EPiServer.SpecializedProperties;
using System.ComponentModel.DataAnnotations;

[ContentType(
    DisplayName = "Blog Item Page",
    GUID = "...",
    Description = "A specific article or blog post."
)]
public class BlogItemPage : PageData
{
    [CultureSpecific]
    [Display(Name = "Main Content Body", GroupName = SystemTabNames.Content)]
    public virtual XhtmlString MainBody { get; set; }

    // Target properties for AI updates
    [Display(Name = "Categories", GroupName = SystemTabNames.Content)]
    public virtual CategoryList Categories { get; set; }

    [Display(Name = "AI Generated Tags (CSV)", GroupName = SystemTabNames.Content)]
    public virtual string AiTags { get; set; }
}

Designing the AI Service Abstraction

To keep the CMS logic clean and testable, the actual interaction with external AI APIs (like OpenAI, Azure AI Services, or Cohere) should be encapsulated in a dedicated service layer. This service takes the raw content and returns structured metadata.

AI Metadata Generator Interface

Define the contract for the service in CmsIv.Web/Services/IMetadataGeneratorService.cs:

using EPiServer.SpecializedProperties;

public interface IMetadataGeneratorService
{
    /// <summary>
    /// Generates structured metadata (tags and categories) from content.
    /// </summary>
    /// <param name="contentBodyText">The plain text content of the article.</param>
    /// <returns>A tuple containing generated tags (CSV string) and categories.</returns>
    Task<(string Tags, CategoryList Categories)> GenerateMetadataAsync(string contentBodyText);
}

Note: The implementation of MetadataGeneratorService involves API calls, secure key handling, and robust JSON parsing, which is omitted here for brevity. For Optimizely compatibility, the AI must return category IDs or names that map correctly to your defined categories within the CMS.

Implementing the Scheduled Cleanup Job

The core logic resides in an Optimizely Scheduled Job. This job iterates through the content tree, identifies pages missing metadata, fetches the required data using the AI service, and publishes the updated content.

The Scheduled Job Definition

Create the job class in CmsIv.Web/Jobs/AiMetadataCleanupJob.cs:

using EPiServer.Core;
using EPiServer.DataAbstraction;
using EPiServer.Framework.Blobs;
using EPiServer.Logging;
using EPiServer.PlugIn;
using EPiServer.Scheduler;
using CmsIv.Web.Services;
using CmsIv.Model.Pages;

[ScheduledJob(
    DisplayName = "AI Metadata Cleanup Job",
    DefaultEnabled = false,
    IntervalLength = 3600000, // Run hourly (for example)
    SortIndex = 1000
)]
public class AiMetadataCleanupJob : ScheduledJobBase
{
    private readonly IContentLoader _contentLoader;
    private readonly IContentRepository _contentRepository;
    private readonly IMetadataGeneratorService _metadataGenerator;
    private static readonly ILogger Logger = LogManager.GetLogger();

    public AiMetadataCleanupJob(
        IContentLoader contentLoader,
        IContentRepository contentRepository,
        IMetadataGeneratorService metadataGenerator,
        IBlobFactory blobFactory)
        : base(Logger, blobFactory)
    {
        _contentLoader = contentLoader;
        _contentRepository = contentRepository;
        _metadataGenerator = metadataGenerator;
        // Optionally inject CategoryRepository if needed for category ID mapping
    }

    public override string Execute()
    {
        OnStatusChanged("Starting AI metadata cleanup...");
        var processedCount = 0;

        // 1. Define where to start the search (e.g., the root of the Blog section)
        var blogRoot = new ContentReference(100); // Replace 100 with your actual Blog landing page ID

        var descendants = _contentLoader.GetDescendents(blogRoot)
            .Select(id => _contentLoader.Get<IContent>(id))
            .OfType<BlogItemPage>();

        foreach (var blogPage in descendants)
        {
            if (IsStopSignalled)
            {
                return $"Job stopped manually. Processed {processedCount} pages.";
            }

            // 2. Check if metadata is missing (or incomplete)
            bool metadataMissing = string.IsNullOrWhiteSpace(blogPage.AiTags) || 
                                   (blogPage.Categories == null || blogPage.Categories.Count == 0);

            if (metadataMissing)
            {
                OnStatusChanged($"Processing: {blogPage.Name} (ID: {blogPage.ContentLink.ID})");

                // Get the content text for AI input (handle XhtmlString conversion)
                var contentText = blogPage.MainBody.ToHtmlString().StripHtml(); // Extension method required
                
                try
                {
                    // 3. Call the AI Service
                    var (tags, categories) = _metadataGenerator.GenerateMetadataAsync(contentText).GetAwaiter().GetResult();

                    if (!string.IsNullOrEmpty(tags) || (categories != null && categories.Count > 0))
                    {
                        // 4. Create a writable clone and update
                        var clone = (BlogItemPage)blogPage.CreateWritableClone();
                        
                        clone.AiTags = tags;
                        clone.Categories = categories;

                        _contentRepository.Save(clone, EPiServer.DataAccess.SaveAction.Publish, EPiServer.Security.AccessLevel.NoAccess);
                        processedCount++;
                    }
                }
                catch (Exception ex)
                {
                    Logger.Error($"Failed to process page {blogPage.Name}: {ex.Message}", ex);
                    continue; 
                }
            }
        }

        return $"AI Metadata Cleanup Job finished successfully. Updated {processedCount} blog posts.";
    }
}

Important Optimization: Stripping HTML

The AI service generally requires clean text input. You must ensure the XhtmlString content is converted to plain text before passing it to the AI API. A simple extension method can achieve this:

public static class XhtmlStringExtensions
{
    public static string StripHtml(this string html)
    {
        // Use regex or a library like HtmlAgilityPack for robust stripping
        if (string.IsNullOrEmpty(html)) return string.Empty;
        var stripped = System.Text.RegularExpressions.Regex.Replace(html, "<[^>]*>", string.Empty);
        return System.Net.WebUtility.HtmlDecode(stripped).Trim();
    }
}

Security and Permissions Considerations

When running scheduled jobs that modify content, the job executes under the context of the user running the job (or the service account defined for the application pool if run automatically). Since the job uses IContentRepository.Save(..., SaveAction.Publish), the account needs sufficient access rights (Write and Publish permissions) on the content nodes it is updating. Using AccessLevel.NoAccess bypasses explicit security checks during the save, which is acceptable for internal maintenance jobs if the service account is trusted.

Troubleshooting Common Issues

Issue 1: Performance Degradation During Scan

Cause: Using IContentLoader.GetDescendents retrieves *all* content references under the root, which can easily exceed tens of thousands of items in large projects, leading to high memory usage and slow execution.

Solution: Instead of fetching all descendants and then filtering with .OfType<BlogItemPage>(), utilize the Optimizely Search functionality (like EPiServer.Find or the built-in Content Search API) to perform the initial filtering and retrieval in optimized batches. This offloads the heavy lifting from the database and application memory.

// Conceptual Find implementation for batching
var results = _client.Search<BlogItemPage>()
    .Filter(x => x.ContentLink.ID.Within(blogRoot)) // Optional filtering
    .Filter(x => x.AiTags.Match("")) // Filtering pages missing tags
    .Take(100) // Process in batches of 100
    .GetResult();

Issue 2: Category Mapping Errors

Cause: The AI service returns category names (e.g., "Technology"), but Optimizely requires the internal Category object, which relies on a specific ID defined in the Admin view.

Solution: The IMetadataGeneratorService implementation must include a lookup mechanism. After receiving the suggested category names from the AI, inject ICategoryRepository and map the names back to the CMS's official Category objects, ensuring the resulting CategoryList contains valid, existing category IDs.

Automating content hygiene using AI is a significant step towards maintaining high SEO standards across large Optimizely installations. By leveraging the robustness of scheduled jobs in .NET 8, the CmsIv project can ensure that all legacy content remains discoverable and properly categorized.