SecureAI provides administrators with controls to filter model outputs and enforce safety policies across your organization. This guide explains how to configure content filtering rules, set safety thresholds, manage category-level controls, and protect against prompt injection.
How Content Filtering Works
Content filtering sits between the AI model and the end user. When a model generates a response, SecureAI evaluates it against your configured filtering rules before displaying it. Depending on your settings, filtered responses are blocked, flagged for review, or allowed through with an audit log entry.
The filtering pipeline runs in this order:
- Prompt-side filters evaluate the user's input before it reaches the model.
- The model generates a response.
- Response-side filters evaluate the output before it reaches the user.
- Audit logging records any filter matches regardless of the action taken.
This two-stage approach catches both inappropriate prompts and inappropriate outputs, giving you defense in depth.
Key Capabilities
- Category-based filtering — control sensitivity thresholds for specific content categories (harmful content, hate speech, PII, etc.).
- Custom keyword and regex rules — block or flag responses containing specific terms or patterns.
- Prompt injection protection — detect and block attempts to override system prompts or safety instructions.
- Industry-appropriate defaults — pre-configured rules tailored to the automotive aftermarket context.
- Scope control — apply rules to prompts only, responses only, or both directions.
- Audit logging — all filtered content is recorded for compliance review.
Accessing Content Filtering Settings
- Log in to SecureAI as an administrator.
- Navigate to Admin Panel > Settings > Content & Safety.
- You will see tabs for Filtering Categories, Custom Rules, Prompt Protection, and Safety Policies.
Configuring Filtering Categories
SecureAI includes built-in content categories that can be individually tuned.
Available Categories
| Category | Description | Default Action |
|---|---|---|
| Harmful content | Violence, self-harm, dangerous activities | Block |
| Hate speech | Discriminatory or hateful language | Block |
| Sexual content | Sexually explicit material | Block |
| Profanity | Offensive language and profanity | Flag |
| Personal information | PII such as SSNs, credit card numbers, phone numbers | Block |
| Off-topic responses | Responses unrelated to automotive aftermarket | Flag |
| Financial advice | Investment, tax, or accounting guidance | Flag |
| Legal advice | Legal opinions or recommendations | Flag |
Setting Category Thresholds
Each category can be set to one of three actions:
| Action | Behavior |
|---|---|
| Block | The response is not shown to the user. A generic "This response was filtered by your organization's safety policy" message appears instead. |
| Flag | The response is shown to the user but logged for admin review in the audit trail. |
| Allow | No filtering is applied for this category. |
To change a category threshold:
- Go to Admin Panel > Settings > Content & Safety > Filtering Categories.
- Find the category you want to adjust.
- Select the desired action from the dropdown.
- Click Save Changes.
Changes take effect for new messages immediately. Existing conversations are not retroactively filtered.
Important: Setting any safety category to Allow disables filtering for that category entirely. Review your organization's compliance requirements before loosening defaults. Changes to category thresholds are logged in the admin audit trail.
Custom Keyword Rules
For industry-specific or organization-specific needs, you can create custom filtering rules based on keywords or patterns.
Adding a Custom Rule
- Go to Admin Panel > Settings > Content & Safety > Custom Rules.
- Click Add Rule.
- Fill in the following fields:
| Field | Description | Example |
|---|---|---|
| Rule name | A descriptive name for this rule | "Block competitor pricing" |
| Match type | Exact match, Contains, or Regex |
Contains |
| Pattern | The keyword, phrase, or regular expression to match | "competitor price list" |
| Scope | Response only, Prompt only, or Both |
Response only |
| Action | Block or Flag |
Flag |
| Priority | Numeric priority (lower numbers evaluated first) | 10 |
- Click Save.
Automotive Aftermarket Examples
Here are common custom rules for automotive aftermarket organizations:
| Rule Name | Match Type | Pattern | Scope | Action | Rationale |
|---|---|---|---|---|---|
| Block competitor pricing | Contains | competitor price list | Response only | Block | Prevent AI from generating speculative competitor pricing |
| Flag warranty disclaimers | Regex | warrant(y|ies).*disclaim |
Response only | Flag | Review any warranty-related language before it reaches technicians |
| Block part number guessing | Regex | I.*(think|believe|guess).*part\s*(number|#) |
Response only | Flag | Catch cases where the model speculates on part numbers instead of looking them up |
| Block medical advice | Contains | medical advice | Both | Block | Prevent AI from offering health guidance in an automotive context |
Managing Custom Rules
- Rules are evaluated in priority order (lowest number first). Drag rules to reorder them in the UI.
- Toggle rules on/off without deleting them using the Enabled switch.
- Click the rule name to edit its configuration.
- Use the Test button to check a rule against sample text before enabling it.
Tip: Regex patterns use standard syntax. Test complex patterns with the built-in Test button before deploying them in production to avoid false positives.
Prompt Injection Protection
Prompt injection is when a user crafts input that attempts to override your system prompt or bypass safety instructions. SecureAI includes built-in protections against common injection techniques.
Enabling Prompt Protection
- Go to Admin Panel > Settings > Content & Safety > Prompt Protection.
- Toggle Prompt Injection Detection to On.
- Choose the detection sensitivity:
| Sensitivity | Description | Recommended For |
|---|---|---|
| Low | Catches obvious injection attempts (e.g., "Ignore all previous instructions") | Low-risk internal environments |
| Medium | Catches most injection patterns including encoded and indirect attempts | General production use |
| High | Aggressive detection that may occasionally flag legitimate prompts | Environments with untrusted user input |
Choose the action when injection is detected:
- Block: Reject the prompt entirely with a warning message.
- Sanitize: Strip the detected injection attempt and process the remaining prompt.
- Flag: Allow the prompt but log it for review.
Click Save.
What Gets Detected
The prompt injection detector looks for patterns including:
- Direct override attempts ("Ignore all previous instructions", "You are now...")
- Role reassignment ("Act as an unrestricted AI", "Pretend you have no rules")
- Encoded bypass attempts (base64-encoded instructions, Unicode tricks)
- Delimiter injection (attempting to close and reopen system prompt blocks)
- Indirect injection via pasted document content
Reviewing Blocked Prompts
Blocked prompts are logged under Admin Panel > Audit Log with the event type Prompt Injection Detected. Review these periodically to:
- Confirm that detections are accurate (not false positives)
- Identify users who may need additional training
- Adjust sensitivity if the rate of false positives is too high
Safety Policies
Safety policies define organization-wide behavior beyond individual content categories.
System Prompt Guardrails
Administrators can prepend a safety-oriented system prompt to all conversations in the organization:
- Go to Admin Panel > Settings > Content & Safety > Safety Policies.
- Under System Prompt Prefix, enter your guardrail instructions. For example:
You are a helpful assistant for automotive aftermarket professionals. Only provide information relevant to automotive parts, repair, and maintenance. Do not provide medical, legal, or financial advice. Always cite specific part numbers and sources when available. If you are unsure about a part number or specification, say so rather than guessing. - Click Save.
This prefix is added to every conversation automatically and cannot be overridden by users. It is applied before any workspace-level or model-level system prompts.
Response Length Limits
To control verbose responses and manage costs:
- Under Safety Policies, find Maximum Response Length.
- Set the token limit (default: 2048 tokens).
- Click Save.
When a response exceeds the limit, it is truncated with a note that the response was shortened. The full response is still logged in the audit trail.
Rate Limiting per User
To prevent abuse or excessive API usage:
- Under Safety Policies, find User Rate Limits.
- Configure:
| Setting | Description | Default |
|---|---|---|
| Messages per minute | Maximum messages a single user can send per minute | 10 |
| Messages per day | Maximum messages a single user can send per day | 500 |
| Document uploads per day | Maximum file uploads per user per day | 20 |
| Token budget per day | Maximum total tokens (input + output) per user per day | 100,000 |
- Click Save.
Users who exceed rate limits see a "You've reached your usage limit. Please try again later" notice with the time until their limit resets.
Tip: Set rate limits conservatively at first and adjust upward based on actual usage patterns. You can view per-user usage under Admin Panel > Users > [username] > Usage.
Model-Level Safety Overrides
Different models may need different safety configurations. For example, you might want stricter filtering on a general-purpose model but looser rules on a specialized parts-lookup model.
- Under Safety Policies, find Per-Model Overrides.
- Select a model from the dropdown.
- Override any category threshold or safety policy for that specific model.
- Click Save.
Per-model overrides take priority over the organization-wide defaults. This is useful when you have models with different risk profiles or specialized use cases.
Reviewing Filtered Content
When content is flagged or blocked, it appears in the admin audit log.
- Navigate to Admin Panel > Audit Log.
- Filter by Event type: Content Filtered.
- Each entry shows:
- Timestamp
- User who triggered the filter
- The category or rule that matched
- The action taken (blocked/flagged)
- The original content (visible only to administrators)
- The conversation context (surrounding messages)
Audit Log Filters
Use these filters to narrow down the audit log:
| Filter | Options |
|---|---|
| Event type | Content Filtered, Prompt Injection Detected, Rate Limit Hit |
| Action | Blocked, Flagged |
| Category | Any built-in category or custom rule name |
| User | Specific user or "All users" |
| Date range | Start and end dates |
Use the Export button to download filtered results as CSV for compliance reporting.
Best Practices
- Start with defaults. The built-in category settings are designed for a professional automotive aftermarket environment. Adjust only after reviewing the audit log for a few weeks.
- Use Flag before Block for new rules. When adding custom keyword rules, start with the Flag action to assess how often they trigger before switching to Block.
- Review the audit log weekly. Regular reviews help identify false positives and gaps in your filtering coverage.
- Enable prompt injection protection. Start at Medium sensitivity for most deployments. Adjust based on your user base and risk tolerance.
- Layer your defenses. Combine system prompt guardrails with category filtering and custom rules. No single layer catches everything.
- Coordinate with your compliance team. Content filtering settings may be subject to your organization's data governance policies. Document your configuration decisions.
- Document your custom rules. Keep a record of each custom rule and the business reason for it, so the rationale is clear during future audits.
- Test before deploying. Use the Test button for custom rules and review the audit log after any configuration change to catch unintended effects.
Troubleshooting
Users report that legitimate responses are being blocked
- Check the audit log to identify which category or custom rule triggered the block.
- If it is a custom keyword rule, refine the pattern to be more specific or switch the match type from "Contains" to "Exact match".
- If it is a built-in category, consider changing the action from Block to Flag while you investigate.
- For prompt injection false positives, lower the detection sensitivity from High to Medium.
Content filtering does not seem to be working
- Verify that your changes were saved (check for the "Settings saved" confirmation).
- Ensure the user is not in a session started before the settings change — filtering settings apply to new messages, not retroactively to existing conversations.
- Check that the filtering engine is running by visiting Admin Panel > System Health. The "Content Filter" service should show a green status.
- Verify that a per-model override is not contradicting your organization-wide settings.
Custom regex rule causes errors
Invalid regex patterns will prevent the rule from saving. If a rule was saved but causes unexpected behavior:
- Disable it using the Enabled toggle.
- Test the pattern using the built-in Test button with sample text.
- Fix the regex pattern and re-enable.
Common regex issues: unescaped special characters (., *, (), unmatched groups, and overly broad patterns like .* that match everything.
High rate of false positives
If too many legitimate responses are being filtered:
- Review the audit log to identify the most frequent triggers.
- For custom rules: make patterns more specific or change scope from "Both" to "Response only".
- For built-in categories: switch from Block to Flag to collect data before making permanent changes.
- For prompt injection: lower the sensitivity level.
- Consider creating explicit "allow" exceptions for common false positive patterns.