Content Filtering and Safety Settings

SecureAI provides administrators with controls to filter model outputs and enforce safety policies across your organization. This guide explains how to configure content filtering rules, set safety thresholds, manage category-level controls, and protect against prompt injection.

How Content Filtering Works

Content filtering sits between the AI model and the end user. When a model generates a response, SecureAI evaluates it against your configured filtering rules before displaying it. Depending on your settings, filtered responses are blocked, flagged for review, or allowed through with an audit log entry.

The filtering pipeline runs in this order:

Prompt-side filters evaluate the user's input before it reaches the model.
The model generates a response.
Response-side filters evaluate the output before it reaches the user.
Audit logging records any filter matches regardless of the action taken.

This two-stage approach catches both inappropriate prompts and inappropriate outputs, giving you defense in depth.

Key Capabilities

Category-based filtering — control sensitivity thresholds for specific content categories (harmful content, hate speech, PII, etc.).
Custom keyword and regex rules — block or flag responses containing specific terms or patterns.
Prompt injection protection — detect and block attempts to override system prompts or safety instructions.
Industry-appropriate defaults — pre-configured rules tailored to the automotive aftermarket context.
Scope control — apply rules to prompts only, responses only, or both directions.
Audit logging — all filtered content is recorded for compliance review.

Accessing Content Filtering Settings

Log in to SecureAI as an administrator.
Navigate to Admin Panel > Settings > Content & Safety.
You will see tabs for Filtering Categories, Custom Rules, Prompt Protection, and Safety Policies.

Configuring Filtering Categories

SecureAI includes built-in content categories that can be individually tuned.

Available Categories

Category	Description	Default Action
Harmful content	Violence, self-harm, dangerous activities	Block
Hate speech	Discriminatory or hateful language	Block
Sexual content	Sexually explicit material	Block
Profanity	Offensive language and profanity	Flag
Personal information	PII such as SSNs, credit card numbers, phone numbers	Block
Off-topic responses	Responses unrelated to automotive aftermarket	Flag
Financial advice	Investment, tax, or accounting guidance	Flag
Legal advice	Legal opinions or recommendations	Flag

Setting Category Thresholds

Each category can be set to one of three actions:

Action	Behavior
Block	The response is not shown to the user. A generic "This response was filtered by your organization's safety policy" message appears instead.
Flag	The response is shown to the user but logged for admin review in the audit trail.
Allow	No filtering is applied for this category.

To change a category threshold:

Go to Admin Panel > Settings > Content & Safety > Filtering Categories.
Find the category you want to adjust.
Select the desired action from the dropdown.
Click Save Changes.

Changes take effect for new messages immediately. Existing conversations are not retroactively filtered.

Important: Setting any safety category to Allow disables filtering for that category entirely. Review your organization's compliance requirements before loosening defaults. Changes to category thresholds are logged in the admin audit trail.

Custom Keyword Rules

For industry-specific or organization-specific needs, you can create custom filtering rules based on keywords or patterns.

Adding a Custom Rule

Go to Admin Panel > Settings > Content & Safety > Custom Rules.
Click Add Rule.
Fill in the following fields:

Field	Description	Example
Rule name	A descriptive name for this rule	"Block competitor pricing"
Match type	`Exact match`, `Contains`, or `Regex`	Contains
Pattern	The keyword, phrase, or regular expression to match	"competitor price list"
Scope	`Response only`, `Prompt only`, or `Both`	Response only
Action	`Block` or `Flag`	Flag
Priority	Numeric priority (lower numbers evaluated first)	10

Click Save.

Automotive Aftermarket Examples

Here are common custom rules for automotive aftermarket organizations:

Rule Name	Match Type	Pattern	Scope	Action	Rationale
Block competitor pricing	Contains	competitor price list	Response only	Block	Prevent AI from generating speculative competitor pricing
Flag warranty disclaimers	Regex	`warrant(y\|ies).*disclaim`	Response only	Flag	Review any warranty-related language before it reaches technicians
Block part number guessing	Regex	`I.(think\|believe\|guess).part\s*(number\|#)`	Response only	Flag	Catch cases where the model speculates on part numbers instead of looking them up
Block medical advice	Contains	medical advice	Both	Block	Prevent AI from offering health guidance in an automotive context

Managing Custom Rules

Rules are evaluated in priority order (lowest number first). Drag rules to reorder them in the UI.
Toggle rules on/off without deleting them using the Enabled switch.
Click the rule name to edit its configuration.
Use the Test button to check a rule against sample text before enabling it.

Tip: Regex patterns use standard syntax. Test complex patterns with the built-in Test button before deploying them in production to avoid false positives.

Prompt Injection Protection

Prompt injection is when a user crafts input that attempts to override your system prompt or bypass safety instructions. SecureAI includes built-in protections against common injection techniques.

Enabling Prompt Protection

Go to Admin Panel > Settings > Content & Safety > Prompt Protection.
Toggle Prompt Injection Detection to On.
Choose the detection sensitivity:

Sensitivity	Description	Recommended For
Low	Catches obvious injection attempts (e.g., "Ignore all previous instructions")	Low-risk internal environments
Medium	Catches most injection patterns including encoded and indirect attempts	General production use
High	Aggressive detection that may occasionally flag legitimate prompts	Environments with untrusted user input

Choose the action when injection is detected:
- Block: Reject the prompt entirely with a warning message.
- Sanitize: Strip the detected injection attempt and process the remaining prompt.
- Flag: Allow the prompt but log it for review.
Click Save.

What Gets Detected

The prompt injection detector looks for patterns including:

Direct override attempts ("Ignore all previous instructions", "You are now...")
Role reassignment ("Act as an unrestricted AI", "Pretend you have no rules")
Encoded bypass attempts (base64-encoded instructions, Unicode tricks)
Delimiter injection (attempting to close and reopen system prompt blocks)
Indirect injection via pasted document content

Reviewing Blocked Prompts

Blocked prompts are logged under Admin Panel > Audit Log with the event type Prompt Injection Detected. Review these periodically to:

Confirm that detections are accurate (not false positives)
Identify users who may need additional training
Adjust sensitivity if the rate of false positives is too high

Safety Policies

Safety policies define organization-wide behavior beyond individual content categories.

System Prompt Guardrails

Administrators can prepend a safety-oriented system prompt to all conversations in the organization:

Go to Admin Panel > Settings > Content & Safety > Safety Policies.

Under System Prompt Prefix, enter your guardrail instructions. For example:

You are a helpful assistant for automotive aftermarket professionals.
Only provide information relevant to automotive parts, repair, and maintenance.
Do not provide medical, legal, or financial advice.
Always cite specific part numbers and sources when available.
If you are unsure about a part number or specification, say so rather than guessing.

Click Save.

This prefix is added to every conversation automatically and cannot be overridden by users. It is applied before any workspace-level or model-level system prompts.

Response Length Limits

To control verbose responses and manage costs:

Under Safety Policies, find Maximum Response Length.
Set the token limit (default: 2048 tokens).
Click Save.

When a response exceeds the limit, it is truncated with a note that the response was shortened. The full response is still logged in the audit trail.

Rate Limiting per User

To prevent abuse or excessive API usage:

Under Safety Policies, find User Rate Limits.
Configure:

Setting	Description	Default
Messages per minute	Maximum messages a single user can send per minute	10
Messages per day	Maximum messages a single user can send per day	500
Document uploads per day	Maximum file uploads per user per day	20
Token budget per day	Maximum total tokens (input + output) per user per day	100,000

Click Save.

Users who exceed rate limits see a "You've reached your usage limit. Please try again later" notice with the time until their limit resets.

Tip: Set rate limits conservatively at first and adjust upward based on actual usage patterns. You can view per-user usage under Admin Panel > Users > [username] > Usage.

Model-Level Safety Overrides

Different models may need different safety configurations. For example, you might want stricter filtering on a general-purpose model but looser rules on a specialized parts-lookup model.

Under Safety Policies, find Per-Model Overrides.
Select a model from the dropdown.
Override any category threshold or safety policy for that specific model.
Click Save.

Per-model overrides take priority over the organization-wide defaults. This is useful when you have models with different risk profiles or specialized use cases.

Reviewing Filtered Content

When content is flagged or blocked, it appears in the admin audit log.

Navigate to Admin Panel > Audit Log.
Filter by Event type: Content Filtered.
Each entry shows:
- Timestamp
- User who triggered the filter
- The category or rule that matched
- The action taken (blocked/flagged)
- The original content (visible only to administrators)
- The conversation context (surrounding messages)

Audit Log Filters

Use these filters to narrow down the audit log:

Filter	Options
Event type	Content Filtered, Prompt Injection Detected, Rate Limit Hit
Action	Blocked, Flagged
Category	Any built-in category or custom rule name
User	Specific user or "All users"
Date range	Start and end dates

Use the Export button to download filtered results as CSV for compliance reporting.

Best Practices

Start with defaults. The built-in category settings are designed for a professional automotive aftermarket environment. Adjust only after reviewing the audit log for a few weeks.
Use Flag before Block for new rules. When adding custom keyword rules, start with the Flag action to assess how often they trigger before switching to Block.
Review the audit log weekly. Regular reviews help identify false positives and gaps in your filtering coverage.
Enable prompt injection protection. Start at Medium sensitivity for most deployments. Adjust based on your user base and risk tolerance.
Layer your defenses. Combine system prompt guardrails with category filtering and custom rules. No single layer catches everything.
Coordinate with your compliance team. Content filtering settings may be subject to your organization's data governance policies. Document your configuration decisions.
Document your custom rules. Keep a record of each custom rule and the business reason for it, so the rationale is clear during future audits.
Test before deploying. Use the Test button for custom rules and review the audit log after any configuration change to catch unintended effects.

Troubleshooting

Users report that legitimate responses are being blocked

Check the audit log to identify which category or custom rule triggered the block.
If it is a custom keyword rule, refine the pattern to be more specific or switch the match type from "Contains" to "Exact match".
If it is a built-in category, consider changing the action from Block to Flag while you investigate.
For prompt injection false positives, lower the detection sensitivity from High to Medium.

Content filtering does not seem to be working

Verify that your changes were saved (check for the "Settings saved" confirmation).
Ensure the user is not in a session started before the settings change — filtering settings apply to new messages, not retroactively to existing conversations.
Check that the filtering engine is running by visiting Admin Panel > System Health. The "Content Filter" service should show a green status.
Verify that a per-model override is not contradicting your organization-wide settings.

Custom regex rule causes errors

Invalid regex patterns will prevent the rule from saving. If a rule was saved but causes unexpected behavior:

Disable it using the Enabled toggle.
Test the pattern using the built-in Test button with sample text.
Fix the regex pattern and re-enable.

Common regex issues: unescaped special characters (., *, (), unmatched groups, and overly broad patterns like .* that match everything.

High rate of false positives

If too many legitimate responses are being filtered:

Review the audit log to identify the most frequent triggers.
For custom rules: make patterns more specific or change scope from "Both" to "Response only".
For built-in categories: switch from Block to Flag to collect data before making permanent changes.
For prompt injection: lower the sensitivity level.
Consider creating explicit "allow" exceptions for common false positive patterns.