Advanced Query Development
Build powerful, complex queries to extract precise insights from your data
Advanced Query Development
While Infactory can automatically generate simple queries for your data, developing advanced custom queries allows you to unlock deeper insights and create more sophisticated applications. This guide explores techniques for building powerful, reusable queries.
Query Development Fundamentals
Understanding the Query Lifecycle
Query Selection
Infactory matches a user’s natural language question to the most appropriate query.
Parameter Extraction
The system extracts parameters from the question to customize the query.
Query Execution
The query is executed against your data source with the extracted parameters.
Response Formatting
The query results are structured into a consistent response format.
Query Structure
All Infactory queries follow a consistent structure:
// Query structure
{
name: "unique_query_name",
description: "Human-readable description of what this query does",
examples: [
"Example question that should trigger this query?",
"Another example question with different phrasing?"
],
parameters: [
{
name: "parameter_name",
description: "What this parameter represents",
type: "string" // or number, date, boolean, etc.
}
],
slots: {
query: "Your database query with parameter placeholders: {{ parameter_name }}"
},
output_template: {
// Standard structure for returned data
}
}
Building Complex Queries
Parameter Types and Validation
Infactory supports various parameter types with validation:
String
Text values with optional validation patterns
Number
Numeric values with optional range validation
Date
Date values with flexible formats
Boolean
True/false values
Enum
Selection from predefined options
Array
Lists of values with item validation
Parameter Definition Examples
// String parameter with validation
{
name: "product_category",
description: "The category of products to analyze",
type: "string",
required: true,
validation: {
pattern: "^[A-Za-z0-9\\s\\-]+$",
message: "Category must contain only letters, numbers, spaces, and hyphens"
}
}
// Number parameter with range validation
{
name: "limit",
description: "Maximum number of results to return",
type: "number",
required: false,
default: 10,
validation: {
min: 1,
max: 100,
message: "Limit must be between 1 and 100"
}
}
// Date parameter with format options
{
name: "start_date",
description: "The start date for the analysis period",
type: "date",
required: true,
validation: {
format: "YYYY-MM-DD",
min: "2020-01-01",
message: "Start date must be after January 1, 2020"
}
}
// Enum parameter with fixed options
{
name: "sort_order",
description: "The sort direction for results",
type: "enum",
required: false,
default: "desc",
options: ["asc", "desc"],
validation: {
message: "Sort order must be either 'asc' or 'desc'"
}
}
Dynamic SQL Generation
Create dynamic SQL queries that adapt based on parameters:
{
name: "sales_by_dimension",
description: "Analyze sales data grouped by a chosen dimension",
parameters: [
{
name: "dimension",
description: "The dimension to group by",
type: "enum",
options: ["product_category", "region", "customer_segment", "sales_channel"],
required: true
},
{
name: "time_period",
description: "The time period to analyze",
type: "enum",
options: ["day", "week", "month", "quarter", "year"],
required: false,
default: "month"
},
{
name: "start_date",
description: "The start date for analysis",
type: "date",
required: false
},
{
name: "end_date",
description: "The end date for analysis",
type: "date",
required: false
}
],
slots: {
query: `
SELECT
{% if time_period == 'day' %}
DATE(created_at) as time_period,
{% elif time_period == 'week' %}
DATE_TRUNC('week', created_at) as time_period,
{% elif time_period == 'month' %}
DATE_TRUNC('month', created_at) as time_period,
{% elif time_period == 'quarter' %}
DATE_TRUNC('quarter', created_at) as time_period,
{% else %}
DATE_TRUNC('year', created_at) as time_period,
{% endif %}
{{ dimension }} as dimension,
SUM(amount) as total_sales,
COUNT(DISTINCT order_id) as order_count,
COUNT(DISTINCT customer_id) as customer_count
FROM sales
WHERE 1=1
{% if start_date %}
AND created_at >= '{{ start_date }}'
{% endif %}
{% if end_date %}
AND created_at <= '{{ end_date }}'
{% endif %}
GROUP BY time_period, dimension
ORDER BY time_period, total_sales DESC
`
}
}
Handling Multiple Data Sources
For queries that need to access multiple tables or data sources:
{
name: "customer_order_analysis",
description: "Analyze customer purchasing patterns with demographic information",
parameters: [
{
name: "segment",
description: "Customer segment to analyze",
type: "string",
required: false
},
{
name: "min_purchase_count",
description: "Minimum number of purchases",
type: "number",
required: false,
default: 2
}
],
slots: {
query: `
WITH customer_orders AS (
SELECT
customer_id,
COUNT(order_id) as order_count,
SUM(amount) as total_spent,
AVG(amount) as avg_order_value,
MIN(created_at) as first_purchase,
MAX(created_at) as last_purchase
FROM orders
GROUP BY customer_id
HAVING COUNT(order_id) >= {{ min_purchase_count }}
)
SELECT
c.id,
c.name,
c.email,
c.segment,
co.order_count,
co.total_spent,
co.avg_order_value,
co.first_purchase,
co.last_purchase,
DATEDIFF(day, co.first_purchase, co.last_purchase) as customer_lifetime_days
FROM customers c
JOIN customer_orders co ON c.id = co.customer_id
WHERE 1=1
{% if segment %}
AND c.segment = '{{ segment }}'
{% endif %}
ORDER BY co.total_spent DESC
`
}
}
Advanced Techniques
Parameterized Aggregations
Create queries that dynamically change aggregation methods:
{
name: "metric_by_dimension",
description: "Analyze a specific metric aggregated by a dimension",
parameters: [
{
name: "metric",
description: "The metric to analyze",
type: "enum",
options: ["revenue", "order_count", "items_sold", "discount_amount"],
required: true
},
{
name: "aggregation",
description: "How to aggregate the metric",
type: "enum",
options: ["sum", "avg", "min", "max", "count"],
required: false,
default: "sum"
},
{
name: "dimension",
description: "The dimension to group by",
type: "enum",
options: ["product_category", "region", "customer_segment", "sales_channel"],
required: true
}
],
slots: {
query: `
SELECT
{{ dimension }} as dimension,
{% if aggregation == 'sum' %}
{% if metric == 'revenue' %}
SUM(amount) as value
{% elif metric == 'order_count' %}
COUNT(DISTINCT order_id) as value
{% elif metric == 'items_sold' %}
SUM(quantity) as value
{% elif metric == 'discount_amount' %}
SUM(discount) as value
{% endif %}
{% elif aggregation == 'avg' %}
{% if metric == 'revenue' %}
AVG(amount) as value
{% elif metric == 'order_count' %}
AVG(orders_per_customer) as value
{% elif metric == 'items_sold' %}
AVG(quantity) as value
{% elif metric == 'discount_amount' %}
AVG(discount) as value
{% endif %}
{% elif aggregation == 'min' %}
{% if metric == 'revenue' %}
MIN(amount) as value
{% elif metric == 'order_count' %}
MIN(orders_per_customer) as value
{% elif metric == 'items_sold' %}
MIN(quantity) as value
{% elif metric == 'discount_amount' %}
MIN(discount) as value
{% endif %}
{% elif aggregation == 'max' %}
{% if metric == 'revenue' %}
MAX(amount) as value
{% elif metric == 'order_count' %}
MAX(orders_per_customer) as value
{% elif metric == 'items_sold' %}
MAX(quantity) as value
{% elif metric == 'discount_amount' %}
MAX(discount) as value
{% endif %}
{% endif %}
FROM sales_data
GROUP BY {{ dimension }}
ORDER BY value DESC
`
}
}
Temporal Analysis Patterns
Queries for time-based analysis:
{
name: "period_over_period_comparison",
description: "Compare metrics between two time periods",
parameters: [
{
name: "metric",
description: "The metric to compare",
type: "enum",
options: ["revenue", "orders", "customers"],
required: true
},
{
name: "current_period_start",
description: "Start date of current period",
type: "date",
required: true
},
{
name: "current_period_end",
description: "End date of current period",
type: "date",
required: true
},
{
name: "comparison_type",
description: "How to determine the comparison period",
type: "enum",
options: ["previous_period", "same_period_last_year"],
required: false,
default: "previous_period"
}
],
slots: {
query: `
WITH current_period AS (
SELECT
{% if metric == 'revenue' %}
SUM(amount) as metric_value
{% elif metric == 'orders' %}
COUNT(DISTINCT order_id) as metric_value
{% elif metric == 'customers' %}
COUNT(DISTINCT customer_id) as metric_value
{% endif %}
FROM orders
WHERE created_at BETWEEN '{{ current_period_start }}' AND '{{ current_period_end }}'
),
comparison_period AS (
SELECT
{% if metric == 'revenue' %}
SUM(amount) as metric_value
{% elif metric == 'orders' %}
COUNT(DISTINCT order_id) as metric_value
{% elif metric == 'customers' %}
COUNT(DISTINCT customer_id) as metric_value
{% endif %}
FROM orders
WHERE
{% if comparison_type == 'previous_period' %}
-- Calculate previous period of same duration
created_at BETWEEN
DATE_SUB('{{ current_period_start }}', INTERVAL DATEDIFF(DAY, '{{ current_period_start }}', '{{ current_period_end }}') + 1 DAY)
AND DATE_SUB('{{ current_period_end }}', INTERVAL DATEDIFF(DAY, '{{ current_period_start }}', '{{ current_period_end }}') + 1 DAY)
{% elif comparison_type == 'same_period_last_year' %}
created_at BETWEEN
DATE_SUB('{{ current_period_start }}', INTERVAL 1 YEAR)
AND DATE_SUB('{{ current_period_end }}', INTERVAL 1 YEAR)
{% endif %}
)
SELECT
cp.metric_value as current_value,
comp.metric_value as comparison_value,
cp.metric_value - comp.metric_value as absolute_change,
CASE
WHEN comp.metric_value = 0 THEN NULL
ELSE (cp.metric_value - comp.metric_value) / comp.metric_value * 100
END as percentage_change
FROM current_period cp, comparison_period comp
`
}
}
Cohort Analysis
Create queries for analyzing user cohorts:
{
name: "customer_cohort_retention",
description: "Analyze customer retention by acquisition cohort",
parameters: [
{
name: "cohort_period",
description: "Time period for cohort grouping",
type: "enum",
options: ["week", "month", "quarter"],
required: false,
default: "month"
},
{
name: "max_periods",
description: "Maximum number of periods to analyze",
type: "number",
required: false,
default: 12
}
],
slots: {
query: `
WITH
-- Identify first purchase date for each customer
first_purchases AS (
SELECT
customer_id,
MIN(created_at) as first_purchase_date,
DATE_TRUNC('{{ cohort_period }}', MIN(created_at)) as cohort_date
FROM orders
GROUP BY customer_id
),
-- Get all purchases with cohort information
cohort_purchases AS (
SELECT
o.customer_id,
o.order_id,
o.created_at as purchase_date,
fp.cohort_date,
DATE_TRUNC('{{ cohort_period }}', o.created_at) as activity_date,
DATEDIFF('{{ cohort_period }}', fp.cohort_date, DATE_TRUNC('{{ cohort_period }}', o.created_at)) as period_number
FROM orders o
JOIN first_purchases fp ON o.customer_id = fp.customer_id
WHERE DATEDIFF('{{ cohort_period }}', fp.cohort_date, DATE_TRUNC('{{ cohort_period }}', o.created_at)) <= {{ max_periods }}
),
-- Calculate cohort sizes
cohort_sizes AS (
SELECT
cohort_date,
COUNT(DISTINCT customer_id) as customers
FROM first_purchases
GROUP BY cohort_date
),
-- Calculate retention for each cohort and period
cohort_retention AS (
SELECT
cp.cohort_date,
cp.period_number,
COUNT(DISTINCT cp.customer_id) as active_customers
FROM cohort_purchases cp
GROUP BY cp.cohort_date, cp.period_number
)
-- Final retention percentages
SELECT
cr.cohort_date,
cs.customers as cohort_size,
cr.period_number,
cr.active_customers,
ROUND(cr.active_customers * 100.0 / cs.customers, 2) as retention_percentage
FROM cohort_retention cr
JOIN cohort_sizes cs ON cr.cohort_date = cs.cohort_date
ORDER BY cr.cohort_date, cr.period_number
`
}
}
Funnel Analysis
Create queries for analyzing conversion funnels:
{
name: "conversion_funnel",
description: "Analyze the conversion funnel between specified events",
parameters: [
{
name: "start_event",
description: "The first event in the funnel",
type: "string",
required: true
},
{
name: "middle_events",
description: "Ordered list of intermediate events",
type: "array",
items: {
type: "string"
},
required: false,
default: []
},
{
name: "end_event",
description: "The final conversion event",
type: "string",
required: true
},
{
name: "time_window_days",
description: "Maximum days between first and last event",
type: "number",
required: false,
default: 30
}
],
slots: {
query: `
WITH funnel_events AS (
SELECT
user_id,
event_name,
event_time,
CASE
WHEN event_name = '{{ start_event }}' THEN 1
{% for event in middle_events %}
WHEN event_name = '{{ event }}' THEN {{ loop.index + 1 }}
{% endfor %}
WHEN event_name = '{{ end_event }}' THEN {{ middle_events|length + 2 }}
END as step_number
FROM events
WHERE event_name IN ('{{ start_event }}'
{% for event in middle_events %}
, '{{ event }}'
{% endfor %}
, '{{ end_event }}')
),
funnel_users AS (
SELECT
user_id,
MAX(CASE WHEN step_number = 1 THEN 1 ELSE 0 END) as reached_step_1,
{% for i in range(2, middle_events|length + 2) %}
MAX(CASE WHEN step_number = {{ i }} THEN 1 ELSE 0 END) as reached_step_{{ i }},
{% endfor %}
MAX(CASE WHEN step_number = {{ middle_events|length + 2 }} THEN 1 ELSE 0 END) as reached_final_step
FROM funnel_events
GROUP BY user_id
HAVING reached_step_1 = 1
),
total_by_step AS (
SELECT
SUM(reached_step_1) as step_1_count,
{% for i in range(2, middle_events|length + 2) %}
SUM(reached_step_{{ i }}) as step_{{ i }}_count,
{% endfor %}
SUM(reached_final_step) as final_step_count
FROM funnel_users
)
SELECT
'{{ start_event }}' as step_name,
step_1_count as user_count,
100 as percentage_of_previous,
100 as percentage_of_total
{% for i in range(2, middle_events|length + 2) %}
UNION ALL
SELECT
{% if i <= middle_events|length %}
'{{ middle_events[i-2] }}' as step_name,
{% else %}
'{{ end_event }}' as step_name,
{% endif %}
step_{{ i }}_count as user_count,
CASE
WHEN step_{{ i-1 }}_count = 0 THEN 0
ELSE ROUND(step_{{ i }}_count * 100.0 / step_{{ i-1 }}_count, 2)
END as percentage_of_previous,
CASE
WHEN step_1_count = 0 THEN 0
ELSE ROUND(step_{{ i }}_count * 100.0 / step_1_count, 2)
END as percentage_of_total
{% endfor %}
FROM total_by_step
ORDER BY step_name
`
}
}
Query Optimization Techniques
Performance-Focused Queries
Optimize query performance with these techniques:
Use Appropriate Indexes
Ensure your database has indexes on frequently queried columns
Limit Result Sets
Add LIMIT clauses to prevent returning too many rows
Filter Early
Apply WHERE clauses before expensive operations like joins
Avoid SELECT *
Specify only the columns you need rather than selecting all
Use CTEs
Common Table Expressions improve readability and optimization
Consider Materialized Views
For frequent complex queries, consider using materialized views
Query Example with Optimization
{
name: "optimized_sales_analytics",
description: "Optimized query for sales analytics with proper indexing hints",
parameters: [
{
name: "date_range",
description: "Time period to analyze",
type: "enum",
options: ["last_7_days", "last_30_days", "last_90_days", "last_365_days"],
required: false,
default: "last_30_days"
},
{
name: "category",
description: "Product category to filter by",
type: "string",
required: false
}
],
slots: {
query: `
/* Use index on created_at and category columns */
WITH filtered_sales AS (
SELECT
s.id,
s.product_id,
s.amount,
s.created_at
FROM sales s
WHERE
/* Apply date filter first to limit data set */
s.created_at >= CASE
WHEN '{{ date_range }}' = 'last_7_days' THEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
WHEN '{{ date_range }}' = 'last_30_days' THEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
WHEN '{{ date_range }}' = 'last_90_days' THEN DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
WHEN '{{ date_range }}' = 'last_365_days' THEN DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
END
{% if category %}
/* Only join products table if category filter is applied */
AND s.product_id IN (
SELECT id FROM products WHERE category = '{{ category }}'
)
{% endif %}
/* Limit to reasonable amount even in analysis queries */
LIMIT 1000000
),
/* Pre-aggregate data to reduce rows before final grouping */
daily_sales AS (
SELECT
DATE(created_at) as sale_date,
SUM(amount) as daily_amount,
COUNT(*) as transaction_count
FROM filtered_sales
GROUP BY DATE(created_at)
)
/* Final aggregation on much smaller dataset */
SELECT
sale_date,
daily_amount,
transaction_count,
SUM(daily_amount) OVER (ORDER BY sale_date) as cumulative_amount
FROM daily_sales
ORDER BY sale_date
`
}
}
Testing and Troubleshooting
Query Testing Strategies
Test your query with both valid and invalid parameter values to ensure validation works correctly:
// Test with valid parameters
{
"dimension": "product_category",
"time_period": "month"
}
// Test with invalid parameters
{
"dimension": "invalid_column",
"time_period": "decade" // not in allowed options
}
Test with edge cases that might cause issues:
// Empty result sets
{
"start_date": "2099-01-01", // Future date with no data
"metric": "revenue"
}
// Extremely large result sets
{
"date_range": "all_time",
"granularity": "day"
}
// Special characters in string parameters
{
"product_name": "Product's Name; With Special -- Characters"
}
Test query performance with realistic data volumes:
- Execute the query against a production-sized dataset
- Monitor execution time and resource usage
- Use EXPLAIN ANALYZE to identify bottlenecks
- Test with different parameter combinations that might affect performance
Common Query Issues
Problem: Infactory fails to extract parameters from natural language questions.
Solutions:
- Improve example questions: Add more diverse examples to your query definition
- Parameter descriptions: Make parameter descriptions more detailed
- Default values: Provide sensible defaults for non-critical parameters
- Validation messages: Set clear validation messages that explain proper format
Problem: Queries are slow to execute, especially with certain parameters.
Solutions:
- Review execution plan: Use EXPLAIN ANALYZE to identify bottlenecks
- Add appropriate indexes: Ensure columns used in WHERE, JOIN, and GROUP BY have indexes
- Limit result sets: Add LIMIT clauses to prevent returning too many rows
- Pre-aggregate data: Use CTEs to pre-aggregate data before final processing
- Parameter-specific optimization: Add conditional logic for better execution paths based on parameters
Problem: The wrong query is selected for certain questions.
Solutions:
- Improve query description: Make the description more specific to its purpose
- Add diverse examples: Include variations of questions that should trigger this query
- Review similar queries: Check for other queries with overlapping purposes
- Parameter differentiation: Make parameter names and descriptions more distinct
Best Practices for Query Development
Consistent Naming
Use consistent naming conventions for queries, parameters, and output fields
Clear Documentation
Document your queries with detailed descriptions and usage examples
Parameterize Everything
Make queries flexible by parameterizing filters, limits, and groupings
Error Handling
Include error handling for edge cases and invalid inputs
Performance Testing
Test queries with realistic data volumes and monitor performance
Version Control
Maintain version control for your queries to track changes
Next Steps
After mastering advanced query development, explore:
- Building complex search interfaces with your queries
- Implementing security best practices for sensitive data
- Optimizing performance for high-volume applications
- Creating intelligent dashboards powered by your advanced queries
Was this page helpful?
- Advanced Query Development
- Query Development Fundamentals
- Understanding the Query Lifecycle
- Query Structure
- Building Complex Queries
- Parameter Types and Validation
- Parameter Definition Examples
- Dynamic SQL Generation
- Handling Multiple Data Sources
- Advanced Techniques
- Parameterized Aggregations
- Temporal Analysis Patterns
- Cohort Analysis
- Funnel Analysis
- Query Optimization Techniques
- Performance-Focused Queries
- Query Example with Optimization
- Testing and Troubleshooting
- Query Testing Strategies
- Common Query Issues
- Best Practices for Query Development
- Next Steps