Advanced Query Development

While Infactory can automatically generate simple queries for your data, developing advanced custom queries allows you to unlock deeper insights and create more sophisticated applications. This guide explores techniques for building powerful, reusable queries.

Query Development Fundamentals

Understanding the Query Lifecycle

1

Query Selection

Infactory matches a user’s natural language question to the most appropriate query.

2

Parameter Extraction

The system extracts parameters from the question to customize the query.

3

Query Execution

The query is executed against your data source with the extracted parameters.

4

Response Formatting

The query results are structured into a consistent response format.

Query Structure

All Infactory queries follow a consistent structure:

// Query structure
{
  name: "unique_query_name",
  description: "Human-readable description of what this query does",
  examples: [
    "Example question that should trigger this query?",
    "Another example question with different phrasing?"
  ],
  parameters: [
    {
      name: "parameter_name",
      description: "What this parameter represents",
      type: "string" // or number, date, boolean, etc.
    }
  ],
  slots: {
    query: "Your database query with parameter placeholders: {{ parameter_name }}"
  },
  output_template: {
    // Standard structure for returned data
  }
}

Building Complex Queries

Parameter Types and Validation

Infactory supports various parameter types with validation:

String

Text values with optional validation patterns

Number

Numeric values with optional range validation

Date

Date values with flexible formats

Boolean

True/false values

Enum

Selection from predefined options

Array

Lists of values with item validation

Parameter Definition Examples

// String parameter with validation
{
  name: "product_category",
  description: "The category of products to analyze",
  type: "string",
  required: true,
  validation: {
    pattern: "^[A-Za-z0-9\\s\\-]+$",
    message: "Category must contain only letters, numbers, spaces, and hyphens"
  }
}

// Number parameter with range validation
{
  name: "limit",
  description: "Maximum number of results to return",
  type: "number",
  required: false,
  default: 10,
  validation: {
    min: 1,
    max: 100,
    message: "Limit must be between 1 and 100"
  }
}

// Date parameter with format options
{
  name: "start_date",
  description: "The start date for the analysis period",
  type: "date",
  required: true,
  validation: {
    format: "YYYY-MM-DD",
    min: "2020-01-01",
    message: "Start date must be after January 1, 2020"
  }
}

// Enum parameter with fixed options
{
  name: "sort_order",
  description: "The sort direction for results",
  type: "enum",
  required: false,
  default: "desc",
  options: ["asc", "desc"],
  validation: {
    message: "Sort order must be either 'asc' or 'desc'"
  }
}

Dynamic SQL Generation

Create dynamic SQL queries that adapt based on parameters:

{
  name: "sales_by_dimension",
  description: "Analyze sales data grouped by a chosen dimension",
  parameters: [
    {
      name: "dimension",
      description: "The dimension to group by",
      type: "enum",
      options: ["product_category", "region", "customer_segment", "sales_channel"],
      required: true
    },
    {
      name: "time_period",
      description: "The time period to analyze",
      type: "enum",
      options: ["day", "week", "month", "quarter", "year"],
      required: false,
      default: "month"
    },
    {
      name: "start_date",
      description: "The start date for analysis",
      type: "date",
      required: false
    },
    {
      name: "end_date",
      description: "The end date for analysis",
      type: "date",
      required: false
    }
  ],
  slots: {
    query: `
      SELECT 
        {% if time_period == 'day' %}
          DATE(created_at) as time_period,
        {% elif time_period == 'week' %}
          DATE_TRUNC('week', created_at) as time_period,
        {% elif time_period == 'month' %}
          DATE_TRUNC('month', created_at) as time_period,
        {% elif time_period == 'quarter' %}
          DATE_TRUNC('quarter', created_at) as time_period,
        {% else %}
          DATE_TRUNC('year', created_at) as time_period,
        {% endif %}
        {{ dimension }} as dimension,
        SUM(amount) as total_sales,
        COUNT(DISTINCT order_id) as order_count,
        COUNT(DISTINCT customer_id) as customer_count
      FROM sales
      WHERE 1=1
        {% if start_date %}
          AND created_at >= '{{ start_date }}'
        {% endif %}
        {% if end_date %}
          AND created_at <= '{{ end_date }}'
        {% endif %}
      GROUP BY time_period, dimension
      ORDER BY time_period, total_sales DESC
    `
  }
}

Handling Multiple Data Sources

For queries that need to access multiple tables or data sources:

{
  name: "customer_order_analysis",
  description: "Analyze customer purchasing patterns with demographic information",
  parameters: [
    {
      name: "segment",
      description: "Customer segment to analyze",
      type: "string",
      required: false
    },
    {
      name: "min_purchase_count",
      description: "Minimum number of purchases",
      type: "number",
      required: false,
      default: 2
    }
  ],
  slots: {
    query: `
      WITH customer_orders AS (
        SELECT 
          customer_id,
          COUNT(order_id) as order_count,
          SUM(amount) as total_spent,
          AVG(amount) as avg_order_value,
          MIN(created_at) as first_purchase,
          MAX(created_at) as last_purchase
        FROM orders
        GROUP BY customer_id
        HAVING COUNT(order_id) >= {{ min_purchase_count }}
      )
      
      SELECT 
        c.id,
        c.name,
        c.email,
        c.segment,
        co.order_count,
        co.total_spent,
        co.avg_order_value,
        co.first_purchase,
        co.last_purchase,
        DATEDIFF(day, co.first_purchase, co.last_purchase) as customer_lifetime_days
      FROM customers c
      JOIN customer_orders co ON c.id = co.customer_id
      WHERE 1=1
        {% if segment %}
          AND c.segment = '{{ segment }}'
        {% endif %}
      ORDER BY co.total_spent DESC
    `
  }
}

Advanced Techniques

Parameterized Aggregations

Create queries that dynamically change aggregation methods:

{
  name: "metric_by_dimension",
  description: "Analyze a specific metric aggregated by a dimension",
  parameters: [
    {
      name: "metric",
      description: "The metric to analyze",
      type: "enum",
      options: ["revenue", "order_count", "items_sold", "discount_amount"],
      required: true
    },
    {
      name: "aggregation",
      description: "How to aggregate the metric",
      type: "enum",
      options: ["sum", "avg", "min", "max", "count"],
      required: false,
      default: "sum"
    },
    {
      name: "dimension",
      description: "The dimension to group by",
      type: "enum",
      options: ["product_category", "region", "customer_segment", "sales_channel"],
      required: true
    }
  ],
  slots: {
    query: `
      SELECT 
        {{ dimension }} as dimension,
        {% if aggregation == 'sum' %}
          {% if metric == 'revenue' %}
            SUM(amount) as value
          {% elif metric == 'order_count' %}
            COUNT(DISTINCT order_id) as value
          {% elif metric == 'items_sold' %}
            SUM(quantity) as value
          {% elif metric == 'discount_amount' %}
            SUM(discount) as value
          {% endif %}
        {% elif aggregation == 'avg' %}
          {% if metric == 'revenue' %}
            AVG(amount) as value
          {% elif metric == 'order_count' %}
            AVG(orders_per_customer) as value
          {% elif metric == 'items_sold' %}
            AVG(quantity) as value
          {% elif metric == 'discount_amount' %}
            AVG(discount) as value
          {% endif %}
        {% elif aggregation == 'min' %}
          {% if metric == 'revenue' %}
            MIN(amount) as value
          {% elif metric == 'order_count' %}
            MIN(orders_per_customer) as value
          {% elif metric == 'items_sold' %}
            MIN(quantity) as value
          {% elif metric == 'discount_amount' %}
            MIN(discount) as value
          {% endif %}
        {% elif aggregation == 'max' %}
          {% if metric == 'revenue' %}
            MAX(amount) as value
          {% elif metric == 'order_count' %}
            MAX(orders_per_customer) as value
          {% elif metric == 'items_sold' %}
            MAX(quantity) as value
          {% elif metric == 'discount_amount' %}
            MAX(discount) as value
          {% endif %}
        {% endif %}
      FROM sales_data
      GROUP BY {{ dimension }}
      ORDER BY value DESC
    `
  }
}

Temporal Analysis Patterns

Queries for time-based analysis:

{
  name: "period_over_period_comparison",
  description: "Compare metrics between two time periods",
  parameters: [
    {
      name: "metric",
      description: "The metric to compare",
      type: "enum",
      options: ["revenue", "orders", "customers"],
      required: true
    },
    {
      name: "current_period_start",
      description: "Start date of current period",
      type: "date",
      required: true
    },
    {
      name: "current_period_end",
      description: "End date of current period",
      type: "date",
      required: true
    },
    {
      name: "comparison_type",
      description: "How to determine the comparison period",
      type: "enum",
      options: ["previous_period", "same_period_last_year"],
      required: false,
      default: "previous_period"
    }
  ],
  slots: {
    query: `
      WITH current_period AS (
        SELECT 
          {% if metric == 'revenue' %}
            SUM(amount) as metric_value
          {% elif metric == 'orders' %}
            COUNT(DISTINCT order_id) as metric_value
          {% elif metric == 'customers' %}
            COUNT(DISTINCT customer_id) as metric_value
          {% endif %}
        FROM orders
        WHERE created_at BETWEEN '{{ current_period_start }}' AND '{{ current_period_end }}'
      ),
      
      comparison_period AS (
        SELECT 
          {% if metric == 'revenue' %}
            SUM(amount) as metric_value
          {% elif metric == 'orders' %}
            COUNT(DISTINCT order_id) as metric_value
          {% elif metric == 'customers' %}
            COUNT(DISTINCT customer_id) as metric_value
          {% endif %}
        FROM orders
        WHERE 
          {% if comparison_type == 'previous_period' %}
            -- Calculate previous period of same duration
            created_at BETWEEN 
              DATE_SUB('{{ current_period_start }}', INTERVAL DATEDIFF(DAY, '{{ current_period_start }}', '{{ current_period_end }}') + 1 DAY) 
              AND DATE_SUB('{{ current_period_end }}', INTERVAL DATEDIFF(DAY, '{{ current_period_start }}', '{{ current_period_end }}') + 1 DAY)
          {% elif comparison_type == 'same_period_last_year' %}
            created_at BETWEEN 
              DATE_SUB('{{ current_period_start }}', INTERVAL 1 YEAR) 
              AND DATE_SUB('{{ current_period_end }}', INTERVAL 1 YEAR)
          {% endif %}
      )
      
      SELECT
        cp.metric_value as current_value,
        comp.metric_value as comparison_value,
        cp.metric_value - comp.metric_value as absolute_change,
        CASE 
          WHEN comp.metric_value = 0 THEN NULL
          ELSE (cp.metric_value - comp.metric_value) / comp.metric_value * 100 
        END as percentage_change
      FROM current_period cp, comparison_period comp
    `
  }
}

Cohort Analysis

Create queries for analyzing user cohorts:

{
  name: "customer_cohort_retention",
  description: "Analyze customer retention by acquisition cohort",
  parameters: [
    {
      name: "cohort_period",
      description: "Time period for cohort grouping",
      type: "enum",
      options: ["week", "month", "quarter"],
      required: false,
      default: "month"
    },
    {
      name: "max_periods",
      description: "Maximum number of periods to analyze",
      type: "number",
      required: false,
      default: 12
    }
  ],
  slots: {
    query: `
      WITH 
      -- Identify first purchase date for each customer
      first_purchases AS (
        SELECT 
          customer_id,
          MIN(created_at) as first_purchase_date,
          DATE_TRUNC('{{ cohort_period }}', MIN(created_at)) as cohort_date
        FROM orders
        GROUP BY customer_id
      ),
      
      -- Get all purchases with cohort information
      cohort_purchases AS (
        SELECT
          o.customer_id,
          o.order_id,
          o.created_at as purchase_date,
          fp.cohort_date,
          DATE_TRUNC('{{ cohort_period }}', o.created_at) as activity_date,
          DATEDIFF('{{ cohort_period }}', fp.cohort_date, DATE_TRUNC('{{ cohort_period }}', o.created_at)) as period_number
        FROM orders o
        JOIN first_purchases fp ON o.customer_id = fp.customer_id
        WHERE DATEDIFF('{{ cohort_period }}', fp.cohort_date, DATE_TRUNC('{{ cohort_period }}', o.created_at)) <= {{ max_periods }}
      ),
      
      -- Calculate cohort sizes
      cohort_sizes AS (
        SELECT 
          cohort_date,
          COUNT(DISTINCT customer_id) as customers
        FROM first_purchases
        GROUP BY cohort_date
      ),
      
      -- Calculate retention for each cohort and period
      cohort_retention AS (
        SELECT
          cp.cohort_date,
          cp.period_number,
          COUNT(DISTINCT cp.customer_id) as active_customers
        FROM cohort_purchases cp
        GROUP BY cp.cohort_date, cp.period_number
      )
      
      -- Final retention percentages
      SELECT
        cr.cohort_date,
        cs.customers as cohort_size,
        cr.period_number,
        cr.active_customers,
        ROUND(cr.active_customers * 100.0 / cs.customers, 2) as retention_percentage
      FROM cohort_retention cr
      JOIN cohort_sizes cs ON cr.cohort_date = cs.cohort_date
      ORDER BY cr.cohort_date, cr.period_number
    `
  }
}

Funnel Analysis

Create queries for analyzing conversion funnels:

{
  name: "conversion_funnel",
  description: "Analyze the conversion funnel between specified events",
  parameters: [
    {
      name: "start_event",
      description: "The first event in the funnel",
      type: "string",
      required: true
    },
    {
      name: "middle_events",
      description: "Ordered list of intermediate events",
      type: "array",
      items: {
        type: "string"
      },
      required: false,
      default: []
    },
    {
      name: "end_event",
      description: "The final conversion event",
      type: "string",
      required: true
    },
    {
      name: "time_window_days",
      description: "Maximum days between first and last event",
      type: "number",
      required: false,
      default: 30
    }
  ],
  slots: {
    query: `
      WITH funnel_events AS (
        SELECT
          user_id,
          event_name,
          event_time,
          CASE 
            WHEN event_name = '{{ start_event }}' THEN 1
            {% for event in middle_events %}
            WHEN event_name = '{{ event }}' THEN {{ loop.index + 1 }}
            {% endfor %}
            WHEN event_name = '{{ end_event }}' THEN {{ middle_events|length + 2 }}
          END as step_number
        FROM events
        WHERE event_name IN ('{{ start_event }}'
          {% for event in middle_events %}
          , '{{ event }}'
          {% endfor %}
          , '{{ end_event }}')
      ),
      
      funnel_users AS (
        SELECT
          user_id,
          MAX(CASE WHEN step_number = 1 THEN 1 ELSE 0 END) as reached_step_1,
          {% for i in range(2, middle_events|length + 2) %}
          MAX(CASE WHEN step_number = {{ i }} THEN 1 ELSE 0 END) as reached_step_{{ i }},
          {% endfor %}
          MAX(CASE WHEN step_number = {{ middle_events|length + 2 }} THEN 1 ELSE 0 END) as reached_final_step
        FROM funnel_events
        GROUP BY user_id
        HAVING reached_step_1 = 1
      ),
      
      total_by_step AS (
        SELECT
          SUM(reached_step_1) as step_1_count,
          {% for i in range(2, middle_events|length + 2) %}
          SUM(reached_step_{{ i }}) as step_{{ i }}_count,
          {% endfor %}
          SUM(reached_final_step) as final_step_count
        FROM funnel_users
      )
      
      SELECT
        '{{ start_event }}' as step_name,
        step_1_count as user_count,
        100 as percentage_of_previous,
        100 as percentage_of_total
        
        {% for i in range(2, middle_events|length + 2) %}
        UNION ALL
        SELECT
          {% if i <= middle_events|length %}
          '{{ middle_events[i-2] }}' as step_name,
          {% else %}
          '{{ end_event }}' as step_name,
          {% endif %}
          step_{{ i }}_count as user_count,
          CASE 
            WHEN step_{{ i-1 }}_count = 0 THEN 0 
            ELSE ROUND(step_{{ i }}_count * 100.0 / step_{{ i-1 }}_count, 2) 
          END as percentage_of_previous,
          CASE 
            WHEN step_1_count = 0 THEN 0 
            ELSE ROUND(step_{{ i }}_count * 100.0 / step_1_count, 2) 
          END as percentage_of_total
        {% endfor %}
      FROM total_by_step
      ORDER BY step_name
    `
  }
}

Query Optimization Techniques

Performance-Focused Queries

Optimize query performance with these techniques:

Use Appropriate Indexes

Ensure your database has indexes on frequently queried columns

Limit Result Sets

Add LIMIT clauses to prevent returning too many rows

Filter Early

Apply WHERE clauses before expensive operations like joins

Avoid SELECT *

Specify only the columns you need rather than selecting all

Use CTEs

Common Table Expressions improve readability and optimization

Consider Materialized Views

For frequent complex queries, consider using materialized views

Query Example with Optimization

{
  name: "optimized_sales_analytics",
  description: "Optimized query for sales analytics with proper indexing hints",
  parameters: [
    {
      name: "date_range",
      description: "Time period to analyze",
      type: "enum",
      options: ["last_7_days", "last_30_days", "last_90_days", "last_365_days"],
      required: false,
      default: "last_30_days"
    },
    {
      name: "category",
      description: "Product category to filter by",
      type: "string",
      required: false
    }
  ],
  slots: {
    query: `
      /* Use index on created_at and category columns */
      WITH filtered_sales AS (
        SELECT 
          s.id,
          s.product_id,
          s.amount,
          s.created_at
        FROM sales s
        WHERE 
          /* Apply date filter first to limit data set */
          s.created_at >= CASE
            WHEN '{{ date_range }}' = 'last_7_days' THEN DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
            WHEN '{{ date_range }}' = 'last_30_days' THEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
            WHEN '{{ date_range }}' = 'last_90_days' THEN DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
            WHEN '{{ date_range }}' = 'last_365_days' THEN DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
          END
          {% if category %}
          /* Only join products table if category filter is applied */
          AND s.product_id IN (
            SELECT id FROM products WHERE category = '{{ category }}'
          )
          {% endif %}
        /* Limit to reasonable amount even in analysis queries */
        LIMIT 1000000
      ),
      
      /* Pre-aggregate data to reduce rows before final grouping */
      daily_sales AS (
        SELECT 
          DATE(created_at) as sale_date,
          SUM(amount) as daily_amount,
          COUNT(*) as transaction_count
        FROM filtered_sales
        GROUP BY DATE(created_at)
      )
      
      /* Final aggregation on much smaller dataset */
      SELECT 
        sale_date,
        daily_amount,
        transaction_count,
        SUM(daily_amount) OVER (ORDER BY sale_date) as cumulative_amount
      FROM daily_sales
      ORDER BY sale_date
    `
  }
}

Testing and Troubleshooting

Query Testing Strategies

Common Query Issues

Best Practices for Query Development

Consistent Naming

Use consistent naming conventions for queries, parameters, and output fields

Clear Documentation

Document your queries with detailed descriptions and usage examples

Parameterize Everything

Make queries flexible by parameterizing filters, limits, and groupings

Error Handling

Include error handling for edge cases and invalid inputs

Performance Testing

Test queries with realistic data volumes and monitor performance

Version Control

Maintain version control for your queries to track changes

Next Steps

After mastering advanced query development, explore: