Blog | Tech Catalogue Blogs

Data Modeling: A short guide with examples

October 14, 2025 · 26 min read

Introduction to Data Modeling
Types of Data Models
Entity Modeling
Relationship Modeling
Database Modeling
Dimensional Modeling
Object-Oriented Data Modeling
NoSQL Data Modeling
Graph Data Modeling
Time-Series Data Modeling
Event-Driven Data Modeling
Data Modeling Methodologies
Best Practices and Design Patterns
Tools and Technologies

Introduction to Data Modeling

Data modeling is the process of creating a conceptual representation of data objects, their relationships, and the rules that govern them. It serves as a blueprint for database design and helps ensure data integrity, consistency, and efficient access patterns.

Why Data Modeling Matters

Communication: Provides a common language between business and technical teams
Documentation: Serves as living documentation of data structures
Quality: Ensures data consistency and integrity
Performance: Optimizes data access patterns and query performance
Maintenance: Simplifies future modifications and enhancements

Data Modeling Levels

Conceptual Model (What) → Logical Model (How) → Physical Model (Where)
     ↓                        ↓                      ↓
Business View            Technical View        Implementation View

Types of Data Models

Overview of Data Modeling Approaches

Model Type	Primary Focus	Use Cases	Key Characteristics
Entity Modeling	Business objects	Requirements gathering	High-level, business-focused
Relationship Modeling	Object connections	Process design	Focus on interactions
Database Modeling	Storage structure	Implementation	Technical, performance-oriented
Dimensional Modeling	Analytics	Data warehousing	Star/snowflake schemas
Object-Oriented	Code alignment	OOP applications	Inheritance, encapsulation
NoSQL Modeling	Flexible schemas	Modern applications	Document, key-value, column
Graph Modeling	Network relationships	Social networks, fraud	Nodes and edges
Time-Series	Temporal data	IoT, monitoring	Time-indexed data
Event-Driven	State changes	Event sourcing	Immutable events

Entity Modeling

Definition and Purpose

Entity modeling focuses on identifying and defining the core business objects (entities) that an organization needs to track. It's the foundation of conceptual data modeling.

Key Components

1. Entity Definition

Entity: A thing of significance to the business about which data must be stored
Examples: Customer, Product, Order, Employee, Account

2. Entity Characteristics

Tangible vs Intangible: Physical objects vs concepts
Independent vs Dependent: Can exist alone vs requires parent entity
Strong vs Weak: Has its own identifier vs borrows identifier

Entity Modeling Process

Step 1: Entity Identification

Business Requirements Analysis:
1. Review business processes
2. Identify nouns in requirements
3. Determine what needs to be tracked
4. Validate with stakeholders

Example: E-commerce System
- Customer (who buys)
- Product (what is sold)
- Order (purchase transaction)
- Category (product grouping)
- Supplier (product source)

Step 2: Entity Classification

Classification Framework:

Core Entities (Business Critical):
├── Customer
├── Product  
└── Order

Supporting Entities (Operational):
├── Category
├── Supplier
└── Address

Reference Entities (Master Data):
├── Country
├── Currency
└── Status

Step 3: Entity Attributes Definition

Entity: Customer
├── Identifying Attributes
│   └── CustomerID (Primary Key)
├── Descriptive Attributes
│   ├── FirstName
│   ├── LastName
│   ├── Email
│   └── PhoneNumber
└── Administrative Attributes
    ├── CreatedDate
    ├── LastModified
    └── IsActive

Entity Modeling Example: Library System

Library Management System Entities:

1. BOOK
   - BookID (PK)
   - ISBN
   - Title
   - Author
   - Publisher
   - PublicationYear
   - Genre
   - Available

2. MEMBER
   - MemberID (PK)
   - FirstName
   - LastName
   - Email
   - PhoneNumber
   - Address
   - MembershipType
   - JoinDate

3. LOAN
   - LoanID (PK)
   - BookID (FK)
   - MemberID (FK)
   - LoanDate
   - DueDate
   - ReturnDate
   - Status

4. AUTHOR
   - AuthorID (PK)
   - FirstName
   - LastName
   - Biography
   - Nationality

5. RESERVATION
   - ReservationID (PK)
   - BookID (FK)
   - MemberID (FK)
   - ReservationDate
   - ExpiryDate
   - Status

Relationship Modeling

Definition and Scope

Relationship modeling focuses on defining how entities interact with each other, capturing the business rules that govern these interactions.

Relationship Types

1. Cardinality Classifications

One-to-One (1:1):
├── Person ←→ Passport
├── Employee ←→ Desk Assignment
└── Country ←→ Capital City

One-to-Many (1:M):
├── Customer → Orders
├── Department → Employees
└── Category → Products

Many-to-Many (M:M):
├── Students ←→ Courses
├── Books ←→ Authors
└── Employees ←→ Projects

2. Participation Constraints

Mandatory vs Optional:

Mandatory (Total Participation):
- Every Order MUST have a Customer
- Every Employee MUST belong to a Department

Optional (Partial Participation):
- A Customer MAY have Orders
- An Employee MAY have a Manager

Advanced Relationship Concepts

1. Recursive Relationships

Employee Management Hierarchy:
Employee ──manages──→ Employee

Examples:
- Manager manages Employees
- Category has Parent Category
- Product has Related Products

2. Ternary Relationships

Three-way relationship:
Supplier ──supplies──→ Product ──to──→ Project

Business Rule:
"A supplier supplies specific products to specific projects"

3. Identifying vs Non-Identifying Relationships

Identifying Relationship:
Customer ──has──→ Order
(Order cannot exist without Customer)

Non-Identifying Relationship:
Order ──assigned_to──→ Salesperson
(Order exists independently of Salesperson assignment)

Social Media Relationships:

1. USER follows USER (M:M, Recursive)
   - FollowerID
   - FollowingID
   - FollowDate
   - NotificationEnabled

2. USER creates POST (1:M)
   - UserID
   - PostID
   - CreationDate

3. USER likes POST (M:M)
   - UserID
   - PostID
   - LikeDate
   - LikeType (like, love, laugh, etc.)

4. POST belongs_to GROUP (M:1)
   - PostID
   - GroupID
   - PostDate

5. USER joins GROUP (M:M)
   - UserID
   - GroupID
   - JoinDate
   - Role (member, admin, moderator)

Database Modeling

Definition and Focus

Database modeling translates conceptual and logical models into physical database structures, focusing on implementation details, performance optimization, and storage efficiency.

Database Model Types

1. Relational Database Modeling

Characteristics:

Tables with rows and columns
Primary and foreign key relationships
ACID properties
SQL-based querying

Design Process:

Normalization Process:
1NF → 2NF → 3NF → BCNF
├── Eliminate repeating groups
├── Remove partial dependencies
├── Remove transitive dependencies
└── Remove remaining anomalies

Example: E-commerce Relational Model

-- Customers Table
CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    FirstName VARCHAR(50) NOT NULL,
    LastName VARCHAR(50) NOT NULL,
    Email VARCHAR(100) UNIQUE NOT NULL,
    PhoneNumber VARCHAR(15),
    CreatedDate TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_email (Email)
);

-- Products Table
CREATE TABLE Products (
    ProductID INT PRIMARY KEY,
    ProductName VARCHAR(200) NOT NULL,
    CategoryID INT,
    Price DECIMAL(10,2) NOT NULL,
    StockQuantity INT DEFAULT 0,
    Description TEXT,
    IsActive BOOLEAN DEFAULT TRUE,
    FOREIGN KEY (CategoryID) REFERENCES Categories(CategoryID),
    INDEX idx_category (CategoryID),
    INDEX idx_active (IsActive)
);

-- Orders Table
CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
    CustomerID INT NOT NULL,
    OrderDate TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    TotalAmount DECIMAL(12,2) NOT NULL,
    OrderStatus ENUM('pending', 'processing', 'shipped', 'delivered', 'cancelled'),
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID),
    INDEX idx_customer (CustomerID),
    INDEX idx_date (OrderDate),
    INDEX idx_status (OrderStatus)
);

-- Order Items Junction Table
CREATE TABLE OrderItems (
    OrderID INT,
    ProductID INT,
    Quantity INT NOT NULL,
    UnitPrice DECIMAL(10,2) NOT NULL,
    PRIMARY KEY (OrderID, ProductID),
    FOREIGN KEY (OrderID) REFERENCES Orders(OrderID),
    FOREIGN KEY (ProductID) REFERENCES Products(ProductID)
);

2. Denormalization for Performance

-- Denormalized Customer Order Summary
CREATE TABLE CustomerOrderSummary (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    TotalOrders INT DEFAULT 0,
    TotalSpent DECIMAL(12,2) DEFAULT 0.00,
    LastOrderDate TIMESTAMP,
    AverageOrderValue DECIMAL(10,2),
    PreferredCategory VARCHAR(50),
    -- Denormalized for reporting performance
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

Physical Design Considerations

1. Indexing Strategy

-- Performance Indexes
CREATE INDEX idx_orders_customer_date ON Orders(CustomerID, OrderDate);
CREATE INDEX idx_products_category_price ON Products(CategoryID, Price);
CREATE INDEX idx_customers_email_active ON Customers(Email, IsActive);

-- Composite Index for Common Query Patterns
CREATE INDEX idx_order_items_product_order ON OrderItems(ProductID, OrderID);

2. Partitioning

-- Range Partitioning by Date
CREATE TABLE Orders (
    OrderID INT,
    CustomerID INT,
    OrderDate DATE,
    ...
) PARTITION BY RANGE (YEAR(OrderDate)) (
    PARTITION p2022 VALUES LESS THAN (2023),
    PARTITION p2023 VALUES LESS THAN (2024),
    PARTITION p2024 VALUES LESS THAN (2025),
    PARTITION pFuture VALUES LESS THAN MAXVALUE
);

Dimensional Modeling

Definition and Purpose

Dimensional modeling is designed specifically for analytical workloads and data warehousing, optimizing for query performance and business intelligence.

Core Concepts

1. Star Schema

Fact Table (Center):
├── Sales Fact
│   ├── SalesID (PK)
│   ├── ProductKey (FK)
│   ├── CustomerKey (FK)
│   ├── DateKey (FK)
│   ├── StoreKey (FK)
│   ├── Quantity
│   ├── Revenue
│   └── Profit

Dimension Tables (Points):
├── Product Dimension
├── Customer Dimension  
├── Date Dimension
└── Store Dimension

2. Snowflake Schema

Normalized Dimensions:
Product Dimension:
├── Product
│   ├── ProductKey
│   ├── ProductName
│   └── CategoryKey (FK)
└── Category
    ├── CategoryKey
    ├── CategoryName
    └── DepartmentKey (FK)
    └── Department
        ├── DepartmentKey
        └── DepartmentName

Dimensional Modeling Example: Retail Analytics

-- Date Dimension (Type 0 - Never Changes)
CREATE TABLE DimDate (
    DateKey INT PRIMARY KEY,
    FullDate DATE,
    DayOfWeek INT,
    DayName VARCHAR(10),
    Month INT,
    MonthName VARCHAR(10),
    Quarter INT,
    Year INT,
    IsWeekend BOOLEAN,
    IsHoliday BOOLEAN
);

-- Product Dimension (Type 2 - Track History)
CREATE TABLE DimProduct (
    ProductKey INT PRIMARY KEY,
    ProductID INT,  -- Business Key
    ProductName VARCHAR(200),
    Category VARCHAR(100),
    Brand VARCHAR(100),
    Price DECIMAL(10,2),
    EffectiveDate DATE,
    ExpiryDate DATE,
    IsCurrent BOOLEAN
);

-- Customer Dimension (Type 1 - Overwrite)
CREATE TABLE DimCustomer (
    CustomerKey INT PRIMARY KEY,
    CustomerID INT,  -- Business Key
    CustomerName VARCHAR(100),
    CustomerSegment VARCHAR(50),
    City VARCHAR(100),
    State VARCHAR(50),
    Country VARCHAR(50)
);

-- Sales Fact Table
CREATE TABLE FactSales (
    SalesKey INT PRIMARY KEY,
    ProductKey INT,
    CustomerKey INT,
    DateKey INT,
    Quantity INT,
    UnitPrice DECIMAL(10,2),
    Revenue DECIMAL(12,2),
    Cost DECIMAL(12,2),
    Profit DECIMAL(12,2),
    FOREIGN KEY (ProductKey) REFERENCES DimProduct(ProductKey),
    FOREIGN KEY (CustomerKey) REFERENCES DimCustomer(CustomerKey),
    FOREIGN KEY (DateKey) REFERENCES DimDate(DateKey)
);

Slowly Changing Dimensions (SCD)

Type 1: Overwrite (No History)

-- Update customer address - lose history
UPDATE DimCustomer 
SET Address = 'New Address', City = 'New City'
WHERE CustomerID = 12345;

Type 2: Add New Record (Keep History)

-- End current record
UPDATE DimProduct 
SET ExpiryDate = '2024-01-31', IsCurrent = FALSE
WHERE ProductID = 456 AND IsCurrent = TRUE;

-- Insert new record with changes
INSERT INTO DimProduct (ProductID, ProductName, Price, EffectiveDate, IsCurrent)
VALUES (456, 'Updated Product Name', 29.99, '2024-02-01', TRUE);

Type 3: Add New Column (Limited History)

ALTER TABLE DimCustomer 
ADD COLUMN PreviousSegment VARCHAR(50),
ADD COLUMN SegmentChangeDate DATE;

UPDATE DimCustomer 
SET PreviousSegment = CustomerSegment,
    CustomerSegment = 'Premium',
    SegmentChangeDate = CURRENT_DATE
WHERE CustomerID = 789;

Object-Oriented Data Modeling

Definition and Characteristics

Object-oriented data modeling aligns data structures with object-oriented programming paradigms, emphasizing encapsulation, inheritance, and polymorphism.

Core OO Concepts in Data Modeling

1. Inheritance Hierarchies

Vehicle (Superclass)
├── Car (Subclass)
│   ├── Sedan
│   └── SUV
├── Motorcycle (Subclass)
└── Truck (Subclass)
    ├── PickupTruck
    └── DeliveryTruck

2. Implementation Strategies

Table Per Hierarchy (TPH):

CREATE TABLE Vehicle (
    VehicleID INT PRIMARY KEY,
    VehicleType VARCHAR(20), -- Discriminator
    Brand VARCHAR(50),
    Model VARCHAR(50),
    Year INT,
    -- Car specific
    NumberOfDoors INT,
    -- Motorcycle specific  
    HasSidecar BOOLEAN,
    -- Truck specific
    LoadCapacity DECIMAL(10,2)
);

Table Per Type (TPT):

-- Base table
CREATE TABLE Vehicle (
    VehicleID INT PRIMARY KEY,
    Brand VARCHAR(50),
    Model VARCHAR(50),
    Year INT
);

-- Derived tables
CREATE TABLE Car (
    VehicleID INT PRIMARY KEY,
    NumberOfDoors INT,
    FOREIGN KEY (VehicleID) REFERENCES Vehicle(VehicleID)
);

CREATE TABLE Motorcycle (
    VehicleID INT PRIMARY KEY,
    HasSidecar BOOLEAN,
    FOREIGN KEY (VehicleID) REFERENCES Vehicle(VehicleID)
);

Table Per Concrete Class (TPC):

CREATE TABLE Car (
    CarID INT PRIMARY KEY,
    Brand VARCHAR(50),
    Model VARCHAR(50),
    Year INT,
    NumberOfDoors INT
);

CREATE TABLE Motorcycle (
    MotorcycleID INT PRIMARY KEY,
    Brand VARCHAR(50),
    Model VARCHAR(50),
    Year INT,
    HasSidecar BOOLEAN
);

OO Modeling Example: Content Management System

-- Base Content Entity
CREATE TABLE Content (
    ContentID INT PRIMARY KEY,
    Title VARCHAR(200),
    Author VARCHAR(100),
    CreatedDate TIMESTAMP,
    ContentType VARCHAR(50), -- Discriminator
    Status ENUM('draft', 'published', 'archived')
);

-- Article (inherits from Content)
CREATE TABLE Article (
    ContentID INT PRIMARY KEY,
    Body TEXT,
    WordCount INT,
    ReadingTime INT,
    FOREIGN KEY (ContentID) REFERENCES Content(ContentID)
);

-- Video (inherits from Content)  
CREATE TABLE Video (
    ContentID INT PRIMARY KEY,
    Duration INT, -- seconds
    Resolution VARCHAR(20),
    FileSize BIGINT,
    VideoURL VARCHAR(500),
    FOREIGN KEY (ContentID) REFERENCES Content(ContentID)
);

-- Podcast (inherits from Content)
CREATE TABLE Podcast (
    ContentID INT PRIMARY KEY,
    Duration INT, -- seconds
    EpisodeNumber INT,
    AudioURL VARCHAR(500),
    FOREIGN KEY (ContentID) REFERENCES Content(ContentID)
);

-- Content Tags (Many-to-Many)
CREATE TABLE ContentTags (
    ContentID INT,
    TagID INT,
    PRIMARY KEY (ContentID, TagID),
    FOREIGN KEY (ContentID) REFERENCES Content(ContentID),
    FOREIGN KEY (TagID) REFERENCES Tags(TagID)
);

NoSQL Data Modeling

Document-Based Modeling (MongoDB)

Document modeling focuses on flexible, schema-less data storage optimized for application data access patterns.

Design Principles:

Embed Related Data when accessed together
Reference Large or Frequently Updated Data
Optimize for Application Queries
Consider Document Size Limits

Example: E-commerce Product Catalog

// Embedded Document Model
{
  "_id": "product_123",
  "name": "Wireless Headphones",
  "brand": "TechBrand",
  "price": 99.99,
  "category": {
    "id": "electronics",
    "name": "Electronics",
    "parent": "consumer_goods"
  },
  "specifications": {
    "battery_life": "20 hours",
    "connectivity": ["Bluetooth 5.0", "USB-C"],
    "weight": "250g"
  },
  "reviews": [
    {
      "user_id": "user_456",
      "username": "techreviewer",
      "rating": 4,
      "comment": "Great sound quality!",
      "date": "2024-01-15"
    }
  ],
  "inventory": {
    "stock_count": 150,
    "warehouse_locations": ["WH01", "WH03"],
    "reorder_level": 20
  },
  "created_date": "2024-01-01T00:00:00Z",
  "last_updated": "2024-01-15T10:30:00Z"
}

Reference Pattern for Large Collections:

// Product Document (without embedded reviews)
{
  "_id": "product_123",
  "name": "Wireless Headphones",
  "brand": "TechBrand",
  "price": 99.99,
  "review_count": 1250,
  "average_rating": 4.2
}

// Separate Reviews Collection
{
  "_id": "review_789",
  "product_id": "product_123",
  "user_id": "user_456",
  "rating": 4,
  "comment": "Great sound quality!",
  "helpful_votes": 15,
  "date": "2024-01-15"
}

Key-Value Modeling (Redis)

Key-value modeling focuses on simple, fast data retrieval using unique keys.

Design Patterns:

1. Simple Key-Value Storage:

# User session storage
SET user:session:abc123 "{"user_id": 456, "login_time": "2024-01-15T10:00:00Z"}"
EXPIRE user:session:abc123 3600

# Product cache
SET product:123 "{"name": "Headphones", "price": 99.99}"

2. Hash Storage:

# User profile
HSET user:456 name "John Doe"
HSET user:456 email "john@example.com"  
HSET user:456 last_login "2024-01-15"

# Product details
HSET product:123 name "Headphones"
HSET product:123 price "99.99"
HSET product:123 stock "150"

3. List Storage:

# User activity feed
LPUSH user:456:activities "liked_product:123"
LPUSH user:456:activities "viewed_product:789"
LTRIM user:456:activities 0 99  # Keep last 100 activities

4. Set Storage:

# User's favorite products
SADD user:456:favorites product:123
SADD user:456:favorites product:456
SADD user:456:favorites product:789

# Product tags
SADD product:123:tags "electronics"
SADD product:123:tags "audio"
SADD product:123:tags "wireless"

Column-Family Modeling (Cassandra)

Column-family modeling optimizes for high-volume, distributed data storage with focus on query patterns.

Design Example: Time-Series Sensor Data

-- Sensor readings table
CREATE TABLE sensor_readings (
    sensor_id UUID,
    reading_date DATE,
    timestamp TIMESTAMP,
    temperature DECIMAL,
    humidity DECIMAL,
    pressure DECIMAL,
    PRIMARY KEY ((sensor_id, reading_date), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

-- User activity tracking  
CREATE TABLE user_activity (
    user_id UUID,
    activity_date DATE,
    timestamp TIMESTAMP,
    activity_type TEXT,
    activity_data MAP<TEXT, TEXT>,
    PRIMARY KEY ((user_id, activity_date), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

-- Product catalog with denormalization
CREATE TABLE products_by_category (
    category_id UUID,
    product_id UUID,
    product_name TEXT,
    price DECIMAL,
    brand TEXT,
    rating DECIMAL,
    created_date TIMESTAMP,
    PRIMARY KEY (category_id, price, product_id)
) WITH CLUSTERING ORDER BY (price ASC);

Graph Data Modeling

Definition and Use Cases

Graph modeling represents data as nodes (entities) and edges (relationships), ideal for highly connected data and complex relationship queries.

Graph Model Components

1. Nodes (Vertices)

Node Types:
├── Person
├── Company  
├── Product
├── Location
└── Event

2. Relationships (Edges)

Relationship Types:
├── WORKS_FOR (Person → Company)
├── FRIENDS_WITH (Person → Person)
├── LOCATED_IN (Company → Location)
├── PURCHASED (Person → Product)
└── SIMILAR_TO (Product → Product)

3. Properties

Node Properties:
Person: {name, age, email, created_date}
Product: {title, price, category, rating}

Edge Properties:
WORKS_FOR: {start_date, position, salary}
PURCHASED: {date, quantity, total_amount}

// Create Person nodes
CREATE (alice:Person {
  name: 'Alice Johnson',
  age: 30,
  email: 'alice@email.com',
  location: 'New York'
})

CREATE (bob:Person {
  name: 'Bob Smith', 
  age: 25,
  email: 'bob@email.com',
  location: 'San Francisco'
})

// Create Company node
CREATE (tech_corp:Company {
  name: 'TechCorp',
  industry: 'Technology',
  founded: 2010,
  location: 'San Francisco'
})

// Create relationships
CREATE (alice)-[:WORKS_FOR {
  position: 'Software Engineer',
  start_date: '2022-01-15',
  salary: 120000
}]->(tech_corp)

CREATE (alice)-[:FRIENDS_WITH {
  since: '2020-05-10',
  connection_strength: 0.8
}]->(bob)

// Create skill nodes and relationships
CREATE (python:Skill {name: 'Python', category: 'Programming'})
CREATE (alice)-[:HAS_SKILL {proficiency: 'Expert', years: 5}]->(python)

// Query examples
// Find Alice's colleagues
MATCH (alice:Person {name: 'Alice Johnson'})-[:WORKS_FOR]->(company)
      <-[:WORKS_FOR]-(colleague:Person)
RETURN colleague.name, colleague.email

// Find friends of friends
MATCH (alice:Person {name: 'Alice Johnson'})-[:FRIENDS_WITH]->(friend)
      -[:FRIENDS_WITH]->(friend_of_friend:Person)
WHERE friend_of_friend <> alice
RETURN friend_of_friend.name, friend.name as mutual_friend

Advanced Graph Patterns

1. Recommendation Engine Model

// Product recommendation based on purchase patterns
MATCH (user:Person {name: 'Alice'})-[:PURCHASED]->(product:Product)
      <-[:PURCHASED]-(other_user:Person)-[:PURCHASED]->(recommended:Product)
WHERE NOT (user)-[:PURCHASED]->(recommended)
RETURN recommended.name, 
       COUNT(other_user) as recommendation_score
ORDER BY recommendation_score DESC
LIMIT 10

2. Fraud Detection Model

// Find suspicious patterns - multiple people using same device/location
MATCH (person1:Person)-[:USES_DEVICE]->(device:Device)
      <-[:USES_DEVICE]-(person2:Person)
WHERE person1 <> person2
WITH person1, person2, device, 
     COUNT(*) as shared_sessions
WHERE shared_sessions > 5
RETURN person1.name, person2.name, 
       device.id, shared_sessions
ORDER BY shared_sessions DESC

Time-Series Data Modeling

Definition and Characteristics

Time-series modeling focuses on data points indexed by time, optimized for temporal queries, aggregations, and trend analysis.

Design Principles

Time as Primary Index: Optimize for time-based queries
Immutable Data: Most time-series data doesn't change
High Volume: Handle millions of data points
Aggregation Friendly: Support downsampling and rollups
Retention Policies: Automatic data lifecycle management

Time-Series Schema Patterns

1. Single Metric Per Row (InfluxDB Style)

-- IoT Sensor Measurements
CREATE TABLE sensor_measurements (
    time TIMESTAMP,
    sensor_id STRING,
    location STRING,
    measurement_type STRING, -- temperature, humidity, pressure
    value FLOAT,
    unit STRING,
    tags MAP<STRING, STRING>, -- Additional metadata
    PRIMARY KEY (time, sensor_id, measurement_type)
);

-- Sample Data
INSERT INTO sensor_measurements VALUES
('2024-01-15 10:00:00', 'sensor_001', 'warehouse_a', 'temperature', 23.5, 'celsius', {'floor': '1', 'zone': 'A'}),
('2024-01-15 10:00:00', 'sensor_001', 'warehouse_a', 'humidity', 65.2, 'percent', {'floor': '1', 'zone': 'A'}),
('2024-01-15 10:01:00', 'sensor_001', 'warehouse_a', 'temperature', 23.7, 'celsius', {'floor': '1', 'zone': 'A'});

2. Multiple Metrics Per Row (Wide Table)

-- Application Performance Metrics
CREATE TABLE app_metrics (
    timestamp TIMESTAMP,
    server_id STRING,
    cpu_usage_percent FLOAT,
    memory_usage_mb FLOAT,
    disk_io_read_mb FLOAT,
    disk_io_write_mb FLOAT,
    network_in_mb FLOAT,
    network_out_mb FLOAT,
    active_connections INT,
    response_time_ms FLOAT,
    error_rate_percent FLOAT,
    PRIMARY KEY (timestamp, server_id)
);

3. Event-Based Time Series

-- User Activity Events
CREATE TABLE user_events (
    event_time TIMESTAMP,
    user_id UUID,
    session_id STRING,
    event_type STRING, -- page_view, click, purchase, logout
    event_data JSON, -- Flexible event-specific data
    user_agent STRING,
    ip_address STRING,
    referrer STRING,
    PRIMARY KEY (event_time, user_id, session_id)
) PARTITION BY RANGE (event_time);

Time-Series Optimization Strategies

1. Partitioning by Time

-- Monthly partitions for large datasets
CREATE TABLE metrics_2024_01 PARTITION OF metrics
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE metrics_2024_02 PARTITION OF metrics  
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

2. Downsampling and Rollups

-- Create aggregated tables for different time granularities
CREATE TABLE hourly_metrics AS
SELECT 
    DATE_TRUNC('hour', timestamp) as hour,
    server_id,
    AVG(cpu_usage_percent) as avg_cpu,
    MAX(cpu_usage_percent) as max_cpu,
    AVG(memory_usage_mb) as avg_memory,
    COUNT(*) as data_points
FROM app_metrics
GROUP BY hour, server_id;

CREATE TABLE daily_metrics AS
SELECT 
    DATE_TRUNC('day', hour) as day,
    server_id,
    AVG(avg_cpu) as avg_cpu,
    MAX(max_cpu) as peak_cpu,
    AVG(avg_memory) as avg_memory
FROM hourly_metrics
GROUP BY day, server_id;

3. Retention Policies

-- Automatic data lifecycle management
DELETE FROM app_metrics 
WHERE timestamp < NOW() - INTERVAL '90 days';

DELETE FROM hourly_metrics 
WHERE hour < NOW() - INTERVAL '1 year';

-- Keep daily aggregates for 5 years
DELETE FROM daily_metrics 
WHERE day < NOW() - INTERVAL '5 years';

Event-Driven Data Modeling

Definition and Purpose

Event-driven modeling captures state changes as immutable events, enabling event sourcing, audit trails, and eventual consistency patterns.

Core Concepts

1. Event Structure

{
  "event_id": "evt_123456789",
  "event_type": "OrderPlaced",
  "aggregate_id": "order_987654",
  "aggregate_type": "Order",
  "event_version": 1,
  "timestamp": "2024-01-15T10:30:00Z",
  "user_id": "user_456",
  "correlation_id": "session_abc123",
  "causation_id": "command_xyz789",
  "data": {
    "customer_id": "cust_123",
    "items": [
      {
        "product_id": "prod_456",
        "quantity": 2,
        "unit_price": 29.99
      }
    ],
    "total_amount": 59.98,
    "shipping_address": {
      "street": "123 Main St",
      "city": "New York",
      "state": "NY",
      "zip": "10001"
    }
  },
  "metadata": {
    "source": "web_application",
    "user_agent": "Mozilla/5.0...",
    "ip_address": "192.168.1.1"
  }
}

2. Event Store Schema

-- Event Store Table
CREATE TABLE event_store (
    event_id UUID PRIMARY KEY,
    aggregate_id UUID NOT NULL,
    aggregate_type VARCHAR(100) NOT NULL,
    event_type VARCHAR(100) NOT NULL,
    event_version INTEGER NOT NULL,
    event_data JSONB NOT NULL,
    metadata JSONB,
    timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    correlation_id UUID,
    causation_id UUID,
    
    -- Ensure event ordering within aggregate
    UNIQUE(aggregate_id, event_version),
    
    -- Indexes for common query patterns
    INDEX idx_aggregate (aggregate_id, event_version),
    INDEX idx_event_type (event_type, timestamp),
    INDEX idx_correlation (correlation_id),
    INDEX idx_timestamp (timestamp)
);

Event-Driven Modeling Examples

1. E-commerce Order Lifecycle

-- Event Types for Order Aggregate
INSERT INTO event_store (event_id, aggregate_id, aggregate_type, event_type, event_version, event_data) VALUES

-- Order Placed
('evt_001', 'order_123', 'Order', 'OrderPlaced', 1, '{
  "customer_id": "cust_456", 
  "items": [{"product_id": "prod_789", "quantity": 2}],
  "total": 99.98
}'),

-- Payment Processed  
('evt_002', 'order_123', 'Order', 'PaymentProcessed', 2, '{
  "payment_method": "credit_card",
  "amount": 99.98,
  "transaction_id": "txn_abc123"
}'),

-- Order Shipped
('evt_003', 'order_123', 'Order', 'OrderShipped', 3, '{
  "tracking_number": "TRK123456789",
  "carrier": "UPS",
  "estimated_delivery": "2024-01-20"
}'),

-- Order Delivered
('evt_004', 'order_123', 'Order', 'OrderDelivered', 4, '{
  "delivered_at": "2024-01-19T14:30:00Z",
  "signature": "J.Smith"
}');

2. Projection/Read Model Generation

-- Order Summary Projection
CREATE TABLE order_projections (
    order_id UUID PRIMARY KEY,
    customer_id UUID,
    order_status VARCHAR(50),
    total_amount DECIMAL(10,2),
    created_at TIMESTAMP,
    shipped_at TIMESTAMP,
    delivered_at TIMESTAMP,
    last_updated TIMESTAMP
);

-- Projection Update Function
CREATE OR REPLACE FUNCTION update_order_projection()
RETURNS TRIGGER AS $$
BEGIN
    CASE NEW.event_type
        WHEN 'OrderPlaced' THEN
            INSERT INTO order_projections (
                order_id, customer_id, order_status, 
                total_amount, created_at, last_updated
            ) VALUES (
                NEW.aggregate_id,
                (NEW.event_data->>'customer_id')::UUID,
                'placed',
                (NEW.event_data->>'total')::DECIMAL,
                NEW.timestamp,
                NEW.timestamp
            );
            
        WHEN 'PaymentProcessed' THEN
            UPDATE order_projections 
            SET order_status = 'paid', last_updated = NEW.timestamp
            WHERE order_id = NEW.aggregate_id;
            
        WHEN 'OrderShipped' THEN
            UPDATE order_projections 
            SET order_status = 'shipped', 
                shipped_at = NEW.timestamp,
                last_updated = NEW.timestamp
            WHERE order_id = NEW.aggregate_id;
            
        WHEN 'OrderDelivered' THEN
            UPDATE order_projections 
            SET order_status = 'delivered',
                delivered_at = (NEW.event_data->>'delivered_at')::TIMESTAMP,
                last_updated = NEW.timestamp
            WHERE order_id = NEW.aggregate_id;
    END CASE;
    
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Trigger to maintain projections
CREATE TRIGGER order_projection_trigger
    AFTER INSERT ON event_store
    FOR EACH ROW
    EXECUTE FUNCTION update_order_projection();

Event Sourcing Benefits and Challenges

Benefits:

Complete Audit Trail: Every state change is recorded
Time Travel: Recreate state at any point in time
Debugging: Understand exactly what happened and when
Analytics: Rich data for business intelligence
Flexibility: Easy to create new projections

Challenges:

Complexity: More complex than CRUD operations
Storage: Events accumulate over time
Eventual Consistency: Projections may be temporarily out of sync
Schema Evolution: Handling event structure changes

Data Modeling Methodologies

1. Top-Down Approach

Process Flow:

Business Requirements → Conceptual Model → Logical Model → Physical Model

Characteristics:
├── Starts with business needs
├── Focuses on entities and relationships
├── Gradually adds technical details
└── Good for new systems

Example: Library Management System

Step 1: Business Requirements
- Track books and members
- Manage loans and returns
- Handle reservations
- Generate reports

Step 2: Conceptual Model
- Entities: Book, Member, Loan, Reservation
- Relationships: Member borrows Book, Member reserves Book

Step 3: Logical Model  
- Add attributes, keys, constraints
- Define cardinalities and participation
- Normalize relationships

Step 4: Physical Model
- Choose database technology
- Define indexes and partitions
- Optimize for performance

2. Bottom-Up Approach

Process Flow:

Existing Data → Normalize → Group Related Data → Create Logical Model

Characteristics:
├── Starts with existing data sources
├── Focuses on data analysis and normalization
├── Good for legacy system modernization
└── Data-driven rather than requirements-driven

3. Agile Data Modeling

Iterative Process:

Initial Model → Implement → Get Feedback → Refine Model → Repeat

Characteristics:
├── Evolutionary design
├── Just enough modeling
├── Continuous refinement
├── Close collaboration with development team
└── Embrace change

Agile Modeling Principles:

Model with a Purpose: Only model what's necessary
Multiple Models: Use different models for different purposes
Travel Light: Keep models simple and maintainable
Content Over Representation: Focus on meaning, not notation
Know Your Tools: Use appropriate modeling tools
Adapt to Specific Needs: No one-size-fits-all approach

4. Domain-Driven Design (DDD)

Core Concepts:

Bounded Context → Aggregates → Entities → Value Objects

Domain Model Components:
├── Entities (Identity-based objects)
├── Value Objects (Attribute-based objects)
├── Aggregates (Consistency boundaries)
├── Domain Services (Domain logic not fitting entities)
└── Repositories (Data access abstraction)

DDD Modeling Example:

E-commerce Bounded Contexts:

1. Order Management Context:
   ├── Order Aggregate
   │   ├── Order (Entity - Root)
   │   ├── OrderItem (Entity)
   │   └── Address (Value Object)
   └── Customer (Entity)

2. Inventory Context:
   ├── Product Aggregate
   │   ├── Product (Entity - Root)
   │   ├── ProductVariant (Entity)
   │   └── Price (Value Object)
   └── Category (Entity)

3. Payment Context:
   ├── Payment Aggregate
   │   ├── Payment (Entity - Root)
   │   ├── PaymentMethod (Value Object)
   │   └── Transaction (Entity)

Best Practices and Design Patterns

1. Universal Design Principles

Data Quality Principles

ACCURATE: Data should be correct and error-free
COMPLETE: All required data should be present
CONSISTENT: Data should be consistent across systems
TIMELY: Data should be available when needed
VALID: Data should conform to defined formats
UNIQUE: No unnecessary duplication

Naming Conventions

-- Table Naming
customers              -- Plural nouns
order_items           -- Underscore for compound names
customer_addresses    -- Clear relationship indication

-- Column Naming  
customer_id           -- Descriptive and consistent
first_name           -- Readable format
created_timestamp    -- Include data type hint
is_active           -- Boolean prefix
total_amount        -- Clear business meaning

-- Index Naming
idx_customers_email                    -- idx_table_columns
idx_orders_customer_id_created_date   -- Composite indexes
pk_customers                          -- Primary key
fk_orders_customer_id                 -- Foreign key

2. Performance Design Patterns

1. Read Replica Pattern

-- Master Database (Write Operations)
CREATE TABLE orders_master (
    order_id UUID PRIMARY KEY,
    customer_id UUID,
    order_date TIMESTAMP,
    total_amount DECIMAL(10,2),
    status VARCHAR(20)
);

-- Read Replica Configuration
-- Automatically synchronized from master
-- Used for reporting and analytics queries
-- Reduces load on master database

2. Materialized View Pattern

-- Complex aggregation query
CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    COUNT(o.order_id) as total_orders,
    SUM(o.total_amount) as total_spent,
    AVG(o.total_amount) as avg_order_value,
    MAX(o.order_date) as last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;

-- Refresh strategy
REFRESH MATERIALIZED VIEW customer_order_summary;

3. Sharding Pattern

-- Horizontal partitioning by customer region
CREATE TABLE orders_us (
    LIKE orders_template INCLUDING ALL,
    CONSTRAINT orders_us_region_check CHECK (region = 'US')
);

CREATE TABLE orders_eu (
    LIKE orders_template INCLUDING ALL,
    CONSTRAINT orders_eu_region_check CHECK (region = 'EU')
);

-- Application-level routing
-- Route US customers to orders_us
-- Route EU customers to orders_eu

3. Data Integrity Patterns

1. Audit Trail Pattern

-- Audit table for tracking changes
CREATE TABLE customer_audit (
    audit_id SERIAL PRIMARY KEY,
    customer_id UUID NOT NULL,
    operation_type VARCHAR(10) NOT NULL, -- INSERT, UPDATE, DELETE
    old_values JSONB,
    new_values JSONB,
    changed_by UUID,
    changed_at TIMESTAMP DEFAULT NOW(),
    change_reason TEXT
);

-- Trigger function for automatic auditing
CREATE OR REPLACE FUNCTION audit_customer_changes()
RETURNS TRIGGER AS $$
BEGIN
    IF TG_OP = 'DELETE' THEN
        INSERT INTO customer_audit (customer_id, operation_type, old_values, changed_by)
        VALUES (OLD.customer_id, 'DELETE', row_to_json(OLD), USER);
        RETURN OLD;
    ELSIF TG_OP = 'UPDATE' THEN
        INSERT INTO customer_audit (customer_id, operation_type, old_values, new_values, changed_by)
        VALUES (OLD.customer_id, 'UPDATE', row_to_json(OLD), row_to_json(NEW), USER);
        RETURN NEW;
    ELSIF TG_OP = 'INSERT' THEN
        INSERT INTO customer_audit (customer_id, operation_type, new_values, changed_by)
        VALUES (NEW.customer_id, 'INSERT', row_to_json(NEW), USER);
        RETURN NEW;
    END IF;
    RETURN NULL;
END;
$$ LANGUAGE plpgsql;

2. Soft Delete Pattern

-- Add deletion tracking columns
ALTER TABLE customers 
ADD COLUMN is_deleted BOOLEAN DEFAULT FALSE,
ADD COLUMN deleted_at TIMESTAMP,
ADD COLUMN deleted_by UUID;

-- Soft delete function
UPDATE customers 
SET is_deleted = TRUE,
    deleted_at = NOW(),
    deleted_by = :user_id
WHERE customer_id = :customer_id;

-- Views for active records only
CREATE VIEW active_customers AS
SELECT * FROM customers 
WHERE is_deleted = FALSE OR is_deleted IS NULL;

3. Optimistic Locking Pattern

-- Add version column for concurrency control
ALTER TABLE products
ADD COLUMN version INTEGER DEFAULT 1;

-- Update with version check
UPDATE products 
SET name = :new_name,
    price = :new_price,
    version = version + 1
WHERE product_id = :product_id 
  AND version = :expected_version;

-- Application should check affected rows
-- If 0 rows affected, another user modified the record

Tools and Technologies

Data Modeling Tools

1. Enterprise Tools

ERwin Data Modeler:
├── Comprehensive enterprise modeling
├── Forward/reverse engineering
├── Supports multiple databases
└── Team collaboration features

PowerDesigner:
├── Unified modeling (data, process, enterprise)
├── Model validation and impact analysis
├── Extensive database support
└── Integration with development tools

ER/Studio:
├── Data architecture and modeling
├── Data lineage and impact analysis
├── Collaborative modeling
└── Agile modeling support

2. Open Source Tools

MySQL Workbench:
├── Visual database design
├── SQL development
├── Database administration
└── Free with MySQL

pgAdmin/pgModeler:
├── PostgreSQL-focused
├── Visual schema design
├── SQL script generation
└── Open source

DbSchema:
├── Universal database designer
├── Visual query builder
├── Documentation generation
└── Multiple database support

3. Cloud-Based Tools

Lucidchart:
├── Web-based diagramming
├── Real-time collaboration
├── Template library
└── Integration with other tools

Draw.io (now Diagrams.net):
├── Free web-based tool
├── Extensive shape libraries
├── Export to multiple formats
└── No registration required

Vertabelo:
├── Online database modeler
├── Team collaboration
├── Version control
└── SQL script generation

Database Technologies by Use Case

OLTP (Online Transaction Processing)

PostgreSQL:
├── Advanced SQL features
├── JSON/JSONB support
├── Strong consistency
└── Extensible architecture

MySQL:
├── High performance
├── Wide adoption
├── Good tooling ecosystem
└── Cloud provider support

SQL Server:
├── Enterprise features
├── Integration with Microsoft stack
├── Advanced analytics
└── Hybrid cloud support

OLAP (Online Analytical Processing)

Snowflake:
├── Cloud-native architecture
├── Separate compute and storage
├── Zero-copy cloning
└── Time travel features

Amazon Redshift:
├── Columnar storage
├── Massively parallel processing
├── AWS ecosystem integration
└── Cost-effective scaling

Google BigQuery:
├── Serverless architecture
├── Machine learning integration
├── Real-time analytics
└── Automatic scaling

NoSQL Databases

Document Stores:
├── MongoDB (General purpose)
├── Amazon DocumentDB (AWS managed)
├── Azure Cosmos DB (Multi-model)
└── CouchDB (Offline-first)

Key-Value Stores:
├── Redis (In-memory, caching)
├── Amazon DynamoDB (Managed)
├── Apache Cassandra (Distributed)
└── Riak (Distributed, available)

Graph Databases:
├── Neo4j (Property graph)
├── Amazon Neptune (Multi-model)
├── ArangoDB (Multi-model)
└── OrientDB (Multi-model)

Modern Data Stack Components

Data Integration

ETL/ELT Tools:
├── Apache Airflow (Workflow orchestration)
├── dbt (Data transformation)
├── Fivetran (Managed ETL)
├── Stitch (Simple data integration)
└── Talend (Enterprise data integration)

Change Data Capture:
├── Debezium (Open source CDC)
├── AWS DMS (Database migration service)
├── Confluent Platform (Kafka-based)
└── Oracle GoldenGate (Enterprise CDC)

Data Quality and Governance

Data Catalogs:
├── Apache Atlas (Open source)
├── AWS Glue Data Catalog
├── Google Cloud Data Catalog
├── Collibra (Enterprise)
└── Alation (Enterprise)

Data Quality Tools:
├── Great Expectations (Python-based)
├── Deequ (Scala/Spark-based)
├── Apache Griffin (Big data quality)
├── Talend Data Quality
└── Informatica Data Quality

Suffix Trees: Comprehensive Guide

October 9, 2025 · 9 min read

Introduction
What is a Suffix Tree?
Construction Algorithms
Properties and Characteristics
Common Applications
Implementation Examples
Problem-Solving Patterns
Comparison with Other Data Structures
Problem Reference Table

Introduction

A Suffix Tree is a compressed trie containing all suffixes of a given string. It's one of the most powerful data structures in string processing, enabling efficient solutions to numerous string problems. Originally developed by Weiner (1973), McCreight (1976), and later optimized by Ukkonen (1995).

What is a Suffix Tree?

Definition

For a string S of length n, a suffix tree is a rooted tree with exactly n leaves numbered 1 to n. Each path from root to leaf represents one suffix of S.

Key Properties

Compressed: No node has exactly one child (except root in some cases)
Complete: Contains all suffixes as root-to-leaf paths
Unique: Each suffix corresponds to exactly one leaf
Space Efficient: O(n) nodes and edges despite representing O(n²) characters

Example: Suffix Tree for "banana$"

Suffixes:
0: banana$
1: anana$
2: nana$
3: ana$
4: na$
5: a$
6: $

Suffix Tree Structure:
        root
       /  |  \
      $   a   b
     (6) / \   \
        na$ ana$  anana$
       (4)  |     (0)
           na$
           (2,3,5)

Construction Algorithms

1. Naive Construction - O(n²)

Build suffix tree by inserting each suffix one by one into a trie, then compress.

def naive_construction(s):
    # Insert all suffixes into trie
    # Compress nodes with single children
    # Time: O(n²), Space: O(n²) worst case

2. McCreight's Algorithm - O(n)

Builds suffix tree in linear time using suffix links.

Key Concepts:

Suffix Links: Fast navigation between related nodes
Incremental Construction: Build tree suffix by suffix efficiently
Active Point: Current position in tree construction

3. Ukkonen's Algorithm - O(n) ⭐

Most popular linear-time construction algorithm.

Core Ideas:

Online Algorithm: Can process string character by character
Implicit Suffix Tree: Handles non-explicit suffixes efficiently
Global End: All leaf edges extend automatically
Active Point Trick: Maintains construction state efficiently

Ukkonen's Algorithm Steps:

Extension Phase: For each character, extend all suffixes
Rule 1: Path ends at leaf → do nothing (implicit extension)
Rule 2: Path continues with different character → add new branch
Rule 3: Path continues with same character → do nothing
Suffix Links: Jump to related suffix for efficiency

class UkkonenNode:
    def __init__(self):
        self.children = {}
        self.suffix_link = None
        self.start = None
        self.end = None
        
def ukkonen_construction(s):
    # Phase i: extend all suffixes ending at position i
    # Use global end, active point, and suffix links
    # Time: O(n), Space: O(n)

Properties and Characteristics

Space Complexity

Nodes: At most 2n - 1 nodes for string of length n
Edges: At most 2n - 1 edges
Total Space: O(n) with proper implementation

Time Complexities

Operation	Naive	McCreight	Ukkonen
Construction	O(n²)	O(n)	O(n)
Pattern Search	O(m + occ)	O(m + occ)	O(m + occ)
Suffix Query	O(log n)	O(1)	O(1)

Edge Compression

Instead of storing individual characters on edges, store:

Start Index: Beginning position in original string
End Index: Ending position (can be global end)

Common Applications

1. Pattern Matching

Problem: Find all occurrences of pattern P in text T

Solution:

Build suffix tree for T
Follow path for pattern P
All leaves in subtree are occurrences
Time: O(|T|) preprocessing, O(|P| + occ) query

2. Longest Repeated Substring

Problem: Find longest substring that appears ≥ 2 times Solution:

Build suffix tree
Find deepest internal node
Path from root to this node is the answer
Time: O(n)

3. Longest Common Substring

Problem: Find LCS of multiple strings Solution:

Build generalized suffix tree for all strings
Find deepest node with leaves from all strings
Time: O(sum of string lengths)

4. Suffix Array Construction

Problem: Build suffix array efficiently Solution:

Build suffix tree in O(n)
DFS to get lexicographically sorted suffixes
Time: O(n) vs O(n log n) direct sorting

5. String Compression

Problem: Find repeated patterns for compression Solution:

Build suffix tree
Internal nodes represent repeated substrings
Choose optimal set for maximum compression

Implementation Examples

Basic Suffix Tree Node

class SuffixTreeNode:
    def __init__(self):
        self.children = {}
        self.suffix_index = -1  # -1 for internal nodes
        self.start = None
        self.end = None
        self.suffix_link = None
    
    def edge_length(self):
        return self.end - self.start + 1 if self.end else 0

Pattern Search Implementation

def pattern_search(root, text, pattern):
    """Find all occurrences of pattern in text using suffix tree"""
    current = root
    i = 0
    
    # Navigate to end of pattern
    while i < len(pattern):
        if pattern[i] in current.children:
            child = current.children[pattern[i]]
            edge_start = child.start
            edge_end = min(child.end, edge_start + len(pattern) - i - 1)
            
            # Check if pattern matches edge
            for j in range(edge_start, edge_end + 1):
                if text[j] != pattern[i]:
                    return []  # No match
                i += 1
            
            current = child
        else:
            return []  # No match
    
    # Collect all suffix indices in subtree
    return collect_suffix_indices(current)

Longest Repeated Substring

def longest_repeated_substring(root, text):
    """Find longest repeated substring using suffix tree"""
    max_depth = 0
    result_node = None
    
    def dfs(node, depth):
        nonlocal max_depth, result_node
        
        if node.suffix_index == -1:  # Internal node
            if depth > max_depth:
                max_depth = depth
                result_node = node
        
        for child in node.children.values():
            dfs(child, depth + child.edge_length())
    
    dfs(root, 0)
    
    if result_node:
        # Reconstruct string from root to result_node
        return reconstruct_path(root, result_node, text)
    return ""

Problem-Solving Patterns

Pattern 1: Depth-Based Problems

Longest Repeated Substring: Deepest internal node
Longest Common Extension: Path depth between nodes
K-th Order Statistics: Depth-based tree traversal

Pattern 2: Leaf-Based Problems

Pattern Matching: Collect leaves in subtree
Suffix Array: DFS order of leaves
Document Retrieval: Group leaves by document

Pattern 3: Generalized Suffix Tree

Multiple String Problems: Build GST with separators
Longest Common Substring: Find nodes with leaves from all strings
String Distance: Compare paths in GST

Pattern 4: Path Compression Applications

Space Optimization: Edge compression reduces memory
Cache Efficiency: Fewer nodes improve performance
Practical Implementation: Handle large alphabets efficiently

Comparison with Other Data Structures

Feature	Suffix Tree	Suffix Array	Trie	KMP
Construction	O(n)	O(n log n)	O(Σm)	O(m)
Space	O(n)	O(n)	O(Σm)	O(m)
Pattern Search	O(m + occ)	O(m log n + occ)	O(m)	O(n + m)
LCP Queries	O(1)	O(log n)	-	-
Memory Access	Random	Sequential	Random	Sequential
Cache Performance	Poor	Good	Poor	Good
Implementation	Complex	Simple	Simple	Simple

When to Use Each:

Suffix Tree: Complex string queries, multiple pattern types
Suffix Array: Simple queries, memory constraints, competitive programming
Trie: Dictionary operations, prefix queries
KMP: Single pattern matching, streaming data

Problem Reference Table

Problem Category	Problem Name	Additional Algorithms/Components	Time Complexity	Space
Pattern Matching	Multiple Pattern Search	Aho-Corasick integration	O(n + Σm + occ)	O(n + Σm)
	Approximate Pattern Matching	Edit distance DP	O(nm²)	O(nm)
	Pattern with wildcards	Suffix tree + backtracking	O(nm)	O(n)
Substring Problems	Longest Repeated Substring	DFS traversal	O(n)	O(n)
	Longest Common Substring (2 strings)	Generalized suffix tree	O(n + m)	O(n + m)
	Longest Common Substring (k strings)	Generalized ST + coloring	O(Σn_i)	O(Σn_i)
	K most frequent substrings	Suffix tree + heap	O(n log k)	O(n)
	All palindromic substrings	Manacher's + suffix tree	O(n)	O(n)
String Queries	Range LCP queries	Suffix tree + LCA (Sparse Table)	O(1)	O(n log n)
	Substring frequency	Suffix tree + subtree size	O(m + occ)	O(n)
	Lexicographic substring queries	Suffix tree + DFS ordering	O(m + k)	O(n)
	Distinct substrings count	Suffix tree traversal	O(n)	O(n)
Advanced Applications	Text compression (LZ77)	Suffix tree + greedy	O(n)	O(n)
	Burrows-Wheeler Transform	Suffix tree + DFS	O(n)	O(n)
	Document retrieval	Inverted index + suffix tree	O(m + occ)	O(n)
	Plagiarism detection	GST + similarity metrics	O(n + m)	O(n + m)
Computational Biology	DNA sequence alignment	Suffix tree + DP	O(nm)	O(n + m)
	Tandem repeat detection	Suffix tree + period analysis	O(n log n)	O(n)
	Phylogenetic tree construction	GST + clustering	O(nk)	O(nk)
	Motif discovery	Suffix tree + statistical analysis	O(n²)	O(n)
Graph Theory Integration	Shortest superstring	Suffix tree + TSP	O(n!2ⁿ)	O(n2ⁿ)
	String graph construction	Suffix tree + overlap detection	O(n²)	O(n²)
Data Structures	Suffix array construction	DFS traversal	O(n)	O(n)
	LCP array construction	Suffix tree + parent-child	O(n)	O(n)
	Compressed suffix tree	Heavy-light decomposition	O(log n)	O(n)
Dynamic Problems	Online suffix tree	Ukkonen's algorithm	O(1) amortized	O(n)
	Sliding window queries	Suffix tree + deque	O(n)	O(k)
	Incremental pattern matching	Dynamic suffix tree	O(m) per update	O(n)

Legend:

n, m: String lengths
k: Number of strings/patterns
occ: Number of occurrences
Σ: Alphabet size
GST: Generalized Suffix Tree
LCA: Lowest Common Ancestor
DP: Dynamic Programming

Complexity Notes:

Construction time assumes linear-time algorithms (McCreight/Ukkonen)
Query times assume suffix tree is already built
Space complexity is typically O(n) for suffix tree + additional structures
Some problems require combining suffix trees with other advanced algorithms

Using Azure Services for Developing Solutions

June 22, 2024 · 2 min read

Repository for knowing various Azure services for developers. Core concept of azure are compute,storage,databases,integration,networking. this is a developer only guide so we are not touching other azure services but only requried by developers.

What Azure Offers

Services for Running applications also known as compute
1. IaaS: VM, Container Services, Container Instances, Web App for Containers, Batch, Service Fabric
2. PaaS: Cloud Services, Mobile app, Web App, Service fabric
3. LaaS: Logic apps, Functions
Services for Storing data
1. Relational Data: Azure SQL db/MYSQL db/PostgreSQL db/Maria db
2. Document or Unstructured Data: Cosmos DB / Azure Storage ( files, blobs, Queue, Table, Disk etc)
3. Data Analytics : Data warehouse / Data Lake Store
Services for Securing applications and data
1. Authentication : AD (Azure Active Directory)
2. Encryption ( protect data at rest) : Encryption features provided.
3. Store User Identities : Azure Active Directory.
4. Store Keys/Secrets : Key valut.
5. Provide TLS certificate : App Service Certificate.
Services for Analysing data
1. Azure Data Factory
2. Azure Analysis Service
3. Azure Data Lake Analytics
4. Azure Stream Analysis
5. Azure Time Series Insight
6. Azure Data Bricks
7. Azure HD Insights
Services for Monitoring your applications
1. Azure Application Insights
2. Azure Log Analytics
3. Azure Network Watcher
4. Azure Monitor
5. Azure Security Center
6. Azure Advisor
7. visual studio app center
Services for Adding Intelligence to your applications
1. Machine Learning
2. Congitive Services
3. BOT Framework
Services for Event/Messaging between applications
1. command : azure storage queues, azure service bus , azure servie bus topics , azure IOT hubs, Azure Logic Apps
2. facts : discreet - azure event Grids , streaming - azure event hubs
3. mass notification : azure notification hubs
Services for Speeding and improving performance of the applications
1. Redis Cache
2. Traffic Manager
3. CDN
Services for Building/Testing/Deploying
1. VS Team Foundation Services
2. VS App Center 3 Azure devops Projects

Network and Protocols

May 28, 2024 · 2 min read

Ashish Kumar

maintainer - Tech

There are various network protocols and they works at different layers of OSI model. The digaram below gives a quick reference of protocols and layers at which they work. There are more than these protocols but we hae just mentioned the common ones.

Advice is to have an understanding of these protocols.

network layers

Networking Devices

Hub	Multiport repeater, Half Duplex, No intelligence, connect Devices together
Switch	Forward traffic to right destination
Router	Route Tables, forwarding based on IP Table , connected multiple type of networks
Access Point	Wireless router or WAP , Bridge to connect, wired or wireless
DSL
Repeater	Regenrates signals
FireWall	Filters traffic , routers or dedicated hardware
Bridge
Patch Panels
CloudBased network controller	Central way to
EOP ethernet over power lines
Power over ethernet

Most Used common protocols

Scenario	Protocol		Ports
Files	FTP		20/21
	AFP		548
	SMB	TCP	445

Emails	SMTP	25	25
	POP3		110
	IMAP		143

Web	HTTPS		443
	HTTP		80

Management	Telnet		23
	SSH		22
	SNMP	TCP/UDP	161/162
	RDP		3389

DNS			53
DHCP			67/68
SLP(Service Location Protocol)			427
NetBios		TCP/UDP	137/138/139
LDAP( Lightweight Directory Access Protocol)		TCP/UDP	389
Network time Protocol
ARP( Address Resolution Protocol)

Table of Contents​

Introduction to Data Modeling​

Why Data Modeling Matters​

Data Modeling Levels​

Types of Data Models​

Overview of Data Modeling Approaches​

Entity Modeling​

Definition and Purpose​

Key Components​

1. Entity Definition​

2. Entity Characteristics​

Entity Modeling Process​

Step 1: Entity Identification​

Step 2: Entity Classification​

Step 3: Entity Attributes Definition​

Entity Modeling Example: Library System​

Relationship Modeling​

Definition and Scope​

Relationship Types​

1. Cardinality Classifications​

2. Participation Constraints​

Advanced Relationship Concepts​

1. Recursive Relationships​

2. Ternary Relationships​

3. Identifying vs Non-Identifying Relationships​

Relationship Modeling Example: Social Media Platform​

Database Modeling​

Definition and Focus​

Database Model Types​

1. Relational Database Modeling​

2. Denormalization for Performance​

Physical Design Considerations​

1. Indexing Strategy​

2. Partitioning​

Dimensional Modeling​

Definition and Purpose​

Core Concepts​

1. Star Schema​

2. Snowflake Schema​

Dimensional Modeling Example: Retail Analytics​

Slowly Changing Dimensions (SCD)​

Type 1: Overwrite (No History)​

Type 2: Add New Record (Keep History)​

Type 3: Add New Column (Limited History)​

Object-Oriented Data Modeling​

Definition and Characteristics​

Core OO Concepts in Data Modeling​

1. Inheritance Hierarchies​

2. Implementation Strategies​

OO Modeling Example: Content Management System​

NoSQL Data Modeling​

Document-Based Modeling (MongoDB)​

Design Principles:​

Example: E-commerce Product Catalog​

Reference Pattern for Large Collections:​

Key-Value Modeling (Redis)​

Design Patterns:​

Column-Family Modeling (Cassandra)​

Design Example: Time-Series Sensor Data​

Graph Data Modeling​

Definition and Use Cases​

Graph Model Components​

1. Nodes (Vertices)​

2. Relationships (Edges)​

3. Properties​

Graph Modeling Example: Social Network (Neo4j)​

Advanced Graph Patterns​

1. Recommendation Engine Model​

2. Fraud Detection Model​

Time-Series Data Modeling​

Definition and Characteristics​

Design Principles​

Time-Series Schema Patterns​

1. Single Metric Per Row (InfluxDB Style)​

2. Multiple Metrics Per Row (Wide Table)​

3. Event-Based Time Series​

Time-Series Optimization Strategies​

1. Partitioning by Time​

2. Downsampling and Rollups​

3. Retention Policies​

Table of Contents

Introduction to Data Modeling

Why Data Modeling Matters

Data Modeling Levels

Types of Data Models

Overview of Data Modeling Approaches

Entity Modeling

Definition and Purpose

Key Components

1. Entity Definition

2. Entity Characteristics

Entity Modeling Process

Step 1: Entity Identification

Step 2: Entity Classification

Step 3: Entity Attributes Definition

Entity Modeling Example: Library System

Relationship Modeling

Definition and Scope

Relationship Types

1. Cardinality Classifications

2. Participation Constraints

Advanced Relationship Concepts

1. Recursive Relationships

2. Ternary Relationships

3. Identifying vs Non-Identifying Relationships

Relationship Modeling Example: Social Media Platform

Database Modeling

Definition and Focus

Database Model Types

1. Relational Database Modeling

2. Denormalization for Performance

Physical Design Considerations

1. Indexing Strategy

2. Partitioning

Dimensional Modeling

Definition and Purpose

Core Concepts

1. Star Schema

2. Snowflake Schema

Dimensional Modeling Example: Retail Analytics

Slowly Changing Dimensions (SCD)

Type 1: Overwrite (No History)

Type 2: Add New Record (Keep History)

Type 3: Add New Column (Limited History)

Object-Oriented Data Modeling

Definition and Characteristics

Core OO Concepts in Data Modeling

1. Inheritance Hierarchies

2. Implementation Strategies

OO Modeling Example: Content Management System

NoSQL Data Modeling

Document-Based Modeling (MongoDB)

Design Principles:

Example: E-commerce Product Catalog

Reference Pattern for Large Collections:

Key-Value Modeling (Redis)

Design Patterns:

Column-Family Modeling (Cassandra)

Design Example: Time-Series Sensor Data

Graph Data Modeling

Definition and Use Cases

Graph Model Components

1. Nodes (Vertices)

2. Relationships (Edges)

3. Properties

Graph Modeling Example: Social Network (Neo4j)

Advanced Graph Patterns

1. Recommendation Engine Model

2. Fraud Detection Model

Time-Series Data Modeling

Definition and Characteristics

Design Principles

Time-Series Schema Patterns

1. Single Metric Per Row (InfluxDB Style)

2. Multiple Metrics Per Row (Wide Table)

3. Event-Based Time Series

Time-Series Optimization Strategies

1. Partitioning by Time

2. Downsampling and Rollups

3. Retention Policies