Skip to main content

Data Modeling: A short guide with examples

· 26 min read

Table of Contents

  1. Introduction to Data Modeling
  2. Types of Data Models
  3. Entity Modeling
  4. Relationship Modeling
  5. Database Modeling
  6. Dimensional Modeling
  7. Object-Oriented Data Modeling
  8. NoSQL Data Modeling
  9. Graph Data Modeling
  10. Time-Series Data Modeling
  11. Event-Driven Data Modeling
  12. Data Modeling Methodologies
  13. Best Practices and Design Patterns
  14. Tools and Technologies

Introduction to Data Modeling

Data modeling is the process of creating a conceptual representation of data objects, their relationships, and the rules that govern them. It serves as a blueprint for database design and helps ensure data integrity, consistency, and efficient access patterns.

Why Data Modeling Matters

  • Communication: Provides a common language between business and technical teams
  • Documentation: Serves as living documentation of data structures
  • Quality: Ensures data consistency and integrity
  • Performance: Optimizes data access patterns and query performance
  • Maintenance: Simplifies future modifications and enhancements

Data Modeling Levels

Conceptual Model (What) → Logical Model (How) → Physical Model (Where)
↓ ↓ ↓
Business View Technical View Implementation View

Types of Data Models

Overview of Data Modeling Approaches

Model TypePrimary FocusUse CasesKey Characteristics
Entity ModelingBusiness objectsRequirements gatheringHigh-level, business-focused
Relationship ModelingObject connectionsProcess designFocus on interactions
Database ModelingStorage structureImplementationTechnical, performance-oriented
Dimensional ModelingAnalyticsData warehousingStar/snowflake schemas
Object-OrientedCode alignmentOOP applicationsInheritance, encapsulation
NoSQL ModelingFlexible schemasModern applicationsDocument, key-value, column
Graph ModelingNetwork relationshipsSocial networks, fraudNodes and edges
Time-SeriesTemporal dataIoT, monitoringTime-indexed data
Event-DrivenState changesEvent sourcingImmutable events

Entity Modeling

Definition and Purpose

Entity modeling focuses on identifying and defining the core business objects (entities) that an organization needs to track. It's the foundation of conceptual data modeling.

Key Components

1. Entity Definition

Entity: A thing of significance to the business about which data must be stored
Examples: Customer, Product, Order, Employee, Account

2. Entity Characteristics

  • Tangible vs Intangible: Physical objects vs concepts
  • Independent vs Dependent: Can exist alone vs requires parent entity
  • Strong vs Weak: Has its own identifier vs borrows identifier

Entity Modeling Process

Step 1: Entity Identification

Business Requirements Analysis:
1. Review business processes
2. Identify nouns in requirements
3. Determine what needs to be tracked
4. Validate with stakeholders

Example: E-commerce System
- Customer (who buys)
- Product (what is sold)
- Order (purchase transaction)
- Category (product grouping)
- Supplier (product source)

Step 2: Entity Classification

Classification Framework:

Core Entities (Business Critical):
├── Customer
├── Product
└── Order

Supporting Entities (Operational):
├── Category
├── Supplier
└── Address

Reference Entities (Master Data):
├── Country
├── Currency
└── Status

Step 3: Entity Attributes Definition

Entity: Customer
├── Identifying Attributes
│ └── CustomerID (Primary Key)
├── Descriptive Attributes
│ ├── FirstName
│ ├── LastName
│ ├── Email
│ └── PhoneNumber
└── Administrative Attributes
├── CreatedDate
├── LastModified
└── IsActive

Entity Modeling Example: Library System

Library Management System Entities:

1. BOOK
- BookID (PK)
- ISBN
- Title
- Author
- Publisher
- PublicationYear
- Genre
- Available

2. MEMBER
- MemberID (PK)
- FirstName
- LastName
- Email
- PhoneNumber
- Address
- MembershipType
- JoinDate

3. LOAN
- LoanID (PK)
- BookID (FK)
- MemberID (FK)
- LoanDate
- DueDate
- ReturnDate
- Status

4. AUTHOR
- AuthorID (PK)
- FirstName
- LastName
- Biography
- Nationality

5. RESERVATION
- ReservationID (PK)
- BookID (FK)
- MemberID (FK)
- ReservationDate
- ExpiryDate
- Status

Relationship Modeling

Definition and Scope

Relationship modeling focuses on defining how entities interact with each other, capturing the business rules that govern these interactions.

Relationship Types

1. Cardinality Classifications

One-to-One (1:1):
├── Person ←→ Passport
├── Employee ←→ Desk Assignment
└── Country ←→ Capital City

One-to-Many (1:M):
├── Customer → Orders
├── Department → Employees
└── Category → Products

Many-to-Many (M:M):
├── Students ←→ Courses
├── Books ←→ Authors
└── Employees ←→ Projects

2. Participation Constraints

Mandatory vs Optional:

Mandatory (Total Participation):
- Every Order MUST have a Customer
- Every Employee MUST belong to a Department

Optional (Partial Participation):
- A Customer MAY have Orders
- An Employee MAY have a Manager

Advanced Relationship Concepts

1. Recursive Relationships

Employee Management Hierarchy:
Employee ──manages──→ Employee

Examples:
- Manager manages Employees
- Category has Parent Category
- Product has Related Products

2. Ternary Relationships

Three-way relationship:
Supplier ──supplies──→ Product ──to──→ Project

Business Rule:
"A supplier supplies specific products to specific projects"

3. Identifying vs Non-Identifying Relationships

Identifying Relationship:
Customer ──has──→ Order
(Order cannot exist without Customer)

Non-Identifying Relationship:
Order ──assigned_to──→ Salesperson
(Order exists independently of Salesperson assignment)

Relationship Modeling Example: Social Media Platform

Social Media Relationships:

1. USER follows USER (M:M, Recursive)
- FollowerID
- FollowingID
- FollowDate
- NotificationEnabled

2. USER creates POST (1:M)
- UserID
- PostID
- CreationDate

3. USER likes POST (M:M)
- UserID
- PostID
- LikeDate
- LikeType (like, love, laugh, etc.)

4. POST belongs_to GROUP (M:1)
- PostID
- GroupID
- PostDate

5. USER joins GROUP (M:M)
- UserID
- GroupID
- JoinDate
- Role (member, admin, moderator)

Database Modeling

Definition and Focus

Database modeling translates conceptual and logical models into physical database structures, focusing on implementation details, performance optimization, and storage efficiency.

Database Model Types

1. Relational Database Modeling

Characteristics:

  • Tables with rows and columns
  • Primary and foreign key relationships
  • ACID properties
  • SQL-based querying

Design Process:

Normalization Process:
1NF → 2NF → 3NF → BCNF
├── Eliminate repeating groups
├── Remove partial dependencies
├── Remove transitive dependencies
└── Remove remaining anomalies

Example: E-commerce Relational Model

-- Customers Table
CREATE TABLE Customers (
CustomerID INT PRIMARY KEY,
FirstName VARCHAR(50) NOT NULL,
LastName VARCHAR(50) NOT NULL,
Email VARCHAR(100) UNIQUE NOT NULL,
PhoneNumber VARCHAR(15),
CreatedDate TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_email (Email)
);

-- Products Table
CREATE TABLE Products (
ProductID INT PRIMARY KEY,
ProductName VARCHAR(200) NOT NULL,
CategoryID INT,
Price DECIMAL(10,2) NOT NULL,
StockQuantity INT DEFAULT 0,
Description TEXT,
IsActive BOOLEAN DEFAULT TRUE,
FOREIGN KEY (CategoryID) REFERENCES Categories(CategoryID),
INDEX idx_category (CategoryID),
INDEX idx_active (IsActive)
);

-- Orders Table
CREATE TABLE Orders (
OrderID INT PRIMARY KEY,
CustomerID INT NOT NULL,
OrderDate TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
TotalAmount DECIMAL(12,2) NOT NULL,
OrderStatus ENUM('pending', 'processing', 'shipped', 'delivered', 'cancelled'),
FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID),
INDEX idx_customer (CustomerID),
INDEX idx_date (OrderDate),
INDEX idx_status (OrderStatus)
);

-- Order Items Junction Table
CREATE TABLE OrderItems (
OrderID INT,
ProductID INT,
Quantity INT NOT NULL,
UnitPrice DECIMAL(10,2) NOT NULL,
PRIMARY KEY (OrderID, ProductID),
FOREIGN KEY (OrderID) REFERENCES Orders(OrderID),
FOREIGN KEY (ProductID) REFERENCES Products(ProductID)
);

2. Denormalization for Performance

-- Denormalized Customer Order Summary
CREATE TABLE CustomerOrderSummary (
CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(100),
TotalOrders INT DEFAULT 0,
TotalSpent DECIMAL(12,2) DEFAULT 0.00,
LastOrderDate TIMESTAMP,
AverageOrderValue DECIMAL(10,2),
PreferredCategory VARCHAR(50),
-- Denormalized for reporting performance
FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

Physical Design Considerations

1. Indexing Strategy

-- Performance Indexes
CREATE INDEX idx_orders_customer_date ON Orders(CustomerID, OrderDate);
CREATE INDEX idx_products_category_price ON Products(CategoryID, Price);
CREATE INDEX idx_customers_email_active ON Customers(Email, IsActive);

-- Composite Index for Common Query Patterns
CREATE INDEX idx_order_items_product_order ON OrderItems(ProductID, OrderID);

2. Partitioning

-- Range Partitioning by Date
CREATE TABLE Orders (
OrderID INT,
CustomerID INT,
OrderDate DATE,
...
) PARTITION BY RANGE (YEAR(OrderDate)) (
PARTITION p2022 VALUES LESS THAN (2023),
PARTITION p2023 VALUES LESS THAN (2024),
PARTITION p2024 VALUES LESS THAN (2025),
PARTITION pFuture VALUES LESS THAN MAXVALUE
);

Dimensional Modeling

Definition and Purpose

Dimensional modeling is designed specifically for analytical workloads and data warehousing, optimizing for query performance and business intelligence.

Core Concepts

1. Star Schema

Fact Table (Center):
├── Sales Fact
│ ├── SalesID (PK)
│ ├── ProductKey (FK)
│ ├── CustomerKey (FK)
│ ├── DateKey (FK)
│ ├── StoreKey (FK)
│ ├── Quantity
│ ├── Revenue
│ └── Profit

Dimension Tables (Points):
├── Product Dimension
├── Customer Dimension
├── Date Dimension
└── Store Dimension

2. Snowflake Schema

Normalized Dimensions:
Product Dimension:
├── Product
│ ├── ProductKey
│ ├── ProductName
│ └── CategoryKey (FK)
└── Category
├── CategoryKey
├── CategoryName
└── DepartmentKey (FK)
└── Department
├── DepartmentKey
└── DepartmentName

Dimensional Modeling Example: Retail Analytics

-- Date Dimension (Type 0 - Never Changes)
CREATE TABLE DimDate (
DateKey INT PRIMARY KEY,
FullDate DATE,
DayOfWeek INT,
DayName VARCHAR(10),
Month INT,
MonthName VARCHAR(10),
Quarter INT,
Year INT,
IsWeekend BOOLEAN,
IsHoliday BOOLEAN
);

-- Product Dimension (Type 2 - Track History)
CREATE TABLE DimProduct (
ProductKey INT PRIMARY KEY,
ProductID INT, -- Business Key
ProductName VARCHAR(200),
Category VARCHAR(100),
Brand VARCHAR(100),
Price DECIMAL(10,2),
EffectiveDate DATE,
ExpiryDate DATE,
IsCurrent BOOLEAN
);

-- Customer Dimension (Type 1 - Overwrite)
CREATE TABLE DimCustomer (
CustomerKey INT PRIMARY KEY,
CustomerID INT, -- Business Key
CustomerName VARCHAR(100),
CustomerSegment VARCHAR(50),
City VARCHAR(100),
State VARCHAR(50),
Country VARCHAR(50)
);

-- Sales Fact Table
CREATE TABLE FactSales (
SalesKey INT PRIMARY KEY,
ProductKey INT,
CustomerKey INT,
DateKey INT,
Quantity INT,
UnitPrice DECIMAL(10,2),
Revenue DECIMAL(12,2),
Cost DECIMAL(12,2),
Profit DECIMAL(12,2),
FOREIGN KEY (ProductKey) REFERENCES DimProduct(ProductKey),
FOREIGN KEY (CustomerKey) REFERENCES DimCustomer(CustomerKey),
FOREIGN KEY (DateKey) REFERENCES DimDate(DateKey)
);

Slowly Changing Dimensions (SCD)

Type 1: Overwrite (No History)

-- Update customer address - lose history
UPDATE DimCustomer
SET Address = 'New Address', City = 'New City'
WHERE CustomerID = 12345;

Type 2: Add New Record (Keep History)

-- End current record
UPDATE DimProduct
SET ExpiryDate = '2024-01-31', IsCurrent = FALSE
WHERE ProductID = 456 AND IsCurrent = TRUE;

-- Insert new record with changes
INSERT INTO DimProduct (ProductID, ProductName, Price, EffectiveDate, IsCurrent)
VALUES (456, 'Updated Product Name', 29.99, '2024-02-01', TRUE);

Type 3: Add New Column (Limited History)

ALTER TABLE DimCustomer 
ADD COLUMN PreviousSegment VARCHAR(50),
ADD COLUMN SegmentChangeDate DATE;

UPDATE DimCustomer
SET PreviousSegment = CustomerSegment,
CustomerSegment = 'Premium',
SegmentChangeDate = CURRENT_DATE
WHERE CustomerID = 789;

Object-Oriented Data Modeling

Definition and Characteristics

Object-oriented data modeling aligns data structures with object-oriented programming paradigms, emphasizing encapsulation, inheritance, and polymorphism.

Core OO Concepts in Data Modeling

1. Inheritance Hierarchies

Vehicle (Superclass)
├── Car (Subclass)
│ ├── Sedan
│ └── SUV
├── Motorcycle (Subclass)
└── Truck (Subclass)
├── PickupTruck
└── DeliveryTruck

2. Implementation Strategies

Table Per Hierarchy (TPH):

CREATE TABLE Vehicle (
VehicleID INT PRIMARY KEY,
VehicleType VARCHAR(20), -- Discriminator
Brand VARCHAR(50),
Model VARCHAR(50),
Year INT,
-- Car specific
NumberOfDoors INT,
-- Motorcycle specific
HasSidecar BOOLEAN,
-- Truck specific
LoadCapacity DECIMAL(10,2)
);

Table Per Type (TPT):

-- Base table
CREATE TABLE Vehicle (
VehicleID INT PRIMARY KEY,
Brand VARCHAR(50),
Model VARCHAR(50),
Year INT
);

-- Derived tables
CREATE TABLE Car (
VehicleID INT PRIMARY KEY,
NumberOfDoors INT,
FOREIGN KEY (VehicleID) REFERENCES Vehicle(VehicleID)
);

CREATE TABLE Motorcycle (
VehicleID INT PRIMARY KEY,
HasSidecar BOOLEAN,
FOREIGN KEY (VehicleID) REFERENCES Vehicle(VehicleID)
);

Table Per Concrete Class (TPC):

CREATE TABLE Car (
CarID INT PRIMARY KEY,
Brand VARCHAR(50),
Model VARCHAR(50),
Year INT,
NumberOfDoors INT
);

CREATE TABLE Motorcycle (
MotorcycleID INT PRIMARY KEY,
Brand VARCHAR(50),
Model VARCHAR(50),
Year INT,
HasSidecar BOOLEAN
);

OO Modeling Example: Content Management System

-- Base Content Entity
CREATE TABLE Content (
ContentID INT PRIMARY KEY,
Title VARCHAR(200),
Author VARCHAR(100),
CreatedDate TIMESTAMP,
ContentType VARCHAR(50), -- Discriminator
Status ENUM('draft', 'published', 'archived')
);

-- Article (inherits from Content)
CREATE TABLE Article (
ContentID INT PRIMARY KEY,
Body TEXT,
WordCount INT,
ReadingTime INT,
FOREIGN KEY (ContentID) REFERENCES Content(ContentID)
);

-- Video (inherits from Content)
CREATE TABLE Video (
ContentID INT PRIMARY KEY,
Duration INT, -- seconds
Resolution VARCHAR(20),
FileSize BIGINT,
VideoURL VARCHAR(500),
FOREIGN KEY (ContentID) REFERENCES Content(ContentID)
);

-- Podcast (inherits from Content)
CREATE TABLE Podcast (
ContentID INT PRIMARY KEY,
Duration INT, -- seconds
EpisodeNumber INT,
AudioURL VARCHAR(500),
FOREIGN KEY (ContentID) REFERENCES Content(ContentID)
);

-- Content Tags (Many-to-Many)
CREATE TABLE ContentTags (
ContentID INT,
TagID INT,
PRIMARY KEY (ContentID, TagID),
FOREIGN KEY (ContentID) REFERENCES Content(ContentID),
FOREIGN KEY (TagID) REFERENCES Tags(TagID)
);

NoSQL Data Modeling

Document-Based Modeling (MongoDB)

Document modeling focuses on flexible, schema-less data storage optimized for application data access patterns.

Design Principles:

  1. Embed Related Data when accessed together
  2. Reference Large or Frequently Updated Data
  3. Optimize for Application Queries
  4. Consider Document Size Limits

Example: E-commerce Product Catalog

// Embedded Document Model
{
"_id": "product_123",
"name": "Wireless Headphones",
"brand": "TechBrand",
"price": 99.99,
"category": {
"id": "electronics",
"name": "Electronics",
"parent": "consumer_goods"
},
"specifications": {
"battery_life": "20 hours",
"connectivity": ["Bluetooth 5.0", "USB-C"],
"weight": "250g"
},
"reviews": [
{
"user_id": "user_456",
"username": "techreviewer",
"rating": 4,
"comment": "Great sound quality!",
"date": "2024-01-15"
}
],
"inventory": {
"stock_count": 150,
"warehouse_locations": ["WH01", "WH03"],
"reorder_level": 20
},
"created_date": "2024-01-01T00:00:00Z",
"last_updated": "2024-01-15T10:30:00Z"
}

Reference Pattern for Large Collections:

// Product Document (without embedded reviews)
{
"_id": "product_123",
"name": "Wireless Headphones",
"brand": "TechBrand",
"price": 99.99,
"review_count": 1250,
"average_rating": 4.2
}

// Separate Reviews Collection
{
"_id": "review_789",
"product_id": "product_123",
"user_id": "user_456",
"rating": 4,
"comment": "Great sound quality!",
"helpful_votes": 15,
"date": "2024-01-15"
}

Key-Value Modeling (Redis)

Key-value modeling focuses on simple, fast data retrieval using unique keys.

Design Patterns:

1. Simple Key-Value Storage:

# User session storage
SET user:session:abc123 "{"user_id": 456, "login_time": "2024-01-15T10:00:00Z"}"
EXPIRE user:session:abc123 3600

# Product cache
SET product:123 "{"name": "Headphones", "price": 99.99}"

2. Hash Storage:

# User profile
HSET user:456 name "John Doe"
HSET user:456 email "john@example.com"
HSET user:456 last_login "2024-01-15"

# Product details
HSET product:123 name "Headphones"
HSET product:123 price "99.99"
HSET product:123 stock "150"

3. List Storage:

# User activity feed
LPUSH user:456:activities "liked_product:123"
LPUSH user:456:activities "viewed_product:789"
LTRIM user:456:activities 0 99 # Keep last 100 activities

4. Set Storage:

# User's favorite products
SADD user:456:favorites product:123
SADD user:456:favorites product:456
SADD user:456:favorites product:789

# Product tags
SADD product:123:tags "electronics"
SADD product:123:tags "audio"
SADD product:123:tags "wireless"

Column-Family Modeling (Cassandra)

Column-family modeling optimizes for high-volume, distributed data storage with focus on query patterns.

Design Example: Time-Series Sensor Data

-- Sensor readings table
CREATE TABLE sensor_readings (
sensor_id UUID,
reading_date DATE,
timestamp TIMESTAMP,
temperature DECIMAL,
humidity DECIMAL,
pressure DECIMAL,
PRIMARY KEY ((sensor_id, reading_date), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

-- User activity tracking
CREATE TABLE user_activity (
user_id UUID,
activity_date DATE,
timestamp TIMESTAMP,
activity_type TEXT,
activity_data MAP<TEXT, TEXT>,
PRIMARY KEY ((user_id, activity_date), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

-- Product catalog with denormalization
CREATE TABLE products_by_category (
category_id UUID,
product_id UUID,
product_name TEXT,
price DECIMAL,
brand TEXT,
rating DECIMAL,
created_date TIMESTAMP,
PRIMARY KEY (category_id, price, product_id)
) WITH CLUSTERING ORDER BY (price ASC);

Graph Data Modeling

Definition and Use Cases

Graph modeling represents data as nodes (entities) and edges (relationships), ideal for highly connected data and complex relationship queries.

Graph Model Components

1. Nodes (Vertices)

Node Types:
├── Person
├── Company
├── Product
├── Location
└── Event

2. Relationships (Edges)

Relationship Types:
├── WORKS_FOR (Person → Company)
├── FRIENDS_WITH (Person → Person)
├── LOCATED_IN (Company → Location)
├── PURCHASED (Person → Product)
└── SIMILAR_TO (Product → Product)

3. Properties

Node Properties:
Person: {name, age, email, created_date}
Product: {title, price, category, rating}

Edge Properties:
WORKS_FOR: {start_date, position, salary}
PURCHASED: {date, quantity, total_amount}

Graph Modeling Example: Social Network (Neo4j)

// Create Person nodes
CREATE (alice:Person {
name: 'Alice Johnson',
age: 30,
email: 'alice@email.com',
location: 'New York'
})

CREATE (bob:Person {
name: 'Bob Smith',
age: 25,
email: 'bob@email.com',
location: 'San Francisco'
})

// Create Company node
CREATE (tech_corp:Company {
name: 'TechCorp',
industry: 'Technology',
founded: 2010,
location: 'San Francisco'
})

// Create relationships
CREATE (alice)-[:WORKS_FOR {
position: 'Software Engineer',
start_date: '2022-01-15',
salary: 120000
}]->(tech_corp)

CREATE (alice)-[:FRIENDS_WITH {
since: '2020-05-10',
connection_strength: 0.8
}]->(bob)

// Create skill nodes and relationships
CREATE (python:Skill {name: 'Python', category: 'Programming'})
CREATE (alice)-[:HAS_SKILL {proficiency: 'Expert', years: 5}]->(python)

// Query examples
// Find Alice's colleagues
MATCH (alice:Person {name: 'Alice Johnson'})-[:WORKS_FOR]->(company)
<-[:WORKS_FOR]-(colleague:Person)
RETURN colleague.name, colleague.email

// Find friends of friends
MATCH (alice:Person {name: 'Alice Johnson'})-[:FRIENDS_WITH]->(friend)
-[:FRIENDS_WITH]->(friend_of_friend:Person)
WHERE friend_of_friend <> alice
RETURN friend_of_friend.name, friend.name as mutual_friend

Advanced Graph Patterns

1. Recommendation Engine Model

// Product recommendation based on purchase patterns
MATCH (user:Person {name: 'Alice'})-[:PURCHASED]->(product:Product)
<-[:PURCHASED]-(other_user:Person)-[:PURCHASED]->(recommended:Product)
WHERE NOT (user)-[:PURCHASED]->(recommended)
RETURN recommended.name,
COUNT(other_user) as recommendation_score
ORDER BY recommendation_score DESC
LIMIT 10

2. Fraud Detection Model

// Find suspicious patterns - multiple people using same device/location
MATCH (person1:Person)-[:USES_DEVICE]->(device:Device)
<-[:USES_DEVICE]-(person2:Person)
WHERE person1 <> person2
WITH person1, person2, device,
COUNT(*) as shared_sessions
WHERE shared_sessions > 5
RETURN person1.name, person2.name,
device.id, shared_sessions
ORDER BY shared_sessions DESC

Time-Series Data Modeling

Definition and Characteristics

Time-series modeling focuses on data points indexed by time, optimized for temporal queries, aggregations, and trend analysis.

Design Principles

  1. Time as Primary Index: Optimize for time-based queries
  2. Immutable Data: Most time-series data doesn't change
  3. High Volume: Handle millions of data points
  4. Aggregation Friendly: Support downsampling and rollups
  5. Retention Policies: Automatic data lifecycle management

Time-Series Schema Patterns

1. Single Metric Per Row (InfluxDB Style)

-- IoT Sensor Measurements
CREATE TABLE sensor_measurements (
time TIMESTAMP,
sensor_id STRING,
location STRING,
measurement_type STRING, -- temperature, humidity, pressure
value FLOAT,
unit STRING,
tags MAP<STRING, STRING>, -- Additional metadata
PRIMARY KEY (time, sensor_id, measurement_type)
);

-- Sample Data
INSERT INTO sensor_measurements VALUES
('2024-01-15 10:00:00', 'sensor_001', 'warehouse_a', 'temperature', 23.5, 'celsius', {'floor': '1', 'zone': 'A'}),
('2024-01-15 10:00:00', 'sensor_001', 'warehouse_a', 'humidity', 65.2, 'percent', {'floor': '1', 'zone': 'A'}),
('2024-01-15 10:01:00', 'sensor_001', 'warehouse_a', 'temperature', 23.7, 'celsius', {'floor': '1', 'zone': 'A'});

2. Multiple Metrics Per Row (Wide Table)

-- Application Performance Metrics
CREATE TABLE app_metrics (
timestamp TIMESTAMP,
server_id STRING,
cpu_usage_percent FLOAT,
memory_usage_mb FLOAT,
disk_io_read_mb FLOAT,
disk_io_write_mb FLOAT,
network_in_mb FLOAT,
network_out_mb FLOAT,
active_connections INT,
response_time_ms FLOAT,
error_rate_percent FLOAT,
PRIMARY KEY (timestamp, server_id)
);

3. Event-Based Time Series

-- User Activity Events
CREATE TABLE user_events (
event_time TIMESTAMP,
user_id UUID,
session_id STRING,
event_type STRING, -- page_view, click, purchase, logout
event_data JSON, -- Flexible event-specific data
user_agent STRING,
ip_address STRING,
referrer STRING,
PRIMARY KEY (event_time, user_id, session_id)
) PARTITION BY RANGE (event_time);

Time-Series Optimization Strategies

1. Partitioning by Time

-- Monthly partitions for large datasets
CREATE TABLE metrics_2024_01 PARTITION OF metrics
FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE metrics_2024_02 PARTITION OF metrics
FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

2. Downsampling and Rollups

-- Create aggregated tables for different time granularities
CREATE TABLE hourly_metrics AS
SELECT
DATE_TRUNC('hour', timestamp) as hour,
server_id,
AVG(cpu_usage_percent) as avg_cpu,
MAX(cpu_usage_percent) as max_cpu,
AVG(memory_usage_mb) as avg_memory,
COUNT(*) as data_points
FROM app_metrics
GROUP BY hour, server_id;

CREATE TABLE daily_metrics AS
SELECT
DATE_TRUNC('day', hour) as day,
server_id,
AVG(avg_cpu) as avg_cpu,
MAX(max_cpu) as peak_cpu,
AVG(avg_memory) as avg_memory
FROM hourly_metrics
GROUP BY day, server_id;

3. Retention Policies

-- Automatic data lifecycle management
DELETE FROM app_metrics
WHERE timestamp < NOW() - INTERVAL '90 days';

DELETE FROM hourly_metrics
WHERE hour < NOW() - INTERVAL '1 year';

-- Keep daily aggregates for 5 years
DELETE FROM daily_metrics
WHERE day < NOW() - INTERVAL '5 years';

Event-Driven Data Modeling

Definition and Purpose

Event-driven modeling captures state changes as immutable events, enabling event sourcing, audit trails, and eventual consistency patterns.

Core Concepts

1. Event Structure

{
"event_id": "evt_123456789",
"event_type": "OrderPlaced",
"aggregate_id": "order_987654",
"aggregate_type": "Order",
"event_version": 1,
"timestamp": "2024-01-15T10:30:00Z",
"user_id": "user_456",
"correlation_id": "session_abc123",
"causation_id": "command_xyz789",
"data": {
"customer_id": "cust_123",
"items": [
{
"product_id": "prod_456",
"quantity": 2,
"unit_price": 29.99
}
],
"total_amount": 59.98,
"shipping_address": {
"street": "123 Main St",
"city": "New York",
"state": "NY",
"zip": "10001"
}
},
"metadata": {
"source": "web_application",
"user_agent": "Mozilla/5.0...",
"ip_address": "192.168.1.1"
}
}

2. Event Store Schema

-- Event Store Table
CREATE TABLE event_store (
event_id UUID PRIMARY KEY,
aggregate_id UUID NOT NULL,
aggregate_type VARCHAR(100) NOT NULL,
event_type VARCHAR(100) NOT NULL,
event_version INTEGER NOT NULL,
event_data JSONB NOT NULL,
metadata JSONB,
timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
correlation_id UUID,
causation_id UUID,

-- Ensure event ordering within aggregate
UNIQUE(aggregate_id, event_version),

-- Indexes for common query patterns
INDEX idx_aggregate (aggregate_id, event_version),
INDEX idx_event_type (event_type, timestamp),
INDEX idx_correlation (correlation_id),
INDEX idx_timestamp (timestamp)
);

Event-Driven Modeling Examples

1. E-commerce Order Lifecycle

-- Event Types for Order Aggregate
INSERT INTO event_store (event_id, aggregate_id, aggregate_type, event_type, event_version, event_data) VALUES

-- Order Placed
('evt_001', 'order_123', 'Order', 'OrderPlaced', 1, '{
"customer_id": "cust_456",
"items": [{"product_id": "prod_789", "quantity": 2}],
"total": 99.98
}'),

-- Payment Processed
('evt_002', 'order_123', 'Order', 'PaymentProcessed', 2, '{
"payment_method": "credit_card",
"amount": 99.98,
"transaction_id": "txn_abc123"
}'),

-- Order Shipped
('evt_003', 'order_123', 'Order', 'OrderShipped', 3, '{
"tracking_number": "TRK123456789",
"carrier": "UPS",
"estimated_delivery": "2024-01-20"
}'),

-- Order Delivered
('evt_004', 'order_123', 'Order', 'OrderDelivered', 4, '{
"delivered_at": "2024-01-19T14:30:00Z",
"signature": "J.Smith"
}');

2. Projection/Read Model Generation

-- Order Summary Projection
CREATE TABLE order_projections (
order_id UUID PRIMARY KEY,
customer_id UUID,
order_status VARCHAR(50),
total_amount DECIMAL(10,2),
created_at TIMESTAMP,
shipped_at TIMESTAMP,
delivered_at TIMESTAMP,
last_updated TIMESTAMP
);

-- Projection Update Function
CREATE OR REPLACE FUNCTION update_order_projection()
RETURNS TRIGGER AS $$
BEGIN
CASE NEW.event_type
WHEN 'OrderPlaced' THEN
INSERT INTO order_projections (
order_id, customer_id, order_status,
total_amount, created_at, last_updated
) VALUES (
NEW.aggregate_id,
(NEW.event_data->>'customer_id')::UUID,
'placed',
(NEW.event_data->>'total')::DECIMAL,
NEW.timestamp,
NEW.timestamp
);

WHEN 'PaymentProcessed' THEN
UPDATE order_projections
SET order_status = 'paid', last_updated = NEW.timestamp
WHERE order_id = NEW.aggregate_id;

WHEN 'OrderShipped' THEN
UPDATE order_projections
SET order_status = 'shipped',
shipped_at = NEW.timestamp,
last_updated = NEW.timestamp
WHERE order_id = NEW.aggregate_id;

WHEN 'OrderDelivered' THEN
UPDATE order_projections
SET order_status = 'delivered',
delivered_at = (NEW.event_data->>'delivered_at')::TIMESTAMP,
last_updated = NEW.timestamp
WHERE order_id = NEW.aggregate_id;
END CASE;

RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Trigger to maintain projections
CREATE TRIGGER order_projection_trigger
AFTER INSERT ON event_store
FOR EACH ROW
EXECUTE FUNCTION update_order_projection();

Event Sourcing Benefits and Challenges

Benefits:

  • Complete Audit Trail: Every state change is recorded
  • Time Travel: Recreate state at any point in time
  • Debugging: Understand exactly what happened and when
  • Analytics: Rich data for business intelligence
  • Flexibility: Easy to create new projections

Challenges:

  • Complexity: More complex than CRUD operations
  • Storage: Events accumulate over time
  • Eventual Consistency: Projections may be temporarily out of sync
  • Schema Evolution: Handling event structure changes

Data Modeling Methodologies

1. Top-Down Approach

Process Flow:

Business Requirements → Conceptual Model → Logical Model → Physical Model

Characteristics:
├── Starts with business needs
├── Focuses on entities and relationships
├── Gradually adds technical details
└── Good for new systems

Example: Library Management System

Step 1: Business Requirements
- Track books and members
- Manage loans and returns
- Handle reservations
- Generate reports

Step 2: Conceptual Model
- Entities: Book, Member, Loan, Reservation
- Relationships: Member borrows Book, Member reserves Book

Step 3: Logical Model
- Add attributes, keys, constraints
- Define cardinalities and participation
- Normalize relationships

Step 4: Physical Model
- Choose database technology
- Define indexes and partitions
- Optimize for performance

2. Bottom-Up Approach

Process Flow:

Existing Data → Normalize → Group Related Data → Create Logical Model

Characteristics:
├── Starts with existing data sources
├── Focuses on data analysis and normalization
├── Good for legacy system modernization
└── Data-driven rather than requirements-driven

3. Agile Data Modeling

Iterative Process:

Initial Model → Implement → Get Feedback → Refine Model → Repeat

Characteristics:
├── Evolutionary design
├── Just enough modeling
├── Continuous refinement
├── Close collaboration with development team
└── Embrace change

Agile Modeling Principles:

  • Model with a Purpose: Only model what's necessary
  • Multiple Models: Use different models for different purposes
  • Travel Light: Keep models simple and maintainable
  • Content Over Representation: Focus on meaning, not notation
  • Know Your Tools: Use appropriate modeling tools
  • Adapt to Specific Needs: No one-size-fits-all approach

4. Domain-Driven Design (DDD)

Core Concepts:

Bounded Context → Aggregates → Entities → Value Objects

Domain Model Components:
├── Entities (Identity-based objects)
├── Value Objects (Attribute-based objects)
├── Aggregates (Consistency boundaries)
├── Domain Services (Domain logic not fitting entities)
└── Repositories (Data access abstraction)

DDD Modeling Example:

E-commerce Bounded Contexts:

1. Order Management Context:
├── Order Aggregate
│ ├── Order (Entity - Root)
│ ├── OrderItem (Entity)
│ └── Address (Value Object)
└── Customer (Entity)

2. Inventory Context:
├── Product Aggregate
│ ├── Product (Entity - Root)
│ ├── ProductVariant (Entity)
│ └── Price (Value Object)
└── Category (Entity)

3. Payment Context:
├── Payment Aggregate
│ ├── Payment (Entity - Root)
│ ├── PaymentMethod (Value Object)
│ └── Transaction (Entity)

Best Practices and Design Patterns

1. Universal Design Principles

Data Quality Principles

ACCURATE: Data should be correct and error-free
COMPLETE: All required data should be present
CONSISTENT: Data should be consistent across systems
TIMELY: Data should be available when needed
VALID: Data should conform to defined formats
UNIQUE: No unnecessary duplication

Naming Conventions

-- Table Naming
customers -- Plural nouns
order_items -- Underscore for compound names
customer_addresses -- Clear relationship indication

-- Column Naming
customer_id -- Descriptive and consistent
first_name -- Readable format
created_timestamp -- Include data type hint
is_active -- Boolean prefix
total_amount -- Clear business meaning

-- Index Naming
idx_customers_email -- idx_table_columns
idx_orders_customer_id_created_date -- Composite indexes
pk_customers -- Primary key
fk_orders_customer_id -- Foreign key

2. Performance Design Patterns

1. Read Replica Pattern

-- Master Database (Write Operations)
CREATE TABLE orders_master (
order_id UUID PRIMARY KEY,
customer_id UUID,
order_date TIMESTAMP,
total_amount DECIMAL(10,2),
status VARCHAR(20)
);

-- Read Replica Configuration
-- Automatically synchronized from master
-- Used for reporting and analytics queries
-- Reduces load on master database

2. Materialized View Pattern

-- Complex aggregation query
CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT
c.customer_id,
c.first_name,
c.last_name,
COUNT(o.order_id) as total_orders,
SUM(o.total_amount) as total_spent,
AVG(o.total_amount) as avg_order_value,
MAX(o.order_date) as last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;

-- Refresh strategy
REFRESH MATERIALIZED VIEW customer_order_summary;

3. Sharding Pattern

-- Horizontal partitioning by customer region
CREATE TABLE orders_us (
LIKE orders_template INCLUDING ALL,
CONSTRAINT orders_us_region_check CHECK (region = 'US')
);

CREATE TABLE orders_eu (
LIKE orders_template INCLUDING ALL,
CONSTRAINT orders_eu_region_check CHECK (region = 'EU')
);

-- Application-level routing
-- Route US customers to orders_us
-- Route EU customers to orders_eu

3. Data Integrity Patterns

1. Audit Trail Pattern

-- Audit table for tracking changes
CREATE TABLE customer_audit (
audit_id SERIAL PRIMARY KEY,
customer_id UUID NOT NULL,
operation_type VARCHAR(10) NOT NULL, -- INSERT, UPDATE, DELETE
old_values JSONB,
new_values JSONB,
changed_by UUID,
changed_at TIMESTAMP DEFAULT NOW(),
change_reason TEXT
);

-- Trigger function for automatic auditing
CREATE OR REPLACE FUNCTION audit_customer_changes()
RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'DELETE' THEN
INSERT INTO customer_audit (customer_id, operation_type, old_values, changed_by)
VALUES (OLD.customer_id, 'DELETE', row_to_json(OLD), USER);
RETURN OLD;
ELSIF TG_OP = 'UPDATE' THEN
INSERT INTO customer_audit (customer_id, operation_type, old_values, new_values, changed_by)
VALUES (OLD.customer_id, 'UPDATE', row_to_json(OLD), row_to_json(NEW), USER);
RETURN NEW;
ELSIF TG_OP = 'INSERT' THEN
INSERT INTO customer_audit (customer_id, operation_type, new_values, changed_by)
VALUES (NEW.customer_id, 'INSERT', row_to_json(NEW), USER);
RETURN NEW;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;

2. Soft Delete Pattern

-- Add deletion tracking columns
ALTER TABLE customers
ADD COLUMN is_deleted BOOLEAN DEFAULT FALSE,
ADD COLUMN deleted_at TIMESTAMP,
ADD COLUMN deleted_by UUID;

-- Soft delete function
UPDATE customers
SET is_deleted = TRUE,
deleted_at = NOW(),
deleted_by = :user_id
WHERE customer_id = :customer_id;

-- Views for active records only
CREATE VIEW active_customers AS
SELECT * FROM customers
WHERE is_deleted = FALSE OR is_deleted IS NULL;

3. Optimistic Locking Pattern

-- Add version column for concurrency control
ALTER TABLE products
ADD COLUMN version INTEGER DEFAULT 1;

-- Update with version check
UPDATE products
SET name = :new_name,
price = :new_price,
version = version + 1
WHERE product_id = :product_id
AND version = :expected_version;

-- Application should check affected rows
-- If 0 rows affected, another user modified the record

Tools and Technologies

Data Modeling Tools

1. Enterprise Tools

ERwin Data Modeler:
├── Comprehensive enterprise modeling
├── Forward/reverse engineering
├── Supports multiple databases
└── Team collaboration features

PowerDesigner:
├── Unified modeling (data, process, enterprise)
├── Model validation and impact analysis
├── Extensive database support
└── Integration with development tools

ER/Studio:
├── Data architecture and modeling
├── Data lineage and impact analysis
├── Collaborative modeling
└── Agile modeling support

2. Open Source Tools

MySQL Workbench:
├── Visual database design
├── SQL development
├── Database administration
└── Free with MySQL

pgAdmin/pgModeler:
├── PostgreSQL-focused
├── Visual schema design
├── SQL script generation
└── Open source

DbSchema:
├── Universal database designer
├── Visual query builder
├── Documentation generation
└── Multiple database support

3. Cloud-Based Tools

Lucidchart:
├── Web-based diagramming
├── Real-time collaboration
├── Template library
└── Integration with other tools

Draw.io (now Diagrams.net):
├── Free web-based tool
├── Extensive shape libraries
├── Export to multiple formats
└── No registration required

Vertabelo:
├── Online database modeler
├── Team collaboration
├── Version control
└── SQL script generation

Database Technologies by Use Case

OLTP (Online Transaction Processing)

PostgreSQL:
├── Advanced SQL features
├── JSON/JSONB support
├── Strong consistency
└── Extensible architecture

MySQL:
├── High performance
├── Wide adoption
├── Good tooling ecosystem
└── Cloud provider support

SQL Server:
├── Enterprise features
├── Integration with Microsoft stack
├── Advanced analytics
└── Hybrid cloud support

OLAP (Online Analytical Processing)

Snowflake:
├── Cloud-native architecture
├── Separate compute and storage
├── Zero-copy cloning
└── Time travel features

Amazon Redshift:
├── Columnar storage
├── Massively parallel processing
├── AWS ecosystem integration
└── Cost-effective scaling

Google BigQuery:
├── Serverless architecture
├── Machine learning integration
├── Real-time analytics
└── Automatic scaling

NoSQL Databases

Document Stores:
├── MongoDB (General purpose)
├── Amazon DocumentDB (AWS managed)
├── Azure Cosmos DB (Multi-model)
└── CouchDB (Offline-first)

Key-Value Stores:
├── Redis (In-memory, caching)
├── Amazon DynamoDB (Managed)
├── Apache Cassandra (Distributed)
└── Riak (Distributed, available)

Graph Databases:
├── Neo4j (Property graph)
├── Amazon Neptune (Multi-model)
├── ArangoDB (Multi-model)
└── OrientDB (Multi-model)

Modern Data Stack Components

Data Integration

ETL/ELT Tools:
├── Apache Airflow (Workflow orchestration)
├── dbt (Data transformation)
├── Fivetran (Managed ETL)
├── Stitch (Simple data integration)
└── Talend (Enterprise data integration)

Change Data Capture:
├── Debezium (Open source CDC)
├── AWS DMS (Database migration service)
├── Confluent Platform (Kafka-based)
└── Oracle GoldenGate (Enterprise CDC)

Data Quality and Governance

Data Catalogs:
├── Apache Atlas (Open source)
├── AWS Glue Data Catalog
├── Google Cloud Data Catalog
├── Collibra (Enterprise)
└── Alation (Enterprise)

Data Quality Tools:
├── Great Expectations (Python-based)
├── Deequ (Scala/Spark-based)
├── Apache Griffin (Big data quality)
├── Talend Data Quality
└── Informatica Data Quality

Suffix Trees: Comprehensive Guide

· 9 min read

Table of Contents

  1. Introduction
  2. What is a Suffix Tree?
  3. Construction Algorithms
  4. Properties and Characteristics
  5. Common Applications
  6. Implementation Examples
  7. Problem-Solving Patterns
  8. Comparison with Other Data Structures
  9. Problem Reference Table

Introduction

A Suffix Tree is a compressed trie containing all suffixes of a given string. It's one of the most powerful data structures in string processing, enabling efficient solutions to numerous string problems. Originally developed by Weiner (1973), McCreight (1976), and later optimized by Ukkonen (1995).

What is a Suffix Tree?

Definition

For a string S of length n, a suffix tree is a rooted tree with exactly n leaves numbered 1 to n. Each path from root to leaf represents one suffix of S.

Key Properties

  • Compressed: No node has exactly one child (except root in some cases)
  • Complete: Contains all suffixes as root-to-leaf paths
  • Unique: Each suffix corresponds to exactly one leaf
  • Space Efficient: O(n) nodes and edges despite representing O(n²) characters

Example: Suffix Tree for "banana$"

Suffixes:
0: banana$
1: anana$
2: nana$
3: ana$
4: na$
5: a$
6: $

Suffix Tree Structure:
root
/ | \
$ a b
(6) / \ \
na$ ana$ anana$
(4) | (0)
na$
(2,3,5)

Construction Algorithms

1. Naive Construction - O(n²)

Build suffix tree by inserting each suffix one by one into a trie, then compress.

def naive_construction(s):
# Insert all suffixes into trie
# Compress nodes with single children
# Time: O(n²), Space: O(n²) worst case

2. McCreight's Algorithm - O(n)

Builds suffix tree in linear time using suffix links.

Key Concepts:

  • Suffix Links: Fast navigation between related nodes
  • Incremental Construction: Build tree suffix by suffix efficiently
  • Active Point: Current position in tree construction

3. Ukkonen's Algorithm - O(n) ⭐

Most popular linear-time construction algorithm.

Core Ideas:

  • Online Algorithm: Can process string character by character
  • Implicit Suffix Tree: Handles non-explicit suffixes efficiently
  • Global End: All leaf edges extend automatically
  • Active Point Trick: Maintains construction state efficiently

Ukkonen's Algorithm Steps:

  1. Extension Phase: For each character, extend all suffixes
  2. Rule 1: Path ends at leaf → do nothing (implicit extension)
  3. Rule 2: Path continues with different character → add new branch
  4. Rule 3: Path continues with same character → do nothing
  5. Suffix Links: Jump to related suffix for efficiency
class UkkonenNode:
def __init__(self):
self.children = {}
self.suffix_link = None
self.start = None
self.end = None

def ukkonen_construction(s):
# Phase i: extend all suffixes ending at position i
# Use global end, active point, and suffix links
# Time: O(n), Space: O(n)

Properties and Characteristics

Space Complexity

  • Nodes: At most 2n - 1 nodes for string of length n
  • Edges: At most 2n - 1 edges
  • Total Space: O(n) with proper implementation

Time Complexities

OperationNaiveMcCreightUkkonen
ConstructionO(n²)O(n)O(n)
Pattern SearchO(m + occ)O(m + occ)O(m + occ)
Suffix QueryO(log n)O(1)O(1)

Edge Compression

Instead of storing individual characters on edges, store:

  • Start Index: Beginning position in original string
  • End Index: Ending position (can be global end)

Common Applications

1. Pattern Matching

Problem: Find all occurrences of pattern P in text T

Solution:

  • Build suffix tree for T
  • Follow path for pattern P
  • All leaves in subtree are occurrences
  • Time: O(|T|) preprocessing, O(|P| + occ) query

2. Longest Repeated Substring

Problem: Find longest substring that appears ≥ 2 times Solution:

  • Build suffix tree
  • Find deepest internal node
  • Path from root to this node is the answer
  • Time: O(n)

3. Longest Common Substring

Problem: Find LCS of multiple strings Solution:

  • Build generalized suffix tree for all strings
  • Find deepest node with leaves from all strings
  • Time: O(sum of string lengths)

4. Suffix Array Construction

Problem: Build suffix array efficiently Solution:

  • Build suffix tree in O(n)
  • DFS to get lexicographically sorted suffixes
  • Time: O(n) vs O(n log n) direct sorting

5. String Compression

Problem: Find repeated patterns for compression Solution:

  • Build suffix tree
  • Internal nodes represent repeated substrings
  • Choose optimal set for maximum compression

Implementation Examples

Basic Suffix Tree Node

class SuffixTreeNode:
def __init__(self):
self.children = {}
self.suffix_index = -1 # -1 for internal nodes
self.start = None
self.end = None
self.suffix_link = None

def edge_length(self):
return self.end - self.start + 1 if self.end else 0

Pattern Search Implementation

def pattern_search(root, text, pattern):
"""Find all occurrences of pattern in text using suffix tree"""
current = root
i = 0

# Navigate to end of pattern
while i < len(pattern):
if pattern[i] in current.children:
child = current.children[pattern[i]]
edge_start = child.start
edge_end = min(child.end, edge_start + len(pattern) - i - 1)

# Check if pattern matches edge
for j in range(edge_start, edge_end + 1):
if text[j] != pattern[i]:
return [] # No match
i += 1

current = child
else:
return [] # No match

# Collect all suffix indices in subtree
return collect_suffix_indices(current)

Longest Repeated Substring

def longest_repeated_substring(root, text):
"""Find longest repeated substring using suffix tree"""
max_depth = 0
result_node = None

def dfs(node, depth):
nonlocal max_depth, result_node

if node.suffix_index == -1: # Internal node
if depth > max_depth:
max_depth = depth
result_node = node

for child in node.children.values():
dfs(child, depth + child.edge_length())

dfs(root, 0)

if result_node:
# Reconstruct string from root to result_node
return reconstruct_path(root, result_node, text)
return ""

Problem-Solving Patterns

Pattern 1: Depth-Based Problems

  • Longest Repeated Substring: Deepest internal node
  • Longest Common Extension: Path depth between nodes
  • K-th Order Statistics: Depth-based tree traversal

Pattern 2: Leaf-Based Problems

  • Pattern Matching: Collect leaves in subtree
  • Suffix Array: DFS order of leaves
  • Document Retrieval: Group leaves by document

Pattern 3: Generalized Suffix Tree

  • Multiple String Problems: Build GST with separators
  • Longest Common Substring: Find nodes with leaves from all strings
  • String Distance: Compare paths in GST

Pattern 4: Path Compression Applications

  • Space Optimization: Edge compression reduces memory
  • Cache Efficiency: Fewer nodes improve performance
  • Practical Implementation: Handle large alphabets efficiently

Comparison with Other Data Structures

FeatureSuffix TreeSuffix ArrayTrieKMP
ConstructionO(n)O(n log n)O(Σm)O(m)
SpaceO(n)O(n)O(Σm)O(m)
Pattern SearchO(m + occ)O(m log n + occ)O(m)O(n + m)
LCP QueriesO(1)O(log n)--
Memory AccessRandomSequentialRandomSequential
Cache PerformancePoorGoodPoorGood
ImplementationComplexSimpleSimpleSimple

When to Use Each:

  • Suffix Tree: Complex string queries, multiple pattern types
  • Suffix Array: Simple queries, memory constraints, competitive programming
  • Trie: Dictionary operations, prefix queries
  • KMP: Single pattern matching, streaming data

Problem Reference Table

Problem CategoryProblem NameAdditional Algorithms/ComponentsTime ComplexitySpace
Pattern MatchingMultiple Pattern SearchAho-Corasick integrationO(n + Σm + occ)O(n + Σm)
Approximate Pattern MatchingEdit distance DPO(nm²)O(nm)
Pattern with wildcardsSuffix tree + backtrackingO(nm)O(n)
Substring ProblemsLongest Repeated SubstringDFS traversalO(n)O(n)
Longest Common Substring (2 strings)Generalized suffix treeO(n + m)O(n + m)
Longest Common Substring (k strings)Generalized ST + coloringO(Σn_i)O(Σn_i)
K most frequent substringsSuffix tree + heapO(n log k)O(n)
All palindromic substringsManacher's + suffix treeO(n)O(n)
String QueriesRange LCP queriesSuffix tree + LCA (Sparse Table)O(1)O(n log n)
Substring frequencySuffix tree + subtree sizeO(m + occ)O(n)
Lexicographic substring queriesSuffix tree + DFS orderingO(m + k)O(n)
Distinct substrings countSuffix tree traversalO(n)O(n)
Advanced ApplicationsText compression (LZ77)Suffix tree + greedyO(n)O(n)
Burrows-Wheeler TransformSuffix tree + DFSO(n)O(n)
Document retrievalInverted index + suffix treeO(m + occ)O(n)
Plagiarism detectionGST + similarity metricsO(n + m)O(n + m)
Computational BiologyDNA sequence alignmentSuffix tree + DPO(nm)O(n + m)
Tandem repeat detectionSuffix tree + period analysisO(n log n)O(n)
Phylogenetic tree constructionGST + clusteringO(nk)O(nk)
Motif discoverySuffix tree + statistical analysisO(n²)O(n)
Graph Theory IntegrationShortest superstringSuffix tree + TSPO(n!2ⁿ)O(n2ⁿ)
String graph constructionSuffix tree + overlap detectionO(n²)O(n²)
Data StructuresSuffix array constructionDFS traversalO(n)O(n)
LCP array constructionSuffix tree + parent-childO(n)O(n)
Compressed suffix treeHeavy-light decompositionO(log n)O(n)
Dynamic ProblemsOnline suffix treeUkkonen's algorithmO(1) amortizedO(n)
Sliding window queriesSuffix tree + dequeO(n)O(k)
Incremental pattern matchingDynamic suffix treeO(m) per updateO(n)

Legend:

  • n, m: String lengths
  • k: Number of strings/patterns
  • occ: Number of occurrences
  • Σ: Alphabet size
  • GST: Generalized Suffix Tree
  • LCA: Lowest Common Ancestor
  • DP: Dynamic Programming

Complexity Notes:

  • Construction time assumes linear-time algorithms (McCreight/Ukkonen)
  • Query times assume suffix tree is already built
  • Space complexity is typically O(n) for suffix tree + additional structures
  • Some problems require combining suffix trees with other advanced algorithms

Using Azure Services for Developing Solutions

· 2 min read

Repository for knowing various Azure services for developers. Core concept of azure are compute,storage,databases,integration,networking. this is a developer only guide so we are not touching other azure services but only requried by developers.

What Azure Offers

  • Services for Running applications also known as compute
    1. IaaS: VM, Container Services, Container Instances, Web App for Containers, Batch, Service Fabric
    2. PaaS: Cloud Services, Mobile app, Web App, Service fabric
    3. LaaS: Logic apps, Functions Compute Options
  • Services for Storing data
    1. Relational Data: Azure SQL db/MYSQL db/PostgreSQL db/Maria db
    2. Document or Unstructured Data: Cosmos DB / Azure Storage ( files, blobs, Queue, Table, Disk etc)
    3. Data Analytics : Data warehouse / Data Lake Store
  • Services for Securing applications and data
    1. Authentication : AD (Azure Active Directory)
    2. Encryption ( protect data at rest) : Encryption features provided.
    3. Store User Identities : Azure Active Directory.
    4. Store Keys/Secrets : Key valut.
    5. Provide TLS certificate : App Service Certificate.
  • Services for Analysing data
    1. Azure Data Factory
    2. Azure Analysis Service
    3. Azure Data Lake Analytics
    4. Azure Stream Analysis
    5. Azure Time Series Insight
    6. Azure Data Bricks
    7. Azure HD Insights
  • Services for Monitoring your applications
    1. Azure Application Insights
    2. Azure Log Analytics
    3. Azure Network Watcher
    4. Azure Monitor
    5. Azure Security Center
    6. Azure Advisor
    7. visual studio app center
  • Services for Adding Intelligence to your applications
    1. Machine Learning
    2. Congitive Services
    3. BOT Framework
  • Services for Event/Messaging between applications
    1. command : azure storage queues, azure service bus , azure servie bus topics , azure IOT hubs, Azure Logic Apps
    2. facts : discreet - azure event Grids , streaming - azure event hubs
    3. mass notification : azure notification hubs
  • Services for Speeding and improving performance of the applications
    1. Redis Cache
    2. Traffic Manager
    3. CDN
  • Services for Building/Testing/Deploying
    1. VS Team Foundation Services
    2. VS App Center 3 Azure devops Projects

Network and Protocols

· 2 min read
Ashish Kumar
maintainer - Tech

There are various network protocols and they works at different layers of OSI model. The digaram below gives a quick reference of protocols and layers at which they work. There are more than these protocols but we hae just mentioned the common ones.

Advice is to have an understanding of these protocols.

network layers

Networking Devices

HubMultiport repeater, Half Duplex, No intelligence, connect Devices together
SwitchForward traffic to right destination
RouterRoute Tables, forwarding based on IP Table , connected multiple type of networks
Access PointWireless router or WAP , Bridge to connect, wired or wireless
DSL 
RepeaterRegenrates signals
FireWallFilters traffic , routers or dedicated hardware
Bridge 
Patch Panels 
CloudBased network controllerCentral way to
EOP ethernet over power lines 
Power over ethernet 

Most Used common protocols

ScenarioProtocol PortsDetails
FilesFTP 20/21 
 AFP 548 
 SMBTCP445 
     
EmailsSMTP2525 
 POP3 110 
 IMAP 143 
     
WebHTTPS 443 
 HTTP 80 
     
ManagementTelnet 23 
 SSH 22 
 SNMPTCP/UDP161/162 
 RDP 3389 
     
DNS  53 
DHCP  67/68 
SLP(Service Location Protocol)  427 
NetBios TCP/UDP137/138/139 
LDAP( Lightweight Directory Access Protocol) TCP/UDP389 
Network time Protocol    
ARP( Address Resolution Protocol)