Notes for DataGPT

Tổng hợp những ghi chú technique trong quá trình làm tracuu.info.´

Performance and Optimization

Check thêm trong các section khác nữa! Ví dụ database!

Những câu hỏi đặt ra khi thiết kế database (đa phần là Thiện hỏi):

Dùng pgvector và postgres cho vector database liệu có ổn? Tốc độ có quá chậm? ← Qdrant?

Độ chính xác của query, fine similarity,… thì nên chọn algo nào, database nào?…

Liệu dùng uuidv4 có quá nặng so với nanoid để đặt cho cột id của các tables ko? Có quá quan trọng không?

Permission Architecture

Role-Based Access Control (RBAC) challenge

Khi sử dụng với Clerk, luồng xác thực sẽ như sau

Frontend → Gởi request với JWT token đến backend

Backend → Verify JWT token với Clerk

Backend → Tìm user trong database bằng clerk_id

Backend → Kiểm tra quyền và plan limits

Backend → Cho phép / từ chối request

Flow

1graph TD
2    A[User in Next.js] --> B[Clerk Session Token]
3    B --> C[API Request + JWT]
4    C --> D[FastAPI Middleware]
5    D --> E[Token Verification]
6    E --> F[Database RLS Context]
7    F --> G[Query Execution]

graph TD A[User in Next.js] --> B[Clerk Session Token] B --> C[API Request + JWT] C --> D[FastAPI Middleware] D --> E[Token Verification] E --> F[Database RLS Context] F --> G[Query Execution]

FastAPI

Depends() để injection dependencies

Async database connection lifecycle khi run multiple async test cases cùng lúc.

NameError: name 'Pattern' is not defined

1# change from this
2name: constr = Field(..., max_length=100)
3
4# to this
5name: Annotated[str, StringConstraints(max_length=100)] = Field(...)

Chunking + vector database + integrate

Một flow thông thường phải vầy

1📄 Document Upload
2       ↓
3🔍 Text Extraction (PyMuPDF, python-docx, etc.)
4       ↓
5✂️ Text Chunking (LangChain RecursiveCharacterTextSplitter)
6       ↓
7🧠 Generate Embeddings (Voyage AI / OpenAI API)
8       ↓
9💾 Store in Qdrant (vectors + metadata)
10       ↓ 
11📊 Update PostgreSQL (status, chunk count)

Integrate Qdrant in the current version

Current Architecture

1[User Query] → [PostgreSQL] → [Results]
2     ↓
3[Simple keyword search only]

New Hybrid Architecture

1[User Query]
2     ↓
3[Generate Embedding]
4     ↓
5[Qdrant Vector Search] + [PostgreSQL Metadata]
6     ↓                     ↓
7[Similar Chunks]     + [Document Info, Permissions]
8     ↓
9[Combine & Rank Results]
10     ↓
11[AI Response with Citations]

Database

Qdrant

Postgres + pgvector không tối ưu lắm cho chunks và tốc độ → dùng qdrant.

Qdrant: Sử dụng HNSW (Hierarchical Navigable Small World) algorithm được tối ưu hoá cho high-dimensional vectors

pgvector: Sử dụng IVFFlat index, performance kém hơn đáng kể khi dataset lớn

Benchmark: Qdrant nhanh hơn 3-5x trong similarity search với datasets > 100K vectors

Qdrant có built-in hybrid search (semantic + keyword), trong khi pgvector phải tự implement!

Disk-based storage: Handle datasets lớn hơn RAM trong khi pgvector load toàn bộ vectors vào memory, ko scalable cho large dataset!

Qdrant chỉ là storage (vector database) + thiết kế để tìm kiếm vector embeddings + semantic search + similarity matching.

Database Design

1NF, 2NF, 3NF ????

ACID Compliance

Your normalized approach maintains consistency
Redundant data breaks consistency guarantees

Tables “documents”, “tags” nên có table “document_tags” → ko nên để column “tag_name” bên trong table “documents” ← violate principle 1NF, SSOT (single source of truth), DRY (don’t repeat yourself),…

Tables “roles”, “permissions” and table “role_permissions” → ko nên để columns “role_name” và “permission_name” trong table “role_permissions” ← violate principles SSOT, DRY, ACID

General

Cái file init.sql được mount vào /docker-entrypoint-initdb.d trong docker-compose.yml (prop “volumes”) ← PostgreSQL container tự động chạy mọi file .sql trong folder này chỉ khi database trống!

→ ☝️ Nếu dùng Alembic thì hãy để Alembic handle hết mọi thứ, không nên dùng init.sql kiểu này, dễ có conflict!

IVFFlat = Inverted File with Flat compression → Thuật toán indexing cho vector similarity search. Chia vectors thành clusters (centroids). Search chỉ trong relevant clusters → faster

HNSW (pgvector 0.5.0+) — Hierarchical Navigable Small World: Better accuracy, faster queries, slower build time.

Ý tưởng core: Build một multi-layer graph để tìm đường đi nhanh nhất đến vector gần nhất.

1Layer 2: [A] ---------> [B]         (ít nodes, long jumps)
2         |               |
3Layer 1: [A]--[C]--[D]--[B]--[E]    (medium nodes, medium jumps)  
4         |  |  |  |  |  |  |
5Layer 0: [A][C][D][B][E][F][G]...   (all nodes, short jumps)

→ Cả 2 thuật toán ở trên đều có sẵn trong pgvector/pgvector:pg17

Qdrant/Pinecone (dedicated vector DB) ← Best performance cho scale. Advanced features (filtering, metadata). Additional infrastructure cost

Dùng UUIDv4 rất random → random on disk locations luôn → có thể dùng UUIDv7 (The Time-Sortable Identifier for Modern Databases)

1-- With UUID v4 - documents scattered across disk
2documents: [a1b2c3d4, f5e6d7c8, 2a3b4c5d] → Random disk locations
3chunks:    [x9y8z7w6, k2l3m4n5, p6q7r8s9] → More random locations
4
5-- With UUID v7 - time-ordered, clustered storage
6documents: [01H2ABC..., 01H2ABD..., 01H2ABE...] → Sequential disk locations
7chunks:    [01H2ABF..., 01H2ABG..., 01H2ABH...] → Also sequential

Alembic & SQLAlchemy

db.commit() được sử dụng để persist (lưu trữ vĩnh viễn) các thay đổi từ memory vào database.

db.add(): Chỉ thêm object vào session memory
db.commit(): Lưu thay đổi xuống database
db.refresh(): Reload object từ database (để lấy auto-generated fields như ID)
db.rollback(): Hủy bỏ tất cả thay đổi trong transaction hiện tại

Những thứ không thể được tạo/định nghĩa trong models

Create extension vector

Enable RLS and create policy for tables

Create special indexes like ivfflat for embedding)

Insert default plans

Khi ấy, phải chạy alembic revision --autogenerate -m "Initial schema" trước, sau đó sửa cái file .py được tạo thành để thêm các bước ở trên rồi apply changes bằng alembic upgrade head

Alembic không tự động "scan" folder models/. Bạn phải tell nó where to look trong file alembic/env.py:

Testing

Có 2 dạng philosophy:

"True Unit Tests" (Purist Approach) — no database, no external dependencies
"Integration Tests" (Pragmatic Approach) — use real database (but not touch the real data of production)

→ Thực tế cần cả 2 tests này! ← 80% Integration Tests, 20% Pure Unit Tests trong dự án SaaS này

1         /\
2        /  \          Few, Slow, Expensive
3       /E2E \         End-to-End Tests
4      /______\
5     /        \
6    /Integration\ 
7   /_____________\     Some, Medium Speed
8  /               \
9 /    Unit Tests   \   Many, Fast, Cheap
10/___________________\

Có 1 vấn đề khi làm mấy testing integration:

Nếu chọn SQLite (trên RAM) thì nó ko support type JSONB (của PostgreSQL), dẫn đến tình trạng trong function chính sẽ chia ra 2 trường hợp riêng cho test (SQLite) và prod (PostgreSQL), rất không ok.

→ Nên dùng: PostgreSQL + tmpfs (lưu database trên RAM)

SQLite cannot work with

Vector Extensions (Critical for RAG)

Row-Level Security (RLS) Policies

JSONB Operations

UUID Generation

Next.js and Frontend

Classic chicken-and-egg problem: Trong Dockerfile, cần phải có package-lock.json. Tuy nhiên lúc mới init dự án, vẫn chưa có file này. Nếu sau này chỉ là clone về chạy thì sẽ có sẵn. Cách giải quyết

Có thể là chạy npm i cho lần init này (chỉ chạy 1 lần duy nhất lúc khởi tạo dự án).

Thêm fallback logic trong trường hợp ko có lock file

1RUN \
2  if [ -f yarn.lock ]; then yarn --frozen-lockfile; \
3  elif [ -f package-lock.json ]; then npm ci; \
4  elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm i --frozen-lockfile; \
5  else echo "No lockfile found, running npm install" && npm install; \
6  fi

npm i --legacy-peer-deps → Nếu có những trường hợp conflict versions (có những package đòi hỏi những version cụ thể của những packages khác), dùng flag này sẽ sử dụng cơ chế của npm@v6 là ignore những conflicts này. Có thể sẽ cài được nhưng không tốt cho production vì sẽ có những side effect không mong muốn (take risk).

Nên dùng ^ thay vì ~, is a standard best practice ← vừa bug fix, vừa new feature mà không breaking changes.

Run 1 single test in a test file

1python -m pytest tests/test_tags.py::test_create_tag_success

Using Ruff để format và linter codes.

Differences between formater and linter → để làm giống ruff check --fix thì phải mở Command Palette (ctrl+shift+p), chọn “Ruff: Fix all auto-fixable problems”. Còn để format code (giống Black) thì phải chọn “Format Document…” (with Ruff as default formatter)

Formatter: handles style (indentation, line length, quotes)
Linter: handles code quality (unused imports, variables, syntax issues)

Docker

Note: Docker

Các image -alpine ← refers to Alpine Linux, a lightweight, security-focused Linux distribution.

.env (is required) cùng thư mục (root) với docker-compose.yml thì những cái trong đây sẽ được dùng trong docker-compose.yml ở:

1environment:
2  - DATABASE_URL=${DATABASE_URL}

Nếu có subfolder và bên trong nó ta có .env file riêng, thì trong docker-compose.yml:

1services:
2	backend:
3		env_file:
4			- ./backend/.env

Nếu chỉ thay đổi properties environment hay env_file thì không cần build lại image. Chúng nó chỉ dành cho container runtime, not build time. ← chỉ build docker image khi change trong Dockerfile mà thôi! ← Check bên trong container echo $GEMINI_API_KEY.

Logs

Tình trạng người dùng có trên Clerk nhưng không có trên databse. Ví dụ như webhook từ Clerk vào backend có vấn đề chẳng hạn. ← JIT User Creation = "Create the user in our database the moment we realize they're missing”

1graph TD 2 A[User in Next.js] --> B[Clerk Session Token] 3 B --> C[API Request + JWT] 4 C --> D[FastAPI Middleware] 5 D --> E[Token Verification] 6 E --> F[Database RLS Context] 7 F --> G[Query Execution]

1📄 Document Upload 2 ↓ 3🔍 Text Extraction (PyMuPDF, python-docx, etc.) 4 ↓ 5✂️ Text Chunking (LangChain RecursiveCharacterTextSplitter) 6 ↓ 7🧠 Generate Embeddings (Voyage AI / OpenAI API) 8 ↓ 9💾 Store in Qdrant (vectors + metadata) 10 ↓ 11📊 Update PostgreSQL (status, chunk count)

1[User Query] 2 ↓ 3[Generate Embedding] 4 ↓ 5[Qdrant Vector Search] + [PostgreSQL Metadata] 6 ↓ ↓ 7[Similar Chunks] + [Document Info, Permissions] 8 ↓ 9[Combine & Rank Results] 10 ↓ 11[AI Response with Citations]

1Layer 2: [A] ---------> [B] (ít nodes, long jumps) 2 | | 3Layer 1: [A]--[C]--[D]--[B]--[E] (medium nodes, medium jumps) 4 | | | | | | | 5Layer 0: [A][C][D][B][E][F][G]... (all nodes, short jumps)

1-- With UUID v4 - documents scattered across disk 2documents: [a1b2c3d4, f5e6d7c8, 2a3b4c5d] → Random disk locations 3chunks: [x9y8z7w6, k2l3m4n5, p6q7r8s9] → More random locations 4 5-- With UUID v7 - time-ordered, clustered storage 6documents: [01H2ABC..., 01H2ABD..., 01H2ABE...] → Sequential disk locations 7chunks: [01H2ABF..., 01H2ABG..., 01H2ABH...] → Also sequential

1 /\ 2 / \ Few, Slow, Expensive 3 /E2E \ End-to-End Tests 4 /______\ 5 / \ 6 /Integration\ 7 /_____________\ Some, Medium Speed 8 / \ 9 / Unit Tests \ Many, Fast, Cheap 10/___________________\

1RUN \ 2 if [ -f yarn.lock ]; then yarn --frozen-lockfile; \ 3 elif [ -f package-lock.json ]; then npm ci; \ 4 elif [ -f pnpm-lock.yaml ]; then corepack enable pnpm && pnpm i --frozen-lockfile; \ 5 else echo "No lockfile found, running npm install" && npm install; \ 6 fi

Notes for DataGPT

Notes for DataGPT

Performance and Optimization

Permission Architecture

FastAPI

Chunking + vector database + integrate

Database

Qdrant

Database Design

General

Alembic & SQLAlchemy

Testing

SQLite cannot work with

Next.js and Frontend

Python related

Docker

Logs

Notes for DataGPT

Notes for DataGPT

Performance and Optimization

Permission Architecture

FastAPI

Chunking + vector database + integrate

Database

Qdrant

Database Design

General

Alembic & SQLAlchemy

Testing

SQLite cannot work with

Next.js and Frontend

Python related

Docker

Logs