-
Notifications
You must be signed in to change notification settings - Fork 5
a dummy server
#a dummy server tries to be reliable - in the way of telecom switch#
##1. Background##
In the past, i had the luck to work in telecom industry developing switch control software. A telecom switch is a distributed system of relative size: a backbone switch can have 2-3 frames, each frame contains 8 or 16 shelves, each shelf contains 16 or 32 slots, into each slot we can insert a card/board, the card/board contains circuits for application logic: io ports, switch fabric, or control, etc.. normally a card/board has a CPU, switch/shelf controllers use two boards/CPUs, one active and one standby.
In the various projects i worked with, there were two kinds of software designs. One uses layers of OO frameworks and design patterns and using Corba for inter-controller communications. The resulted systems are complicated. The other uses plain old C and light weight message passing, and a few simple rules as described below. The resulted system is relatively simple and scalable.
The simple rules for message passing based design are as following:
- partition application functionalities into distinct tasks (threads or processes)
- define the interactions between tasks as messages (message ids and message data structs). in other words, define the public interfaces of tasks as the set of messages they will send and the set of messages they will recv
- one task can only interact with another task by sending messages to it. a task's state is private (normally one part of application state) which no other tasks can change directly without going thru the messaging interface.
In my understanding, the above rules are simply another interpretation of the principle Rob and other Go developers have advocated: "Do not communicate by sharing memory; instead, share memory by communicating."
##2. Design##
As an exercise, we'll try to implement a dummy server following the above message based design. And use a goroutine for each application task.
normally, we'll partition the id space into sections, each of which is for an application functionality, handled by one or a group of tasks. for example, if we use integer as ids, we can divide id space as following:
- 101-200 for system management
- 201-300 for performance management
- 301-400 for fault management
- ...
###2.1 System partition###
For our dummy server, i'll use path names as ids and partition the systems as following:
####2.1.1 SystemManager Task#### To make our server reliable, let's have two server instances running at the same time, one active (serving the requests), one standby (ready to replace the active one). SysteManager at standby server monitors the heartbeats from active server, if 2 heartbeats miss in a row, standby will become active and start serving user requests. SystemManager at active server will send heartbeats to standby, and send commands to other tasks to drive their life cycle.
-
messages sent
/Sys/Ctrl/Heartbeat active servant sends heartbeats to standby servant /Sys/Command * commands sent to control subordinate tasks' life cycle: Start, Stop, Shutdown * commands sent to manage app services: AddService, DelService
-
messages recved
/Sys/Ctrl/Heartbeat standby servant monitors heartbeats from active servant /Sys/OutOfService fault manager will send OOS msg to system mananger when system has fault and become unusable SysId(PubId/UnPubId) system manager will subscribe to these two system ids to detect the join/leave of clients. to simplify our sample, when client connect and subscribe to a service, sysetm manager will grab the service name from the subscription and start a service with that name. so the server will start with no app services. in real world, the server probably starts with a set of app services.
####2.1.2 Service Task####
a dummy application service task, just bouncing back request id with some transaction number. should be one task per service, differentiated by unique "ServiceName". a real server could provide multiple services, thus have mulit ServiceTasks running.
-
messages sent
/App/ServiceName/Response send response to client's request received at "/App/ServiceName/Request" /Fault/AppService/Exception randomly send fake app exception to fault manager /DB/Request randomly send fake DB requests to DbTask
-
messages recved
/App/ServiceName/Request receive client requests /App/ServiceName/Command system tasks can send commands to control service task: Start, Stop, Reset. currently only fault manager will send Reset when it receives an exception from this service task /App/ServiceName/DB/Response receive response from DbTask for DB requests sent /Sys/Command receive commands from system manager, mostly for life cycle management, such as Start, Stop
####2.1.3 Database Task####
a database task manages database connections, carries out db transactions on behalf of service tasks. currently it does nothing besides sending back empty db responses and randomly generate exception.
-
messages sent
/App/ServiceName/DB/Response send db transaction result back to service task with name "ServiceName" /Fault/DB/Exception randomly raise db exception to fault manager
-
messages recved
/DB/Request receive db requests from service tasks /Sys/Command receive commands from system manager, mostly for life cycle management
####2.1.4 FaultManager Task####
A fault manager will receive FaultRecords from other tasks, perform fault correlation and handling. Normally for reasonably sized system, we can have a hierachy of fault managers. Low level fault managers will handle local fault for a specific functionality or app module, and propagate the faults to upper level managers if it cannot be handled locally.
-
messages sent
/Sys/OutOfService send msg to system manager to notify that system becomes unusable because of some faults (for our simple sample, a fault from DbTask will do this) /App/ServiceName/Command for our simple sample, fault manager sends "Reset" to service task when receiving fault from it
-
messages recved
/Fault/DB/Exception receive db exception /Fault/AppService/Exception receive app service exception /Sys/Command receive commands from system manager, mostly for life cycle management
###2.2 Task###
Each task is a running goroutine which will go thru the following standard stages of life cycle:
- init
- start
- stop
- shutdown
The following are the jobs performed at each stage:
- init
- create channels and attach channels to ids in router
- possible other jobs:
- load config data
- open conn to database
- open conn to backend legacy server
- start actively perform service, handle user requests, send responses
- stop pause active service
- shutdown
- detach chans from router
- close conn to database
- close conn to other servers
This may look similar to the life cycle of Java's applet. However they are really different:
- Java applets' life cycle methods are called by browser/JVM at proper moments. they expose call-back interface to runtime framework.
- A task is active with its own goroutine. A task's life cycle is driven by messages from SystemManager Task. A task's public interface are solely the set of messages it sends and the set of messages it receives. A task exposes no public methods or public data.
###2.3 Program structure###
main() function is simple:
- create routers and connect them as need
- create Tasks and connect them to routers
To simplify our code, the dummy server code is organized as following:
- create a "Servant" struct which contains a router listening at some socket addr/port for incoming client connections
- inside a Servant, create instances of above tasks attached to the router. in real world, for load balance or reliability, we could configure the tasks of a Servant running distributedly with two routers on two machines. these two routers can be connected thru sockets and tasks at each machine connected to its local router. Tasks' code do not need change in new configuration.
- for simplicity, in main() function of dummy server, we create two instances of Servant, one active and the other standby. in real world, we may deploy these two instances of servant as two processes, or two machines for more reliability.
- connect the routers of these two servants with filters defined at proxies to allow only heartbeat messages passed between them
- when clients connect to dummy server, it will connect to both servant instances, although at any moment, only the active servant instance is providing the service and answering client requests
##3. code##
- code is under samples/dummyserver.
- tasks: sysmgrtask.go svctask.go dbtask.go faultmgrtask.go
- server: servant.go server.go
- client: client.go
##4. How to run it##
-
in one console, run "./server" to start the server
-
in 2nd console, run "./client news 123456" to start a client to talk to "news" service in server for 123456 times
-
in 3rd console, run "./client stock 123456" to start a client to talk to "stock" service in server for 123456 times
-
observe the trace messages in server console, see how the standby servant will come up automatically when active servant goes out of service:
App Service [ news ] at [ servant1 ] process req: request 1616 App Service [ stock ] at [ servant1 ] process req: request 1822 App Service [ stock ] at [ servant1 ] process req: request 1823 DbTask at [ servant1 ] handles req from : stock DbTask at [ servant1 ] report fault fault manager at [ servant1 ] report OOS xxxx Servant [ servant1 ] will take a break and standby ... App Service [ stock ] at [ servant1 ] is stopped servant1 enter monitor heartbeat App Service [ news ] at [ servant1 ] is stopped servant1 exit send heartbeat servant2 exit monitor heartbeat !!!! Servant [ servant2 ] come up in service ... servant2 enter send heartbeat App Service [ news ] at [ servant2 ] is activated App Service [ stock ] at [ servant2 ] is activated App Service [ stock ] at [ servant2 ] process req: request 1824 App Service [ stock ] at [ servant2 ] process req: request 1825
-
observe the trace messages in client console, see when servant fail-over/switch-over happens, the client may have one request timed out, then the responses will keep coming back, however from a different servant:
client sent request [request 1822] to serivce [stock] client recv response ( [request 1822] is processed at [servant1] : transaction_id [4] ) client sent request [request 1823] to serivce [stock] client recv response ( [request 1823] is processed at [servant1] : transaction_id [5] ) client sent request [request 1824] to serivce [stock] time out for reqest [request 1824] client sent request [request 1824] to serivce [stock] client recv response ( [request 1824] is processed at [servant2] : transaction_id [4] ) client sent request [request 1825] to serivce [stock] client recv response ( [request 1825] is processed at [servant2] : transaction_id [5] )