a dummy server

#a dummy server tries to be reliable - in the way of telecom switch#

##1. Background##

In the past, i had the luck to work in telecom industry developing switch control software. A telecom switch is a distributed system of relative size: a backbone switch can have 2-3 frames, each frame contains 8 or 16 shelves, each shelf contains 16 or 32 slots, into each slot we can insert a card/board, the card/board contains circuits for application logic: io ports, switch fabric, or control, etc.. normally a card/board has a CPU, switch/shelf controllers use two boards/CPUs, one active and one standby.

In the various projects i worked with, there were two kinds of software designs. One uses layers of OO frameworks and design patterns and using Corba for inter-controller communications. The resulted systems are complicated. The other uses plain old C and light weight message passing, and a few simple rules as described below. The resulted system is relatively simple and scalable.

The simple rules for message passing based design are as following:

partition application functionalities into distinct tasks (threads or processes)
define the interactions between tasks as messages (message ids and message data structs). in other words, define the public interfaces of tasks as the set of messages they will send and the set of messages they will recv
one task can only interact with another task by sending messages to it. a task's state is private (normally one part of application state) which no other tasks can change directly without going thru the messaging interface.

In my understanding, the above rules are simply another interpretation of the principle Rob and other Go developers have advocated: "Do not communicate by sharing memory; instead, share memory by communicating."

##2. Design##

As an exercise, we'll try to implement a dummy server following the above message based design. And use a goroutine for each application task.

normally, we'll partition the id space into sections, each of which is for an application functionality, handled by one or a group of tasks. for example, if we use integer as ids, we can divide id space as following:

101-200 for system management
201-300 for performance management
301-400 for fault management
...

###2.1 System partition###

For our dummy server, i'll use path names as ids and partition the systems as following:

####2.1.1 SystemManager Task#### To make our server reliable, let's have two server instances running at the same time, one active (serving the requests), one standby (ready to replace the active one). SysteManager at standby server monitors the heartbeats from active server, if 2 heartbeats miss in a row, standby will become active and start serving user requests. SystemManager at active server will send heartbeats to standby, and send commands to other tasks to drive their life cycle.

messages sent

/Sys/Ctrl/Heartbeat   

    active servant sends heartbeats to standby servant

/Sys/Command   

    * commands sent to control subordinate tasks' life cycle: Start, Stop, Shutdown
    * commands sent to manage app services: AddService, DelService

messages recved

/Sys/Ctrl/Heartbeat   

    standby servant monitors heartbeats from active servant

/Sys/OutOfService   

    fault manager will send OOS msg to system mananger when system has fault   
    and become unusable

SysId(PubId/UnPubId) 

   system manager will subscribe to these two system ids to detect 
   the join/leave of clients. to simplify our sample, when client connect 
   and subscribe to a service, sysetm manager will grab the service name from the          
   subscription and start a service with that name. so the server will start 
   with no app services. in real world, the server probably starts with a set 
   of app services.

####2.1.2 Service Task####

a dummy application service task, just bouncing back request id with some transaction number. should be one task per service, differentiated by unique "ServiceName". a real server could provide multiple services, thus have mulit ServiceTasks running.

messages sent

/App/ServiceName/Response

    send response to client's request received at "/App/ServiceName/Request"

/Fault/AppService/Exception

    randomly send fake app exception to fault manager

/DB/Request

    randomly send fake DB requests to DbTask

messages recved

/App/ServiceName/Request

    receive client requests

/App/ServiceName/Command

    system tasks can send commands to control service task: Start, Stop, Reset.
    currently only fault manager will send Reset when it receives an exception 
    from this service task

/App/ServiceName/DB/Response

    receive response from DbTask for DB requests sent

/Sys/Command

    receive commands from system manager, mostly for life cycle management, 
    such as Start, Stop

####2.1.3 Database Task####

a database task manages database connections, carries out db transactions on behalf of service tasks. currently it does nothing besides sending back empty db responses and randomly generate exception.

messages sent

/App/ServiceName/DB/Response

    send db transaction result back to service task with name "ServiceName"

/Fault/DB/Exception

    randomly raise db exception to fault manager

messages recved

/DB/Request

    receive db requests from service tasks

/Sys/Command

    receive commands from system manager, mostly for life cycle management

####2.1.4 FaultManager Task####

A fault manager will receive FaultRecords from other tasks, perform fault correlation and handling. Normally for reasonably sized system, we can have a hierachy of fault managers. Low level fault managers will handle local fault for a specific functionality or app module, and propagate the faults to upper level managers if it cannot be handled locally.

messages sent

/Sys/OutOfService

    send msg to system manager to notify that system becomes unusable because of 
    some faults (for our simple sample, a fault from DbTask will do this)

/App/ServiceName/Command

    for our simple sample, fault manager sends "Reset" to service task when 
    receiving fault from it

messages recved

/Fault/DB/Exception

      receive db exception

/Fault/AppService/Exception

      receive app service exception

/Sys/Command

     receive commands from system manager, mostly for life cycle management

###2.2 Task###

Each task is a running goroutine which will go thru the following standard stages of life cycle:

init
start
stop
shutdown

The following are the jobs performed at each stage:

init
- create channels and attach channels to ids in router
- possible other jobs:
  - load config data
  - open conn to database
  - open conn to backend legacy server
start actively perform service, handle user requests, send responses
stop pause active service
shutdown
- detach chans from router
- close conn to database
- close conn to other servers

This may look similar to the life cycle of Java's applet. However they are really different:

Java applets' life cycle methods are called by browser/JVM at proper moments. they expose call-back interface to runtime framework.
A task is active with its own goroutine. A task's life cycle is driven by messages from SystemManager Task. A task's public interface are solely the set of messages it sends and the set of messages it receives. A task exposes no public methods or public data.

###2.3 Program structure###

main() function is simple:

create routers and connect them as need
create Tasks and connect them to routers

To simplify our code, the dummy server code is organized as following:

create a "Servant" struct which contains a router listening at some socket addr/port for incoming client connections
inside a Servant, create instances of above tasks attached to the router. in real world, for load balance or reliability, we could configure the tasks of a Servant running distributedly with two routers on two machines. these two routers can be connected thru sockets and tasks at each machine connected to its local router. Tasks' code do not need change in new configuration.
for simplicity, in main() function of dummy server, we create two instances of Servant, one active and the other standby. in real world, we may deploy these two instances of servant as two processes, or two machines for more reliability.
connect the routers of these two servants with filters defined at proxies to allow only heartbeat messages passed between them
when clients connect to dummy server, it will connect to both servant instances, although at any moment, only the active servant instance is providing the service and answering client requests

##3. code##

code is under samples/dummyserver.
tasks: sysmgrtask.go svctask.go dbtask.go faultmgrtask.go
server: servant.go server.go
client: client.go

##4. How to run it##

in one console, run "./server" to start the server
in 2nd console, run "./client news 123456" to start a client to talk to "news" service in server for 123456 times
in 3rd console, run "./client stock 123456" to start a client to talk to "stock" service in server for 123456 times

observe the trace messages in server console, see how the standby servant will come up automatically when active servant goes out of service:

  App Service [ news ] at [ servant1 ] process req:  request 1616
  App Service [ stock ] at [ servant1 ] process req:  request 1822
  App Service [ stock ] at [ servant1 ] process req:  request 1823
  DbTask at [ servant1 ] handles req from :  stock
  DbTask at [ servant1 ] report fault
  fault manager at [ servant1 ] report OOS
  xxxx Servant [ servant1 ] will take a break and standby ...
  App Service [ stock ] at [ servant1 ] is stopped
  servant1  enter monitor heartbeat
  App Service [ news ] at [ servant1 ] is stopped
  servant1  exit send heartbeat
  servant2  exit monitor heartbeat
  !!!! Servant [ servant2 ] come up in service ...
  servant2  enter send heartbeat
  App Service [ news ] at [ servant2 ] is activated
  App Service [ stock ] at [ servant2 ] is activated
  App Service [ stock ] at [ servant2 ] process req:  request 1824
  App Service [ stock ] at [ servant2 ] process req:  request 1825

observe the trace messages in client console, see when servant fail-over/switch-over happens, the client may have one request timed out, then the responses will keep coming back, however from a different servant:

  client sent request [request 1822] to serivce [stock]
  client recv response ( [request 1822] is processed at [servant1] : transaction_id [4] )
  client sent request [request 1823] to serivce [stock]
  client recv response ( [request 1823] is processed at [servant1] : transaction_id [5] )
  client sent request [request 1824] to serivce [stock]
  time out for reqest [request 1824]
  client sent request [request 1824] to serivce [stock]
  client recv response ( [request 1824] is processed at [servant2] : transaction_id [4] )
  client sent request [request 1825] to serivce [stock]
  client recv response ( [request 1825] is processed at [servant2] : transaction_id [5] )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a dummy server

Clone this wiki locally