4. SYSTEM ARCHITECTURE AND SOFTWARE DESIGN
To enable relatively real time user-to-user communication in a high traffic Web site requires quick, light weight, low load, CGI software. At the time of this writing, bianca is receiving an average of 1.4 million hits a day from just under 50,000 unique hosts. It should be kept in mind that most of these hits are not low load static GET requests, but are rather high load dynamic CGI requests. Using only one SGI Indy and one Sun Sparc4, bianca has been able to evolve to its current state through intelligent and conservative software design and by continuously reacting and adapting to its growing traffic load.
4.1 Design Goals
Though bianca's chat software has seen many revisions, the design goals have remained the same. The design goals can be separated into two categories: internal design and external design. External design consists of the parts of the software the user can either 'see' or interact with. Internal design consists of the parts of the software that the user does not see or interact with but are essential to its operation.
Internally, the software had to be easy to maintain, maximize the number of users, minimize the system load, maintain a log of all interactions, and have a relatively high degree of scalability.
Externally, the software had to be easy to use, display the number of users in a room and who they were, enable users to ignore and group with each other, filter out potentially destructive or abusive HTML, and check all other HTML for completeness.
4.2 Logical Design and Overall Architecture
In an attempt to keep with the virtual house metaphor, the chat spaces were logically divided up by room, with only one chat 'channel' per room. Unlike IRC or telnet chat environments, where users can split off into different 'channels' when a particular area of the chat space becomes crowded, bianca only allows users to communicate in one channel per room. This was done in a conscience effort to keep bianca's chat spaces as close to real life as possible. In the real world, there is only one 'channel' of communication. If there are 10 people in one room, they must all communicate within the same shared air space and make accommodations based on that. Much is the same in bianca.
The overall architecture supporting the logical design is composed of four components: a c client and perl server combination to post messages, a c program to display messages, information and status files, and logs of the user's conversations. Each chat area of the site has an associated chat daemon resident in memory. When a user POSTs a message to the chat room, their request initiates a small c client, called uclient , which connects to the specified room's daemon, for instance chatd.altar , and the daemon then handles posting the message to the room. When a user wants to view new messages since their last update, they execute a c program, called displaychat , which displays the most recent messages in the room. In essence, chatting in a room is accomplished through a combination of the room's chat daemon and displaychat . The chat daemons and displaychat use URLs and the file system to maintain a consistent system image of each room's current status. At any given moment, two files, called chatinfo and currentchat , contain the most current information about the room, such as the number of users in the room, the handles of all users in the room, the last messages posted in the room, and the 25 most recent messages posted in the room. The file system is also used to continuously log all conversations in every room. The log files are used to enable users to download previous chat activity.
bianca's chat system has seen many revisions and improvements over time, here however, we discuss the implementation of the current system. This section is divided into two parts. The first part is a brief discussion of how each of the design goals are implemented. The second part is a detailed walk through of the uclient and chat daemon combination, and displaychat .
Part 1: Brief Explanation
The first internal design goal was an easy to maintain system. To help with maintenance, and ease the process of revision control, the CVS source code control system is used at bianca. Another design goal was to log all conversations. Logging is easily accomplished by having each room's chat daemon maintain an open file handle to a log file, and continuously appending each user's posted message to that log file. The next three internal design goals, that of maximizing the number of users, minimizing the system load, and achieving a high degree of scalability, are not accomplished by any one implementation technique, but rather the sum of the parts of bianca's chat system have enable bianca to achieve it's internal design goals. By using conservative software engineering techniques such as using as few conditional statements as possible and using as few disk accesses as possible, bianca has been able to maximize the number of users, and minimize the system load, which has achieved a good degree of scalability and enabled bianca to support its current traffic load.
Part 2: Detailed Explanation
As briefly mentioned earlier, posting a message is accomplished by a client/server combination. When a user presses the "Post Message" button, on the room's chat form, an example of which can be seen in Figure 3, the URL of the form activates a program called uclient . Every chat space uses the same uclient to connect to their respective individual room chat daemons. The URL's 'PATH_INFO' and 'QUERY_STRING' are used to maintain state between each request and pass information from client to server. A typical URL might be: http://chat.bianca.com/cgi-bin/uclient /shack/altar/chat?id=7403262343239845+h=freeform+hcolor=brown+hsize=4+hitalics=on. The 'PATH_INFO' contains room specific information, and the 'QUERY_STRING' contains user specific information. The uclient program parses the 'PATH_INFO' to determine what chat daemon it should connect to. The uclient accepts the form submitted from the browser, does some minimal pre-processing of the form and then packages the submitted form information with some additional httpd environment variables, opens a connection to the appropriate chat daemon and sends the information to it. The uclient keeps the connection to the chat daemon open while it waits for the chat daemon to process the POST. When the chat daemon has processed the POST, it returns a new dynamic Web page, containing a new chat form and the 25 most recent messages, to the uclient which then displays the new page on the user's screen.
uclient is a bare bones c client. When executed, it checks to make sure it was indeed called by an httpd request by checking for the existence of an environment variable called 'REQUEST_METHOD'. If 'REQUEST_METHOD' is defined, then uclient checks for the existence of the 'PATH_INFO' environment variable. If 'PATH_INFO' is defined, uclient attempts to parse the 'PATH_INFO' to determine what chat daemon to connect to. This is done by comparing the 'PATH_INFO' to a list of all possible 'PATH_INFO's that it should know about. This list is a mapping of 'PATH_INFO' to the appropriate chat daemon. A 'PATH_INFO' of '/shack/altar/chat' would tell the uclient that it should connect to the chat daemon that services the altar. The 'PATH_INFO' is also used to tell the chat daemon where to find the information files for a particular room and will be further explained later when discussing the implementation of the chat daemons. After uclient has parsed the 'PATH_INFO' and determined what chat daemon it should connect to, it collects the user's message and wraps it with the other httpd environment variables in a 'package' that the chat daemon will understand. The uclient then attempts to open a connection to the appropriate chat daemon. If the connection is successful, uclient passes the package containing the user's message and the httpd environment variables to the chat daemon. Once this information is passed, uclient keeps the connection open and waits for the chat daemon to process the POST.
After the chat daemon has accepted the connection from the uclient and has had the list of environment variables passed to it, the chat daemon first parses the 'PATH_INFO' to determine what information files to use. As mentioned earlier, the 'PATH_INFO' contains the location of the room's information files. There are three information files for each room: pageinfo , chatinfo , and currentchat . The pageinfo file contains the name and description of the room, the header and footer information for the page, and what navigational links should appear on the page. The chatinfo file contains the number representing the current estimated number of users in the room, and the last posted message. The currentchat file contains the 25 most recent posted messages.
After setting the information files, the chat daemon parses the 'QUERY_STRING' to determine user specific information. User specific information tells the chat daemon what the user's handle is, what the user's handle color, what the user's handle size is, whether the user's handle should be displayed in italics, who the user is ignoring, who the user is grouped with, and what the user's unique identification tag is.
After determining the user specific information, the chat daemon parses the actual submitted message. The submitted form information consists of three fields: a 'From' field, a 'To' field, and a 'Comments' field. The From field is a means for the user to change their handle. If the From field contains a handle that is different that the user's handle as determined by the 'QUERY_STRING', the handle in the From field becomes the user's new handle. The To field contains the handle of another user the current user may be conversing with. The Comments field contains the text of the user's message. All three fields are checked for HTML completeness. As described in later chapters, bianca allows users to use HTML in their messages. If a system is going to allow HTML to be posted, the HTML should be checked for completeness before actually being posted to the room as valid chat. The reason being, is that if incomplete HTML is placed on a Web page it may alter the appearance of everything below it on the page. If the submitted message is found to contain incomplete HTML, for instant if an </b> is missing, or an HTML tag was not closed such as "<font size=2", the message is not posted to the room and the user is notified that their message contained "Bad HTML" and that they should post again.
After the user's message has been parsed and verified, the chat daemon appends the user's message to the continuos log file and then updates it's user information table with the current user's information. The user information table is a perl5 associative array, often called a hash. The hash contains information about every user in the chat room who has made a POST in the last two minutes. Indexed by a user's unique identification tag, the hash contains the time at which the user last posted to the room, the user's handle, and the user's handle configuration. With each POST, the entire hash is enumerated and every user's last POST time is compared against the current time. If a user's last POST time is found to be more than two minutes prior to the current time, their entry is removed from the hash and they are no longer considered by the chat daemon to be 'in' the room, and are thus not included in the count of the estimated number of users in the room.
After the user information table has been updated, the chat daemon adds the user's message to the room's currentchat file. The currentchat file is opened and read in to a perl5 associative array based on time of post. The user's new message is then added to this hash. Because the user's message is the most recent message, it will be first in the hash. The hash is then enumerated and each message is written to the currentchat file until 25 messages have been written. After the 25th message has been written to file, the currentchat file is closed. This composes the 25 most recent messages to the chat room.
After updating the currentchat file, the room's chatinfo file is also updated. The number of entries in the user information table, the handles of each of those entries, and the current user's message compose the information contained in the chatinfo file. The chatinfo file contains the estimated number of user in a room, the handle of every user determined to be 'in' the room, and the last posted message. The number of entries in the user information table and the handles of those entries represents the estimated number of users in a room and their associated handles. Because the user's current message is the most recent posted message, it is logically then the last posted message.
After updating the chatinfo file, the chat daemon is then ready to build the new Web page that the user will see. The page consists of the information obtained from the pageinfo file, an HTML form, the estimated number of users, and the 25 most recent messages, which includes the user's posted message. The HTML for this page is compiled together and then shipped off to the waiting uclient through the connected socket.
Displaying new messages is accomplished by the displaychat c program. Like uclient , displaychat is used by every room. displaychat also uses the 'PATH_INFO' to inform it of what room to display new message for and where to find those messages. The URL for a call to displaychat is much the same as it is to uclient : http://chat.bianca.com/cgi-bin/displaychat /PATH_INFO?QUERY_STRING. displaychat uses the PATH_INFO to find the location of the room's information files, and then uses those information files to construct a new Web page.
Like uclient , when displaychat is first executed, it checks to be sure it was indeed called by httpd. It's next order of business is parsing the 'PATH_INFO' to determine and set the location of the three information files: pageinfo , chatinfo , and currentchat . After parsing the 'PATH_INFO', displaychat parses the 'QUERY_STRING' to determine the user specific information, the most important of which are the user's unique identification tag, and who the user is ignoring and grouped with as of that request. Once the 'PATH_INFO' and 'QUERY_STRING' have been parsed, displaychat is then ready to build the new Web page that the user will see. Just as with the chat daemons, the page consists of the information obtained from the information files, an HTML form, the estimated number of users, and the 25 most recent messages. The HTML for this page is compiled together and then displayed to the user.
If it is determined that the user is ignoring other users or grouped with other users, both displaychat and the individual chat room daemons will parse the 25 most recent messages before they are display to the user. They compare the id tag of the sender of each message to the id of the current user and the list of ids the user is either ignoring or grouped with. bianca associates every message with the user who posted it by embedding the user's id tag into the message itself. If the user is ignoring other users, the display procedures of both displaychat and the chat daemons will not display messages from users who's id tags match that of the current user's list of ignored id tags. Much is the same for grouping. If a user is grouped with other users, they will only see messages which have id tags that match the id tags in their grouping list.
Through a combination of state saving and intelligent URL's, a client/server solution to handle POSTs, displaychat to display new messages, and information files contained in the shared space of the file system, Web chat is accomplished at bianca.
4.4 Evolution of The System
Like the site itself, the implementation of bianca's design goals have been a continuos evolution. Starting with a single perl script executed every time a POST or GET was made, to the current system, bianca's chat system has evolved over time in reaction to and anticipation of user demand, and system constraints. In this section we examine some of the experimental studies that has lead to the current design.
Initially bianca's chat system was implemented through a single static perl script which had to be interpreted into memory every time it was executed. For a low traffic site, this approach is fine, and is the easiest to implement. However, as bianca became more popular and attracted more and more users, the single perl script was being executed and interpreted many times a second. Executing and interpreting a perl script is a high system load process. With so many interpreted perl scripts executing every second and each one having to make four or five disk accesses, at peak traffic, bianca was experiencing a load of greater than 140. With such a high load, all running processes would grind to a halt and more often than not the system would crash or completely lock up.
To combat this problem, it was decided after much thought, to pursue a client/server approach in an attempt to lower the machine's load. By further developing the current perl script to act as a daemon which always stayed resident in the machine's memory, the execution and interpretation of perl scripts was eliminated. A chat daemon was created for each room, the uclient was developed and the new system was deployed. Not only did the implementation of a client/server based system eliminate the continuos interpretation of perl scripts, it also eliminated most disk writes, the exception being the continuos log of every user's conversation. Most of the disk writes were eliminated because the information could stay resident in the chat daemons and therefore didn't need to be written to disk. Because the chat daemons were always resident in memory, the list of users and the 25 most recent messages could all be keep resident inside the chat daemon. By eliminated the continuos interpretation of perl scripts and all but one of the disk writes, the machine's load was lowered from its average of 80 or 90 to an average of six or seven. Of all the gains made during bianca's evolution, moving to a client/server implementation for Web chat was the single largest gain.
However, as would be the case with every one of bianca's renovations and improvements, the improvements to the system not only increased its speed, and lowered the machine's load, it allowed more users access. With more users came slower response times and an increased load. In an effort to lower the load and increase the response time between user accesses, it was decided to re-write the perl daemons, using perl5's highly touted 'object-oriented' techniques. Every part of the daemon was 'objectified'. Every user was considered an object as well as every message. There were objects to control and maintain those user objects, and objects to control and maintain the message objects. Unfortunately though, when the new system was released, the machine's load only increased. It seemed that 'objectifying' the code only increased the time needed to process a POST. Within a week, the older, non 'objectified' code was put back in place, and another approach had to be taken to lower the load.
By closely observing the behavior of bianca's users, it was determined that most users attempt to display new message five times more than they POST new messages. It was also noted at this time that the chat daemons are a major bottle neck in the user's chatting process, as the daemons purposely serialize their accesses so as to eliminate contention problems and possible deadlocks or race time conditions. Serializing the accesses is necessary so that if two or more users POST at the same time, their POSTs will not overwrite each other. However, reading messages does not require serialized access, as two or more users can be reading the same thing at the same time. With the knowledge that users display messages far more than they POST, and that all request, be they POSTs or GETs, go through the chat daemon thus creating an unnecessary bottle neck, it was decided to attempt to separate posting and displaying messages into two separate processes. Unfortunately, this meant that the two processes would have to communicate so that they would both be displaying the same information to the user. To accomplish this communication, the file system was used to store the information. By using the file system, disk writes would have to be reintroduced to the system; a load increasing procedure. However, even though the number of disk writes per access would increase, it was decided to try out this new system of separate programs for posting messages and displaying messages. Fortunately this new system was a win. Not only was the machine's load slightly lowered but more importantly the response times were faster and more users could gain accesses.
Again though, with the increased speed, more users were able to access the site. With more users attracted to the site, once again bianca's system's load was too heavy to support the number of users attempting to chat. Fresh out of ideas on how to further stream-line the chat software, efforts were made to stream-line the actual httpd software. By eliminating many of the un-used modules of the httpd software and recompiling, the resident memory size of every running httpd was cut in third. With more memory available, more uclients could run and thus more users could access the system.
With all software streamlining options exhausted, and the load steadily approaching system breakdown point due to an increasing number of users, it was decided to pursue a modified load shedding technique to prevent bianca's machines from halting due to system overload. Load shedding is a technique used to prevent a system from overloading to the point of system failure, by disabling access to new users or preventing new programs from running after a certain threshold has been reached. Unfortunately, because HTTP is a stateless multi-connection medium it is extremely difficult if not impossible to limit access is this way. However, in an effort to prevent bianca's machines from becoming bogged down with requests to the point of system failure it was decided to take an approach where all requests would be denied if the system began to approach the break down point. By continuously monitoring the system load, memory usage, resource allocations, and the number of running processes, the point at which the system would halt due to overload could be predicted. If the threshold of the system halt point was ever approached, all incoming HTTP requests would be denied for a brief period and all running processes on the machine would be killed and restarted. Though this process successfully enabled the machines to recover from a near break down, it did briefly disrupt user communication in the process. However, this is better than the alternative of letting the machines completely overrun to the point of break down, thus preventing user from communicating for an even longer period while the machines rebooted, and reinitialized themselves.
One of the evolutionary steps in the software design that did not involve reacting to system load was determining the current estimated number of users in a room. Because HTTP is a stateless multi-connection based medium, it is difficult to know when a user has 'left' a chat room. Therefore a table of every user in the room and the time at which they last accessed the system needs to be maintained. This table is the means in which an estimated count of the number of users in a room is attained. The chat daemons of every room maintain such a table, called the user information hash table. The problems arises though: when to determine that a user has 'left' the room? A time frame between user accesses to the system is then needed to determine when to expire a user's entry from the hash table. Based on bianca's system memory constraints and after tests with 30 second, one minute, and five minute time frames proved unsuccessful, a two minute time frame between user accesses was determined to best represent the estimated number of users in a room.
It should be noted that in an attempt to consolidate information, improve overall performance, and reduce memory usage, the hash table used to determine the estimated number of users in a room is also used to control handle impersonation. Impersonation is an aspect of computer-mediated communication called deviant behavior which will be covered in depth in chapter eight. Briefly, impersonation is an attempt by a user to use another user's handle.
Because the hash table is used for two purposes, controlling impersonation and estimating the number of users in a room, a compromised had to be made on the time frame between user accesses before their entry was removed from the hash table. To fully control impersonation, a five minute time frame would have been ideal. However, such a time frame would have greatly exaggerated the estimated number of users in a room. For example, tests made with a five minute time frame in high traffic transitionary rooms such as the bathroom or bedroom, often reported the number of estimated users as high as 70 or 80. As bianca only allows a total of 80 httpd processes to run simultaneously at any one time, and with other rooms reporting 10, 20, and 30 users each, the number of users were clearly exaggerated. Also, during high traffic times the memory required to maintain a five minute time frame hash table in each of the room's chat daemons caused bianca's machine to run out of memory, thus bringing the system to a halt.
Tests made with 30 seconds and one minute time frames found the time to be too short and did not allow users to read and respond to comments before having their handle removed from the hash and thus becoming open targets for impersonation. If a user spent more than one minute reading comments and formulating a reply before posting, their handle was removed from the hash. The one minute time frame left "slow" users still open to impersonation.
Though a five minute time frame was ample time for users to read and respond to comments, it often misrepresented the estimated number of users in a room. Therefore, a two minute time frame between user accesses was found to be best for bianca in terms of representing the estimated number of users in a room, controlling impersonation, and in respects of memory usage.
4.5 Lessons Learned
Over the course of its existence, bianca's designers have learned that through adaptive and conservative software design it is possible to run a high traffic World Wide Web server on mid-range hardware. During its evolution, bianca's designers have learned a few lessons that may be helpful to the future virtual community software designer.
Probably the single most important lesson to be learned is that the need for rapid prototyping based on system performance modeling will invariably occur in a high traffic Web based virtual community. A designer should be patient and flexible, and willing to quickly adapt to whatever system demands come up. No matter how much time a designer may have put into crafting the 'perfect' system, when it is finally released, one should be willing to scrap it all and start over. Although there has only been one complete rewrite of bianca's software, the number of times quick fixes and the continuous tweaking of the system warrants extreme respect from any designer. Always allow for system design changes, and above all be willing to adapt.
An approach that may help in lowering the designers need to make changes to a design after it has been released, is to start with conservative software design. If a way can be found to eliminate one disk write, take it early on, do not wait until the system is released and the machine is bogging down in disk I/O to make the change. Always be looking for ways to lower a system's load, even if it means completely redesigning to use one less disk write or conditional statement.
Finally, a lesson that should be fairly obvious from bianca's evolutionary path, is that with every improvement in speed, comes more users which will eventually slow the system down again. Be prepared to continually be improving and progressing the system; its a never ending battle.
[ Next: User Interface ]