Using semantic knowledge to improve compression on log files

Otten, F.J. (2009) Using semantic knowledge to improve compression on log files. Masters thesis, Rhodes University.




With the move towards global and multi-national companies, information technology infrastructure requirements are increasing. As the size of these computer networks increases, it becomes more and more difficult to monitor, control, and secure them. Networks consist of a number of diverse devices, sensors, and gateways which are often spread over large geographical areas. Each of these devices produce log files which need to be analysed and monitored to provide network security and satisfy regulations. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data for archival purposes after the log files have been rotated. However, there are many other compression programs which exist - each with their own advantages and disadvantages. These programs each use a different amount of memory and take different compression and decompression times to achieve different compression ratios. System log files also contain redundancy which is not necessarily exploited by standard compression programs. Log messages usually use a similar format with a defined syntax. In the log files, all the ASCII characters are not used and the messages contain certain "phrases" which often repeated. This thesis investigates the use of compression as a means of data reduction and how the use of semantic knowledge can improve data compression (also applying results to different scenarios that can occur in a distributed computing environment). It presents the results of a series of tests performed on different log files. It also examines the semantic knowledge which exists in maillog files and how it can be exploited to improve the compression results. The results from a series of text preprocessors which exploit this knowledge are presented and evaluated. These preprocessors include: one which replaces the timestamps and IP addresses with their binary equivalents and one which replaces words from a dictionary with unused ASCII characters. In this thesis, data compression is shown to be an effective method of data reduction producing up to 98 percent reduction in filesize on a corpus of log files. The use of preprocessors which exploit semantic knowledge results in up to 56 percent improvement in overall compression time and up to 32 percent reduction in compressed size.

Item Type:Thesis (Masters)
Uncontrolled Keywords:data compression, semantic knowledge
Subjects:Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Divisions:Faculty > Faculty of Science > Computer Science
Supervisors:Irwin, B.
ID Code:1660
Deposited By: Rhodes Library Archive Administrator
Deposited On:29 Apr 2010 13:19
Last Modified:06 Jan 2012 16:21
311 full-text download(s) since 29 Apr 2010 13:19
120 full-text download(s) in the past 12 months
More statistics...

Repository Staff Only: item control page