IR Lab Logo
Dr. Nazli Goharian
Home Biography Research Publications Classes
Information Retrieval


Misuse Detection in Information Retrieval (on-going)

Most computer crime traditionally has been the "insider" problem. In fact after virus, i.e., malicious code, insider abuse, called misuse, is the second most threatening attack. We focus the problem on misuse of search systems. Misuse detection is an attack to the system by an authorized user who is misusing their privileges. Prior work on misuse detection mainly focused on using logs and user profiles. Profile-based detection systems audit the deviation of user activities from normal user profiles. A user's command history is reviewed based on the percentage of commands used over a specific period of time and logs are mined. We developed algorithms and implemented a misuse detection system by comparing user behavior to user interest profile learned through clustering, relevance feedback, and finally fusion of results of these methods. We evaluated our system by setting up both an automatic and manual (four human evaluators) evaluation systems and showed a detection rate of over 95%.

R. Cathey, L. Ma, N. Goharian, D. Grossman, "Misuse Detection for Information Retrieval Systems", ACM 12th Conference on Information and Knowledge Management (CIKM), November 2003
L. Ma and N. Goharian, "Using Relevance Feedback to Detect Misuse in Information Retrieval Systems" ACM 13th Conference on Information and Knowledge Management (CIKM), November 2004.
N. Goharian, L. Ma, "Query Length Impact on Misuse Detection in Information Retrieval Systems", ACM 20th Symposium on Applied Computing (SAC), March 2005.
N. Goharian, L. Ma, C. Meyer, .Detecting Misuse of Information Retrieval Systems Using Data Mining Techniques. IEEE International Conference on Intelligence and Security Informatics. May 2005.
N. Goharian, L. Ma, Off-Topic Access Detection In Information Systems, ACM 14th Conference on Information and Knowledge Management (CIKM), November 2005.
N. Goharian, A. Platt, Detection Using Clustering Query Results. IEEE International Conference on Intelligence and Security Informatics (ISI), May 2006 (to appear).
N. Goharian, A. Platt, O. Frieder, On Off-Topic Web Browsing. IEEE International Conference on Intelligence and Security Informatics (ISI), May 2006 (to appear).

Text Extraction

TD> significant portion of the data on the World Wide Web is in the form of HTML pages. Since content, navigational information, advertisement, and formatting have no clear separation in HTML, the conventional information retrieval systems have the additional task of dealing with noisy data when providing full-text search. A problem that is not well studied is the negative effect of such noise data on the result of the user queries. Removing these data improves the effectiveness of search by reducing the irrelevant results. Furthermore, we argue that the irrelevant results, even covering a small fraction of retrieved results, have the restaurant-effect, namely users are less likely to return or use the search service after a bad experience. This is of more importance, considering the fact that an average of 26.8% of each page is formatting data and advertisement. We developed an algorithm and implemented the system. Our experimental results demonstrated that using extraction reduces the irrelevant results for the queries that generate "bad" results. Our experimental results shows that if one would use a system with extracted text instead of non-extracted text, then a 100% improvement can be achieved on irrelevant results retrieved by the engine based on non-extracted text. On our experiment with cnnfn collection and AOL user queries, we improved all bad results.

L. Ma, N. Goharian, A. Chowdhury, M. Chung, "Extracting Unstructured Data From Template Generated Web Documents", ACM 12th Conference on Information and Knowledge Management (CIKM), November 2003.

Sparse Matrix Information Retrieval System

With the large volume of data, the task of query processing to identify relevant documents is significantly time consuming. Information explosion demands scalable retrieval systems, motivating the selection of indexing and search algorithms that can support effectiveness and efficiency of the search. By representing an inverted index as a sparse matrix, matrix-vector multiplication algorithms can be used to query the index. As many parallel sparse matrix multiplication algorithms exist, such an information retrieval approach lends itself to parallelism. This enables us to attack the problem of parallel information retrieval, which has resisted good scalability. To improve accuracy, we developed a novel matrix based, relevance feedback technique as well as a proximity search algorithm. We developed a parallel implementation of sparse matrix information retrieval engine using a Beowulf cluster of 16 computers and achieved a substantial efficiency.

Goharian, T. El-Ghazawi, D. Grossman, .Enterprise Text Processing: A Sparse Matrix Approach., IEEE International Conference on Information Techniques on: Coding & Computing (ITCC) 2001.
A. Jain, N. Goharian, "On Parallel Implementation of Sparse Matrix Information Retrieval Engine", The 2002 International Multi-conferences in Computer Science: on Information and Knowledge Engineering (IKE), 2002.
S. Stein, N. Goharian, "On the Mapping of Index Compression Techniques on CSR Information Retrieval", IEEE International Conference on Information Techniques on: Coding & Computing (ITCC), 2003.
N. Goharian, A. Jain, Q. Sun, "Comparative Analysis of Sparse Matrix Algorithms for Information Retrieval", Journal of Systemics, Cybernetics and Informatics, 2003.


Medical Informatics


A joint project with Urology department of Northwestern Medical school. This effort includes two projects in computer assisted diagnoises systems. (Data collection for clinical research is quite fragmented in the field of medicine. A recent report outlined the complex process of defining clinical research projects and then collecting data in the various phases of the clinical study. Issues of administrative hurdles associated with multi-center studies, errors free data collection, automated analysis, and increased collaboration among different medical research centers are of concern. The database and data mining techniques allow for error free and automated analysis. We designed and developed a computer-assisted medicine application that captures data needed to study the effectiveness of the treatment of Urinary Tract Infections (UTIs) and provides diagnosis and treatment. Furthermore, using our system, the patients' data at Northwestern Medical School were analyzed and as the result of the findings recommendations on application of LithoTron® lithotripter were given.)

P. Jain, N. Goharian, G. Kora, R. Nadler, S. Kim, A. Weiser, Computer-Assisted Medicine in the Treatment of Nephrolithiasis., IEEE International Conference on Advanced Science and Technology (ICAST 2001), Chicago, Illinois, 2001.
N. Goharian, P. Jain, G. Kora, A. Jain .A Web-Based Medical Diagnosis and Treatment System for Urinary Tract Infections., The 2002 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS'02), Las Vegas, Nevada, June 2002.
P. Jain, N. Goharian, A. Weiser, S. Kimm, S. Kim, J. Stern, J. Pazona, C. Wambi, R. Yap, L. Blunt, and R. Nadler, "Efficiency and Safety of the Healthtronics LithoTron® Lithotripter", Journal of EndoUrology, Vol 18, No 1, January/February 2004.