Using Wikipedia to improve document classification/clustering
ACKNOWLEDGEMENT
I, (Your name), assure that all the matter of this report is my work unless specifically referred and given proper credit. I would like to extend my appreciation and gratitude to my instructor who has been a sign of inspiration and a true guide helping be through the process of conducting this research. The research was made possible by the help of a number of people who directly or indirectly encouraged me, and guided me to finish this work at this acceptable level.
Signature: _______________________________
Date: __________________________
DECLARATION
I , declare that to the best of my knowledge and effort, this report is my work, and does not use the material of others unless proper credit is given. I have taken possible steps to avoid any misuse of information or and other such mishaps. Any issues that may arise are accidental and not intentional in anyway.
The intention of the research is to understand how document classification/clustering can be improved using Wikipedia. A variety of literatures have been reviewed to get a better understanding of the topic, and to study details of the functions of Wikipedia and the effect it can have on document classification. Spoken and written communication further facilitated our technological advancement. Starting with the early Greeks, the encyclopedia emerged as a modern form of culture and knowledge transmission and has served as an important tool for the collecting and archiving of knowledge allowing future generations the opportunity to build on prior developments rather than continual rediscovery. The popularity of Wikipedia has also grown and currently (as of October, 2011) ranks fifth in overall global web traffic (“Alexa Top 500 Global Sites,” n.d.). Web users looking for information on any topic will likely come across a Wikipedia article fairly quickly. Keeping this in mind we fail to reject H1 = Document classification/clustering can be improved using Wikipedia, H3 = Algorithm models are the factors that contribute to document classification/clustering, and H6 = the current position document classification/clustering is good.
TABLE OF CONTENTS
ACKNOWLEDGEMENTII
DECLARATIONIII
ABSTRACTIV
CHAPTER 01: INTRODUCTION1
Background of the research2
Research on text categorization3
Document classification5
Classification of documents5
Classification5
Importance of classification6
Process of classification7
Research Aim8
Research Objectives8
Research Questions9
Hypothesis9
CHAPTER 02: LITERATURE REVIEW10
Mean Free Path Based Categorization10
Data Quality11
Audit Trail12
Replication Recipes13
Attribution13
Analysis13
Developing software tools14
Tertiary Source15
Presidential Committee on Information Literacy17
Collaborative nature20
History of the Encyclopedia21
Encyclopedia Britannica23
Editing process25
Research on Wikipedia27
Legitimacy29
Supporting Evidence31
Quantity of contributions33
Authorship36
CHAPTER 03: METHODOLOGY40
Overview40
Research Design40
Descriptive Research41
Exploratory Research41
Data Collection Methods41
Quality research42
Threats to Validity42
CHAPTER 04: RESULT AND FINDINGS44
Wikipedia44
Features45
Updating of information46
Computational linguistics47
Like tf-idf to choose keywords to classify documents48
Overview48
A naive Bayes model49
Summary of the existing research based on the semiotics framework52
Semantic similarity52
Wikipedia as Taxonomy53
Mapping of Concepts54
CHAPTER 05: LIMITATIONS55
CHAPTER 06: CONCLUSION56
Wikipedia database57
CHAPTER 01: INTRODUCTION
The intention of the research is to understand how document classification/clustering can be improved using Wikipedia. A variety of literatures have been reviewed to get a better understanding of the topic, and to study details of the functions of Wikipedia and the effect it can have on document classification. This chapter will shed light on some basic information so the reader can have a better understanding of some details so ...