Using Wikipedia To Improve Document Classification/Clustering

Using Wikipedia to improve document classification/clustering

ACKNOWLEDGEMENT

I, (Your name), assure that all the matter of this report is my work unless specifically referred and given proper credit. I would like to extend my appreciation and gratitude to my instructor who has been a sign of inspiration and a true guide helping be through the process of conducting this research. The research was made possible by the help of a number of people who directly or indirectly encouraged me, and guided me to finish this work at this acceptable level.

Signature: _______________________________

Date: __________________________

DECLARATION

I , declare that to the best of my knowledge and effort, this report is my work, and does not use the material of others unless proper credit is given. I have taken possible steps to avoid any misuse of information or and other such mishaps. Any issues that may arise are accidental and not intentional in anyway.

Signed: __________________. Date: _________________.

Abstract

The intention of the research is to understand how document classification/clustering can be improved using Wikipedia. A variety of literatures have been reviewed to get a better understanding of the topic, and to study details of the functions of Wikipedia and the effect it can have on document classification. Spoken and written communication further facilitated our technological advancement. Starting with the early Greeks, the encyclopedia emerged as a modern form of culture and knowledge transmission and has served as an important tool for the collecting and archiving of knowledge allowing future generations the opportunity to build on prior developments rather than continual rediscovery. The popularity of Wikipedia has also grown and currently (as of October, 2011) ranks fifth in overall global web traffic (“Alexa Top 500 Global Sites,” n.d.). Web users looking for information on any topic will likely come across a Wikipedia article fairly quickly. Keeping this in mind we fail to reject H1 = Document classification/clustering can be improved using Wikipedia, H3 = Algorithm models are the factors that contribute to document classification/clustering, and H6 = the current position document classification/clustering is good.

TABLE OF CONTENTS

ACKNOWLEDGEMENTII

DECLARATIONIII

ABSTRACTIV

CHAPTER 01: INTRODUCTION1

Background of the research2

Research on text categorization3

Document classification5

Classification of documents5

Classification5

Importance of classification6

Process of classification7

Research Aim8

Research Objectives8

Research Questions9

Hypothesis9

CHAPTER 02: LITERATURE REVIEW10

Mean Free Path Based Categorization10

Data Quality11

Audit Trail12

Replication Recipes13

Attribution13

Analysis13

Developing software tools14

Tertiary Source15

Presidential Committee on Information Literacy17

Collaborative nature20

History of the Encyclopedia21

Encyclopedia Britannica23

Editing process25

Research on Wikipedia27

Legitimacy29

Supporting Evidence31

Quantity of contributions33

Authorship36

CHAPTER 03: METHODOLOGY40

Overview40

Research Design40

Descriptive Research41

Exploratory Research41

Data Collection Methods41

Quality research42

Threats to Validity42

CHAPTER 04: RESULT AND FINDINGS44

Wikipedia44

Features45

Updating of information46

Computational linguistics47

Like tf-idf to choose keywords to classify documents48

Overview48

A naive Bayes model49

Summary of the existing research based on the semiotics framework52

Semantic similarity52

Wikipedia as Taxonomy53

Mapping of Concepts54

CHAPTER 05: LIMITATIONS55

CHAPTER 06: CONCLUSION56

Wikipedia database57

CHAPTER 01: INTRODUCTION

The intention of the research is to understand how document classification/clustering can be improved using Wikipedia. A variety of literatures have been reviewed to get a better understanding of the topic, and to study details of the functions of Wikipedia and the effect it can have on document classification. This chapter will shed light on some basic information so the reader can have a better understanding of some details so ...

Using Wikipedia To Improve Document Classification/Clustering

Wikipedia

Wikipedia

Major Challenges To Using...

Marketing Cluster

The Effects Of Using A Ba...