International Journal of Computational Intelligence Research (IJCIR)

Volume 2, Number 2 (2006)


Using a self organizing feature map for extracting representative web pages 

from a web site

Sebastián A. Ríos, Hiroshi Yasuda, Terumasa Aoki
Research Center for Advanced Science and Technology, University of Tokyo 4-6-1 Komaba Meguroku, Tokyo, Japan.

Juan D. Velásquez
Department of Industrial Engineering, University of Chile República 701, Santiago, Chile


We introduce a method for improving the web site content through the identification of their most representative web pages. The process begin with the transformation of the web page text content in feature vectors by using the vector space model for documents. Next a Self Organizing Feature Map (SOFM) receive these vectors as input, generating a set of clusters, whose centroids contain the most representative text content for a topic in the site.

In the web page's vectorial representation, the text content is transformed in a set of numeric values. Then by operation of the SOFM, the cluster's content are vectors whose relation with the web site pages is not clear. By applying a Reverse Cluster Analysis (RCA), it is possible to identify which pages are represented in each cluster. The RCA consists in the comparison among the vectors in each clusters with the page's vector representation. Next the pages whose vectorial representation is near to the cluster's centroid, are extracted.

This approach was tested in a real web site in order to shows its effectiveness. The results indicate that it is possible to identify representative web page in a web site and for this way, improve the site's text content.

Key words
Neural Networks, Web Content Mining, Self Organizing Maps.