International Journal of Scientific & Technology Research

Home About Us Scope Editorial Board Blog/Latest News Contact Us
10th percentile
Powered by  Scopus
Scopus coverage:
Nov 2018 to May 2020


IJSTR >> Volume 6 - Issue 5, May 2017 Edition

International Journal of Scientific & Technology Research  
International Journal of Scientific & Technology Research

Website: http://www.ijstr.org

ISSN 2277-8616

Search Engine For Ebook Portal

[Full Text]



Prashant Kanade, Aishwarya Sadasivan, Komal Dhuri, Manaswini Muralidaran, Meghna Mohan



similarity modeling, clustering, elastic search, vector space model, term frequency-inverse document frequency (tf-idf) matrix, language modeling, spell checking, text segmentation.



The purpose of this paper is to establish the textual analytics involved in developing a search engine for an ebook portal. We have extracted our dataset from Project Gutenberg using a robot harvester. Textual Analytics is used for efficient search retrieval. The entire dataset is represented using Vector Space Model, where each document is a vector in the vector space. Further, for computational purposes we represent our dataset in the form of a Term Frequency- Inverse Document Frequency (tf-idf) matrix. The first step involves obtaining the most coherent sequence of words of the search query entered. The entered query is processed using Front End algorithms this includes-Spell Checker, Text Segmentation and Language Modeling. Back End processing includes Similarity Modeling, Clustering, Indexing and Retrieval. The relationship between documents and words is established using cosine similarity measured between the documents and words in Vector Space. Clustering performed is used to suggest books that are similar to the search query entered by the user. Lastly, the Lucene Based Elasticsearch engine is used for indexing on the documents. This allows faster retrieval of data. Elasticsearch returns a dictionary and creates a tf-idf matrix. The processed query is compared with the dictionary obtained and tf-idf matrix is used to calculate the score for each match to give most relevant result.



[1] P. D.Turney and P. Pantel (2010)“From Frequency to Meaning: Vector Space Models of Semantics", Volume 37, pages 141-188.

[2] List of IPython (Jupyter) Notebooks by Peter Norvig: How to do things with words or Statistical Natural Language processing in Python.

[3] Programming Collective Intelligence,Building Smart Web 2.0 Applications, by Toby Segaran, O'Reilly Media (Chapter 3 ).

[4] http://lucene.apache.org/java/docs/.

[5] http://snowball.tartarus.org/algorithms/lovins/stemmer.html.

[6] https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html

[7] http://bl.ocks.org/AndrewRP/7468330