Open Access Open Access  Restricted Access Subscription or Fee Access

Algorithmseer: A System for Extracting and Searching for Algorithms in Scholarly Big Data

A Kavitha, R Ranjana, M Nandhini, D Kiruthika

Abstract


In this project, Identification and extraction of varied informative entities from learned digital documents is an active area of research. For algorithm discovery in digital documents, and described a method for automatic detection of pseudo-codes (PCs) in Computer Science publications. Their method assumes that each PC is accompanied by a caption. Such a PC can be identified using a set of regular expressions to capture the presence of the accompanied caption.
However, such an approach is limited in its coverage due to reliance on the presence of PC captions and wide variations in writing styles followed by different journals and authors. PCs are commonly used in scientific documents to represent algorithm, a majority of algorithms are also represented using algorithmic procedures (APs). An algorithmic procedure is a set of descriptive algorithmic instructions and differs from a PC in the following ways: Writing Style and Location in Documents.
The existing algorithms represented in documents do not conform to specific styles and written in arbitrary formats, this becomes a challenge for effective identification and extraction. A novel proposed methodology based on ensemble machine learning to discover algorithm representations such as PCs and APs automatically. Moreover, observe that two or more algorithm representations may be used to describe the same algorithms. Hence, we also proposed a simple heuristic that links different algorithm representations that together constitutes an algorithm.
Automatic discovery and extraction of these algorithm representations will be useful for applications in digital libraries and document engineering. This project represents an automatic description method that first supports the data units on a result page into different groups such that the data in the same group have the same semantic.
Then, for each group we annotate it from different aspects and aggregate the different annotations to predict a final annotation label for it. An annotation wrapper for the search site is inevitably created and can be used to interpret new result pages from the same web database. The project proposes a clustering-based shifting technique to align data units into different groups so that the data units inside the same group have the same semantic.
Instead of using only the DOM tree or other HTML tag tree structures of the SRRs to support the data units (like most current methods do), the approach also considers other important features shared among data units, such as their data types (DT), data contents (DC), presentation styles (PS), and adjacency (AD) information. The experiments indicate that the proposed approach is highly effective.

Full Text:

PDF

References


A. Mc Monnies. Object-Oriented Programming in Visual C#.NET. Pearson Education, First Indian Reprint, 2004.

M.Williams. Microsoft® Visual C#™ .NET (Core Reference). 2nd Edn., Pearson Education Asia.

R.D. Schneider, J.R. Garbus. Optimizing SQL Server. 2nd Edn., Pearson Education Asia.

R.S. Pressman. Software Engineering. New Delhi: Tata McGraw-Hill Publishing Company Limited.

http://www.codeproject.com.

http://www.webreference.com.

http://www.boards.developerforce.com.

http://www.programmersheaven.com

http://msdn.microsoft.com.

J.B. Baker, A.P. Sexton, V. Sorge, M. Suzuki. Comparing approaches to mathematical document analysis from pdf. ICDAR’11, 2011, 463–7p.

S. Bhatia, S. Tuarob, P. Mitra, C.L. Giles. An Algorithm Search Engine for Software Developers. 2011.

H.-H. Chen, L. Gou, X. Zhang, C.L. Giles. Collabseer: a search engine for collaboration discovery, In: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries, JCDL ’11. New York, NY, USA, 2011, 231–40p.

P. Chiu, F. Chen, L. Denoue. Picture detection in document page images, Doc Eng ’10, 2010, 211–4p.

M. Khabsa, P. Treeratpituk, C.L. Giles. Ackseer: a repository and search engine for automatically extracted acknowledgments from digital libraries, In: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, JCDL ’12. New York, NY, USA, 2012, 185–94p.

S. Tuarob, S. Bhatia, P. Mitra, C.L. Giles. Automatic detection of pseudo codes in scholarly documents using machine learning, In: Document Analysis and Recognition (ICDAR), 2013 12th International Conference on. IEEE, 2013, 738–42p.

S. Tuarob, P. Mitra, C.L. Giles. Improving algorithm search using the algorithm co-citation network, In: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries, JCDL ’12. New York, NY, USA, 2012, 277–80p.

S. Tuarob, L.C. Pouchard, C.L. Giles. Automatic tag recommendation for metadata annotation using probabilistic topic modeling, In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL ’13. New York, NY, USA, 2013, 239–48p.

Z. Wu, S. Das, Z. Li, P. Mitra, C.L. Giles. Searching online book documents and analyzing book citations, DocEng ’13. 2013, 81–90p.


Refbacks

  • There are currently no refbacks.