Journal of Contemporary Management Submitte on 25/10/2015 Article ID: 1929-0128-2016-01-27-09 Salha Alzahrani, Naomie Salim, an Vasile Palae Frameork for Plagiarism Detection Using Logical Tree- Structure Features an Multi-Layer Clustering Dr. Salha Alzahrani (Corresponing Author) College of Computers an IT, Taif University Haiah, 21944 Taif, Saui Arabia E-mail: s.zahrani@tu.eu.sa Homepage:.c2learn.com Prof. Naomie Salim Faculty of Computing, Universiti Teknologi Malaysia Skuai, 81310 Johor, Malaysia E-mail: naomie@utm.com Homepage: comp.utm.my/naomie/ Prof. Vasile Palae Faculty of Engineering an Computing, Coventry University CV1 5FB Priory Street, Coventry, Unite Kingom E-mail: vasile.palae@coventry.ac.uk Homepage:.cs.ox.ac.uk/vasile.palae/ Abstract: Different practices of scientific misconuct have appeare recently an that impose the nee for more sophisticate solutions. Logical tree-structure features escribe the topology of scientific publications in terms of meaningful parts such as title, abstract, backgroun, methos, results, an references. This paper presents the methoology propose to uncover plagiarism in scientific publications using structural ocument features an multi-layer clustering. Logical treestructure features are extracte as generic classes. Structural components such as paragraphs are organise uner these generic classes. Instea of using traitional flat-base plagiarism etection methos, a layer-base clustering approach is propose to fin similar clusters an perform caniate retrieval using the top layer features. The bottom layer features are use to cluster structural components an to etect plagiarism. The suggeste frameork can be more efficient an reliable to etect plagiarism in scholarly articles than existing approaches. Keyors: logical organisation, tree-structure features, clustering, plagiarism etection JEL Classifications: C00, C82, C890 1. Introuction The problem of plagiarism in the acaemic orl has increase recently ith the gigantic amount of igital resources an open access journals available on the Internet. Universities, publishers an iniviuals ten to use automatic plagiarism checkers to ensure the integrity of scholarly orks. Hoever, there are many ays to enhance the process of plagiarism etection in scientific publications in comparison ith the current anti-plagiarism softare. Scientific publications ten to have consistent structure ith subsequent parts. Several stuies on information extraction have aresse the structure of scientific publications (Burget, 2007; Hagen et al., 2004; Lee et al., 2003; Li an Ng, 2004; Wang et al., 2005; Witt et al., 2010; Zhang et al., 2006). Segmentation of scholarly ocuments takes into consieration that the content structure is presente ~ 27 ~
ISSNs:1929-0128(Print); 1929-0136(Online) Acaemic Research Centre of Canaa by visual or physical elements, e.g. location, position, punctuations, length, font size or type, etc. They may also epen on some keyors, e.g. chapter, introuction, etc., to label a specific content. Several stuies have efine the logical structure of the ocuments using ifferent terminologies as text-type structures (Hagen et al., 2004; Siharthan an Teufel, 2007; Teufel an Moens, 2002), an generic classes in scholarly papers (Luong et al., 2010). Different components extracte from the ocument can be generalise uner these types/classes. For example, Tuefel et al. (2002) efine seven types of text, or argumentative zones accoring to a so-calle rhetorical status, namely On, Other, Backgroun, Textual, Aim, Basis, an contrast. Clustering is the process of grouping together objects or components that ten to have the same or similar features (Manning et al., 2009). Each group of objects is calle a cluster. Clustering iffer from classification in the fact that e have no iea about the labels (i.e. name of features) in the resulting clusters. Hoever in classification, e have a set of specific labels or categories that e ant to assign each object to one of them. Text clustering aims to iscover ocuments, terms, passages, ebsites, or any textual elements hich certainly share some textual similarly (Bhatia an Deogun, 1998; Manning et al., 2009; Shehata et al., 2010). The similarity perspective of texts can be efine in various ays. For instance, a car an a horse iffer physically but similar in their functionality. Examples of text clustering inclue clustering of big ata collections into smaller sub collections, term clustering to fin share themes or concepts in a ata set, clustering of sentences from larger text objects about certain topic, clustering of ebsites an search results. As an essential technique in text mining an knolege iscovery, text clustering is very useful for exploratory text analysis. Thus, it can be applie to etect plagiarism, or in other ors, to get sense about highly similar textual elements an uplicates. This paper aresses the problem of plagiarism in the acaemic publications such as journal articles an conference papers. The contributions of this paper are tofol: (i) the use of logical tree-structure features for ocument segmentation an representation, an (ii) the use of clustering-base approach at ifferent layers for plagiarism etection. The rest of this paper is organise as follo. Section II iscusses the literature revie relate to textual features an plagiarism etection techniques. Section III escribes the logical treestructure feature extraction metho. Section IV escribes the suggeste algorithms for multi-layer clustering an plagiarism etection. Finally in Section V, e give concluing remarks an future orks nee to be one to complete the experimental orks an accomplish this stuy. 2. Relate Research State-of-the-art research have aresse the textual ata features an applie techniques for plagiarism etection (Alzahrani et al., 2012b; Clough, 2000, 2003). In this section, textual ata features are classifie into to types: flat features an structural features. Tree-structure features of ocuments are escribe in epth. Then, e briefly summarise ifferent plagiarism retrieval tasks an etection approaches. The relationship beteen the tree-structure features an clustering techniques is iscusse to brige the gap that remains highly problematic in the acaemic plagiarism. 2.1 Text features Feature representation of textual ocuments can be classifie into flat an structural features. Flat features refer to the lexical, syntax an semantic properties of the text ithout consiering the orientation of these features throughout the ocument (Alzahrani et al., 2012b). Examples of these features inclue character/or n-grams, phrases, sentences, part-of-speech (POS) tags, an others. ~ 28 ~
Journal of Contemporary Management, Vol. 5, No. 1 Structural features, on the other han, represent the text as a tree ith a root noe an chil noes istribute in ifferent layers (at least to layers). For example, a ocument (root noe) can be ivie into sections an sections into paragraphs (chil noes). Such representation exhibits better organisation of the scientific publications as they are highly structure. Structural features also represent better semantics of the content than the flat features. Structural feature extraction can be ivie into block-specific an content-specific (Alzahrani et al., 2012a). Block-specific tree-structure feature representation refers to the use of specific markers such as tags or or counters to represent the tree regarless of the sections in the ocument. A threelevel block-specific tree representation as extracte (Rahman et al., 2007) as shon in Fig. 1 [a]. In (Cho an Rahman, 2009; Rahman an Cho, 2010), a hierarchical ocument organisation similar to (Rahman et al., 2007) as use but ith ifferent imensions for feature vectors as shon in Fig. 1 [b]. Nonetheless, block-specific features coul be semantically insufficient to represent topically relate content in the ocument. Therefore, extraction of content-specific treestructure features oul substantially improve the ocument representation. For example, scientific ocuments can be partitione into sections an sections into paragraphs (Alzahrani et al., 2012b). Tree representations such as ocument-sections-paragraphs or ocument-concepts-chunks oul greatly characterise the semi-structure ocuments such as books, theses an journal articles an conference papers. Hoever, some challenges are impose in content-specific trees such as (i) sections have variable length in comparison to, for instance, fixe-length pages in block-specific trees, an (ii) ifferent sections/concepts coul have ifferent egree of importance hich can be exploite for ifferent purposes such as improving the ocument retrieval an plagiarism etection. Fig. 1. Block-specific tree-structure feature representation of a ocument (Rahman an Cho, 2010) 2.2 Plagiarism etection Several research orks on plagiarism etection have investigate the evelopment an evaluation of computerise techniques that aress this offence. These techniques are generally orking by scanning to textual ocuments, computing the egree of similarity, an highlighting highly similar segments as plagiarism. Most plagiarism etection techniques have utilise flat features to represent the textual ata (Alzahrani et al., 2012b). Fe stuies, on the other han, use structural features for plagiarism etection. For example, a coarse-to-fine frameork for plagiarism etection hich implements ocument-paragraphs-sentences tree for a collection of eb ocuments as propose (Zhang an Cho, 2011). In this regar, matching sentences in the bottom layer obtaine better precision in the plagiarism etection results compare ith the approach in (Rahman et al., 2007). Aitionally, structural information has been investigate to etect significant plagiarism cases in scientific publications (Alzahrani et al., 2012a). MLSOM as use for retrieval of a set of similar ocuments to a suspecte ocument an plagiarism etection (Cho an Rahman, 2009). The top layer performs ocument clustering an retrieval, an the bottom layer plays an important role for etecting similar, potentially plagiarise, ~ 29 ~
ISSNs:1929-0128(Print); 1929-0136(Online) Acaemic Research Centre of Canaa paragraphs. Given a query ocument q, a tree-structure ocument partitioning approach as firstly use to construct the tree ocument-pages-paragraphs. Seconly, feature vectors of the ocuments ere constructe using a vocabulary table an PCA projection matrix, an use as input vector x i. Thirly, neurons in the upper level are matche ith x i to fin the most similar neurons, i.e. ocuments, using Eucliean istance. A set of ocuments D x is marke as having global similarity ith q an use in the next step. Fourthly, the associate noes of x D x in the bottom layer ere compare in-epth ith the thir level noes of q using a paragraph-to-paragraph similarity metric, an the most similar paragraph is the one ith the smallest ifference. 2.3 Briging the gap To sum up, textual features vary from simple lexical features to comprehensive structural features. To ocuments having similar or-histograms at root noes may be completely ifferent in terms of the semantics an context. It is because of ifferent orientation of the same set of ors throughout the ocument, hich is reflecte by the iscriminative loer parts of the tree ata. Thus, tree structure representation can help to achieve better analysis of ocuments an plagiarism etection. Existing techniques applie for the problem of plagiarism etection o not consier content-specific tree-structure features an multi-layer clustering. In aition, the scope of the current methos (Cho an Rahman, 2009; Rahman et al., 2007) that use block-specific features is limite to the literal plagiarism. This research ork aims to brige this gap by using contentspecific tree-structure features representation better than the one use in (Cho an Rahman, 2009). For this aim, e propose the use of logical feature extraction from scientific ocuments an multi-layer clustering (i.e. the use of clustering at ifferent layers). Clustering the root noes ill perform source ocument retrieval an clustering at the bottom letters ill guie for in-epth analysis an plagiarism etection. 3. Logical Tree-Structure Document Moel Scientific publications have a common structure that begins ith a title, authors, abstract, keyors, an the boy hich splits into several parts/components incluing heaers, paragraphs, lists, tables, captions, quotes, references an so on. In contrast to the bag-of-ors -base features use by existing methos (Barrón-Ceeño an Rosso, 2009; Grozea et al., 2009; Kasprzak et al., 2009; Lackes et al., 2009), this ork implements a feature extraction metho that combines structural information an term information from scientific articles. Folloing sections iscuss the segmentation process of scientific articles into structural components, the extraction of the logical tree-structure features, the eighting algorithm of structural components, an the construction of the vocabulary lists. A complete algorithm for the propose tree-structure feature extraction metho (TFEM) use in this stuy is presente in the last section to sum up the hole approach. 3.1 Component-base segmentation One of the goals in this stuy is to capture the semantic organizational features of scientific publications. In this ork, e propose a tool an a metho for structural components extraction base on the visual layout of the ocument an the ra text (Luong et al., 2010). The tool orks by extracting structural components using visual escriptors an keyor inicators. It can extract ifferent constructs namely Title, Author, Aress, Affiliation, Keyors, an Boy. The boy contains Equations, Figures, Figure captions, Footnotes, List items, Notes, References, Section heaers, Subsection heaers, Sub-subsection heaers, Tables, an Table captions. 3.2 Logical tree-structure extraction The use of the tree-structure feature representation facilitates the analysis of scientific articles in a hierarchal, rather than a flat, manner. As mentione in Section II, block-specific tree-structure ~ 30 ~
Journal of Contemporary Management, Vol. 5, No. 1 features such as ocument-pages-paragraphs (Cho an Rahman, 2009) an ocumentparagraphs-sentences (Zhang an Cho, 2011) are not sufficient to represent the semantic organisation of scholarly ocuments. Therefore, e aim to employ content-specific tree-structure organisation herein scientific articles are represente in a logical hierarchical tree namely ocument -> generic classes -> structural components By the or generic classes, e mean a section or a group of sections that serve a unique purpose. We believe that classes convey more semantically relate components than pages. To reflect the scientific topology in scholarly ocuments, e propose the folloing generic classes: G = Title, Oner, Abstract, Introuction, Literature revie, Methoology, Evaluation, Conclusions, Acknolegments an References 3.3 Component-base eighting A component eight C, for a structural component C in a ocument can be efine as a quantitative function hich measure the eight of a structural component C, base on the relevance beteen terms in C an other structural components (e Moura et al., 2010). In this regar, C, efines a qualitative importance of a component C in scholarly ocuments, hich can be assigne manually by an expert uring the inexing phase of ocuments. Some methos have been evelope (Bounhas an Slimani, 2010; e Moura et al., 2010; Marques Pereira et al., 2005; Marteau et al., 2006) that use typical TF-IDF eighting but ith structural components of ocuments taken into consieration. In this paper, e use the approach propose in (Alzahrani et al., 2012a) to compute C, automatically. To statistical measures namely Depth an Sprea (Alzahrani et al., 2012a) are aapte, as belo. Sprea of a term t in scholarly ocument is the number of structural components in that contain t: 1 if t C (1) Sprea( t, ) i here i C 0 ~ 31 ~ otherise Depth of a term t in a generic class G refers to the frequency of t in G normalize by the maximum frequency in G such that e o not unerestimate classes ith lo components. tf t,g Depth( t,g ) MAX t,g here tf t,g is the term frequency in generic class G, an MAX tꞌ,g is the maximum frequency gaine by a term tꞌ in G. Sprea-base an Depth base component-eight factors are efine at component level, as follos: Sprea( t, ) t C (3) C, C, C (2) Depth( t,g ) t C (4) C here t refers to inex terms in a component C, is the article that has C, C is the size of C. Finally, e combine Depth an Sprea into a single factor. C, Depth( t,g ) Sprea( t, ) t C (5) C 3.4 Vocabulary builing To buil the vocabulary list, three steps nee to be one. First is to construct the term frequency table hich contains the terms an their occurrence information in structural components in each, as follos:
ISSNs:1929-0128(Print); 1929-0136(Online) Acaemic Research Centre of Canaa (6) t ft, C tf t, C C, t ft, tft, C G CG here tf t,c is the frequency of a term t in a structural component C, C, is the combine componenteight factor given by formula (5), an t f t an, C t f t are the ne frequency measure of terms in C, an combine ith the structural information taken from the ocument. We construct term eighting table using the frequency table in a ay similar to VSM moel, as follos: D tf.log (8) t, t, D : t here D is total number of ocuments in the ataset, an D t is the number of ocuments in the collection that contains t. Then, the vocabulary table T is built hich inclues terms that obtain the top eights. For ocument features, e ill consier 100 terms, hile for generic classes G an structural components C, 150 an 200 top-frequency terms ill be use, respectively. 3.5 Tree-structure feature extraction The propose algorithm for feature extraction is shon in Fig. 2. For all structural components C in each, e ill construct the feature vector f C from term frequency compute in (6), as state in formula (9). Then, the feature vectors for generic classes calle f G can be obtaine as in equation (10). f tf, tf,..., tf ] (9) C [ t1, C t2, C tn, C (7) G fc C G (10) f On the other han, ocument feature vector f is constructe by using the eights compute in formula (8) as belo. f,,..., ] (11) [ t1, t2, tn, here n is the selecte number of top terms to represent the feature vectors in each layer. Fig. 2. Tree-structure feature extraction metho (TFEM) ~ 32 ~
Journal of Contemporary Management, Vol. 5, No. 1 4. Multi-Layer Clustering an Plagiarism Detection In plagiarism etection research, e eal ith to sets of ocuments: source collection D an query ocuments Q. In this stuy, both sets are represente as content-specific tree-structure features. The propose frameork inclues three main steps: Step 1: Clustering at the top layer. Clustering is performe at the top layer base on ocument features f. The aim of this step is to fin a subset of the ocument collection D x ϲd q D q hich is relatively smaller than D. Step 2: Clustering at the mile layer. For each query ocument q, e just use the set of relatively similar ocuments D x obtaine from step 1. Then, clustering on D x is performe at the mile layer base on generic class features f G. The aim of this step is to fin similar sections or subjects beteen ocuments (i.e. generic classes), an mark them for further analysis. Step 3: Clustering at the bottom layer an plagiarism etection. This step aims to fin all suspicious components C q in q q D q hich are plagiarise from structural components C x in x x D x using structural component-base comparison algorithm explaine belo. Clustering in the top an mile layers can be achieve using general text clustering techniques such as generative probabilistic moels, agglomerative hierarchical clustering (Bhatia an Deogun, 1998), an K-means clustering algorithms (Manning et al., 2009). Then, to fin the cluster that is most likely to contain the set of source ocuments, e ill use the cosine the similarity beteen the centre of each cluster j an the query q can be calculate as follos: j q Sim( j, q ) j q n i n 2 ti i1 1 ti, j ti, q In the last step, etaile analysis an similarity calculation are performe to fin the structural components that are highly similar. Further analysis by humans may esignate plagiarise components from properly cite ones. To this en, associate noes of x D x in the bottom layer ill be compare component-to-component ith the feature vectors of thir layer of q. By components e generally mean paragraphs. The similarity beteen the feature vectors of structural components can be calculate using vector ifference. The most similar paragraph is the one ith the smallest ifference as state by the equation belo., j n i1 2 ti, q (12) PD( q, x ) Cq q (min C ) x f x C f q Cx (13) here f C are the paragraph features for ocuments q an x. 5. Conclusion an Future Work Plagiarism in scientific publications is aresse in this paper. We propose a rough-to-fine frameork for feature extraction namely logical content-specific tree-structure features herein structural components are organise uner generic classes. Clustering is suggeste at ifferent layers to achieve ocument retrieval an plagiarism etection. The suggeste methos an algorithms exhibit better unerstaning of the semantic content an exploratory analysis of scientific publications. Future orks inclue the construction of a groun-truth ataset of scientific ocuments taking into account accurate XML tree representation. Experimental orks shoul be performe on the ataset to evaluate the propose frameork. More in-epth analysis on structural components shoul be performe an information visualization methos can be use for highlighting plagiarism in a ay that is ifferent from other types of ocuments. ~ 33 ~
ISSNs:1929-0128(Print); 1929-0136(Online) Acaemic Research Centre of Canaa References [1] Alzahrani, S., et al. (2012a). "Using structural information an citation evience to etect significant plagiarism cases". Journal of the American Society for Information Science an Technology (JASIST), 63(2): 286-312. [2] Alzahrani, S. M., Salim, N., an Abraham, A. (2012b). "Unerstaning Plagiarism Linguistic Patterns, Textual Features an Detection Methos". IEEE Transactions on Systems, Man, an Cybernetics, Part C: Applications an Revies, 42(2): 133-149. [3] Barrón-Ceeño, A., an Rosso, P. (2009). "On automatic plagiarism etection base on n- grams comparison". Avances in Information Retrieval (pp. 696-700). DOI: 10.1007/978-3- 642-00958-7_69. [4] Bhatia, S. K., an Deogun, J. S. (1998). "Conceptual clustering in information retrieval". IEEE Transactions on Systems, Man, an Cybernetics, Part B: Cybernetics, 28(3): 427-436. [5] Bounhas, I., an Slimani, Y. (2010). "A hierarchical approach for semi-structure ocument inexing an terminology extraction". Paper presente at the International Conference on Information Retrieval an Knolege Management, CAMP'10, Selangor, Malaysia. [6] Burget, R. (2007). "Automatic Document Structure Detection for Data Integration". In: W. Abramoicz (E.), Business Information Systems (Vol. 4439, pp. 391-397): Springer Berlin / Heielberg. [7] Cho, T. W. S., an Rahman, M. K. M. (2009). "Multilayer SOM ith tree-structure ata for efficient ocument retrieval an plagiarism etection". IEEE Transactions on Neural Netorks, 20(9): 1385-1402. [8] Clough, P. (2000). "Plagiarism in Natural an Programming Languages: An Overvie of Current Tools an Technologies", In: Department of Computer Science, University of Sheffiel, UK, Technical Report CS-00-05. [9] Clough, P. (2003). "Ol an ne challenges in automatic plagiarism etection". National UK Plagiarism Avisory Service [Online] Available at http://ir.shef.ac.uk/cloughie/papers/pas_ plagiarism.pf. [10] e Moura, E. S., et al. (2010). "Using structural information to improve search in Web collections". Journal of the American Society for Information Science an Technology, 61(12): 2503-2513. DOI: 10.1002/asi.21436. [11] Grozea, C., Gehl, C., an Popescu, M. (2009). "ENCOPLOT: Pairise sequence matching in linear time applie to plagiarism etection". Paper presente at the 25th Conference of the Spanish Society for Natural Language Processing, SEPLN'09, Donostia, Spain. [12] Hagen, L., Haral, L., an Petra Saskia, B. (2004). "Text type structure an logical ocument structure". Paper presente at the ACL Workshop on Discourse Annotation, Barcelona, Spain. [13] Kasprzak, J., Branejs, M., an Křipač, M. (2009). "Fining Plagiarism by Evaluating Document Similarities". Paper presente at the 25th Conference of the Spanish Society for Natural Language Processing, SEPLN'09, Donostia, Spain. [14] Lackes, R., Bartels, J., Bernt, E., an Frank, E. (2009). "A or-frequency base metho for etecting plagiarism in ocuments". Paper presente at the International Conference on Information Reuse an Integration, IRI'09, Las Vegas, NV. [15] Lee, K. H., Choy, Y. C., an Cho, S. B. (2003). "Logical structure analysis an generation for structure ocuments: A syntactic approach". IEEE Transactions on Knolege an Data Engineering, 15(5): 1277-1294. [16] Li, Z., an Ng, W. K. (2004). "WICCAP: From semi-structure ata to structure ata". Paper presente at the 11th IEEE International Conference an Workshop on the Engineering of Computer-Base Systems, ECBS'04, Brno, Czech Republic. ~ 34 ~
Journal of Contemporary Management, Vol. 5, No. 1 [17] Luong, M.-T., Nguyen, T. D., an Kan, M.-Y. (2010). "Logical structure recovery in scholarly articles ith rich ocument features". International Journal of Digital Library Systems (IJDLS), 1(4): 1-23. [18] Manning, C. D., Raghavan, P., an Schütze, H. (2009). Flat Clustering Introuction to Information Retrieval (pp. 350-374): Cambrige University Press. [19] Marques Pereira, R. A., Molinari, A., an Pasi, G. (2005). "Contextual eighte representations an inexing moels for the retrieval of HTML ocuments". Soft Computing, 9(7): 481-492. [20] Marteau, P.-F., Ménier, G., an Popovici, E. (2006). "Weighte Naïve Bayes moel for semistructure ocument categorization". Paper presente at the 1st International Conference on Multiisciplinary Information Sciences an Technologies, InSciT2006, Meria, Espagne. [21] Rahman, M. K. M., an Cho, T. W. S. (2010). "Content-base hierarchical ocument organization using multi-layer hybri netork an tree-structure features". Expert Systems ith Applications, 37(4): 2874-2881. [22] Rahman, M. K. M., WangPi Yang, Tommy W.S. Cho, an Sitao Wu (2007). "A flexible multi-layer self-organizing map for generic processing of tree-structure ata". Pattern Recognition, 40(5): 1406-1424. [23] Shehata, S., Karray, F., an Kamel, M. (2010). "An efficient concept-base mining moel for enhancing text clustering". IEEE Transactions on Knolege an Data Engineering, 22(10): 1360-1371. [24] Siharthan, A., an Teufel, S. (2007). "Whose iea as this, an hy oes it matter? Attributing scientific ork to citations". Paper presente at the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT 2007). Ne York, USA. [25] Teufel, S., an Moens, M. (2002). "Summarizing scientific articles: Experiments ith relevance an rhetorical status". Computational Linguistics, 28(4): 409-445. [26] Wang, Z.Q., Wang,Y.C., an Gao, K. (2005). A Ne Moel of Document Structure Analysis, Fuzzy Systems an Knolege Discovery (Vol. 3614, pp. 658-666): Springer Berlin, Heielberg. [27] Witt, A. an Metzing, D. (2010). "Discourse Relations an Document Structure". In: N. Ie, J. Véronis, H. Baayen, K. W. Church, J. Klavans, D. T. Barnar, D. Tufis, J. Llisterri, S. Johansson & J. Mariani (Es.), Linguistic Moeling of Information an Markup Languages (Vol. 40, pp. 97-123): Springer Netherlans. [28] Zhang, H., an Cho, T. W. S. (2011). "A coarse-to-fine frameork to efficiently thart plagiarism". Pattern Recognition, 44(2): 471-487. [29] Zhang, K., Wu, G., an Li, J. (2006). "Logical structure base semantic relationship extraction from semi-structure ocuments". Paper presente at the 15th International Conference on Worl Wie Web, Einburgh, Scotlan. ~ 35 ~