Categorical and geographical separation in science Janusz A. Hołyst Julian Sienkiewicz Krzysztof Soja Peter M. A. Sloot Faculty of Physics, Warsaw University of Technology Computational Science, University of Amsterdam, Amsterdam, The Netherlands 9th November 2016 Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 1 / 18
Motivation & Data Motivation The scope of this study is to try to answer two very basic questions: 1 can one indicate a category (i.e., a scientific discipline) that has the greatest impact on the rank of the university? 2 do the best universities collaborate with the best ones only? Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 2 / 18
Motivation & Data Motivation The scope of this study is to try to answer two very basic questions: 1 can one indicate a category (i.e., a scientific discipline) that has the greatest impact on the rank of the university? 2 do the best universities collaborate with the best ones only? Data description Web of Science data from years 2004 2010, around 1.200.000 realible records (articles), for each paper information about authors affiliations and Web of Science Category, categories have been aggregated (i.e., Physics, applied" > Physics"), Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 2 / 18
Correlation analysis For each of WoS 180 categories we calculate Spearman rank correlation ρ between the rank r i of the university (100 best universities from QS 2009 ranking) and the number of papers published by this university in the given category. Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 3 / 18
Correlation analysis For each of WoS 180 categories we calculate Spearman rank correlation ρ between the rank r i of the university (100 best universities from QS 2009 ranking) and the number of papers published by this university in the given category. Category N ρ Agriculture 6681 0.043 Biochemistry and Molecular Biology 59297-0.456*** Chemistry 156631-0.383*** Classics 1245-0.182. Computer Science 69375-0.309** Geography 7055-0.083 Geology 2912-0.118 Mathematics 28228-0.444*** Multidisciplinary Sciences 25490-0.593*** Physics 184647-0.463*** Significance codes: *** p < 0.001, ** 0.001 < p < 0.01, * 0.01 < p < 0.05, 0.05 <. < 0.1 Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 3 / 18
Correlation analysis Correlation vs category size for the majority of categories we observe (expected) negative correlation, there are, however, catgeories with significant number of papers characterized by negligible value of correlation ρ ρ 0.0-0.2-0.4 Statistical significance 0.1 < p < 1 0.05 < p < 0.1 0.01 < p < 0.05 0.001 < p < 0.01 p < 0.001-0.6 10 2 10 3 10 4 10 5 N Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 4 / 18
Principal components analysis (PCA) How to identify key categories and their correlations? One option is to perform Principal Component Analysis (PCA) PCA finds new directions in space (new coordinates) that maximize variances of projected observations new directions are called components, in certain cases it allows for dimension reduction if few first components explain significant part of variance, PCA gives some options for the interpreatation of components Data Variance Components Test 2-4 -2 0 2 4 y 2 x 2 y1 x 1 Variances 0.0 0.5 1.0 1.5 2nd component -2-1 0 1 2-4 -2 0 2 4 Comp.1 Comp.2-2 -1 0 1 2 Test 1 1st component Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 5 / 18
Principal components analysis (PCA) PCA can be restriced to first 3 components (around 90% of variance explained, we use 10 largest categories), separate directions can be connected to principal components Cumulative St. Dev. 1.0 0.8 0.6 0.4 0.2 0.0 0 1 2 3 4 5 6 7 8 9 10 11 Component a -6-3 0 3 6 0.6 6 2 nd component 0.3 0.0-0.3 Biochemistry and Molecular Biology Biology Chemistry Kyoto University Ecole University Normale of Supeacuterieure, Tokyo Paris Physics ETH Zurich Cornell University University of Oxford Materials Science University of Cambridge Harvard University Imperial College London CALTECH MIT Australian National University of Edinburgh Johns University Yale Hopkins University of University National University of Singapore University NUSof Manchester McGill University Pennsylvania Neurosciences University Duke of University Chicago Stanford University University University of Michigan of Toronto UCL Medicine Princeton University Columbia University Kings College London Engineering University of Hong Kong Computer Science Psychology Carnegie Mellon 3 0-3 -0.6-0.6-0.3 0.0 0.3-6 0.6 1 st component 3 rd component 0.8 0.6 0.4 0.2 0.0-0.2-0.4 Carnegie Mellon -4 0 4 CALTECH Physics Ecole Normale Supeacuterieure, Paris Princeton University Australian National University Columbia University Neurosciences of Chicago Chemistry Psychology Kings College UCL London Yale University Stanford University University of Cambridge of Oxford University Johns University Hopkins MIT of Pennsylvania of ETH Edinburgh University Zurich University Duke of University of Toronto Michigan University McGill Imperial Harvard University of Cornell Manchester College University London University of Tokyo University of Hong Medicine Kong Computer Science Kyoto University Biochemistry and Molecular Biology Engineering Biology National University of Singapore NUS Materials Science 4 2 0-2 rd 3 component st 1 component nd 2 component Technical sciences -0.6-0.8-1.0-0.5 0.0 0.5 1.0-4 1.5 Medicine sciences 2 nd component Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 6 / 18 Physics Fundamental sciences
Network representation Network construction one can also analyze direct connections between any pair of universities i and j, a convienient way is to use collaboration matrix C ij : each element gives the number of common publications between i and j Algorithm of C ij construction 1 take 100 heighest ranked universities (u 1, u 2,..., u 100 ) 2 for u 1 search for its publications p 1 1, p 1 2,..., p 1 M, 3 if among the co-authors of p 1 1 there is any that comes from either of u j (j = 2,..., 100) set w 1j = 1, 4 the weight is increased by one each time u j is found among the publications of u 1, 5 change to u 2 and go step 2 etc 6 as an outcome we have a fully connected graph with weights w ij Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 7 / 18
Network representation Threshold concept to analyze community structure among the universities we use the concept of weight threshold, we keep only the connections with weight higher that a certain threshold w T such a procedure can be performed for an arbitrary value of w T, thus creating a set of unweighted networks Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 8 / 18
Network representation Graphs and their properites Graphs and their properites a graph (or a network) is a structure that consits of a set of nodes (vertices) connected by edges (links), here we consider undirected edges, the most fundamental feature is node degree the number of its neighbors Nodes & edges Node degree k i 2 5 2 k 2 = 1 5 k 5 = 0 1 edge 1 k 1 = 3 node 3 4 k 3 = 2 3 4 k 4 = 2 Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 9 / 18
Network representation Graphs and their properites each pair of nodes is connected by a shortest path (i.e., minimal number of hops to get from one node to another), clustering coefficient gives the ratio of existing edges among the neighbors of node i to the total possible number of edges among them, it is possible to calculate the average value over all the nodes (or all pairs of nodes) in the network obtaining e.g. global measures characterizing the whole network Shortest path l ij Clustering coefficient C i 2 5 2 5 possible edge l 23 = 2 1 1 C 1 = 1 3 3 4 3 4 existing edge Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 10 / 18
Network representation Graphs and their properites to see the diversity of node degrees one can use entropy S of node degree distribution (global measure); very low entropy means that all have the same degree and vice versa, in order to check how the nodes connect (high degree high degree or low degree high degree) one uses assortativity coefficient r (Pearson correlation coefficient for all edges) Entropy k=k max S = p(k) ln p(k) k=k min Assortativity r = 1 2E 1 E i j ik i [ 1 ( i j i 2 + ki 2 2E ) [ 1 i (j i + k i ) ] 2 2E i (j i + k i ) ] 2 Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 11 / 18
Network representation Graphs and their properites owing to weight threshold we can observe network properties for increasing w T, N(w T ) r(w T ) 100 80 60 40 20 0.3 0.2 0.1 0.0-0.1-0.2-0.3-0.4 0 0 200 400 600 800 1000 1200 1400 w T 0 200 400 600 800 1000 1200 1400 w T a d E(w T ) S/S max (w T ) 10 4 10 3 10 2 10 1 0.30 0.25 0.20 0.15 0.10 0.05 0 200 400 600 800 1000 1200 1400 0 200 400 600 800 1000 1200 1400 w T w T b e C(w T ) <l>(w T ) 1.0 0.8 0.6 0.4 0.2 0.0 2.5 2.0 1.5 1.0 0 200 400 600 800 1000 1200 1400 w T 3.0 f c 0 200 400 600 800 1000 1200 1400 w T Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 12 / 18
Pajek Geographical representation Moving from abstract to real world: threshold w T = 250 Network Geographical projection Tohoku University University of Zurich Tokyo Institute of Technology ETH Zurich Swiss Federal Institute of Technology Nagoya University Hong Kong University of Science and Technology Osaka University Kyoto University University of Hong Kong University of Tokyo The Chinese University of Hong Kong Rice University Seoul National University Peking University Georgia Institute of Technology Tsinghua University University of Southampton University of Illinois, Urbana-Champaign Australian National University University of Nottingham University Carnegie Mellon of Geneva University Emory University Purdue University University of Queensland Cornell University University Northwestern Duke University of Washington University Chicago University of University North Carolina, Chapel Hill University University Yale University of of Michigan Pennsylvania University UCL University of California, of California, Berkeley Los Angeles UCLA University of New South Wales University of Glasgow Massachusetts of Johns Oxford University Hopkins of of Wisconsin-Madison California, of Institute Toronto University San of Technology Diego MIT Boston University Monash University Harvard Columbia University University New York University NYU Kings University College University of Imperial London Bristol California of Edinburgh College Institute London of Technology University Caltech of Washington University of Western Australia University University of Sheffield of Cambridge McGill University Stanford University University of Sydney Princeton University Brown University University of Texas at Austin University of Adelaide University of Manchester University of Birmingham University of Alberta University of Melbourne University of AmsterdamUniversity of British Columbia University of Warwick Heidelberg University University of Leeds Ecole Polytechnique Utrecht University Leiden University EPFL Ludwig-Maximilians-Universitaet Muenchen Technische Universitaet Muenchen Nanyang Technological University NTU National University of Singapore NUS Aarhus University University of Copenhagen Lund University Uppsala University Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 13 / 18
Pajek Geographical representation Moving from abstract to real world: threshold w T = 600 Network Geographical projection UCL University College London ETH Zurich Swiss Federal Institute of Technology Imperial College London Kings College London University of Zurich Boston University University of Cambridge University of Oxford University of Hong Kong Washington University in St Louis The Chinese University of Hong Kong University of North Carolina, Chapel Hill University of Alberta Tohoku University Duke University University of Toronto Kyoto University University of Pennsylvania Harvard University Yale University Osaka University University of Washington University of British Columbia University of Tokyo McGill University Johns Hopkins University Nagoya University University of California, San Diego University of California, Los Angeles UCLA Tokyo Institute of Technology University of Wisconsin-Madison Massachusetts Institute of Technology MIT University of Chicago Stanford University Monash University University of Michigan Columbia University University of California, Berkeley Northwestern University University of Melbourne New York University NYU University of Illinois, Urbana-Champaign Purdue University California Institute of Technology Caltech University of Sydney University of New South Wales Emory University Ecole Polytechnique Federale de Lausanne Aarhus University Georgia Institute of Technology University of Copenhagen Ecole Polytechnique Leiden University Technische Universit t M nchen University of Amsterdam Ludwig-Maximilians-Universitat Munchen Utrecht University Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 14 / 18
Pajek Geographical representation Moving from abstract to real world: threshold w T = 800 Network Geographical projection Kings College London UCL University College London Imperial College London Stanford University University of Oxford University of Cambridge McGill University Kyoto University Tokyo Institute of Technology Massachusetts Institute of Technology MIT Osaka University University of California, Berkeley University of Michigan University of British Columbia University of Toronto University of Tokyo Tohoku University Nagoya University University of California, Los Angeles UCLA Boston University The Chinese University of Hong Kong Harvard University University of California, San Diego Johns Hopkins University University of Pennsylvania University of Illinois, Urbana-Champaign University of Hong Kong Northwestern University University of Chicago University of New South Wales Duke University Yale University University of Washington University of Sydney University of North Carolina, Chapel Hill University of Melbourne Ecole Polytechnique Federale de Lausanne Monash University Ecole Polytechnique Ludwig-Maximilians-Universitat Muenchen Utrecht University Technische Universitaet Muenchen Leiden University University of Amsterdam Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 15 / 18
Pajek Geographical representation Moving from abstract to real world: threshold w T = 1200 Network Geographical projection UCL University College London Tohoku University Imperial College London University of Oxford University of Tokyo University of Michigan University of Pennsylvania Kyoto University Massachusetts Institute of Technology MIT The Chinese University of Hong Kong Harvard University University of Hong Kong Yale University University of California, Los Angeles UCLA Boston University University of New South Wales Duke University University of Sydney University of North Carolina, Chapel Hill Ecole Polytechnique Federale de Lausanne Monash University Leiden University University of Melbourne Ecole Polytechnique University of Amsterdam Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 16 / 18
Geographical representation geographical distance seems to be key factor, this is confirmed while looking at number of publications as a function of distance 10 4 10 3 w AB 10 2 10 1 10 0 10 0 10 1 10 2 10 3 10 4 d AB Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 17 / 18
Conclusions & Outlook Conclusions & Outlook the studied data show some signs of separation with respect to scientific category however, one needs to be careful as WoS is strongly biased towards science and probably underestimates humanities, in some categories there is higher (anti)correlation between the rank of the university and the number of published papers, universities tend to foster collaboration on a national level (or simply by the means of geographical proximity), The preprint can be found at arxiv:1307.0788 Janusz A. Hołyst et al. Categorical and geographical separation 9th November 2016 18 / 18