regex - Parse Wikipedia Infobox with Go? -


i trying parse infobox wikipedia articles , cannot seem figure out. have downloaded files , albert einstein , attempt parse infobox looks this:

package main  import (     "log"     "regexp" )  func main() {     st := `{{redirect|einstein|other uses|albert einstein (disambiguation)|and|einstein (disambiguation)}}         {{pp-semi-indef}}         {{pp-move-indef}}         {{good article}}         {{infobox scientist         | name       = albert einstein         | image       = einstein 1921 f schmutzer - restoration.jpg         | caption     = albert einstein in 1921         | birth_date  = {{birth date|df=yes|1879|3|14}}         | birth_place = [[ulm]], [[kingdom of württemberg]], [[german empire]]         | death_date  = {{death date , age|df=yes|1955|4|18|1879|3|14}}         | death_place = {{nowrap|[[princeton, new jersey]], u.s.}}         | children    = [[lieserl einstein|"lieserl"]] (1902–1903?)<br />[[hans albert einstein|hans albert]] (1904–1973)<br />[[eduard einstein|eduard "tete"]] (1910–1965)         | spouse      = [[mileva marić]]&nbsp;(1903–1919)<br />{{nowrap|[[elsa löwenthal]]&nbsp;(1919–1936)}}         | residence   = germany, italy, switzerland, austria (today: [[czech republic]]), belgium, united states         | citizenship = {{plainlist|         * [[kingdom of württemberg]] (1879–1896)         * [[statelessness|stateless]] (1896–1901)         * [[switzerland]] (1901–1955)         * austria of [[austro-hungarian empire]] (1911–1912)         * germany (1914–1933)         * united states (1940–1955)         }}         | ethnicity  = jewish         | fields    = [[physics]], [[philosophy]]         | workplaces = {{plainlist|         * [[swiss patent office]] ([[bern]]) (1902–1909)         * [[university of bern]] (1908–1909)         * [[university of zurich]] (1909–1911)         * [[karl-ferdinands-universität|charles university in prague]] (1911–1912)         * [[eth zurich]] (1912–1914)         * [[prussian academy of sciences]] (1914–1933)         * [[humboldt university of berlin]] (1914–1917)         * [[kaiser wilhelm institute]] (director, 1917–1933)         * [[german physical society]] (president, 1916–1918)         * [[leiden university]] (visits, 1920–)         * [[institute advanced study]] (1933–1955)         * [[caltech]] (visits, 1931–1933)         }}         | alma_mater = {{plainlist|         * [[eth zurich|swiss federal polytechnic]] (1896–1900; b.a., 1900)         * [[university of zurich]] (ph.d., 1905)         }}         | doctoral_advisor  = [[alfred kleiner]]         | thesis_title      = eine neue bestimmung der moleküldimensionen (a new determination of molecular dimensions)         | thesis_url        = http://e-collection.library.ethz.ch/eserv/eth:30378/eth-30378-01.pdf         | thesis_year       = 1905         | academic_advisors = [[heinrich friedrich weber]]         | influenced  = {{plainlist|         * [[ernst g. straus]]         * [[nathan rosen]]         * [[leó szilárd]]         }}         | known_for = {{plainlist|         * [[general relativity]] , [[special relativity]]         * [[photoelectric effect]]         * ''[[mass–energy equivalence|e=mc<sup>2</sup>]]''         * theory of [[brownian motion]]         * [[einstein field equations]]         * [[bose–einstein statistics]]         * [[bose–einstein condensate]]         * [[gravitational wave]]         * [[cosmological constant]]         * [[classical unified field theories|unified field theory]]         * [[epr paradox]]         }}         | awards = {{plainlist|         * [[barnard medal meritorious service science|barnard medal]] (1920)         * [[nobel prize in physics]] (1921)         * [[matteucci medal]] (1921)         * [[formemrs]] (1921)<ref name="frs" />         * [[copley medal]] (1925)<ref name="frs" />         * [[max planck medal]] (1929)         * [[time 100: important people of century|''time'' person of century]] (1999)         }}         | signature = albert einstein signature 1934.svg     }}     '''albert einstein''' ({{ipac-en|ˈ|aɪ|n|s|t|aɪ|n}};<ref>{{cite book|last=wells|first=john|authorlink=john c. wells|title=longman pronunciation dictionary|publisher=pearson longman|edition=3rd|date=april 3, 2008|isbn=1-4058-8118-6}}</ref> {{ipa-de|ˈalbɛɐ̯t ˈaɪnʃtaɪn|lang|albert einstein german.ogg}}; 14 march 1879&nbsp;– 18 april 1955) german-born<!-- please not change this—see talk page , many archives.-->      [[theoretical physicist]]. developed [[general theory of relativity]], 1 of 2 pillars of [[modern physics]] (alongside [[quantum mechanics]]).<ref name=frs>{{cite journal | last1 = whittaker | first1 = e. | authorlink = e. t. whittaker| doi = 10.1098/rsbm.1955.0005 | title = albert einstein. 1879–1955 | journal = [[biographical memoirs of fellows of royal society]] | volume = 1 | pages = 37–67 | date = 1 november 1955| jstor = 769242}}</ref><ref name="yanghamilton2010">{{cite book|author1=fujia yang|author2=joseph h. hamilton|title=modern atomic , nuclear physics|date=2010|publisher=world scientific|isbn=978-981-4277-16-7}}</ref>{{rp|274}} einstein's work known influence on [[philosophy of science]].<ref>{{citation |title=einstein's philosophy of science |url=http://plato.stanford.edu/entries/einstein-philscience/#intwaseinepiopp |we......     `      re := regexp.mustcompile(`{{infobox(?s:.*?)}}`)     log.println(re.findallstringsubmatch(st, -1))  } 

i trying put each of items infobox struct or map:

m["name"] = "albert einstein" m["image"] = "einstein...." ... ... m["death_date"] = "{{death date , age|df=yes|1955|4|18|1879|3|14}}" ... ... 

i can't seem isolate infobox. get:

[[{{infobox scientist         | name       = albert einstein         | image       = einstein 1921 f schmutzer - restoration.jpg         | caption     = albert einstein in 1921         | birth_date  = {{birth date|df=yes|1879|3|14}}]] 

the albert einstein entry in api can found at:

https://en.wikipedia.org/w/api.php?action=query&titles=albert%20einstein&prop=revisions&rvprop=content&format=json 

edit:

based on accepted answer to question tried following regex:

(?=\{infobox)(\{([^{}]|(?1))*\}) 

but get:

panic: regexp: compile(`(?=\{infobox)(\{([^{}]|(?1))*\})`): error parsing regexp: invalid or unsupported perl syntax: `(?=` 

edit #2: if there's way extract information via api i'll take that....i've been reading through docs , can't find it.

i made regex might work you:

^\s*\|\s*([^\s]+)\s*=\s*(\{\{plainlist\|(?:\n\s*\*.*)*|.*)

explanation

  • this part: ^\s*\|\s*([^\s]+)\s*=\s* matches start of lines like:

        | <the_label> =  
  • continuing on same line, part: (\{\{plainlist\|(?:\n\s*\*.*)*|.*) match lists:

                         {{plainlist| * [[ernst g. straus]] * [[nathan rosen]] * [[leó szilárd]] 

(note may omit final }}. oh well.)

  • if there no list, matches until end of line.

Comments

Popular posts from this blog

java - nested exception is org.hibernate.exception.SQLGrammarException: could not extract ResultSet Hibernate+SpringMVC -

sql - Postgresql tables exists, but getting "relation does not exist" when querying -

asp.net mvc - breakpoint on javascript in CSHTML? -