regex - Parse Wikipedia Infobox with Go? -
i trying parse infobox wikipedia articles , cannot seem figure out. have downloaded files , albert einstein , attempt parse infobox looks this:
package main import ( "log" "regexp" ) func main() { st := `{{redirect|einstein|other uses|albert einstein (disambiguation)|and|einstein (disambiguation)}} {{pp-semi-indef}} {{pp-move-indef}} {{good article}} {{infobox scientist | name = albert einstein | image = einstein 1921 f schmutzer - restoration.jpg | caption = albert einstein in 1921 | birth_date = {{birth date|df=yes|1879|3|14}} | birth_place = [[ulm]], [[kingdom of württemberg]], [[german empire]] | death_date = {{death date , age|df=yes|1955|4|18|1879|3|14}} | death_place = {{nowrap|[[princeton, new jersey]], u.s.}} | children = [[lieserl einstein|"lieserl"]] (1902–1903?)<br />[[hans albert einstein|hans albert]] (1904–1973)<br />[[eduard einstein|eduard "tete"]] (1910–1965) | spouse = [[mileva marić]] (1903–1919)<br />{{nowrap|[[elsa löwenthal]] (1919–1936)}} | residence = germany, italy, switzerland, austria (today: [[czech republic]]), belgium, united states | citizenship = {{plainlist| * [[kingdom of württemberg]] (1879–1896) * [[statelessness|stateless]] (1896–1901) * [[switzerland]] (1901–1955) * austria of [[austro-hungarian empire]] (1911–1912) * germany (1914–1933) * united states (1940–1955) }} | ethnicity = jewish | fields = [[physics]], [[philosophy]] | workplaces = {{plainlist| * [[swiss patent office]] ([[bern]]) (1902–1909) * [[university of bern]] (1908–1909) * [[university of zurich]] (1909–1911) * [[karl-ferdinands-universität|charles university in prague]] (1911–1912) * [[eth zurich]] (1912–1914) * [[prussian academy of sciences]] (1914–1933) * [[humboldt university of berlin]] (1914–1917) * [[kaiser wilhelm institute]] (director, 1917–1933) * [[german physical society]] (president, 1916–1918) * [[leiden university]] (visits, 1920–) * [[institute advanced study]] (1933–1955) * [[caltech]] (visits, 1931–1933) }} | alma_mater = {{plainlist| * [[eth zurich|swiss federal polytechnic]] (1896–1900; b.a., 1900) * [[university of zurich]] (ph.d., 1905) }} | doctoral_advisor = [[alfred kleiner]] | thesis_title = eine neue bestimmung der moleküldimensionen (a new determination of molecular dimensions) | thesis_url = http://e-collection.library.ethz.ch/eserv/eth:30378/eth-30378-01.pdf | thesis_year = 1905 | academic_advisors = [[heinrich friedrich weber]] | influenced = {{plainlist| * [[ernst g. straus]] * [[nathan rosen]] * [[leó szilárd]] }} | known_for = {{plainlist| * [[general relativity]] , [[special relativity]] * [[photoelectric effect]] * ''[[mass–energy equivalence|e=mc<sup>2</sup>]]'' * theory of [[brownian motion]] * [[einstein field equations]] * [[bose–einstein statistics]] * [[bose–einstein condensate]] * [[gravitational wave]] * [[cosmological constant]] * [[classical unified field theories|unified field theory]] * [[epr paradox]] }} | awards = {{plainlist| * [[barnard medal meritorious service science|barnard medal]] (1920) * [[nobel prize in physics]] (1921) * [[matteucci medal]] (1921) * [[formemrs]] (1921)<ref name="frs" /> * [[copley medal]] (1925)<ref name="frs" /> * [[max planck medal]] (1929) * [[time 100: important people of century|''time'' person of century]] (1999) }} | signature = albert einstein signature 1934.svg }} '''albert einstein''' ({{ipac-en|ˈ|aɪ|n|s|t|aɪ|n}};<ref>{{cite book|last=wells|first=john|authorlink=john c. wells|title=longman pronunciation dictionary|publisher=pearson longman|edition=3rd|date=april 3, 2008|isbn=1-4058-8118-6}}</ref> {{ipa-de|ˈalbɛɐ̯t ˈaɪnʃtaɪn|lang|albert einstein german.ogg}}; 14 march 1879 – 18 april 1955) german-born<!-- please not change this—see talk page , many archives.--> [[theoretical physicist]]. developed [[general theory of relativity]], 1 of 2 pillars of [[modern physics]] (alongside [[quantum mechanics]]).<ref name=frs>{{cite journal | last1 = whittaker | first1 = e. | authorlink = e. t. whittaker| doi = 10.1098/rsbm.1955.0005 | title = albert einstein. 1879–1955 | journal = [[biographical memoirs of fellows of royal society]] | volume = 1 | pages = 37–67 | date = 1 november 1955| jstor = 769242}}</ref><ref name="yanghamilton2010">{{cite book|author1=fujia yang|author2=joseph h. hamilton|title=modern atomic , nuclear physics|date=2010|publisher=world scientific|isbn=978-981-4277-16-7}}</ref>{{rp|274}} einstein's work known influence on [[philosophy of science]].<ref>{{citation |title=einstein's philosophy of science |url=http://plato.stanford.edu/entries/einstein-philscience/#intwaseinepiopp |we...... ` re := regexp.mustcompile(`{{infobox(?s:.*?)}}`) log.println(re.findallstringsubmatch(st, -1)) }
i trying put each of items infobox struct or map:
m["name"] = "albert einstein" m["image"] = "einstein...." ... ... m["death_date"] = "{{death date , age|df=yes|1955|4|18|1879|3|14}}" ... ...
i can't seem isolate infobox. get:
[[{{infobox scientist | name = albert einstein | image = einstein 1921 f schmutzer - restoration.jpg | caption = albert einstein in 1921 | birth_date = {{birth date|df=yes|1879|3|14}}]]
the albert einstein entry in api can found at:
https://en.wikipedia.org/w/api.php?action=query&titles=albert%20einstein&prop=revisions&rvprop=content&format=json
edit:
based on accepted answer to question tried following regex:
(?=\{infobox)(\{([^{}]|(?1))*\})
but get:
panic: regexp: compile(`(?=\{infobox)(\{([^{}]|(?1))*\})`): error parsing regexp: invalid or unsupported perl syntax: `(?=`
edit #2: if there's way extract information via api i'll take that....i've been reading through docs , can't find it.
i made regex might work you:
^\s*\|\s*([^\s]+)\s*=\s*(\{\{plainlist\|(?:\n\s*\*.*)*|.*)
explanation
this part:
^\s*\|\s*([^\s]+)\s*=\s*
matches start of lines like:| <the_label> =
continuing on same line, part:
(\{\{plainlist\|(?:\n\s*\*.*)*|.*)
match lists:{{plainlist| * [[ernst g. straus]] * [[nathan rosen]] * [[leó szilárd]]
(note may omit final }}
. oh well.)
- if there no list, matches until end of line.
Comments
Post a Comment