DEMO: Programski dostop do PubMed#

Na primeru si bomo ogledali, kako iščemo pu zbirki PubMed prek programskega dostopa.

BioPython#

Uporabili bomo BioPython, ki vsebuje orodja za delo z bio-podatki (zaporedja, strukture ipd.); podrobnejša navodila najdete na njegovi domači strani, še posebej uporabna je “kuharica” Biopython Tutorial and Cookbook, kjer najdete različne primere uporabe z razlagami (težava je le v tem, da nekateri primeri iz kuharice ne delujejo, saj niso bili posodobljeni glede na novo verzijo BioPythona). To ne bo edina uporaba BioPythona - uporabljali ga bomo namreč tudi pri drugih vajah.

Del BioPythona, ki nam omogoča iskanje zbirkah na strežnikih NCBI, med drugim PubMed, se imenuje Bio.Entrez, zanj je v okviru BioPython na voljo dokumentacija.

Modul Bio.Entrez#

Modul Bio.Entrez nudi več funkcij, nekaj jih je navedenih tukaj (ostale so opisane v dokumentaciji):

  • esearch (Entrez Search) - z njim iščemo po zbirki/zbirkah, vrne nam pa primarne ključe za dostop (ID), ki se jih nato uporabi za prenos posmeznih zapisov z npr. efetch;

  • efetch (Entrez Fetch) - zapise (kot vhod podamo ključe ID) prenesemo v zahtevanem formatu;

  • esummary (Entrez Summary) - za zapise (ID) nam prenese njihove kratke opise;

  • read - analizira rezultate v XML, ki nam jih vrnejo zgoraj navedene funkcije;

  • egquery - za iskalni pojem nam vrne število zadetkov v posameznih zbirkah, do katerih dostopamo prek Entrez;

  • einfo - vrne nam imena polij ipd.

Modul Bio.Entrez je dejansko osnovan na API (application programming interface) za Entrez, za katerega obstaja uradna dokumentacija na strežnikih NCBI:

Opozorila:

  • Pri programskem dostopu do Entrez se morate identificirati z e-poštnim naslovom (v primerih spodaj, če jih boste poganjali, nastavite vrednost Entrez.email na vaš naslov.

  • Če boste poizvedbe izvajali z visoko frekvenco (npr. zelo veliko število poizvedb na minuto) vas lahko strežnik začasno blokira. Podobno velja za visoko frekvenco poizvedb s prenosom velike količine podatkov (npr. velikega števila celotnih zapisov).


Zgledi#

V nadaljevanju je prikazano nekaj zgledov, ki demonstrirajo poizvedovanje s programskih dostopom do Entrez.

Seznam zbirk#

Najprej si oglejmo, katere podatkovne zbirke so nam na voljo prek tega vmesnika.

from Bio import Entrez
Entrez.email = 'sem.napisite.vas@e-postni.naslov'   # identificiramo se!
handle = Entrez.einfo()
record = Entrez.read(handle)
handle.close()
print(record)   # to nam izpiše vrednost, ki je slovar
{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']}

Vidimo, da je record zapisan kot slovar, vrednost ključa DbList (Database List, ki je dejansko seznam zbirk) pa dobimo tako:

print(record['DbList'])   # izpiše nam vrednost ključa DbList (Database List)
['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

Na voljo imamo torej zbirko pubmed (literaturna zbirka), zbirko aminokislinskih (protein) in nukleotidnih zaporedij (nucleotide) , struktur (structure) itd.

Iskanje#

Na tem mestu se bomo osredotočili na iskanje po zbirku Pubmed, zato bomo vrednost db (database) ustrezno nastavili na pubmed. Definirati moramo tudi iskalni pojem (v spodnjem primeru search_term) - slednjega lahko definiramo kar z navedbo ene ali več besed, s katerimi želimo iskati po PubMedu, lahko pa uporabimo naprednejšo sintakso in sicer na enak način kot prek naprednega spletnega vmesnika za iskanje po tej podatkovni zbirki, kar smo si ogledali pri prejšnji vaji. Če si želimo kompleksnejšo poizvedbo, si celoten iskalni pojem najenostavneje zgradimo tako, da gremo na Pubmed Advanced Search Builder, zgradimo iskanje, ter vsebino iskalne škatle (query box) skopiramo in prilepimo kot vrednost search_term.

Pri iskanju moramo zraven zbirke in iskalnega pojma definirati še retrieval mode oz. retmode (včasih še retrieval type - rettype), možne vrednosti so navedene v pomoči.

Spodaj je en tak primer, kjer za iščemo po imenih člankov, zanimajo pa nas samo taki, kjer se v imenu (Title) pojavi coronavirus, kjerkoli pa mora biti zraven tega še beseda Tasmania (iskanje sicer ni oblutljivo na velike in male črke, to pomeni, da ni case-sensitive). Kot rezultat dobimo slovar, ki vsebuje število zadetkov (count), seznam številk PMID zapisov oz. člankov (IDList) in uporabljen iskalni pojem (kot ga interpretira PubMed, vključno z MeSH).

# štetje zapisov
# from Bio import Entrez
# Entrez.email = 'sem.napisite.vas@e-postni.naslov' # identificiramo se
search_term = '(coronavirus[Title]) AND Tasmania' # definiramo iskanje
handle = Entrez.esearch(db='pubmed', term=search_term, retmode='xml')
records = Entrez.read(handle)
handle.close()
print(records)
{'Count': '4', 'RetMax': '4', 'RetStart': '0', 'IdList': ['33254559', '33195319', '32826150', '32244852'], 'TranslationSet': [{'From': 'Tasmania', 'To': '"tasmania"[MeSH Terms] OR "tasmania"[All Fields]'}], 'TranslationStack': [{'Term': 'coronavirus[Title]', 'Field': 'Title', 'Count': '17407', 'Explode': 'N'}, {'Term': '"tasmania"[MeSH Terms]', 'Field': 'MeSH Terms', 'Count': '1518', 'Explode': 'Y'}, {'Term': '"tasmania"[All Fields]', 'Field': 'All Fields', 'Count': '14210', 'Explode': 'N'}, 'OR', 'GROUP', 'AND'], 'QueryTranslation': 'coronavirus[Title] AND ("tasmania"[MeSH Terms] OR "tasmania"[All Fields])'}

Če nas zanima število zadetkov si lahko to vrednost enostavno izpišemo. Ali koda pravilno deluje lahko preverimo tudi tako, da iskanje izvedemo z uporabo brskalnika.

count = int(records['Count']) # pretvorimo Count v celo število
print(count) # izpišemo število zadetkov
4

Prenos zapisov#

Kot vidimo po zbirki iščemo z uporabo esearch, če pa želimo prenesti zapise, pa uporabimo efetch in sicer v kombinaciji s seznamom PMID (v IdList).

Za primer si oglejmo, kako je sestavljen edem izmed zapisov. Spodnja koda vzame prvo kodo PMID v seznamu IdList ter prenese in izpiše vsebino zapisa:

first_record = records['IdList'][0]   # prvi zapis v seznamu
print('Prvi zapis je:' + first_record)   # izpis kode PMID prvega zapisa v seznamu
one_pubmed_entry = Entrez.efetch(db='pubmed', id=first_record, retmode='xml')
single_record = Entrez.read(one_pubmed_entry)
print(single_record)   # izpis zapisa
Prvi zapis je:33254559
{'PubmedBookArticle': [], 'PubmedArticle': [{'MedlineCitation': DictElement({'OtherAbstract': [], 'SpaceFlightMission': [], 'KeywordList': [ListElement([StringElement('COVID-19', attributes={'MajorTopicYN': 'N'}), StringElement('Drug delivery', attributes={'MajorTopicYN': 'N'}), StringElement('Nanoparticles', attributes={'MajorTopicYN': 'N'})], attributes={'Owner': 'NOTNLM'})], 'CitationSubset': ['IM'], 'GeneralNote': [], 'OtherID': [], 'PMID': StringElement('33254559', attributes={'Version': '1'}), 'DateCompleted': {'Year': '2021', 'Month': '01', 'Day': '05'}, 'DateRevised': {'Year': '2021', 'Month': '01', 'Day': '05'}, 'Article': DictElement({'ArticleDate': [DictElement({'Year': '2020', 'Month': '09', 'Day': '10'}, attributes={'DateType': 'Electronic'})], 'ELocationID': [StringElement('S0306-9877(20)32396-3', attributes={'EIdType': 'pii', 'ValidYN': 'Y'}), StringElement('10.1016/j.mehy.2020.110254', attributes={'EIdType': 'doi', 'ValidYN': 'Y'})], 'Language': ['eng'], 'Journal': {'ISSN': StringElement('1532-2777', attributes={'IssnType': 'Electronic'}), 'JournalIssue': DictElement({'Volume': '144', 'PubDate': {'Year': '2020', 'Month': 'Nov'}}, attributes={'CitedMedium': 'Internet'}), 'Title': 'Medical hypotheses', 'ISOAbbreviation': 'Med Hypotheses'}, 'ArticleTitle': 'Advanced drug delivery systems can assist in targeting coronavirus disease (COVID-19): A hypothesis.', 'Pagination': {'MedlinePgn': '110254'}, 'Abstract': {'AbstractText': ['The highly contagious coronavirus, which had already affected more than 2 million people in 210 countries, triggered a colossal economic crisis consequently resulting from measures adopted by various goverments to limit transmission. This has placed the lives of many people infected worldwide at great risk. Currently there are no established or validated treatments for COVID-19, that is approved worldwide. Nanocarriers may offer a wide range of applications that could be developed into risk-free approaches for successful therapeutic strategies that may lead to immunisation against the severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) which is the primary causative organism that had led to the current COVID-19 pandemic. We address existing as well as emerging therapeutic and prophylactic approaches that may enable us to effectively combat this pandemic, and also may help to identify the key areas where nano-scientists can step in.'], 'CopyrightInformation': 'Copyright © 2020 Elsevier Ltd. All rights reserved.'}, 'AuthorList': ListElement([DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Discipline of Pharmacy, Graduate School of Health, University of Technology Sydney, Ultimo, NSW 2007, Australia; School of Pharmaceutical Sciences, Lovely Professional University, Phagwara 144411, Punjab, India.'}], 'Identifier': [], 'LastName': 'Mehta', 'ForeName': 'Meenu', 'Initials': 'M'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Department of Chemistry, University of Petroleum & Energy Studies, Dehradun 248007, India.'}], 'Identifier': [], 'LastName': 'Prasher', 'ForeName': 'Parteek', 'Initials': 'P'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Department of Chemistry, Uttaranchal University, Dehradun 248007, India.'}], 'Identifier': [], 'LastName': 'Sharma', 'ForeName': 'Mousmee', 'Initials': 'M'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'School of Health Sciences, College of Health and Medicine, University of Tasmania, Launceston, TAS 7250, Australia.'}], 'Identifier': [], 'LastName': 'Shastri', 'ForeName': 'Madhur D', 'Initials': 'MD'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'School of Pharmaceutical Sciences, Lovely Professional University, Phagwara 144411, Punjab, India.'}], 'Identifier': [], 'LastName': 'Khurana', 'ForeName': 'Navneet', 'Initials': 'N'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'School of Pharmaceutical Sciences, Lovely Professional University, Phagwara 144411, Punjab, India.'}], 'Identifier': [], 'LastName': 'Vyas', 'ForeName': 'Manish', 'Initials': 'M'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Faculty of Pharmaceutical Sciences, Maharshi Dayanand University, Rohtak, India.'}], 'Identifier': [], 'LastName': 'Dureja', 'ForeName': 'Harish', 'Initials': 'H'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'School of Pharmacy, Suresh Gyan Vihar University, Jagatpura 302017, Mahal Road, Jaipur, India.'}], 'Identifier': [], 'LastName': 'Gupta', 'ForeName': 'Gaurav', 'Initials': 'G'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Department of Chemical Pathology, School of Pathology, Faculty of Health Sciences and National Health Laboratory Service, University of the Free State, Bloemfontein, South Africa.'}], 'Identifier': [], 'LastName': 'Anand', 'ForeName': 'Krishnan', 'Initials': 'K'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Discipline of Pharmacy, Graduate School of Health, University of Technology Sydney, Ultimo, NSW 2007, Australia; School of Pharmaceutical Sciences, Lovely Professional University, Phagwara 144411, Punjab, India.'}], 'Identifier': [], 'LastName': 'Satija', 'ForeName': 'Saurabh', 'Initials': 'S'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Department of Life Sciences, School of Pharmacy, International Medical University, Bukit Jalil, 57000 Kuala Lumpur, Malaysia.'}], 'Identifier': [], 'LastName': 'Chellappan', 'ForeName': 'Dinesh Kumar', 'Initials': 'DK'}, attributes={'ValidYN': 'Y'}), DictElement({'AffiliationInfo': [{'Identifier': [], 'Affiliation': 'Discipline of Pharmacy, Graduate School of Health, University of Technology Sydney, Ultimo, NSW 2007, Australia; Priority Research Centre for Healthy Lungs, Hunter Medical Research Institute (HMRI) & School of Biomedical Sciences and Pharmacy, The University of Newcastle (UoN), Callaghan, NSW 2308, Australia; School of Pharmaceutical Sciences, Shoolini University, Bajhol, Sultanpur, Solan, Himachal Pradesh 173 229, India. Electronic address: Kamal.Dua@uts.edu.au.'}], 'Identifier': [], 'LastName': 'Dua', 'ForeName': 'Kamal', 'Initials': 'K'}, attributes={'ValidYN': 'Y'})], attributes={'CompleteYN': 'Y'}), 'PublicationTypeList': [StringElement('Journal Article', attributes={'UI': 'D016428'})]}, attributes={'PubModel': 'Print-Electronic'}), 'MedlineJournalInfo': {'Country': 'United States', 'MedlineTA': 'Med Hypotheses', 'NlmUniqueID': '7505668', 'ISSNLinking': '0306-9877'}, 'ChemicalList': [{'RegistryNumber': '0', 'NameOfSubstance': StringElement('Antiviral Agents', attributes={'UI': 'D000998'})}, {'RegistryNumber': '0', 'NameOfSubstance': StringElement('Drug Carriers', attributes={'UI': 'D004337'})}, {'RegistryNumber': '0', 'NameOfSubstance': StringElement('Plant Preparations', attributes={'UI': 'D028321'})}, {'RegistryNumber': '0', 'NameOfSubstance': StringElement('Polysaccharides', attributes={'UI': 'D011134'})}], 'SupplMeshList': [StringElement('COVID-19 drug treatment', attributes={'Type': 'Protocol', 'UI': 'C000705127'})], 'MeshHeadingList': [{'QualifierName': [StringElement('therapeutic use', attributes={'UI': 'Q000627', 'MajorTopicYN': 'N'})], 'DescriptorName': StringElement('Antiviral Agents', attributes={'UI': 'D000998', 'MajorTopicYN': 'N'})}, {'QualifierName': [StringElement('drug therapy', attributes={'UI': 'Q000188', 'MajorTopicYN': 'Y'})], 'DescriptorName': StringElement('COVID-19', attributes={'UI': 'D000086382', 'MajorTopicYN': 'N'})}, {'QualifierName': [], 'DescriptorName': StringElement('Drug Carriers', attributes={'UI': 'D004337', 'MajorTopicYN': 'N'})}, {'QualifierName': [], 'DescriptorName': StringElement('Drug Delivery Systems', attributes={'UI': 'D016503', 'MajorTopicYN': 'Y'})}, {'QualifierName': [], 'DescriptorName': StringElement('Humans', attributes={'UI': 'D006801', 'MajorTopicYN': 'N'})}, {'QualifierName': [], 'DescriptorName': StringElement('Models, Theoretical', attributes={'UI': 'D008962', 'MajorTopicYN': 'N'})}, {'QualifierName': [StringElement('chemistry', attributes={'UI': 'Q000737', 'MajorTopicYN': 'N'})], 'DescriptorName': StringElement('Nanoparticles', attributes={'UI': 'D053758', 'MajorTopicYN': 'N'})}, {'QualifierName': [], 'DescriptorName': StringElement('Nanotechnology', attributes={'UI': 'D036103', 'MajorTopicYN': 'N'})}, {'QualifierName': [], 'DescriptorName': StringElement('Plant Preparations', attributes={'UI': 'D028321', 'MajorTopicYN': 'N'})}, {'QualifierName': [StringElement('chemistry', attributes={'UI': 'Q000737', 'MajorTopicYN': 'N'})], 'DescriptorName': StringElement('Polysaccharides', attributes={'UI': 'D011134', 'MajorTopicYN': 'N'})}, {'QualifierName': [], 'DescriptorName': StringElement('Precision Medicine', attributes={'UI': 'D057285', 'MajorTopicYN': 'N'})}]}, attributes={'Status': 'MEDLINE', 'Owner': 'NLM'}), 'PubmedData': {'ReferenceList': [{'ReferenceList': [], 'Reference': [{'Citation': 'Drug Dev Res. 2020 Sep;81(6):647-649', 'ArticleIdList': [StringElement('32329083', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Nanoscale. 2016 Feb 7;8(5):3040-8', 'ArticleIdList': [StringElement('26781043', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Virology. 2015 Nov;485:330-9', 'ArticleIdList': [StringElement('26331680', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Int J Environ Res Public Health. 2016 Apr 19;13(4):430', 'ArticleIdList': [StringElement('27104546', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Theranostics. 2020 May 1;10(13):5932-5942', 'ArticleIdList': [StringElement('32483428', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Nat Microbiol. 2020 Apr;5(4):562-569', 'ArticleIdList': [StringElement('32094589', attributes={'IdType': 'pubmed'})]}, {'Citation': 'J Rheumatol. 1998 Jun;25(6):1221-5', 'ArticleIdList': [StringElement('9632091', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Antiviral Res. 2020 Apr;176:104742', 'ArticleIdList': [StringElement('32057769', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Future Med Chem. 2020 Sep;12(18):1607-1609', 'ArticleIdList': [StringElement('32589055', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Respir Med. 2020 Jun;167:105987', 'ArticleIdList': [StringElement('32421541', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Cell. 2020 Apr 16;181(2):271-280.e8', 'ArticleIdList': [StringElement('32142651', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Nat Rev Immunol. 2020 Jul;20(7):399-400', 'ArticleIdList': [StringElement('32499636', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Drug Discov Today. 2020 Sep;25(9):1556-1558', 'ArticleIdList': [StringElement('32592866', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Protein Sci. 2020 Jul;29(7):1596-1605', 'ArticleIdList': [StringElement('32304108', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Science. 2020 Jun 19;368(6497):1331-1335', 'ArticleIdList': [StringElement('32321856', attributes={'IdType': 'pubmed'})]}, {'Citation': 'PLoS Negl Trop Dis. 2020 Jul 29;14(7):e0008548', 'ArticleIdList': [StringElement('32726304', attributes={'IdType': 'pubmed'})]}, {'Citation': 'PLoS One. 2013 Dec 26;8(12):e82648', 'ArticleIdList': [StringElement('24386106', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Lancet. 2020 May 30;395(10238):1689-1690', 'ArticleIdList': [StringElement('32422123', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Antiviral Res. 2014 Sep;109:97-109', 'ArticleIdList': [StringElement('24995382', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Hypertension. 2020 Jul;76(1):16-22', 'ArticleIdList': [StringElement('32367746', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Nat Rev Drug Discov. 2016 May;15(5):327-47', 'ArticleIdList': [StringElement('26868298', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Cell. 2020 Apr 16;181(2):281-292.e6', 'ArticleIdList': [StringElement('32155444', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Acc Chem Res. 2013 Mar 19;46(3):792-801', 'ArticleIdList': [StringElement('23387478', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Chem Biol Interact. 2020 Jul 1;325:109125', 'ArticleIdList': [StringElement('32376238', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Science. 2020 Jul 3;369(6499):77-81', 'ArticleIdList': [StringElement('32376603', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Virol J. 2019 May 27;16(1):69', 'ArticleIdList': [StringElement('31133031', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Nat Rev Immunol. 2018 May;18(5):297-308', 'ArticleIdList': [StringElement('29379211', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Virology. 2005 Feb 20;332(2):498-510', 'ArticleIdList': [StringElement('15680415', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Appl Environ Microbiol. 2014 Apr;80(8):2343-50', 'ArticleIdList': [StringElement('24487537', attributes={'IdType': 'pubmed'})]}, {'Citation': 'ACS Appl Mater Interfaces. 2019 Nov 20;11(46):42964-42974', 'ArticleIdList': [StringElement('31633330', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Respir Res. 2005 Jan 20;6:8', 'ArticleIdList': [StringElement('15661082', attributes={'IdType': 'pubmed'})]}, {'Citation': 'J Antimicrob Chemother. 2005 Feb;55(2):280-1', 'ArticleIdList': [StringElement('15650005', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Dermatol Ther. 2020 Jul;33(4):e13501', 'ArticleIdList': [StringElement('32359088', attributes={'IdType': 'pubmed'})]}, {'Citation': 'Front Immunol. 2019 Mar 27;10:594', 'ArticleIdList': [StringElement('30972078', attributes={'IdType': 'pubmed'})]}, {'Citation': 'J Control Release. 2018 Jan 10;269:258-265', 'ArticleIdList': [StringElement('29170138', attributes={'IdType': 'pubmed'})]}]}], 'History': [DictElement({'Year': '2020', 'Month': '07', 'Day': '26'}, attributes={'PubStatus': 'received'}), DictElement({'Year': '2020', 'Month': '08', 'Day': '18'}, attributes={'PubStatus': 'revised'}), DictElement({'Year': '2020', 'Month': '09', 'Day': '04'}, attributes={'PubStatus': 'accepted'}), DictElement({'Year': '2020', 'Month': '12', 'Day': '1', 'Hour': '1', 'Minute': '2'}, attributes={'PubStatus': 'entrez'}), DictElement({'Year': '2020', 'Month': '12', 'Day': '2', 'Hour': '6', 'Minute': '0'}, attributes={'PubStatus': 'pubmed'}), DictElement({'Year': '2021', 'Month': '1', 'Day': '6', 'Hour': '6', 'Minute': '0'}, attributes={'PubStatus': 'medline'})], 'PublicationStatus': 'ppublish', 'ArticleIdList': [StringElement('33254559', attributes={'IdType': 'pubmed'}), StringElement('S0306-9877(20)32396-3', attributes={'IdType': 'pii'}), StringElement('10.1016/j.mehy.2020.110254', attributes={'IdType': 'doi'}), StringElement('PMC7481067', attributes={'IdType': 'pmc'})]}}]}

Izpis dela zapisa#

Vidimo, da je zapis kar ena solata, tako da se v njem malce težko znajti. Spodnja koda nam recimo za vsak PMID izpiše povzetek, če je ta na voljo (tu spet vzamemo vse zadetke prejšnjega iskanja), kar je že nekako lepše.

for identifier in records['IdList']:
    pubmed_entry = Entrez.efetch(db='pubmed', id=identifier, retmode='xml')
    result = Entrez.read(pubmed_entry)
    article = result['PubmedArticle'][0]['MedlineCitation']['Article']
    if 'Abstract' in article:
        print(article['Abstract']['AbstractText'][0])
The highly contagious coronavirus, which had already affected more than 2 million people in 210 countries, triggered a colossal economic crisis consequently resulting from measures adopted by various goverments to limit transmission. This has placed the lives of many people infected worldwide at great risk. Currently there are no established or validated treatments for COVID-19, that is approved worldwide. Nanocarriers may offer a wide range of applications that could be developed into risk-free approaches for successful therapeutic strategies that may lead to immunisation against the severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) which is the primary causative organism that had led to the current COVID-19 pandemic. We address existing as well as emerging therapeutic and prophylactic approaches that may enable us to effectively combat this pandemic, and also may help to identify the key areas where nano-scientists can step in.
Coronavirus disease 2019 (COVID-19) is a rapidly evolving, highly transmissible, and potentially lethal pandemic caused by a novel coronavirus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). As of June 11 2020, more than 7,000,000 COVID-19 cases have been reported worldwide, and more than 400,000 patients have died, affecting at least 188 countries. While literature on the disease is rapidly accumulating, an integrated, multinational perspective on clinical manifestations, immunological effects, diagnosis, prevention, and treatment of COVID-19 can be of global benefit. We aimed to synthesize the most relevant literature and experiences in different parts of the world through our global consortium of experts to provide a consensus-based document at this early stage of the pandemic.
The coronavirus disease 2019 (COVID-19) pandemic is challenging healthcare systems worldwide, none more so than critical and intensive care settings. Significant attention has been paid to the capacity of Australian intensive care unit (ICUs) to respond to a COVID-19 surge, particularly in relation to beds, ventilators, staffing, personal protective equipment, and unparalleled increase in deaths in ICUs associated with COVID-19 seen internationally. While death is not uncommon in critical care, the international experience demonstrates that restrictions to family presence at the end of life result in significant distress for families and clinicians. As a result, the Australian College of Critical Care Nurses and the Australasian College for Infection Prevention and Control supported the development of a position statement to provide critical care nurses with specific guidance and recommendations for practice for this emerging priority area. Where possible, position statements are founded on high-quality evidence. However, the short time period since the first recognition of a cluster of pneumonia-like cases in China in January, 2020, meant that an integrative approach was required to expedite timely development of this position statement in preparation for a COVID-19 surge in Australia. This position statement is intended to provide practical guidance to critical care nurses in facilitating next-of-kin presence for patients dying from COVID-19 in the ICU.
The epicenter of the original outbreak in China has high male smoking rates of around 50%, and early reported death rates have an emphasis on older males, therefore the likelihood of smokers being overrepresented in fatalities is high. In Iran, China, Italy, and South Korea, female smoking rates are much lower than males. Fewer females have contracted the virus. If this analysis is correct, then Indonesia would be expected to begin experiencing high rates of Covid-19 because its male smoking rate is over 60% (Tobacco Atlas). Smokers are vulnerable to respiratory viruses. Smoking can upregulate angiotensin-converting enzyme-2 (ACE2) receptor, the known receptor for both the severe acute respiratory syndrome (SARS)-coronavirus (SARS-CoV) and the human respiratory coronavirus NL638. This could also be true for new electronic smoking devices such as electronic cigarettes and "heat-not-burn" IQOS devices. ACE2 could be a novel adhesion molecule for SARS-CoV-2 causing Covid-19 and a potential therapeutic target for the prevention of fatal microbial infections, and therefore it should be fast tracked and prioritized for research and investigation. Data on smoking status should be collected on all identified cases of Covid-19.

Shranjevanje rezultatov v CSV#

Rezultate lahko shranimo tudi lokalno, na primer v obliki datoteke CSV (Comma Separated Values; kratka razlaga je med drobnarijami). Kaj bi pravzaprav počeli z lokalno shranjenimi podatki? Na primer, lahko bi izvajali rudarjenje po tekstu (text mining), kjer bi analizirali so-pojavljanje določenih besed (npr. imen proteinov ali genov), so-avtorstva, geografsko porazdelitev raziskav (pri avtorjih je namreč dopisana njihova afiliacija), … Sicer je za obširnejše analize najbolje prekopirati kar celotno zbirko MEDLINE, a to je že druga zgodba.

Spodnji zgled ilustrira uporabo knjižnice Pandas in sicer:

  • definira slovar articles_with_abstracts,

  • za vsako kodo PMID prenese zapis ter ga doda v slovar pod ključem, ki je enak PMID,

  • pretvori slovar v Pandas dataframe,

  • shrani dataframe kot datoteko csv.

import pandas as pd
articles_with_abstracts = {}   # ustvarimo slovar

# spodnja koda je podobna kot zgoraj, le da ne izpisuje povzetkov
for identifier in records['IdList']:
    pubmed_entry = Entrez.efetch(db='pubmed', id=identifier, retmode='xml')
    results = Entrez.read(pubmed_entry)
    article = results['PubmedArticle'][0]['MedlineCitation']['Article']
    if 'Abstract' in article:
        articles_with_abstracts[identifier] = article['Abstract']['AbstractText'][0]

# print(articles_with_abstracts)   # to bi nam izpisalo slovar
df = pd.DataFrame(list(articles_with_abstracts.items()), columns=['PMID', 'Abstract'])
print(df)   # izpis dataframe
df.to_csv('izhod/pubmed-search_output.csv')   # shranimo kot datoteko CSV
       PMID                                           Abstract
0  33254559  The highly contagious coronavirus, which had a...
1  33195319  Coronavirus disease 2019 (COVID-19) is a rapid...
2  32826150  The coronavirus disease 2019 (COVID-19) pandem...
3  32244852  The epicenter of the original outbreak in Chin...

Dodatne informacije#

Drugi načini programskega dostopa#

Iskanje po Pubmed z uporabo BioPython-a je, kot ste videli zgoraj, lahko malce zakomplicirano. Primer poenostavitve interakcije je PyMed, ki so ga avtorji uporabili za analizo znanstvene literature na temo COVID-19 - članek COVID-19: A scholarly production dataset report for research analysis. Kako so podatke pridobili je dostopno v repozitoriju na GitHub breno-madruga/dib-covid-datased, ki si ga lahko prenesete in ogledate.

Drug tak primer je Entrezpy, ki je opisan v članku Entrezpy: a Python library to dynamically interact with the NCBI Entrez databases. Relevantne povezave koda v repozitoriju, dokumentacija.

Zgoraj omenjeni PyMed in Entrezpy sta dosegljiva prek upravljalnika paketov pip. Priporočljivo je, da si razne programe, ki jih morda potrebujete samo za test, nameščate v virtualno okolje, kar je na kratko opisano tukaj.