Quickstart
Peptidoform
Peptidoform
accepts peptidoforms (combination
of peptide, modifications, and — optionally — charge state) in ProForma 2.0
notation and supports several peptide-related
operations, e.g.:
>>> from psm_utils import Peptidoform, PSM, PSMList
>>> peptidoform = Peptidoform("ACDEK/2")
>>> peptidoform.theoretical_mass
564.2213546837
>>> peptidoform.composition
Composition({'H': 36, 'C': 21, 'O': 10, 'N': 6, 'S': 1})
>>> peptidoform.sequential_composition
[Composition({'H': 1}),
Composition({'H': 5, 'C': 3, 'O': 1, 'N': 1}),
Composition({'H': 5, 'C': 3, 'S': 1, 'O': 1, 'N': 1}),
Composition({'H': 5, 'C': 4, 'O': 3, 'N': 1}),
Composition({'H': 7, 'C': 5, 'O': 3, 'N': 1}),
Composition({'H': 12, 'C': 6, 'N': 2, 'O': 1}),
Composition({'H': 1, 'O': 1})]
PSM
PSM
links a
Peptidoform
to a specific spectrum where it
was (putatively) identified. A PSM
therefore contains the peptidoform, spectrum (meta)data, and peptide-spectrum
match information:
>>> psm = PSM(
... peptidoform=Peptidoform("VLHPLEGAVVIIFK/2"),
... spectrum_id=17555,
... run="Adult_Frontalcortex_bRP_Elite_85_f09",
... collection="PXD000561",
... is_decoy=False,
... precursor_mz=767.9714,
... )
>>> psm.get_usi()
mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLHPLEGAVVIIFK/2
The spectrum can be retrieved by the USI through the ProteomeXchange USI aggregator: http://proteomecentral.proteomexchange.org/usi/?usi=mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLHPLEGAVVIIFK/2 Note that this is only possible because the spectrum has been fully indexed in one of the ProteomeXchange partner repositories (in this case both MassIVE and PeptideAtlas).
PSMList
PSMList
is a simple list-like object that represents a
group of PSMs, from one or more mass spectrometry runs or collections. This simple,
Pythonic data structure can be flexibly implemented in various contexts.
>>> psm_list = PSMList(psm_list=[
... PSM(peptidoform="ACDK", spectrum_id=1, score=140.2, retention_time=600.2),
... PSM(peptidoform="CDEFR", spectrum_id=2, score=132.9, retention_time=1225.4),
... PSM(peptidoform="DEM[Oxidation]K", spectrum_id=3, score=55.7, retention_time=3389.1),
... ])
PSMList
directly supports iteration:
>>> for psm in psm_list:
... print(psm.peptidoform.score)
140.2
132.9
55.7
PSM
properties can be accessed as a single Numpy array:
>>> psm_list["score"]
array([140.2, 132.9, 55.7], dtype=object)
PSMList
supports indexing and slicing:
>>> psm_list_subset = psm_list[0:2]
>>> psm_list_subset["score"]
array([140.2, 132.9], dtype=object)
>>> psm_list_subset = psm_list[0, 2]
>>> psm_list_subset["score"]
array([140.2, 55.7], dtype=object)
For more advanced and efficient vectorized access, converting the
PSMList
to a Pandas DataFrame is highly recommended:
>>> psm_df = psm_list.to_dataframe()
>>> psm_df[(psm_df["retention_time"] < 2000) & (psm_df["score"] > 10)]
peptidoform spectrum_id run collection spectrum is_decoy score qvalue pep precursor_mz retention_time protein_list rank source provenance_data metadata rescoring_features
0 ACDK 1 None None None None 140.2 None None None 600.0 None None None None None None
1 CDEFR 2 None None None None 132.9 None None None 1225.0 None None None None None None
psm_utils.io
The psm_utils.io
subpackage contains readers and writers for various
PSM file formats (see Supported file formats). Each reader parses the
specific PSM file format into a unified PSMList
object, with peptidoforms parsed into the ProForma notation. Use the high-level
psm_utils.io.read_file()
, psm_utils.io.write_file()
, and
psm_utils.io.convert()
functions to easily read, write, and convert
PSM files:
>>> from psm_utils.io import read_file
>>> psm_list = read_file("data/QExHF04054_tandem.idXML", filetype="idxml")
>>> psm_list[0]
PSM(
peptidoform=Peptidoform('QSGD[Ammonium]E[Ammonium]SYC[Carbamidomethyl]E[Ammonium]R/2'),
spectrum_id='controllerType=0 controllerNumber=1 scan=4941',
run=None,
collection=None,
spectrum=None,
is_decoy=True,
score=17.1,
precursor_mz=624.252254215645,
retention_time=1197.74208,
protein_list=['sP06800'],
source='idXML',
provenance_data=None,
metadata={
'idxml:score_type': 'XTandem',
'idxml:higher_score_better': 'True',
'idxml:significance_threshold': '0.0'
},
rescoring_features=None
)
Alternatively, the more low-level file format-specific reader and writer classes can be
used. Each reader has a read_file()
function:
>>> from psm_utils.io.mzid import MzidReader
>>> psm_list = MzidReader("psms.mzid").read_file()
>>> psm_list[0].peptidoform
Peptidoform('GLTEGLHGFHVHEFGDNTAGC[Carbamidomethyl]TSAGPHFNPLSR/4')
And all readers support iteration over PSMs:
>>> for psm in MzidReader("psms.mzid"):
... print(psm.peptidoform.proforma)
ACDEK
AC[Carbamidomethyl]DEFGR
[Acetyl]-AC[Carbamidomethyl]DEFGHIK
[...]
Similarly, writers can write single PSMs to a file:
>>> from psm_utils.io.tsv import TSVWriter
>>> with TSVWriter("psm_list.tsv", example_psm=psm_list[0]) as writer:
... writer.write_psm(psm_list[0])
And writers can write entire PSM lists at once:
>>> with TSVWriter("psm_list.tsv", example_psm=psm_list[0]) as writer:
... writer.write_file(psm_list)
Take a look at the Python API Reference for details, more examples, and additional information on the supported file formats.
Handling peptide modifications
Supported notations
Peptidoform
accepts all supported
ProForma 2.0 modification types and
notations, through the pyteomics.proforma
module. However, for some
functionality, such as the composition
and
mass
properties, the modification
composition and mass, respectively, should be resolvable. This can be achieved in
multiple ways:
Using a controlled vocabulary identifier or name, such as PSI-MOD or Unimod:
>>> Peptidoform("AC[UNIMOD:4]DEK").theoretical_mass
621.24282637892
>>> Peptidoform("AC[U:4]DEK").theoretical_mass
621.24282637892
>>> Peptidoform("AC[U:Carbamidomethyl]DEK").theoretical_mass
621.24282637892
Using a molecular formula or mass shift:
>>> Peptidoform("AC[Formula:H3C2NO]DEK/2").theoretical_mass
621.24282637892
>>> Peptidoform("AC[+57.021464]DEK/2").theoretical_mass
621.24282637892
A drawback of using the mass shift is that the composition is not resolvable:
>>> Peptidoform("AC[+57.021464]DEK/2").composition
[...]
ModificationException: Cannot resolve composition for modification 57.021464.
Renaming modifications
Often search engines use specific, arbitrary names for modifications. In that case, properties such as their mass or composition will not be resolvable.
>>> from psm_utils.io import read_file
>>> psm_list = read_file("msms.txt")
>>> psm_list["peptidoform"]
array([Peptidoform('AAAAAAALQAK/2'),
Peptidoform('[ac]-AAAAAEQQQFYLLLGNLLSPDNVVR/3'),
Peptidoform('[ac]-AAAAAEQQQFYLLLGNLLSPDNVVRK/3'), ...,
Peptidoform('YYYLPLVSN[de]PK/2'),
Peptidoform('YYYLTNVERLEELESDLK/3'), Peptidoform('YYYNGFYLLWI/3')],
dtype=object)
To address this issue, modifications can be renamed:
>>> psm_list.rename_modifications({
"ac": "U:Acetylation",
"ox": "U:Oxidation",
"de": "U:Deamidation",
"gl": "U:Gln->pyro-Glu",
})
>>> psm_list["peptidoform"]
array([Peptidoform('AAAAAAALQAK/2'),
Peptidoform('[UNIMOD:Acetylation]-AAAAAEQQQFYLLLGNLLSPDNVVR/3'),
Peptidoform('[UNIMOD:Acetylation]-AAAAAEQQQFYLLLGNLLSPDNVVRK/3'),
..., Peptidoform('YYYLPLVSN[UNIMOD:Deamidation]PK/2'),
Peptidoform('YYYLTNVERLEELESDLK/3'), Peptidoform('YYYNGFYLLWI/3')],
dtype=object)
Handling fixed modifications
Additionally, fixed modifications that are not already part of the search engine output can be added and applied across the sequence:
>>> psm_list[19].peptidoform
Peptidoform('AAAPAPEEEMDECEQALAAEPK/2')
>>> psm_list.add_fixed_modifications([("Carbamidomethyl", ["C"])])
>>> psm_list[19].peptidoform
Peptidoform('<[Carbamidomethyl]@C>AAAPAPEEEMDECEQALAAEPK/2')
>>> psm_list.apply_fixed_modifications()
>>> psm_list[19].peptidoform
Peptidoform('AAAPAPEEEMDEC[Carbamidomethyl]EQALAAEPK/2')