Quickstart

Peptidoform

Peptidoform accepts peptidoforms (combination of peptide, modifications, and — optionally — charge state) in ProForma 2.0 notation and supports several peptide-related operations, e.g.:

>>> from psm_utils import Peptidoform, PSM, PSMList
>>> peptidoform = Peptidoform("ACDEK/2")
>>> peptidoform.theoretical_mass
564.2213546837

>>> peptidoform.composition
Composition({'H': 36, 'C': 21, 'O': 10, 'N': 6, 'S': 1})

>>> peptidoform.sequential_composition
[Composition({'H': 1}),
Composition({'H': 5, 'C': 3, 'O': 1, 'N': 1}),
Composition({'H': 5, 'C': 3, 'S': 1, 'O': 1, 'N': 1}),
Composition({'H': 5, 'C': 4, 'O': 3, 'N': 1}),
Composition({'H': 7, 'C': 5, 'O': 3, 'N': 1}),
Composition({'H': 12, 'C': 6, 'N': 2, 'O': 1}),
Composition({'H': 1, 'O': 1})]

PSM

PSM links a Peptidoform to a specific spectrum where it was (putatively) identified. A PSM therefore contains the peptidoform, spectrum (meta)data, and peptide-spectrum match information:

>>> psm = PSM(
...     peptidoform=Peptidoform("VLHPLEGAVVIIFK/2"),
...     spectrum_id=17555,
...     run="Adult_Frontalcortex_bRP_Elite_85_f09",
...     collection="PXD000561",
...     is_decoy=False,
...     precursor_mz=767.9714,
... )
>>> psm.get_usi()
mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLHPLEGAVVIIFK/2

The spectrum can be retrieved by the USI through the ProteomeXchange USI aggregator: http://proteomecentral.proteomexchange.org/usi/?usi=mzspec:PXD000561:Adult_Frontalcortex_bRP_Elite_85_f09:scan:17555:VLHPLEGAVVIIFK/2 Note that this is only possible because the spectrum has been fully indexed in one of the ProteomeXchange partner repositories (in this case both MassIVE and PeptideAtlas).

PSMList

PSMList is a simple list-like object that represents a group of PSMs, from one or more mass spectrometry runs or collections. This simple, Pythonic data structure can be flexibly implemented in various contexts.

>>> psm_list = PSMList(psm_list=[
...     PSM(peptidoform="ACDK", spectrum_id=1, score=140.2, retention_time=600.2),
...     PSM(peptidoform="CDEFR", spectrum_id=2, score=132.9, retention_time=1225.4),
...     PSM(peptidoform="DEM[Oxidation]K", spectrum_id=3, score=55.7, retention_time=3389.1),
... ])

PSMList directly supports iteration:

>>> for psm in psm_list:
...     print(psm.peptidoform.score)
140.2
132.9
55.7

PSM properties can be accessed as a single Numpy array:

>>> psm_list["score"]
array([140.2, 132.9, 55.7], dtype=object)

PSMList supports indexing and slicing:

>>> psm_list_subset = psm_list[0:2]
>>> psm_list_subset["score"]
array([140.2, 132.9], dtype=object)

>>> psm_list_subset = psm_list[0, 2]
>>> psm_list_subset["score"]
array([140.2, 55.7], dtype=object)

For more advanced and efficient vectorized access, converting the PSMList to a Pandas DataFrame is highly recommended:

>>> psm_df = psm_list.to_dataframe()
>>> psm_df[(psm_df["retention_time"] < 2000) & (psm_df["score"] > 10)]
   peptidoform  spectrum_id   run collection spectrum is_decoy  score qvalue   pep precursor_mz  retention_time protein_list  rank source provenance_data metadata rescoring_features
0        ACDK            1  None       None     None     None  140.2   None  None         None           600.0         None  None   None            None     None               None
1       CDEFR            2  None       None     None     None  132.9   None  None         None          1225.0         None  None   None            None     None               None

psm_utils.io

The psm_utils.io subpackage contains readers and writers for various PSM file formats (see Supported file formats). Each reader parses the specific PSM file format into a unified PSMList object, with peptidoforms parsed into the ProForma notation. Use the high-level psm_utils.io.read_file(), psm_utils.io.write_file(), and psm_utils.io.convert() functions to easily read, write, and convert PSM files:

>>> from psm_utils.io import read_file
>>> psm_list = read_file("data/QExHF04054_tandem.idXML", filetype="idxml")
>>> psm_list[0]
PSM(
   peptidoform=Peptidoform('QSGD[Ammonium]E[Ammonium]SYC[Carbamidomethyl]E[Ammonium]R/2'),
   spectrum_id='controllerType=0 controllerNumber=1 scan=4941',
   run=None,
   collection=None,
   spectrum=None,
   is_decoy=True,
   score=17.1,
   precursor_mz=624.252254215645,
   retention_time=1197.74208,
   protein_list=['sP06800'],
   source='idXML',
   provenance_data=None,
   metadata={
      'idxml:score_type': 'XTandem',
      'idxml:higher_score_better': 'True',
      'idxml:significance_threshold': '0.0'
   },
   rescoring_features=None
)

Alternatively, the more low-level file format-specific reader and writer classes can be used. Each reader has a read_file() function:

>>> from psm_utils.io.mzid import MzidReader
>>> psm_list = MzidReader("psms.mzid").read_file()
>>> psm_list[0].peptidoform
Peptidoform('GLTEGLHGFHVHEFGDNTAGC[Carbamidomethyl]TSAGPHFNPLSR/4')

And all readers support iteration over PSMs:

>>> for psm in MzidReader("psms.mzid"):
...     print(psm.peptidoform.proforma)
ACDEK
AC[Carbamidomethyl]DEFGR
[Acetyl]-AC[Carbamidomethyl]DEFGHIK
[...]

Similarly, writers can write single PSMs to a file:

>>> from psm_utils.io.tsv import TSVWriter
>>> with TSVWriter("psm_list.tsv", example_psm=psm_list[0]) as writer:
...     writer.write_psm(psm_list[0])

And writers can write entire PSM lists at once:

>>> with TSVWriter("psm_list.tsv", example_psm=psm_list[0]) as writer:
...     writer.write_file(psm_list)

Take a look at the Python API Reference for details, more examples, and additional information on the supported file formats.

Handling peptide modifications

Supported notations

Peptidoform accepts all supported ProForma 2.0 modification types and notations, through the pyteomics.proforma module. However, for some functionality, such as the composition and mass properties, the modification composition and mass, respectively, should be resolvable. This can be achieved in multiple ways:

Using a controlled vocabulary identifier or name, such as PSI-MOD or Unimod:

>>> Peptidoform("AC[UNIMOD:4]DEK").theoretical_mass
621.24282637892

>>> Peptidoform("AC[U:4]DEK").theoretical_mass
621.24282637892

>>> Peptidoform("AC[U:Carbamidomethyl]DEK").theoretical_mass
621.24282637892

Using a molecular formula or mass shift:

>>> Peptidoform("AC[Formula:H3C2NO]DEK/2").theoretical_mass
621.24282637892

>>> Peptidoform("AC[+57.021464]DEK/2").theoretical_mass
621.24282637892

A drawback of using the mass shift is that the composition is not resolvable:

>>> Peptidoform("AC[+57.021464]DEK/2").composition
[...]
ModificationException: Cannot resolve composition for modification 57.021464.

Renaming modifications

Often search engines use specific, arbitrary names for modifications. In that case, properties such as their mass or composition will not be resolvable.

>>> from psm_utils.io import read_file
>>> psm_list = read_file("msms.txt")
>>> psm_list["peptidoform"]
array([Peptidoform('AAAAAAALQAK/2'),
       Peptidoform('[ac]-AAAAAEQQQFYLLLGNLLSPDNVVR/3'),
       Peptidoform('[ac]-AAAAAEQQQFYLLLGNLLSPDNVVRK/3'), ...,
       Peptidoform('YYYLPLVSN[de]PK/2'),
       Peptidoform('YYYLTNVERLEELESDLK/3'), Peptidoform('YYYNGFYLLWI/3')],
      dtype=object)

To address this issue, modifications can be renamed:

>>> psm_list.rename_modifications({
    "ac": "U:Acetylation",
    "ox": "U:Oxidation",
    "de": "U:Deamidation",
    "gl": "U:Gln->pyro-Glu",
})
>>> psm_list["peptidoform"]
array([Peptidoform('AAAAAAALQAK/2'),
       Peptidoform('[UNIMOD:Acetylation]-AAAAAEQQQFYLLLGNLLSPDNVVR/3'),
       Peptidoform('[UNIMOD:Acetylation]-AAAAAEQQQFYLLLGNLLSPDNVVRK/3'),
       ..., Peptidoform('YYYLPLVSN[UNIMOD:Deamidation]PK/2'),
       Peptidoform('YYYLTNVERLEELESDLK/3'), Peptidoform('YYYNGFYLLWI/3')],
      dtype=object)

Handling fixed modifications

Additionally, fixed modifications that are not already part of the search engine output can be added and applied across the sequence:

>>> psm_list[19].peptidoform
Peptidoform('AAAPAPEEEMDECEQALAAEPK/2')

>>> psm_list.add_fixed_modifications([("Carbamidomethyl", ["C"])])
>>> psm_list[19].peptidoform
Peptidoform('<[Carbamidomethyl]@C>AAAPAPEEEMDECEQALAAEPK/2')

>>> psm_list.apply_fixed_modifications()
>>> psm_list[19].peptidoform
Peptidoform('AAAPAPEEEMDEC[Carbamidomethyl]EQALAAEPK/2')