4.7. Experiment

class asari.experiment.ext_Experiment(sample_registry, parameters)[source]

Similar to metDataModel.core.Experiment with preprocessing methods.

This encapsulates a set of LC-MS files using the same experimental method (chromatography and ionization) to be processed together.

E.g., data from postive ESI and negative ESI should not in the same ext_Experiment instance.

This class has annotation and export functions.

Default asari work flow is in ext_Experiment.process_all.

annotate()[source]

Annotate features via JMS (jms.dbStructures) and khipu. The pre-annotation step is khipu based emprical compound construction, followed by three steps of annotation:

Search known compound database, via neutral mass inferred by khipu
Search singletons for a formula match
Encapsulate remaining features in empCpd format, so that all are exported consistently.

Export Feature_annotation as tsv, Annotated_empricalCompounds in both JSON and pickle. Reference databases can be pre-loaded. Measured m/z values are calibrated to database based values (db_mass_calibrate).

Note

This produces default annotation with asari, but one can redo annotation on the features afterwards, using a method of choice. With JMS/khipu, one can also pass custom adduct/isotopes to EED.adduct_patterns etc. See ExperimentalEcpdDatabase.get_isotope_adduct_patterns().

append_orphans_to_epmCpds(dict_empCpds)[source]

This is the third step of feature annotation in self.annotate, to encapsulate features without annotation in empCpd format. Input via dict_empCpds and returns updated dict_empCpds. See also: annotate

Parameters:: dict_empCpds (dict) – a dictionary of empirical compounds in empCpd format

db_mass_calibrate(required_calibrate_threshold=2e-06)[source]

Use KCD.evaluate_mass_accuracy_ratio to check systematic mass shift, which is calculated as the average ppm difference between measured m/z and theoretical values. If greater than required_calibrate_threshold (default 2 ppm), calibrate m/z values for the whole experiment by updating self.CMAP.FeatureList.

Parameters:: required_calibrate_treshold (float, optional, default: 0.000002) – if the mass shift exceeds this value, mass correction will be applied.

Note

Data format in good_reference_landmark_peaks: [{‘ref_id_num’: 99, ‘apex’: 211, ‘height’: 999999}, …], where ref_id_num is index number of mass track in MassGrid.

export_CMAP_pickle()[source]

Export main CMAP data and MassGrid to pickle, which can be used for visual data exploration.

Included in exported pickle:{

‘_number_of_samples_’: self.CMAP._number_of_samples_, ‘rt_length’: self.CMAP.rt_length, ‘rt_reference_landmarks’: [p[‘apex’]

for p in self.CMAP.good_reference_landmark_peaks],

‘rt_records’: [sample.get_rt_calibration_records(): for sample in self.all_samples ],

‘dict_scan_rtime’: self.CMAP.dict_scan_rtime, ‘list_mass_tracks’: self.CMAP.composite_mass_tracks, ‘MassGrid’: dict(self.CMAP.MassGrid),}

rt_records includes for each sample: {

‘sample_id’: self.sample_id, ‘name’: self.name, ‘rt_landmarks’: self.rt_landmarks, ‘reverse_rt_cal_dict’: self.reverse_rt_cal_dict,

}

Note

RT calibration is exported to include sample.reverse_rt_cal_dict, i.e. {key=reference scan number, value=sample specific scan number}. May add data throttle in the future. The file cmap.pickle can get big.

export_all(anno=True)[source]

Export all files. Annotation of features to empirical compounds is done here.

Parameters:: anno (bool, optional, default: True) – if true, generate annotation files, export CMAP pickle and do QC plot; else skip annotating.

export_feature_tables(_snr=2, _peak_shape=0.7, _cSelectivity=0.7)[source]

To export multiple features tables. Filtering parameters (_snr, _peak_shape, _cSelectivity) only apply to preferred table and unique_compound_table. Full table is filtered by initial peak detection parameters, which contains lower values of snr and gaussian_shape.

Parameters:

_snr (float, optional, default: 2) – signal to noise ratio, peaks must have SNR above this value to be a preferred feature
_peak_shape (float, optional, default: 0.7) – the goodness fitting for a peak must be above this value to be a preferred feature
_cSelectivity (float, optional, default: 0.7) – the cSelectivity of a peak must be above this value to be a preferred feature

Outputs

Multiple features tables under output directory: 1. preferred table under outdir, after quality filtering

by SNR, peak shape and chromatographic selectivity.

full table under outdir/export/
unique compound table under outdir/export/
dependent on target extract option, a targeted_extraction table under outdir.

export_log()[source]: Export project parameters to project.json, which is also used by asari viz.

export_peak_annotation(dict_empCpds, KCD, export_file_name_prefix)[source]

Export feature annotation to tab delimited tsv file, where interim_id is empCpd id.

Parameters:

dict_empCpds (dict) – dictionary of empirical compounds, using interim_id as key, as seen in JMS.
KCD (KnownCompoundDatabase instance) – the known compound database that was used in annotating the empirical compounds.
export_file_name_prefix (str) – to used in output file name.

export_readme()[source]: Export a REAME.txt file as simple instruction to end users.

generate_qc_plot_pdf(outfile='qc_plot.pdf')[source]: Generates a PDF figure of a combined plot feature quality metrics. Used only when –anno True (default). Skip if matplotlib is missing.

get_max_scan_number(sample_registry)[source]

Return max scan number among samples, or None if no valid sample.

Parameters:: sample_registry (dict) – a dict that maps sample IDs to sample data

get_reference_sample_id()[source]: get_reference_sample_id either by user specification, or using the sample of most number_anchor_mz_pairs. This assumes the sample of most good m/z values has a good coverage of features.

get_valid_sample_ids()[source]: Get valid sample ids, as some samples may not be extracted successfully.

load_annotation_db(src='hmdb4')[source]

Load database of known compound using jms.dbStructures.knownCompoundDatabase. The compound tree is precomputed indexing. The src parameter is not used now, but placeholder to add more options later.

Parameters:: src (str, optional, default: hmdb4) – not used but can, in the future, dictate which database is used to generate annotations

process_all()[source]

This is the default asari workflow.

Build MassGrid, using either pairwise (small study) or clustering method. Choose one reference from all samples for the largest number of landmark m/z tracks.
RT alignment via a LOWESS function, using selective landmark peaks.
Build composite elution profile (composite_mass_tracks), by cumulative sum of mass tracks from all samples after RT correction.
Global peak detection is performed on each composite massTrack.
Mapping global peaks (i.e. features) back to all samples and extract sample specific peak areas. This completes the FeatureTable.

Updates:: self.CMAP as instance of CompositeMap, and MassGrid, composite map and features within.

select_unique_compound_features(dict_empCpds)[source]

Get unique feature by highest composite peak area per empirical compound. One may consider alternatives to select the peak representing an empirical compound, e.g. by SNR or M+H, M-H ions. This is can be done separately on the exported files.

Parameters:: dict_empCpds (dict) – dictionary of empirical compounds, using interim_id as key, as seen in JMS.