HAD Spreadsheet Data Format: Column A: Primary GI. The primary

advertisement
HAD Spreadsheet Data Format:
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Column A: Primary GI. The primary GI found in the SFLD for a given unique sequence.
Column B: All GIs. These are all the GIs with sequences identical to Primary GI.
Column C: EFD ID. Internal SFLD identifier.
Column D: Seq Length. The length of the sequence
Column E: URL to EFD ID. A clickable link that references the SFLD page for the
appropriate enzyme.
Column F: hasStructure. This is a binary list stating whether there are PDB structures
associated with a node (useful for coloring, etc in cytoscape).
Column G: PDBs. This is a list of all PDBs associated with a given sequence.
Column H: isCharacterized. This is a binary list (Yes/No) stating whether functional
characterization has occurred with this sequence. Sequences are identified as
characterized from a) SFLD evidence codes (CFM, IES), b) has an entry in swiss-prot
Column I: FamilyAssignment. This identifies what function (SFLD family) this sequence
has been assigned to. This column will not have the same number of entries as column
G (isCharacterized) as this column also includes sequences that have been assigned by
human annotators to functional families in the SFLD and does not include swiss-prot
entries not assigned to SFLD families.
Column J: Swiss-Prot ID. The Swiss-Prot ID for the sequence if it exists.
Column K: isTarget. Binary list (Yes/No) indicating whether the sequence is/was a target
of Lily, P01 or EFI (LabDB).
Column L: targetStatus. The current status of the Lily/P01/EFI target.
Column M: LillyTarget ID. Indicates the identifiers for the target(s) from Lilly.
Column N: P01Target ID. Indicates the identifiers for the P01 targets
Column O: EFITarget ID. Indicates the identifiers for target(s) from EFI. Note: Some
targets didn't have exact matches in the SFLD HAD sequence set. In these cases, the
percent ID for the match of the target sequence to the SFLD sequence is indicated in
parentheses.
Column P: LabDB Url. If the sequence is in LabDB, the link to that entry.
Column Q: Type of life. As defined by the division record of the organisms from which
the enzymes are derived
Column R: Species. A list of all species that correspond to the node.
Column S: Taxon ID. If it exists, the taxonomy ID for the species listed in column R.
Column T: Genome GI. The GI for the genomic DNA for this sequence.
Column U: DNAAvailable. This is a binary list (Yes/No) that indicates whether any of the
species in R are in either MTDF or ATCC.
Column V: inMTDF. This is a binary list that indicates whether any of the species in R are
in the MTDF list.
Column W: inATCC. This is a binary list that indicates whether any of the species in R are
in the ATCC list.
Column X: Subgroup. This is the SFLD subgroup that this sequence is assigned to.
Column Y: Cytoscape ID. The cytoscape identifier this sequence is associated with in the
representative network.
Column Z: Potential Targets. Yes indicates that all of the following are true: No
structures for this sequence, Sequences is not experimentally characterized (as specified
in column H), sequence is not a target (as specified in column K), and the species is in
either MTDF and/or ATCC.
Column AA: spAnnot. Annotation from swiss-prot.
●
●
●
Column AB: nonHADPfamDom. Indicates the non-HAD-clan pfam domain with the
most significant match (within the gathering cutoff specified by Pfam) to the sequence.
(Potentially useful for identifying multidomain proteins.)
Column AC: goldStd. Indicates the gold standard protein represented by the sequence
(each gold standard protein was mapped to the single closest match in the HAD SFLD
superfamily set) and the percent ID for the match.
Column AC: enzyme ID. Internal SFLD identifier.
Download