Quantifying DNA-Protein Matches
FACULTY ADVISOR: David Benedetto PROJECT SPONSOR: Anthony Westbrook, Hubbard Center for Genome Studies
In the field of molecular biology, Quantifying DNA-Protein Matches determining the exact makeup of a Mallorie Biron, Sarah Hall, Mitchell Hersey Sponsor: Anthony Westbrook, Hubbard Center for Genome Studies sample is only part of the battle. What Department of Computer Science, University of New Hampshire, Durham, NH comes next is making sense of the Results Introduction SAM vs. BLAST thousands to millions of strings of A, T, C, and G, the nucleotides that make up an organism’s DNA. Various products use protein alignment to profile DNA Conclusion Goals by comparing that sequence against a library of known genetic structures. Currently, the most popular protein alignment software is BLAST. However, Analysis of Results Design BLAST is slow and takes an estimated 22 years to do the same number of reads Future Work that this product, PALADIN, does in 31 hours. Unfortunately, BLAST uses the Terms industry standard output format (BLAST tabular output) whereas PALADIN can only output in SAM format. In order for PALADIN to see more widespread use, it will need to conform to the industry standard. This project will add an option for PALADIN to output genetic data in BLAST tabular format. The project will not replace or eliminate the existing SAM format output PALADIN uses. This project will create scripts to compare PALADIN against its competitors in multiple scenarios. These scripts will show how PALADIN performs at classifying different types of DNA through statistical certainty. This project will analyze the data to represent it in a meaningful way. Protein alignment software profiles DNA by comparing a sequence against a library of known genetic structures. Currently, the most popular protein alignment software is BLAST. However, BLAST is slow and takes an estimated 22 years to do the same number of reads that our sponsor’s product, PALADIN, does in 31 hours. Unfortunately, BLAST uses the industry standard output format (BLAST tabular output) whereas PALADIN can only output in SAM format. We added an option for PALADIN to output data in BLAST tabular format, and we show how PALADIN performs at classifying different DNA samples through statistical analysis. Protein alignment software profiles DNA by comparing a sequence against a library of known genetic structures. Currently, the most popular protein alignment software is BLAST. However, BLAST is slow and takes an estimated 22 years to do 240 million reads that our sponsor’s product, PALADIN, does in 31 hours. BLAST uses the industry standard output format (BLAST tabular output) whereas PALADIN can only output in SAM format. These differing file formats are what causes people to choose one program over another. These outputs contain different information, some of the fields cannot be easily converted to the other. To make PALADIN see more widespread use, we added an option for PALADIN to output data in BLAST tabular format, and we show how PALADIN performs at classifying different DNA samples through statistical analysis.
• Add a command line option to PALADIN to output data in BLAST tabular format • BLAST output values should be virtually identical to the output using a native BLAST system. • Bit-score and Evalue must be calculated • The BLAST command must be able to output to stdout (for real-time analysis by secondary programs) and a user designated file. • The BLAST features must not affect the existing PALADIN capabilities and should not have any effect on performance. • Test PALADIN output for validity against BLAST's output
Adding a command line option to PALADIN to produce BLAST tabular output was added directly to the code source. 10 of the 12 fields could be obtained from information already produced by the SAM format. The two fields that required some research were the evalue and the bitscore.
The bitscore provides information about how similar the query sequence and the sequence it matched to in the database are. The higher the bitscore, the more similar the sequences. The bitscore can be represented by the equation đ?œ†đ?œ†đ?œ†đ?œ† đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x; đ?‘ đ?‘ đ?‘ đ?‘ đ?‘ đ?‘ đ?‘ đ?‘ đ?‘ đ?‘ đ?‘ đ?‘ đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘&#x;đ?‘ đ?‘ đ?‘ đ?‘ − đ?‘™đ?‘™đ?‘™đ?‘™đ?‘™đ?‘™đ?‘™đ?‘™đ?‘™đ?‘™đ?‘™đ?‘™ ln 2 The raw score is provided by the extended field in SAM AS, which is the alignment score and provides information about how well the sequences match. đ?œ†đ?œ†đ?œ†đ?œ† and K are constants. For gapped alignments, these values are determined by the gap extension and gap opening penalty. These values can be located in the paper written by Stephen Altschul in Methods in Enzymology (vol 266, page 474) . The evalue is affected by the size of the database. It provides the number of matches with the same score that can occur by chance given a random database. Smaller evalues are better. It is represented by the equation đ?‘šđ?‘šđ?‘šđ?‘š ∗ đ?‘™đ?‘™đ?‘™đ?‘™ 2đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘?đ?‘? Where m is the length of the query sequence, and n is the length of the reference database.
SAM stands for Sequence Alignment Map and it is used within the Bioinformatics community. As shown on the tables, the fields differ. For BLAST tabular, the evalue and bitscore are important statistical figures. These cannot be calculated directly from SAM, and require figures PALADIN does not output.
SAM FORMAT
Col Field
Type
Brief Description
1
QNAME
String
Query template NAME
2
FLAG
Int
Bitwise FLAG
3
RNAME
String
References sequence NAME
4
POS
Int
1- based leftmost mapping POSition
5
MAPQ
Int
MAPping Quality
6
CIGAR
String
CIGAR String
7
RNEXT
String
8
PNEXT
Int
Ref. name of the next mate/NEXT read Position of the mate/NEXT read
9
TLEN
Int
Observed Template LENgth
10
SEQ
String
Segment SEQuence
11
QUAL
String
ASCII of Phred-scaled base QUALity+33
BLAST TABULAR
Col Field qseqid
String query (e.g., gene) sequence id
2
sseqid
String subject (e.g., reference genome) sequence id
3
pident
Float
percentage of identical matches
4
length
Int
alignment length
5
mismatch Int
number of mismatches
6
gapopen
Int
number of gap openings
7
qstart
Int
start of alignment in query
8
qend
Int
end of alignment in query
9
sstart
Int
start of alignment in subject
10
send
Int
end of alignment in subject
11
evalue
float
expected value: the number of expected hits of similar score that could be found by chance only. Lower evalue = better match
12
bitscore
Int
Bit score: required size of a sequence database in which the current match could be found just by chance. Higher bit score = better sequence similarity.
C.albicans Avg. Diff Std. Dev
C.elegans Avg. Diff
pident length mismatch gapopen qstart qend sstart seend evalue bitscore
pident length mismatch gapopen qstart qend sstart seend evalue bitscore
3.2595750 2.8434667 1.0257286 0.0154093 50.4960954 48.2315577 16.4209315 16.5396040 0.0968493 13.5958109
7.2856898 4.8655845 2.2758560 0.1396287 46.4923501 42.3380774 63.2226428 63.3066551 0.6573271 9.1565305
All 6 Species
Brief Description
Type
1
Std. Dev
3.0210508 2.3793869 0.7799665 0.0341666 48.5370874 44.2045918 10.5039623 10.4351764 0.1471118 13.2467464
6.6836567 4.4171659 1.7850386 0.2028576 44.2158653 40.2511900 131.2518809 131.2652380 0.8245354 8.9829315
M.luteus Avg. Diff pident length mismatch gapopen qstart qend sstart seend evalue bitscore
Std. Dev
7.7158336 4.8094471 2.4492754 0.0327429 46.4739667 45.3896940 17.5217391 17.5185185 0.0809434 10.8273752
10.6755533 5.7613361 3.3255962 0.2111331 45.1989901 37.1774787 56.4983716 56.3604286 0.5130537 8.1565076
C.difficiles Avg. Diff Std. Dev
F.oxysporum Avg. Diff Std. Dev
P.aeruginosa Avg. Diff Std. Dev
pident length mismatch gapopen qstart qend sstart seend evalue bitscore
pident length mismatch gapopen qstart qend sstart seend evalue bitscore
pident length mismatch gapopen qstart qend sstart seend evalue bitscore
4.7634605 3.0958702 1.5129056 0.0136431 48.9771386 47.0335546 13.0634218 12.9634956 0.0712061 12.8809255
9.6609350 5.2636527 3.0385175 0.1364773 46.4321218 41.5537940 49.9200951 49.5425380 0.5680238 7.8069468
9.9493971 5.6601190 3.0894048 0.0595238 48.2278571 46.1882143 32.1050000 32.0997619 0.1105817 8.2891310
11.4411700 5.6999491 3.5215706 0.2659907 44.8225137 34.7429481 135.7312297 135.7270336 0.6995300 6.8482552
2.0137792 5.9984083 1.8498409 4.1609257 0.6336516 1.8851238 0.0053699 0.0844499 49.4714598 46.7020879 47.2488067 43.7120403 8.4763325 93.3220207 8.5314240 93.2058949 0.0674373 0.5596276 14.1401452 8.0819474
pident length mismatch gapopen qstart qend sstart seend evalue bitscore
Avg. Diff
Avg Std. Dev.
5.1205161 3.4396885 1.5818221 0.0268093 48.6972675 46.3827365 16.3485646 16.3479967 0.0956883 12.1633557
2.2658460 0.6627435 0.7535796 0.0652388 1.0366124 3.3857952 38.0342931 38.1217044 0.1145422 0.8391452
To verify the implementation of BLAST tabular output, PALADIN was tested against size (6) different species. The DNA reads of these species were synthetically generated using a program named ART. ART simulates DNA reads that PALADIN is used to take as input. Different species were used since the different programs work differently against different species. The differences for each field was compared for each matching alignment. A matching alignment is one where the qseqid and the sseqid are the same. This means that a read mapped to the same reference in the protein database. Since the programs use different algorithms, different alignments are to be expected. The goal is to not have the differences be as close to zero as possible, since these programs will not produce the same alignment. The alignments should be similar most of the time. Therefore the differences should be close in value. For the raw score calculation, BLAST uses substitution matrices. PALADIN does not currently support this, but it will in the future. There are expected to be some issues with the bitscore and evalue which will be resolved with the substation matrix raw scores. As seen in the results, BLAST tabular output has been properly implemented. With more testing, PALADIN can now be used as a replacement for BLAST. PALADIN can now reach a wider audience as a protein alignment tool with the addition of this output. PALADIN
• Test PALADIN against other tools (DIAMOND, etc...) • Test program with larger genomes and samples to further prove validity • Update lambda and k values to the most recent table • Implement BLOSUM and PAM substitution scores.
PALADIN: Protein Alignment And Detection Interface, a protein sequence alignment tool designed for the accurate functional characterization of genetic material from environmental samples. BLAST: Basic Local Alignment Search Tool, compares biological sequence information to a library of sequences in order to identify the sequence information. This is PALADIN’s competitor and is much slower than PALADIN.
R.E.D.S.W.A.R.M.S. - Resilient Enterprise Deployment System Which Automates Training Modules Securely AUTHORS: Maximiliano DelRio Benjamin Patton Dylan Wheeler ADVISOR: Kenneth Graf
As the world becomes more digital, R.E.D.S.W.A.R.M.S. cybersecurity becomes ever more vital. To anticipate this demand in talent, Introduction Ansible Workflow the University of New Hampshire has a cybersecurity club and team that fosters growth in students by training them to face real-world scenarios through competitions. However, building these Goals mock networks for training is currently Example Team Topology Team Scenario Components manual, time-consuming, and errorprone. REDSWARMS is a system designed to facilitate the automatic creation and configuration of small-scale enterprise networks to more quickly perform Results Future Work those security training scenarios and competitions. It features virtual infrastructure using VMware’s platform for managing a network of virtual machines as well as an automated deployment pipeline with Ansible and Ansible AWX. With REDSWARMS, the cybersecurity team can stand up brand new networks for training in a matter of minutes, allowing students to spend less time configuring and more time learning. Resilient Enterprise Deployment System Which Automates tRaining Modules Securely Max Del Rio, Ben Patton, & Dylan Wheeler Sponsor: Ken Graf + Cybersecurity Club, Department of Computer Science https://bitbucket.org/unhredswarms/maincode/
Context: UNH’s Cybersecurity Club trains students to tackle tomorrow's biggest cybersecurity challenges.
The Club provides a wealth of resources for students to practice and grow by sponsoring Cyber Defense/Offense competitions, labs, and security research.
Problem: Standing up networks to train on is timeconsuming and error-prone. Students often spend entire meetings cloning and configuring virtual machines. Solution: By automating this process, students can spend less time configuring and more time learning.
Ansible AWX manages and orchestrates network deployments. A team network is deployed via a workflow composed of smaller deployment tasks as templates.
UNH will also be in prime position to host cybersecurity competitions and events.
Each template contains the necessary steps to configure each host. Each component could be deploying a VM template or configuring an application/computer.
Host information, such as IP addresses, users, and target VM template file, is maintained as an inventory. Similar hosts can be grouped using a tagging system. This allows us to run certain tasks on specific hosts. For example, Linux configuration tasks can run independently of Windows tasks. This also simplifies the process of adding new hosts to a deployment.
â?– A network/scenario can be deployed and destroyed
automatically using an Ansible module for VMware
â?– Ansible pipeline must be at least twice as fast as the manual equivalent (and scale with large teams)
â?– System must be packaged and used to deploy team networks for a UNH-hosted competition
� Each team’s network must be isolated from each other and the competition infrastructure
â?– Produce necessary competition documents for a future UNH-hosted cybersecurity event
â?– Allow networks to be easily customized with operating systems, devices, and software
Various components of the team network are meant to simulate systems that students might see in a mock enterprise network or competition scenario. Technologies:
â?– Palo Alto virtual router isolating and controlling access from each teams to the internet
â?– Completed one-touch automated deployment of a competition scenario
â?– Authored detailed technical documentation for future students to expand
â?– Built a modular framework to support additional
scenarios with more complex networks or vulnerabilities
â?– Project version control managed by Bitbucket â?– With only minor changes, REDSWARMS can deploy a Cyber Defense competition in its current state
� pfSense firewall/router appliance � Redmine issue tracker � Ubuntu Desktop for Linux workstations � Nginx Web server for a “customer facing� service � Windows Server with AD service for user/policy management
Features:
â?– Customizable networks and devices, where a tagging system is used to identify device roles and locations in a deployment
â?– Intentionally vulnerable systems, such as old or unconfigured versions of software, to allow students to practice finding and exploiting vulnerabilities
ď ś Additional scenarios to be created for different competition styles (including new software or operating systems).
ď ś Introducing intentionally vulnerable components for remediation training
ď ś Hosting a competition with our new technology ď ś Ensuring the lab's infrastructure can support a competition
Winning Project
2020 INTERDISCIPLINARY SCIENCE AND ENGINEERING SYMPOSIUM • 28
COMPUTER SCIENCE-SYSTEMS
AUTHORS: Mallorie Biron Sarah Hall Mitchell Hersey