HWbash
- Preparatory steps and submitting
- Introduction to tasks A-C
- Task A (counting proteins)
- Task B (counting matches)
- Task C (joining information)
- Task D (passwords)
Lecture on Perl, Lecture on command-line tools
- In this set of tasks, use command-line tools or one-liners in Perl, awk or sed. Do not write any scripts or programs.
- Your commands should work also for other input files with the same format (do not try to generalize them too much, but also do not use very specific properties of a particular input, such as the number of lines etc.)
Preparatory steps and submitting
# create a folder for this homework
mkdir bash
# move to the new folder
cd bash
# link input files to the current folder
ln -s /tasks/bash/human.fa /tasks/bash/dog.fa /tasks/bash/matches.tsv /tasks/bash/names.txt .
# copy protocol to the current folder
cp -i /tasks/bash/protocol.txt .
- Now you can open
protocol.txtin your favorite editor and start working. - Command
lncreated symbolic links (shortcuts) to the input files, so you can use them under names such ashuman.farather than full paths such as/tasks/bash/human.fa.
When you are done, you can submit all required files as follows (substitute your username):
cp -ipv protocol.txt human.txt pairs.txt similar.tsv best.txt function.txt passwords.csv /submit/bash/your_username
# check what was submitted
ls -l /submit/bash/your_username
Introduction to tasks A-C
- In these tasks we will again process bioinformatics data. We have two files of sequences in the FASTA format. This time the sequences represent proteins, not DNA, and therefore they use 20 different letters representing different amino acids. Lines starting with ‘>’ contain the identifier of a protein and potentially an additional description. This is followed by the sequence of this protein, which will not be needed in this task. This data comes from the Uniprot database.
- File
/tasks/bash/dog.fais a FASTA file containing about 10% of randomly selected dog proteins. Each protein is identified in the FASTA file only by its ID such asA0A8I3MJS8_CANLF. - File
/tasks/bash/human.fais a FASTA file with all human proteins. Each ID is followed by a description of the biological function of the protein. - These two sets of proteins were compared by the bioinformatics tool
called BLAST,
which finds proteins with similar sequences. The results of BLAST
are in file
/tasks/bash/matches.tsv. This file contains a section for each dog protein. This section starts with several comments, i.e. lines starting with#symbol. This is followed by a table with the found matches in theTSVformat, i.e., several values delimited by tab characters\t. We will be interested in the first two columns representing the IDs of the dog and human proteins, respectively.
Task A (counting proteins)
Steps (1) and (2)
- Use files
human.faanddog.fato find out how many proteins are in each. Each protein starts with a line starting with the>symbol, so it is sufficient to count those. - Beware that
>symbol means redirect in bash. Therefore you have to enclose it in single quotation marks'>'so that it is taken literally. - For each file write a single command or a pipeline of several commands that will produce the number with the answer. Write the commands and the resulting protein counts to the appropriate sections of your protocol.
Step (3)
- Create file
human.txtwhich contains sequence IDs and descriptions extracted fromhuman.fa. This file will be used in Task C. - Leading
>should be removed. Any text afterOS=in the description should be also removed. - This file should be sorted alphabetically.
- The file should start as follows:
1433B_HUMAN 14-3-3 protein beta/alpha
1433E_HUMAN 14-3-3 protein epsilon
1433F_HUMAN 14-3-3 protein eta
1433G_HUMAN 14-3-3 protein gamma
1433S_HUMAN 14-3-3 protein sigma
- Submit file
human.txt, write your commands to the protocol.
Task B (counting matches)
Step (1)
- From file
matches.tsvextract pairs of similar proteins and store them in filepairs.txt. - Each line of the file should contain a pair of protein IDs extracted
from the first two columns of the
matches.tsvfile. - These IDs should be separated by a single space and the file should be sorted alphabetically.
- Do not forget to omit lines with comments.
- Each pair from the input should be listed only once in the output.
- Commands
grep,sortanduniqwould be helpful. To select only some columns, you can use commandscut,awkor a Perl one-liner. - The file
pairs.txtshould have 18939 lines (verify using commandwc) and it should start as follows:
A0A1Y6D565_CANLF P_HUMAN
A0A1Y6D565_CANLF S13A4_HUMAN
A0A222YTD8_CANLF S39A4_HUMAN
- Submit file
pairs.txtand write your commands to the protocol.
Step (2)
- Find out how many dog proteins have at least one similarity found in
matches.tsv. This can be done by counting distinct values in the first column of yourpairs.txtfile from step (1). - We suggest commands
cut/awk/perl,sort,uniq,wc - The result of your commands should be an output consisting of a single number (and the end-of-line character).
- Write your answer and commands to the protocol. What percentage is this number out of all dog proteins found in Task A(2)? (You may compute the percentage manually by a calculator).
Step (3)
- The third column of
matches.tsv(the first numerical column after the two protein IDs) contains a number between 0 and 100. This is the percentage of amino acids that are the same between the two proteins. - Create a file which will contains only those lines of
matches.tsvthat have this number at least 90, i.e. where fewer than 10% of amino acids mutated to something else. Omit also comment lines. Leave the order of lines as in the original file. - Write the result to
similar.tsv - The file
similar.tsvshould have 1056 lines and it should start as follows:
A0A8I3MMG4_CANLF FKBP3_HUMAN 97.321 224 6 0 1 224 1 224 4.83e-163 448
A0A8I3NRH9_CANLF KIF15_HUMAN 94.045 403 24 0 7 409 7 409 0.0 818
A0A8I3P3E9_CANLF TRAK1_HUMAN 97.065 954 27 1 1 954 1 953 0.0 1900
- Submit file
similar.tsvand write your commands to the protocol.
Task C (joining information)
Step (1)
- For each dog protein, the first (top) match in
matches.tsvrepresents the strongest similarity. - In this step, we want to extract such strongest match for each dog protein which has at least one match.
- The result should be a file
best.txtlisting the two IDs separated by a space. The file should be sorted by the second column (human protein ID). - The file should start as follows:
A0A8I3MVQ9_CANLF 1433E_HUMAN
A0A8I3MT06_CANLF 1433G_HUMAN
A0A8I3QRX3_CANLF 1433T_HUMAN
- This task can be done by printing the lines that are not comments
but follow a comment line starting with
#. - In a Perl one-liner, you can create a state variable which will remember if the previous line was a comment and based on that you decide if you print the current line.
- Instead of using Perl, you can play with
grep. Option-A 1prints the matching lines as well as one line after each match and separates blocks of consecutive lines by--. - Submit file
best.txtwith the result and write your commands to the protocol.
Step (2):
- Now we want to extend file
best.txtwith a description of each human protein. - Since similar proteins often have similar functions, this will allow somebody studying dog proteins to learn something about their possible functions based on similarity to well-studied human proteins.
- To achieve this, we join together file
best.txtfrom step 1 andhuman.txtcreated in Task A(3). Conveniently, they are both sorted by the ID of the human protein. - Use command join to join these files.
- Use option
-1 2to use the second column ofbest.txtas a key for joining. - The output of
joinmay start as follows:
1433E_HUMAN A0A8I3MVQ9_CANLF 14-3-3 protein epsilon
1433G_HUMAN A0A8I3MT06_CANLF 14-3-3 protein gamma
1433T_HUMAN A0A8I3QRX3_CANLF 14-3-3 protein theta
- Further reformat the output so that the dog ID goes first (e.g.
A0A8I3MVQ9_CANLF), followed by human protein ID (e.g.1433E_HUMAN), followed by the rest of the text. - Sort by dog protein ID, store in file
function.txt. - The output should start as follows:
A0A1Y6D565_CANLF P_HUMAN P protein
A0A222YTD8_CANLF S39AA_HUMAN Zinc transporter ZIP10
A0A5F4CW23_CANLF ELA_HUMAN Apelin receptor early endogenous ligand
- Files
best.txtandfunction.txtshould have the same number of lines. - Which human protein is the best match for the dog protein
A0A8I3RVG4_CANLFand what is its function? - Submit file
best.txt. Write your commands and the answer to the question above to your protocol.
Task D (passwords)
- The file
/tasks/bash/names.txtcontains data about several people, one per line. - Each line consists of given name(s), surname and email separated by spaces.
- Each person can have multiple given names (at least 1), but exactly
one surname and one email. Email is always of the form
username@uniba.sk. - The task is to generate file
passwords.csvwhich contains a randomly generated password for each of these users- The output file should have columns separated by commas ‘,’
- The first column should contain username extracted from email address, the second column surname, the third column all given names and the fourth column the randomly generated password.
- Submit file
passwords.csvwith the result of your commands. Write your commands to the protocol.
Example line from input:
Pavol Orszagh Hviezdoslav hviezdoslav32@uniba.sk
Example line from output (password will differ):
hviezdoslav32,Hviezdoslav,Pavol Orszagh,3T3Pu3un
Hints:
- Passwords can be generated using
pwgen(e.g.pwgen -N 10 -1prints 10 passwords, one per line). - We also recommend using
perl,wc,paste(check option-dinpaste) and backtick operator ``. - In Perl, function
popmay be useful for manipulating@Fand functionjoinfor connecting strings with a separator.