HWpar
See also the lecture
For both tasks, submit your source code and the result, when run on
whole dataset (/tasks/par/dataset). The code is expected to use the
Apache beam framework presented in the lecture.
Ideally, you should run the code locally on your computer or try some
cloud providers if you want.
- Submit directory is
/submit/par/ - Outline of the protocol is in
/tasks/par/protocol.txt - Submit files
task_a.pyandtask_b.pywith your programs,task_a.outwith your output for task A - Write the output for task B to the protocol.
- Also include a short description of your algorithm in both tasks to the protocol.
Task A
Count the number of occurrences of each 4-mer in the provided data.
Provided data are in Fastq format. They contain reads from some genomic sequencing. By a read we mean part of some DNA. In Fastq format each read is on 4 lines. The first line starts with @ and contains the read name. The second line contains the actual read (this is the important part for you). The third line contains + and the read name again. The fourth line contains a quality score for each base (you should ignore this).
By k-mer we mean any consecutive substring in a read. For example, in a read “ACGGCTA” the 4-mers are: “ACGG”, “CGGC”, “GGCT”, “GCTA”.
Task B
Count the number of pairs of reads that overlap in exactly 30 bases (the end of one read overlaps the beginning of the second read). For bioinformaticians: You can ignore the reverse complement.
One more clarification: If you have two reads (and say we are counting 4 base overlaps): AAAAxxxxxCCCC and CCCCxxxxxAAAA, this counts as two overlaps.
Hints:
- Try counting pairs for each 30-mer first.
- You can yield something structured from Map/ParDo operatation (e.g. tuple).
- You can have another Map/ParDo after CombinePerKey.
- Run code on small data (subset of all files or even one file or even your small custom data) to quickly iterate and test :)