· Grades from marked homeworks are on the server in file /grades/userid.txt
· We offer a chance to write a remedial midterm again with similar types of quastions as last time. It will take place on May 14 during lecture. It will be necessary to register one week in advance. If you register for second midterm, you lose points from the first midterm. We will not lower the maximum of points on the second midterm as was done on the first.
· Dates of project submission and oral exams:
Early: submit project Tuesday May 19 22:00pm, oral exams Thursday May 21 during class (for students who need or want to finish exams early). Registration for early exam Monday May 18, 22:00 in AIS.
Regular: project deadline June 9, 22:00 and oral exams on June 15.
Remedial exams will take place in the last weeks of the exam period. Beware, there will not be much time to prepare a better project.
· Rust homework is due May 21, 9:00am.
· Projects should be submitted as homeworks to /submit/project.

HWpar

Task A
Task B

See also the lecture

For both tasks, submit your source code and the result, when run on whole dataset (/tasks/par/dataset). The code is expected to use the Apache beam framework presented in the lecture. Ideally, you should run the code locally on your computer or try some cloud providers if you want.

Submit directory is /submit/par/
Outline of the protocol is in /tasks/par/protocol.txt
Submit files task_a.py and task_b.py with your programs, task_a.out with your output for task A
Write the output for task B to the protocol.
Also include a short description of your algorithm in both tasks to the protocol.

Task A

Count the number of occurrences of each 4-mer in the provided data.

Provided data are in Fastq format. They contain reads from some genomic sequencing. By a read we mean part of some DNA. In Fastq format each read is on 4 lines. The first line starts with @ and contains the read name. The second line contains the actual read (this is the important part for you). The third line contains + and the read name again. The fourth line contains a quality score for each base (you should ignore this).

By k-mer we mean any consecutive substring in a read. For example, in a read “ACGGCTA” the 4-mers are: “ACGG”, “CGGC”, “GGCT”, “GCTA”.

Task B

Count the number of pairs of reads that overlap in exactly 30 bases (the end of one read overlaps the beginning of the second read). For bioinformaticians: You can ignore the reverse complement.

One more clarification: If you have two reads (and say we are counting 4 base overlaps): AAAAxxxxxCCCC and CCCCxxxxxAAAA, this counts as two overlaps.

Hints:

Try counting pairs for each 30-mer first.
You can yield something structured from Map/ParDo operatation (e.g. tuple).
You can have another Map/ParDo after CombinePerKey.
Run code on small data (subset of all files or even one file or even your small custom data) to quickly iterate and test :)