2022 Functional Genomics Lab Notebook
2022
2022-01 January
2022-01-21 Friday
Setting up new server
So some issues. There's no power cable for the ssd. And I don't have internet access on the server yet. There's also no deb for megacli to download so I had to use alien to create it. But alien isn't packaged up for nixos, so I had to make a docker image and run alien in there, then put that on a usb and then move it to the server.
Basic overview: https://gist.github.com/fxkraus/595ab82e07cd6f8e057d31bc0bc5e779
Thought I was going to have to order a power cable for the loose ssd, but I found one on the power supply.
2022-02 February
2022-02-19 Saturday Goals and Planning
Accomplishments
2021
nextflow groseq rewrite and extension
First biotech internship
Got a pre-print published
Mentored 3 undergrads
Helped write and teach first course
TODO Previous 5 years
Passed core courses
Got heavily involved in nf-core
Research Goals
2022
Finish QE
Publish pipeline paper
Finish infection GRO-Seq paper
Start on GRO-seq mining
Scale Suhana's Protein analysis
Help Alyssa with WGS viral analysis
TODO Next 3 years
?
Figure out direction
Get some data sequenced by Element(A rich dataset like the GROseq)
Professional & Personal Goals
2022
Learn Julia
Become an nf-core core member
Iterate on Applied Genomics
Look at GitHub Classroom(to automate feedback)
Pick datasets
Rework second half to expand projects
Next 3 years
Come up reproducible "data science" workflow(nextflow level ease of use)
Can an undergrad reproduce it in a week?
Can an undergrad pick it up in a semester?
More "flexible" than nextflow
Graduate
Explore industry jobs(I'm really loving the "Genomics Applications" team)
2022-02-21 Monday
Goals and Planning Meeting
GRO-cap might be another type of dataset to look at.
Question to answer is what happens to the cell?
Three parts to my project
Resource Generation (analysis DB)
Toolkit (nf-core pipeline)
Dynamics of eRNAs
How do they shutoff?
Negative feedback loop, product knocks down the TF
Random thoughts
Peng looked for things going up and INFB was going down
Only found chromatin genes
Reinfection data showed that IMR had a "memory" and didn't have a huge viral response, where GM didn't and had the exact same viral response
scATAC-seq would be a good dataset of this to have to mirror the timecourse
HiC does a first pass of sequencing to check for quality at a low sequencing depth, and then sequences the libraries with good quality at high-depth
The lack of exon junction causes the eRNAs to be degraded. The exon-junction is usually a QC check in the nucleus to make sure bad RNAs don't get exported. The eRNAs are high jacking that system to get broken down.
Applied Genomics
3 different "modules", that are 3 weeks each.
Different Groups can run the pipelines with different params(aligners, reference genome, GTF) to demonstrate the pros and cons of different methods, reproducibility issues, and just how big of an effect it can have.
Cover QC concepts, basic alignment, and gene counts
2022-02-26 Saturday
Setting up raid controller and Julia
Issue with not being able to use megacli, was that I wasn't running it with sudo
sudo ln -s /opt/MegaRAID/MegaCli/MegaCli64 /usr/local/bin/megacli
sudo megacli PDList -aALL
To use the drives connected to the 2108/2208 controller, a RAID must be created. If using drives like in JBOD, each single drive must be created as a RAID 0 individually
Once again confirmed, going to need to make each disk a RAID 0
Installed storcli
wget https://downloadmirror.intel.com/685225/StorCLI_007.1704.0000.0000.zip
unzip StorCLI_007.1704.0000.0000.zip
sudo dpkg -i StorCLI_007.1704.0000.0000/Ubuntu/storcli_007.1704.0000.0000_all.deb
sudo ln -s /opt/MegaRAID/storcli/storcli64 /usr/local/sbin/storcli
Used storcli to setup disks
MegaRAID StorCLI — ASERGO Knowledge Base
# sudo -i
storcli /c0 set jbod=on
# Controller does not support JBOD
# :(
storcli /call/dall show all
storcli /c0 add vd r0 drives=245:0
for i in {4..12}; do
storcli /c0 add vd r0 drives=245:$i
done
storcli /call/vall show
ZFS pool creation
Going for mirrored drives and then ssd caching
sudo zpool create tank mirror /dev/sdd /dev/sde
Not sure if they should be Cache devices or SLOGs?
Porque no los dos?
Going to have a mirrored SLOG(To prevent data lose) and just throw both of them at the L2ARC cache, since it's just for reference and not for actual data. Might use them as a second SLOG
sudo zpool add -f tank log mirror /dev/sdj /dev/sdk
sudo zpool add -f tank cache /dev/sdl /dev/sdm
Made a scratch pool
sudo zfs create tank/scratch
sudo zfs set mountpoint=/scratch tank/scratch
sudo chown -R emiller:users /scratch/
2022-03 March
2022-03-07 Monday
Figure out nature of Transcripts
With Tae Hoon Kim
Talked about PINTS
Build a model for nascent RNAs to pull out the important pieces of eRNAs
We need to get the actual eRNA transcripts, the TREs they provide are just identified enhancers.
Finding the ends of the eRNAs (We know the TSS)
Two ways, the "pausing" at the end, or the Poly A tail of RNA-seq
What the workflow will look like
Pull in TREs
Find 5' and 3' from GRO-seq
Intersect of BAM and TRE, so the TRE is going possibly be too long, and the eRNA stops before the end of the enhancer
Interesting eRNAs will be:
Highly Transcribed
Long
Useful LSI MegaCli64 Commands – Help Center
2022-03-20 Sunday
Ran groseq transcripts against TREs
There's a ton of reads still. Basically, there were still reads, so yay!
2022-04 April
2022-04-04 Monday
Meeting with Tae
Updates
Possible collaboration as an nf-core pipeline for viral integration in human NGS data
Playing around with TRE
Planning on starting Applied Genomics next week
Notes
Possible collaboration with his friend from Yale looking at Nascent data in cancers. Looking specifically at Hox genes(The Role of HOX Transcription Factors in Cancer Predisposition and Progressio...) expression and nascent transcripts around there for cancers that have lost their positional inactivity.
Write about PINTS and why it's better
2022-04-10 Sunday
Funnel Plot
import plotly.express as px
import pandas as pd
stages = [
"All unannotated transcripts",
"Transcripts that overlap with H3K4me1 and H3K27ac histone modifications",
"Transcripts that satisfy cutoff for amplitude index and continuity index",
]
df_mtl = pd.DataFrame(dict(number=[127249, 21966, 10695], stage=stages))
df_mtl["Cell"] = "GM"
df_toronto = pd.DataFrame(dict(number=[88540, 34543, 14775], stage=stages))
df_toronto["Cell"] = "IMR"
df = pd.concat([df_mtl, df_toronto], axis=0)
fig = px.funnel(df, x="number", y="stage", color="Cell")
fig.show()
2022-04-18 Monday
Meeting with Tae
Feedback on Current Research Presentations
Good on the less technical bit on presentation
How aligners might perform
I should look into other aligner benchmarking
Look into STAR aligner paper on using an individualized genome to improve performance
The issue is repeats
Ancient vs recent repeats
Recent are more difficult because they haven't picked up their own unique small differences
150bp reads are an issue because there's too many repeats
ML model to fill in the transcripts
There might be transcripts on either side of a repeat, and take those multiple mapped regions and see if there was one in the middle
PINTS subworkflow/workflow of nascent?
We take RNA-Seq best practices for granted
RNA polymerase (RNAP) is intimently linked to regulation
What happens to the RNAs that get spliced?
Applied Genomics
3 Encapsulated modules
Hybrid?