Biography

I am currently working at Amazon as a Data Engineer. I started programming since high school, and I am proficient in Python, R and SQL and data skills in PySpark, Hive and AWS cloud computing.

Interests

  • Big Data
  • Cloud Computing
  • Software Development
  • Biochemistry

Education

  • PhD in Chemistry, spec. Quantitative Biology, 2019

    Brandeis University

  • BSc in Materials Chemistry, 2012

    Nankai University

  • BSc in Finance, 2012

    Nankai University

Skills

Programming

Python, Django, SQL, Java, git, Bash, R, Matlab, Lisp, Javascript

AWS

IAM, EC2, Redshift, Glue, Athena, EMR, Lambda, S3, RDS, DynamoDB

Distributed computing

Hadoop, Hive, Spark, Kafka, Airflow

Experience

 
 
 
 
 

Data Engineer II

Amazon.com Services LLC.

Feb 2020 – Present Seattle, WA
  • Design and build an end-to-end platform to measure product adoption metrics and serve historical and real-time reporting and predictive analytics needs of the org. Cradle, Datanet, EDX
  • Develop a request collection system and write automated data checks to reduce data quality issues. Python, Lambda, Datanet
  • Maintain web service that hosts 13 data products in a single location and provide customer ease of data access. Typescript, Lambda, DynamoDB
  • Lead data migration process and reduce impacts cross orgs. SQL, Datanet, Cradle
 
 
 
 
 

Data Engineer

Rescale, Inc.

Oct 2019 – Feb 2020 San Francisco, CA
  • Developed ETL pipelines and built a data metrics collection system which brought cross-platform analysis from 1 day to 10 seconds. Redshift, Glue, Spark
  • Built a bot to ingest, aggregate and detect error events and provided developers a view for error triage. Python
  • Employed new features for Rescale platform to explicitly track platform activity. Python, Django, Java
  • Wrote complex SQL queries for multiple teams to aggregate platform usage data for performance monitoring, reporting and decision making. SQL, Athena
 
 
 
 
 

Data Engineer Fellow

Insight Data Science

Jan 2019 – Oct 2019 New York, NY
  • Developed an ETL pipeline to extract, integrate and transform prescription data from multiple providers in AWS cloud computing to enable nation-wide queries on prescription drug usage. Python, AWS, Airflow
  • Validated and combined public available Medicaid and Medicare datasets with NIH, FDA and NPPES sources into a SQL queryable databases in Redshift, visualized in website. Redshift, Tableau, JavaScript, CSS
  • Implemented custom connector to Redshift/PostgreSQL with 20 times more efficiency. Python
  • Built a real-time monitoring pipeline of IoT sensor data and latencies for data center management to handle 10,000+ events per second. Spark streaming, Kafka
 
 
 
 
 

Research Assistant

Brandeis University

Oct 2012 – Dec 2018 Waltham, MA
  • Developed programs to parse massive experiment data into structured analysis and visualization. R/Rstudio
  • Built models to tackle metrics and simulate enzyme kinetic mechanisms. Matlab
  • Developed a python-based pipeline that extracted 100,000 gene sequences encoding protein of interest from 200 million gene sequences in the 114 GB GenBank database from RESTful API. Python
  • Managed 24 graduate and undergraduate students in biochemistry research lab and chemistry teaching lab.
  • Collaborated with 3 teams with multi-disciplines and multi-cultures.

Posts

Postgresql server did not shutdown correctly

I setup Postgresql as database for Django and it works as a charm. However, today I got a error message when I tried to migrate my …

My first Django app

This is my implementaion on the Django tutorial.

Add structure template in Org-mode

The org-mode code blocks can be used for literate programming and creating executable snippets. There is also a quick way to insert …

Integrate Mac version Emacs with Rime input engine

由于我习惯写中文博客,所以将写博客这件事也转移到 Emacs 后,我渐渐感觉到 pyim 的不足。所以今天研究一下如何让 pyim 调用 Rime 的词库。

Hacking the Data Transformation Interview

I am currently (still) seeking a job in data/software engineering area, and I am preparing for all kinds of technical interviews, …

Projects

CSV2Bean

• Provide smooth transition from accounting app to plain text accounting tools.

• Convert .csv file exported from Sui accounting app …

Gene Variation Analysis of Stomach Cancer

• Extracted cancer research data from the Cancer Genome Atlas Network ®.

• Applied GISTIC clustering analysis to patient-indexed …

Multithread Web Crawler

• Write a class to handle multithreading website crawling inside the given domain.

• Feature a breath-first search algorithm and a …

NavPage

• Provide a personalized websites collection and naviation page.

• Built with Javascript, CSS and npm.

RxMiner

• Developed an ETL pipeline to extract, integrate and transform prescription data from multiple providers in AWS cloud computing to …