Data Engineer

  • Boston, MA, USA
  • Full-time

Job Description

Our Data Engineering team builds and maintains a secure, scalable, flexible and user-friendly analytics hub that allows us to make informed and data-driven decisions. They also construct and curate business-critical data sets that allow us to realize the value of all the data we collect.  

A Data Engineer utilizes a multidisciplinary approach to providing ETL solutions for the business, combining technical, analytical, and domain knowledge. The perfect applicant for this role has strong development skills, experience transforming and profiling data to determine risks associated with proposed analytics solutions, a willingness to continually interface with analysts in order to determine an optimal approach, and an eagerness to explore data sources to understand the availability, utility, and integrity of our data.

What you'll own:

Data pipeline / ETL development:

  • Building and enhancing data curation pipelines using tools like SQL, Python, Glue, Spark and other AWS technologies
  • Focus on data curation on top of datalake data to produce trusted datasets for analytics teams

Data Curation:

  • Processing and cleansing data from a variety of sources to transform collected data into an accessible and curated state for Analysts and Data Scientists
  • Migrating self-serve data pipeline to centrally managed ETL pipelines
  • Advanced SQL development and performance tuning 
  • Some exposure to Spark, Glue or other distributed processing frameworks helpful
  • Work with business data stewards & analytics team to research and identify data quality issues to be resolved in the curation process

Data Modeling:

  • Design and build master dimensions to support analytic data requirements
  • Replacing legacy data structures with new datasets sourced from streaming data feeds from the core product and other operational systems
  • Design, build and support pipelines to deliver business critical datasets
  • Resolve complex data design issues & provide optimal solutions that meet business requirements and benefit system performance

Query Engine Expertise & Performance Tuning:

  • Assist Analytics teams with tuning efforts
  • Curated dataset design for performance

Orchestration:

  • Management of job scheduling
  • Dependency management mapping and support
  • Documentation of issue resolution procedures

Data Access

  • Design and management of data access controls mapped to curated datasets

Leveraging devops best practices, such as IAC and CI/CD to build upon a scalable and extensible data environment


Experience you'll need:

  • Strong experience designing and building end-to-end data pipelines
  • Extensive SQL development experience
  • Knowledge of data management fundamentals and data storage principles

Data modeling:

  • Normalization
  • Dimensional/OLAP design and data warehousing
  • Master data management patterns
  • Modeling trade-offs impacting data management & processing/query performance
  • Knowledge of distributed systems as it pertains to data storage, data processing and querying
  • Extensive experience in ETL and DB performance tuning
  • Hands on experience with a scripting language (Python, bash, etc.)
  • Some experience with Hadoop, Spark, Kafka, Impala, or other big data technologies helpful
     

Familiarity with the technology stacks available for:

  • Metadata management: Data Governance, Data Quality, MDM, Lineage, Data Catalog etc.
  • Data management, data processing and curation: 
  • Postgres, Hadoop, Hive, Impala, Presto, Spark, Glue, etc.

Experience in data modeling for batch processing and streaming data feeds; structured and unstructured data


Experience in data security / access management, data cataloging and overall data environment management
 

Experience with cloud services such as AWS and APIs helpful

You’d be a great fit if your current track record looks like this:

  • 5+ years of progressive experience data engineering and data warehousing
  • Experience with a variety of data management platforms (e.g. RDBMS (Postgres), Hadoop (CDH, EMR))
  • Experience with high performance query engines (Hive, Impala, Presto, Athena, MPP engines like RedShift)
  • Strong capability to manipulate and analyze complex, high-volume data from a variety of sources
  • Effective communication skills with technical team members as well as business partners. Able to distill complex ideas into straightforward language
  • Ability to problem solve independently and prioritize work based on the anticipated business value

Additional Information

All your information will be kept confidential according to EEO guidelines.

Videos To Watch

Privacy Policy