BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//nyc2024.pydata.org//X8NXE3
BEGIN:VTIMEZONE
TZID:US/Eastern
BEGIN:STANDARD
DTSTART:20001029T020000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10;UNTIL=20061029T060000Z
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:STANDARD
DTSTART:20071104T020000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=11
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000402T020000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=4;UNTIL=20060402T070000Z
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
BEGIN:DAYLIGHT
DTSTART:20070311T020000
RRULE:FREQ=YEARLY;BYDAY=2SU;BYMONTH=3
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-cfp-X8NXE3@nyc2024.pydata.org
DTSTART;TZID=US/Eastern:20241108T114000
DTEND;TZID=US/Eastern:20241108T122000
DESCRIPTION:Preparing data for LLM pretraining is most challenging and time
  consuming task. Data for pretraining is usually scraped from internet whi
 ch is full of duplicates and having undesired contents like hate\, abuse a
 nd profanity.\n\nTo produce a quality model\, the collected data needs to 
 go through the series of transformations to improve its quality\, add attr
 ibutes ( e.g. detect language )\, make it safe ( e.g. remove spam ) and pu
 t it in a common format expected by training module.\n\nData Prep Kit (htt
 ps://github.com/IBM/data-prep-kit ) aka DPK is a open source project ( cre
 ated by IBM ) to transform the input data collected from internet (https:/
 /commoncrawl.org/ ) into data ready for training.
DTSTAMP:20250709T215032Z
LOCATION:Music Box
SUMMARY:Preparing data for LLM training with Data Prep Kit - Santosh Borse
URL:https://nyc2024.pydata.org/cfp/talk/X8NXE3/
END:VEVENT
END:VCALENDAR